EMR
Introduction
Elastic MapReduce (EMR) helps creating / managing EC2 cluster to analyze and process vast amount of data
The cluster can be made of hundreds of EC2 instances in VPC
Supports: Spark, HBase, Presto, Flink, Hive, etc.
Feature
The default storage for EMR is temporary, running on EBS Volume (HDFS) in 1 AZ.
If need a persistent storage, or need to cross AZ, use EMRFS which integrates with S3 for permanent storage (supports server-side encryption).
Can run Apache Hive on EMR to read from DynamoDB
Node types (avoid using Spot Instance for Master, Core nodes)
Master Node: manage the cluster, coordinate, manage health.
Core Node: run tasks and store data.
Task Node (optional): just to run tasks.
Purchasing options:
On-demand
Reserved Instances (min 1 year) for cost savings
Spot Instances: less reliable due to unexpected termination
Can have long-running cluster / transient cluster
Instance Configuration
Uniform instance groups
Select a single instance type and purchasing option for each node type.
With auto scaling
Instance Fleet
Mix instance types and purchasing options for each node type.
No auto scaling
Running Jobs on AWS
Last updated