# EMR

**Introduction**

* Elastic MapReduce (EMR) helps creating / managing EC2 cluster to analyze and process vast amount of data
* The cluster can be made of hundreds of EC2 instances in VPC
* Supports: Spark, HBase, Presto, Flink, Hive, etc.

**Feature**

* **The default storage for EMR is temporary**, running on EBS Volume (HDFS) in 1 AZ.
  * If need a persistent storage, or need to cross AZ, use EMRFS which integrates with S3 for permanent storage (supports server-side encryption).
* Can run Apache Hive on EMR to read from DynamoDB
* Node types (avoid using Spot Instance for Master, Core nodes)
  * Master Node: manage the cluster, coordinate, manage health.
  * Core Node: run tasks and store data.
  * Task Node (optional): just to run tasks.
* Purchasing options:
  * On-demand
  * Reserved Instances (min 1 year) for cost savings
  * Spot Instances: less reliable due to unexpected termination
* Can have long-running cluster / transient cluster
* Instance Configuration
  * Uniform instance groups
    * Select a single instance type and purchasing option for each node type.
    * With auto scaling
  * Instance Fleet
    * Mix instance types and purchasing  options for each node type.
    * No auto scaling

**Running Jobs on AWS**

![Running Jobs on AWS](https://3303577320-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4cDbT2F2VmcAohuhSN%2F-MEWbe-G4g3yZkb9FtMn%2F-MEWdqj5dn9P3Gal2ctX%2FScreen%20Shot%202020-08-12%20at%203.49.19%20PM.png?alt=media\&token=9b883eb9-ebc8-4627-8d21-c561108339e7)
