EMR

Introduction

  • Elastic MapReduce (EMR) helps creating / managing EC2 cluster to analyze and process vast amount of data

  • The cluster can be made of hundreds of EC2 instances in VPC

  • Supports: Spark, HBase, Presto, Flink, Hive, etc.

Feature

  • The default storage for EMR is temporary, running on EBS Volume (HDFS) in 1 AZ.

    • If need a persistent storage, or need to cross AZ, use EMRFS which integrates with S3 for permanent storage (supports server-side encryption).

  • Can run Apache Hive on EMR to read from DynamoDB

  • Node types (avoid using Spot Instance for Master, Core nodes)

    • Master Node: manage the cluster, coordinate, manage health.

    • Core Node: run tasks and store data.

    • Task Node (optional): just to run tasks.

  • Purchasing options:

    • On-demand

    • Reserved Instances (min 1 year) for cost savings

    • Spot Instances: less reliable due to unexpected termination

  • Can have long-running cluster / transient cluster

  • Instance Configuration

    • Uniform instance groups

      • Select a single instance type and purchasing option for each node type.

      • With auto scaling

    • Instance Fleet

      • Mix instance types and purchasing options for each node type.

      • No auto scaling

Running Jobs on AWS

Last updated