Development Notes
  • Introduction
  • Programming Langauges
    • Java
      • Cache
      • Java Fundamentals
      • Multithreading & Concurrency
      • Spring Boot
        • Spring Security
        • Development tips
      • ORM
        • Mybatis
      • Implementation & Testing
    • Node.js
      • Asynchronous Execution
      • Node.js Notes
    • Python
      • Memo
  • Data Structure & Algorithm
  • Database
  • Design Pattern
  • AWS Notes
    • Services
      • API Gateway
      • CloudHSM
      • Compute & Load Balancing
        • Auto Scaling Group
        • EC2
        • ECS
        • ELB
        • Lambda
      • Data Engineering
        • Athena
        • Batch
        • EMR
        • IoT
        • Kinesis
        • Video Streaming
        • Quicksight
      • Deployment
        • CloudFormation
        • Code Deploy
        • Elastic Beanstalk
        • OpsWorks
        • SAM
        • SSM
      • ElasticSearch
      • Identity & Federation
        • Directory Service
        • IAM
        • Organizations
        • Resource Access Manager (RAM)
        • SSO
        • STS
      • KMS
      • Management Tools
        • Catalog
        • CloudTrail
        • CloudWatch
        • Config
        • Cost Allocation Tags
        • GuardDuty
        • Savings Plans
        • Trusted Advisor
        • X-Ray
      • Migration
        • Cloud Migration: The 6R
        • Disaster Recovery
        • DMS
        • VM Migrations
      • Networking
        • ACM
        • CloudFront
        • Direct Connect
        • EIP & ENI
        • Network Security
        • PrivateLink
        • Route53
        • VPC
        • VPN
      • Service Commnucation
        • Amazon MQ
        • SNS
        • SQS
        • Step Functions
        • SWF
      • Storage
        • Aurora
        • DynamoDB
        • EBS
        • EFS
        • ElastiCache
        • RDS
        • Redshift
        • S3
        • Storage Gateway
      • Other Services
        • Alexa for Business, Lex, Connect
        • AppStream 2.0
        • CloudSearch
        • Comprehend
        • Data Tools
        • Elastic Transcoder
        • Mechanical Turk
        • Rekognition
        • WorkDocs
        • WorkSpaces
    • Well Architect Framework
      • Security
      • Reliability
      • Performance Effeciency
      • Cost Optimization
      • Operational Excellence
    • Labs
      • Webserver Implementation
      • ELB Implementation
      • Auto-scaling Implementation
      • A 3-tier Architecture In VPC
  • Architecture
    • Security
  • Spark
    • Memo
  • Conference Notes
    • Notes of JCConf 2017
  • AI Notes
Powered by GitBook
On this page
  • Transformation Operations
  • Shared Variables

Was this helpful?

  1. Spark

Memo

Transformation Operations

  • Functions

    • filter(boolean) - keep elements when boolean is true

    • flatMap - similar to Map, but FlatMap allows returning 0, 1 or more elements. It applies to each element of RDD and it returns the result as new RDD.

    • flatMapValues - similar to flatMap, operates on the value only.

    • map- operates on the entire record

    • mapValues - operates on the value only

    • reduceByKey - sum up value(s) by key

      • rdd.reduceByKey((x,y) => x + y) // x is one of the two values, y is the other.

      • rdd.reduceByKey(lambda x, y: x + y) // with lambda style

  • Remark

    • With key/value data, use mapValues() / flatMapValues() instead of map() / flatMap() if the transformation doesn't affect the keys for efficiency.

    • About partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Creating RDD with collection

  • parallelize

    • The 2nd parameter is to set the number of partitions which is optional.

  • makeRDD

Shared Variables

  • Broadcast Variables

    • A read-only variable cached on each machine.

    conf = SparkConf().setMaster("local").setAppName("PopularMovies")
    sc = SparkContext(conf = conf)
    
    def loadMovieNames():
      movieNames = {}
      with open("ml-100k/u.ITEM") as f:
          for line in f:
              fields = line.split('|')
              movieNames[int(fields[0])] = fields[1]
      return movieNames
    
    nameDict = sc.broadcast(loadMovieNames())
    • For example, to give every node a copy of a large input dataset in an efficient manner.

  • Accumulators

    • Only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.

    • Methods supported: .add() and .value

PreviousSparkNextConference Notes

Last updated 5 years ago

Was this helpful?