Memo

Transformation Operations

  • Functions

    • filter(boolean) - keep elements when boolean is true

    • flatMap - similar to Map, but FlatMap allows returning 0, 1 or more elements. It applies to each element of RDD and it returns the result as new RDD.

    • flatMapValues - similar to flatMap, operates on the value only.

    • map- operates on the entire record

    • mapValues - operates on the value only

    • reduceByKey - sum up value(s) by key

      • rdd.reduceByKey((x,y) => x + y) // x is one of the two values, y is the other.

      • rdd.reduceByKey(lambda x, y: x + y) // with lambda style

  • Remark

    • With key/value data, use mapValues() / flatMapValues() instead of map() / flatMap() if the transformation doesn't affect the keys for efficiency.

    • About partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Creating RDD with collection

  • parallelize

    • The 2nd parameter is to set the number of partitions which is optional.

  • makeRDD

Shared Variables

  • Broadcast Variables

    • A read-only variable cached on each machine.

    conf = SparkConf().setMaster("local").setAppName("PopularMovies")
    sc = SparkContext(conf = conf)
    
    def loadMovieNames():
      movieNames = {}
      with open("ml-100k/u.ITEM") as f:
          for line in f:
              fields = line.split('|')
              movieNames[int(fields[0])] = fields[1]
      return movieNames
    
    nameDict = sc.broadcast(loadMovieNames())
    • For example, to give every node a copy of a large input dataset in an efficient manner.

  • Accumulators

    • Only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.

    • Methods supported: .add() and .value

Last updated