Memo

Transformation Operations

Functions
- filter(boolean) - keep elements when boolean is true
- flatMap - similar to Map, but FlatMap allows returning 0, 1 or more elements. It applies to each element of RDD and it returns the result as new RDD.
- flatMapValues - similar to flatMap, operates on the value only.
- map- operates on the entire record
- mapValues - operates on the value only
- reduceByKey - sum up value(s) by key
  - rdd.reduceByKey((x,y) => x + y) // x is one of the two values, y is the other.
  - rdd.reduceByKey(lambda x, y: x + y) // with lambda style
Remark
- With key/value data, use mapValues() / flatMapValues() instead of map() / flatMap() if the transformation doesn't affect the keys for efficiency.
- About partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Creating RDD with collection

parallelize
- The 2nd parameter is to set the number of partitions which is optional.
makeRDD

Shared Variables

Broadcast Variables

A read-only variable cached on each machine.

conf = SparkConf().setMaster("local").setAppName("PopularMovies")
sc = SparkContext(conf = conf)

def loadMovieNames():
  movieNames = {}
  with open("ml-100k/u.ITEM") as f:
      for line in f:
          fields = line.split('|')
          movieNames[int(fields[0])] = fields[1]
  return movieNames

nameDict = sc.broadcast(loadMovieNames())

For example, to give every node a copy of a large input dataset in an efficient manner.

Accumulators
- Only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.
- Methods supported: .add() and .value

PreviousSpark NextConference Notes

Last updated 5 years ago

Was this helpful?