Memo
Transformation Operations
Functions
filter(boolean) - keep elements when boolean is true
flatMap - similar to Map, but FlatMap allows returning 0, 1 or more elements. It applies to each element of RDD and it returns the result as new RDD.
flatMapValues - similar to flatMap, operates on the value only.
map- operates on the entire record
mapValues - operates on the value only
reduceByKey - sum up value(s) by key
rdd.reduceByKey((x,y) => x + y) // x is one of the two values, y is the other.
rdd.reduceByKey(lambda x, y: x + y) // with lambda style
Remark
With key/value data, use mapValues() / flatMapValues() instead of map() / flatMap() if the transformation doesn't affect the keys for efficiency.
About partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.
Creating RDD with collection
parallelize
The 2nd parameter is to set the number of partitions which is optional.
makeRDD
Shared Variables
Broadcast Variables
A read-only variable cached on each machine.
For example, to give every node a copy of a large input dataset in an efficient manner.
Accumulators
Only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.
Methods supported: .add() and .value
Last updated