ShuffleDependency

The ShuffleDependency defines how an RDD is partitioned, and how the output of in each partition is organized.

If an RDD has ShuffleDependency, it means there is shuffle required between itself and its upstream dependency. The RDD definition itself includes the information on how to retrieve the parent RDD partitions to construct its own, e.g., ShuffledRowRDD

keyOrdering: Sort by both partition and key. Used by SortShuffleWriter to sort the outgoing records on the map side, and used by BlockStoreShuffleReader to sort incoming records on the reduce side.

partitioner: Decide which partition when new record is written.

serializer: It can be determined by SerializerManager at construction.

aggregator Used for aggregation during shuffle phase. It includes initialize the data structure (createCombiner) with a new record [V], merge new record [V] with existing record [K] mergeValue, and merge two records [V, V] mergeCombiners. The SortShuffleWriter will do the real aggregation (actually ExternalSorter does that).

mapSideCombine Does the map side need the combine. Even the dependency has aggregator, this value may be still false to skip it.

ShuffleDependency

ShuffleDependency

results matching ""

No results matching ""