Correct Answer: A
Explanation
A shuffle operation returns 200 partitions if not explicitly set.
Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.
The coalesce() method should be used to increase the number of partitions.
Incorrect. The coalesce() method can only be used to decrease the number of partitions.
Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.
A narrow transformation does not include a shuffle, so no data need to be exchanged between executors.
Shuffles are expensive and can be a bottleneck for executing Spark workloads.
Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition.
So, it matters how many executors are available to perform work in parallel relative to the number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one would want to have the number of partitions equal to the number of executors (but not more).
So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No data is exchanged between executors when coalesce() is run.
No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.
Short partition processing times are indicative of low skew.
Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.
Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short processing time is not per se indicative a low skew: It may simply be short because the partition is small.
A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their partitions than others. But the answer does not make any comparison - so by itself it does not provide enough information to make any assessment about skew.
More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation