Valid Databricks-Certified-Professional-Data-Engineer Dumps shared by EduDump.com for Helping Passing Databricks-Certified-Professional-Data-Engineer Exam! EduDump.com now offer the newest Databricks-Certified-Professional-Data-Engineer exam dumps, the EduDump.com Databricks-Certified-Professional-Data-Engineer exam questions have been updated and answers have been corrected get the newest EduDump.com Databricks-Certified-Professional-Data-Engineer dumps with Test Engine here:
A data engineer is attempting to execute the following PySpark code: df = spark.read.table("sales") result = df.groupBy("region").agg(sum("revenue")) However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase. Which technique should be applied to reduce shuffling during the groupBy aggregation operation?
Correct Answer: B
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer Documents: Databricks documents that shuffle occurs when Spark redistributes data across partitions for grouping or joining. To optimize aggregation performance, repartitioning by the grouping key (region) ensures rows with the same key are co-located in the same partition, thus minimizing shuffle movement. Caching improves reuse of DataFrames but does not reduce shuffle volume. coalesce() reduces the number of partitions after computation and cannot prevent shuffle. Broadcast joins are unrelated to single-table aggregations. The recommended practice for reducing shuffle in aggregation is explicit repartitioning by the grouping column.