Databricks-Certified-Professional-Data-Engineer Exam Dumps | A data engineer is attempting to execute the following PySpark code: df = spark.read.table("sales") result

<< Prev Question Next Question >>

Question 21/82

A data engineer is attempting to execute the following PySpark code:
df = spark.read.table("sales")
result = df.groupBy("region").agg(sum("revenue"))
However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.
Which technique should be applied to reduce shuffling during the groupBy aggregation operation?

A. Caching the DataFrame df.

B. Repartition by region before aggregation.

C. Use coalesce() after the aggregation.

D. Use broadcast join.

Question 21/82

LEAVE A REPLY

Download PDF File