Valid Databricks-Certified-Professional-Data-Engineer Dumps shared by ExamDiscuss.com for Helping Passing Databricks-Certified-Professional-Data-Engineer Exam! ExamDiscuss.com now offer the newest Databricks-Certified-Professional-Data-Engineer exam dumps, the ExamDiscuss.com Databricks-Certified-Professional-Data-Engineer exam questions have been updated and answers have been corrected get the newest ExamDiscuss.com Databricks-Certified-Professional-Data-Engineer dumps with Test Engine here:
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?
Correct Answer: B
The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly. * Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. This setting directly influences the size of the part-files in the output, aligning with the target file size. * Narrow transformations (which do not involve shuffling data across partitions) can then be applied to this data. * Writing the data out to Parquet will result in files that are approximately the size specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB. * The other options involve unnecessary shuffles or repartitions (B, C, D) or an incorrect setting for this specific requirement (E). References: * Apache Spark Documentation: Configuration - spark.sql.files.maxPartitionBytes * Databricks Documentation on Data Sources: Databricks Data Sources Guide