Valid Professional-Machine-Learning-Engineer Dumps shared by ExamDiscuss.com for Helping Passing Professional-Machine-Learning-Engineer Exam! ExamDiscuss.com now offer the newest Professional-Machine-Learning-Engineer exam dumps, the ExamDiscuss.com Professional-Machine-Learning-Engineer exam questions have been updated and answers have been corrected get the newest ExamDiscuss.com Professional-Machine-Learning-Engineer dumps with Test Engine here:
You are experimenting with a built-in distributed XGBoost model in Vertex AI Workbench user-managed notebooks. You use BigQuery to split your data into training and validation sets using the following queries: CREATE OR REPLACE TABLE 'myproject.mydataset.training' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE RAND() <= 0.8); CREATE OR REPLACE TABLE 'myproject.mydataset.validation' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE RAND() <= 0.2); After training the model, you achieve an area under the receiver operating characteristic curve (AUC ROC) value of 0.8, but after deploying the model to production, you notice that your model performance has dropped to an AUC ROC value of 0.65. What problem is most likely occurring?
Correct Answer: C
The most likely problem is that the tables that you created to hold your training and validation records share some records, and you may not be using all the data in your initial table. This is because the RAND() function generates a random number between 0 and 1 for each row, and the probability of a row being in both the training and validation tables is 0.2 * 0.8 = 0.16, which is not negligible. This means that some of the records that you use to validate your model are also used to train your model, which can lead to overfitting and poor generalization. Moreover, the probability of a row being in neither the training nor the validation table is 0.2 * 0.2 = 0.04, which means that you are wasting some of the data in your initial table and reducing the size of your datasets. A better way to split your data into training and validation sets is to use a hash function on a unique identifier column, such as the following queries: CREATE OR REPLACE TABLE 'myproject.mydataset.training' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE MOD(FARM_FINGERPRINT(id), 10) < 8); CREATE OR REPLACE TABLE 'myproject.mydataset.validation' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE MOD(FARM_FINGERPRINT(id), 10) >= 8); This way, you can ensure that each row has a fixed 80% chance of being in the training table and a 20% chance of being in the validation table, without any overlap or omission. References: * Professional ML Engineer Exam Guide * Preparing for Google Cloud Certification: Machine Learning Engineer Professional Certificate * Google Cloud launches machine learning engineer certification * BigQuery ML: Splitting data for training and testing * BigQuery: FARM_FINGERPRINT function