Databricks Databricks-Certified-Professional-Data-Engineer Exam Dumps [2023] Practice Valid Exam Dumps Question
Databricks-Certified-Professional-Data-Engineer Dumps - Grab Out For [NEW-2023] Databricks Exam
NEW QUESTION # 10
A table is registered with the following code:
Bothusersandordersare Delta Lake tables. Which statement describes the results of queryingrecent_orders?
- A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
- B. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
- C. All logic will execute when the table is definedand store the result of joiningtables to the DBFS; this stored data will be returned when the table is queried.
- D. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
- E. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
Answer: E
Explanation:
Explanation
This is the correct answer because Delta Lake supports time travel, which allows users to query data as of a specific version or timestamp. The code uses the VERSION AS OF syntax to specify the version of each source table to be used in the join. The result of querying recent_orders will be the same as joining those versions of the source tables at query time. The query will use snapshot isolation, which means it will use a consistent snapshot of the table at the time the query began, regardless of any concurrent updates or deletes.
Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Query an older snapshot of a table (time travel)" section.
NEW QUESTION # 11
Which of the following developer operations in CI/CD flow can be implemented in Databricks Re-pos?
- A. Trigger Databricks CICD pipeline
- B. Delete branch
- C. Create a pull request
- D. Commit and push code
- E. Approve the pull request
Answer: D
Explanation:
Explanation
The answer is Commit and push code.
See the below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workflow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops.
Exam focus: Please study the below image carefully to understand all of the steps in the CI/CD flow to understand the tasks that are implemented in Databricks Repo vs Git Provider, exam may ask a different type of questions based on this flow.
Diagram Description automatically generated
NEW QUESTION # 12
You have noticed that Databricks SQL queries are running slow, you are asked to look reason why queries are running slow and identify steps to improve the performance, when you looked at the issue you noticed all the queries are running in parallel and using a SQL endpoint(SQL Warehouse) with a single cluster. Which of the following steps can be taken to improve the performance/response times of the queries?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
- A. They can increase the warehouse size from 2X-Smal to 4XLarge of the SQL end-point(SQL warehouse).
- B. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse).
- C. They can increase the maximum bound of the SQL endpoint(SQL warehouse)'s scaling range
- D. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse) and change the Spot Instance Policy to "Reliability Optimized."
- E. They can turn on the Auto Stop feature for the SQL endpoint(SQL warehouse).
Answer: C
Explanation:
Explanation
The answer is, They can increase the maximum bound of the SQL endpoint's scaling range when you increase the max scaling range more clusters are added so queries instead of waiting in the queue can start running using available clusters, see below for more explanation.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
1.A SQL Warehouse should have at least one cluster
2.A cluster comprises one driver node and one or many worker nodes
3.No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale up
4.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min
1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
5.Increasing the Warehouse cluster size can improve the performance of a query, for example, if a query runs for 1 minute in a 2X-Small warehouse size it may run in 30 Seconds if we change the warehouse size to X-Small. this is due to 2X-Small having 1 worker node and X-Small having 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
6.A warehouse can have more than one cluster this is called Scale out. If a warehouse is con-figured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
7.A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
Please review the below diagram to understand the above concepts:
SQL endpoint(SQL Warehouse) scales horizontally(scale-out) and vertical (scale-up), you have to understand when to use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
Databricks SQL automatically scales as soon as it detects queries are in queuing state, in this example scaling is set for min 1 and max 3 which means the warehouse can add three clusters if it detects queries are waiting.
During the warehouse creation or after you have the ability to change the warehouse size (2X-Small....to
...4XLarge) to improve query performance and the maximize scaling range to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out, if you are changing an existing warehouse you may have to restart the warehouse to make the changes effective.
How do you know how many clusters you need(How to set Max cluster size)?
When you click on an existing warehouse and select the monitoring tab, you can see warehouse utilization information(see below), there are two graphs that provide important information on how the warehouse is being utilized, if you see queries are being queued that means your warehouse can benefit from additional clusters. Please review the additional DBU cost associated with adding clusters so you can take a well balanced decision between cost and performance.
NEW QUESTION # 13
The data engineering team is using a SQL query to review data completeness every day to monitor the ETL job, and query output is being used in multiple dashboards which of the following ap-proaches can be used to set up a schedule and automate this process?
- A. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL
- B. They can schedule the query to refresh every day from the query's page in Databricks SQL
- C. They can schedule the query to refresh every day from the SQL endpoint's page in Databricks SQL.
- D. They can schedule the query to run every 12 hours from the Jobs UI.
- E. They can schedule the query to run every day from the Jobs UI.
Answer: B
Explanation:
Explanation
The answer is They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL, The query pane view in Databricks SQL workspace provides the ability to add or edit and schedule individual queries to run.
You can use scheduled query executions to keep your dashboards updated or to enable routine alerts. By default, your queries do not have a schedule.
Note
If your query is used by an alert, the alert runs on its own refresh schedule and does not use the query schedule.
To set the schedule:
* Click the query info tab.
* Graphical user interface, text, application, email Description automatically generated
* Click the link to the right of Refresh Schedule to open a picker with schedule intervals.
* Graphical user interface, application Description automatically generated
* 3.Set the schedule.
* The picker scrolls and allows you to choose:
* *An interval: 1-30 minutes, 1-12 hours, 1 or 30 days, 1 or 2 weeks
* *A time. The time selector displays in the picker only when the interval is greater than 1 day and the day selection is greater than 1 week. When you schedule a specific time, Databricks SQL takes input in your computer's timezone and converts it to UTC. If you want a query to run at a certain time in UTC, you must adjust the picker by your local offset. For example, if you want a query to execute at 00:00 UTC each day, but your current timezone is PDT (UTC-7), you should select 17:00 in the picker:
* Graphical user interface Description automatically generated
NEW QUESTION # 14
When writing streaming data, Spark's structured stream supports the below write modes
- A. Append, overwrite, Continuous
- B. Complete, Incremental, Update
- C. Append, Delta, Complete
- D. Delta, Complete, Continuous
- E. Append, Complete, Update
Answer: E
Explanation:
Explanation
The answer is Append, Complete, Update
*Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only select, where, map, flatMap, filter, join, etc. will support Append mode.
*Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
*Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.
NEW QUESTION # 15
Data engineering team has provided 10 queries and asked Data Analyst team to build a dashboard and refresh the data every day at 8 AM, identify the best approach to set up data refresh for this dashaboard?
- A. Setup JOB with linear dependency to all load all 10 queries into a table so the dashboard can be refreshed at once.
- B. Each query requires a separate task and setup 10 tasks under a single job to run at 8 AM to refresh the dashboard
- C. The entire dashboard with 10 queries can be refreshed at once, single schedule needs to be set up to refresh at 8 AM.
- D. A dashboard can only refresh one query at a time, 10 schedules to set up the refresh.
- E. Use Incremental refresh to run at 8 AM every day.
Answer: C
Explanation:
Explanation
The answer is,
The entire dashboard with 10 queries can be refreshed at once, single schedule needs to be set up to refresh at
8 AM.
Automatically refresh a dashboard
A dashboard's owner and users with the Can Edit permission can configure a dashboard to auto-matically refresh on a schedule. To automatically refresh a dashboard:
* Click the Schedule button at the top right of the dashboard. The scheduling dialog appears.
* Graphical user interface, text, application, email, Teams Description automatically generated
* 2.In the Refresh every drop-down, select a period.
* 3.In the SQL Warehouse drop-down, optionally select a SQL warehouse to use for all the queries.
If you don't select a warehouse, the queries execute on the last used SQL ware-house.
* 4.Next to Subscribers, optionally enter a list of email addresses to notify when the dashboard is automatically updated.
* Each email address you enter must be associated with a Azure Databricks account or con-figured as an alert destination.
* 5.Click Save. The Schedule button label changes to Scheduled.
NEW QUESTION # 16
A data engineer has developed a code block to perform a streaming read on a data source. The code block is
below:
1. (spark
2. .read
3. .schema(schema)
4. .format("cloudFiles")
5. .option("cloudFiles.format", "json")
6. .load(dataSource)
7. )
The code block is returning an error.
Which of the following changes should be made to the code block to configure the block to successfully
perform a streaming read?
- A. A new .stream line should be added after the .load(dataSource) line
- B. A new .stream line should be added after the .read line
- C. The .read line should be replaced with .readStream
- D. The .format("cloudFiles") line should be replaced with .format("stream")
- E. A new .stream line should be added after the spark line
Answer: C
NEW QUESTION # 17
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?
- A. Can Read
- B. Can Run
- C. Can Edit
- D. No permissions
- E. Can Manage
Answer: A
Explanation:
Explanation
This is the correct answer because it is the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data. Notebook permissions are used to control access to notebooks in Databricks workspaces. There are four types of notebook permissions: Can Manage, Can Edit, Can Run, and Can Read. Can Manage allows full control over the notebook, including editing, running, deleting, exporting, and changing permissions. Can Edit allows modifying and running the notebook, but not changing permissions or deleting it. Can Run allows executing commands in an existing cluster attached to the notebook, but not modifying or exporting it. Can Read allows viewing the notebook content, but not running or modifying it. In this case, granting Can Read permission to the user will allow them to review the production logic in the notebook without allowing them to makeany changes to it or run any commands that may affect production data. Verified References: [Databricks Certified Data Engineer Professional], under "Databricks Workspace" section; Databricks Documentation, under "Notebook permissions" section.
NEW QUESTION # 18
The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".
The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?
- A. preds.write.format("delta").save("/preds/churn_preds")
- B.

- C.

- D. preds.write.mode("append").saveAsTable("churn_preds")
- E.

Answer: B
Explanation:
Explanation
This is the correct answer because it will save the predictions to a Delta Lake table with the ability to compare all predictions across time. The code uses the mergeInto method to perform an upsert operation, which means it will insert new records or update existing records based on the customer_id and date columns. This way, the table will always contain the latest predictions for each customer and date, and also keep the history of previous predictions. The code also uses a new job cluster to run the job, which will minimize the compute costs as it will be created and terminated for each run. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Upsert into a table using merge" section.
NEW QUESTION # 19
Which statement describes Delta Lake Auto Compaction?
- A. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.
- B. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
- C. Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.
- D. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
- E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.
Answer: E
Explanation:
Explanation
This is the correct answer because it describes the behavior of Delta Lake Auto Compaction, which is a feature that automatically optimizes the layout of Delta Lake tables by coalescing small files into larger ones. Auto Compaction runs as an asynchronous job after a write to a table has succeeded and checks if files within a partition can be further compacted. If yes, it runs an optimize job with a default target file size of 128 MB.
Auto Compaction only compacts files that have not been compacted previously. Verified References:
[Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Auto Compaction for Delta Lake on Databricks" section.
NEW QUESTION # 20
What is the output of the below function when executed with input parameters 1, 3 :
1.def check_input(x,y):
2. if x < y:
3. x= x+1
4. if x<y:
5. x= x+1
6. if x <y:
7. x = x+1
8. return x
check_input(1,3)
- A. 0
- B. 3
(Correct) - C. 1
- D. 2
- E. 3
Answer: B
NEW QUESTION # 21
The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following steps can be taken to resolve the issue?
Here is the example query
1.--- Get order summary
2.create or replace table orders_summary
3.as
4.select product_id, sum(order_count) order_count
5.from
6. (
7. select product_id,order_count from orders_instore
8. union all
9. select product_id,order_count from orders_online
10. )
11.group by product_id
12.-- get supply summary
13.create or repalce tabe supply_summary
14.as
15.select product_id, sum(supply_count) supply_count
16.from supply
17.group by product_id
18.
19.-- get on hand based on orders summary and supply summary
20.
21.with stock_cte
22.as (
23.select nvl(s.product_id,o.product_id) as product_id,
24. nvl(supply_count,0) - nvl(order_count,0) as on_hand
25.from supply_summary s
26.full outer join orders_summary o
27. on s.product_id = o.product_id
28.)
29.select *
30.from
31.stock_cte
32.where on_hand = 0
- A. Increase the maximum bound of the SQL endpoint's scaling range.
- B. Turn on the Serverless feature for the SQL endpoint.
- C. Increase the cluster size of the SQL endpoint.
- D. Turn on the Auto Stop feature for the SQL endpoint.
- E. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to "Reliability Optimized."
Answer: C
Explanation:
Explanation
The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the que-ries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.
SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance.
Scale-out -> Add more clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
SQL endpoint
A picture containing diagram Description automatically generated
NEW QUESTION # 22
A junior data engineer has ingested a JSON file into a table raw_table with the following schema:
1. cart_id STRING,
2. items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new table with the
following schema:
1.cart_id STRING,
2.item_id STRING
Which of the following commands should the junior data engineer run to complete this task?
- A. 1. SELECT cart_id, flatten(items) AS item_id
2. FROM raw_table; - B. 1. SELECT cart_id, explode(items) AS item_id
2. FROM raw_table; - C. 1. SELECT cart_id, reduce(items) AS item_id
2. FROM raw_table; - D. 1. SELECT cart_id, slice(items) AS item_id
2. FROM raw_table; - E. 1. SELECT cart_id, filter(items) AS item_id
2. FROM raw_table;
Answer: B
NEW QUESTION # 23
Which Python variable contains a list of directories to be searched when trying to locate required modules?
- A. importlib.resource path
- B. pypi.path
- C. os-path
- D. ,sys.path
- E. pylib.source
Answer: D
NEW QUESTION # 24
The Delta Live Tables Pipeline is configured to run in Development mode using the Triggered Pipeline Mode.
what is the expected outcome after clicking Start to update the pipeline?
- A. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional development and testing
- B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional development and testing
- C. All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline
- D. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped
- E. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated
Answer: C
Explanation:
Explanation
The answer is All datasets will be updated once and the pipeline will shut down. The compute re-sources will persist to allow for additional testing.
DLT pipeline supports two modes Development and Production, you can switch between the two based on the stage of your development and deployment lifecycle.
Development and production modes
When you run your pipeline in development mode, the Delta Live Tables system:
*Reuses a cluster to avoid the overhead of restarts.
*Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
*Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
*Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the buttons in the Pipelines UI to switch between develop-ment and production modes. By default, pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution behavior.
Storage locations must be configured as part of pipeline settings and are not affected when switching between modes.
Please review additional DLT concepts using below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-live-tables-c
NEW QUESTION # 25
While investigating a data issue in a Delta table, you wanted to review logs to see when and who updated the table, what is the best way to review this data?
- A. Check Databricks SQL Audit logs
- B. Review workspace audit logs
- C. Review event logs in the Workspace
- D. Run SQL SHOW HISTORY table_name
- E. Run SQL command DESCRIBE HISTORY table_name
Answer: A
Explanation:
Explanation
The answer is Run SQL command DESCRIBE HISTORY table_name.
here is the sample data of how DESCRIBE HISTORY table_name looks
* +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------
* |version| timestamp|userId|userName|operation| operationParameters|
job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend| operationMetrics|
* +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------
* | 5|2019-07-29 14:07:47| null| null| DELETE|[predicate -> ["(...|null| null| null| 4| Serializable| false|[numTotalRows -> ...|
* | 4|2019-07-29 14:07:41| null| null| UPDATE|[predicate -> (id...|null| null| null| 3| Serializable| false|[numTotalRows -> ...|
* | 3|2019-07-29 14:07:29| null| null| DELETE|[predicate -> ["(...|null| null| null| 2| Serializable| false|[numTotalRows -> ...|
* | 2|2019-07-29 14:06:56| null| null| UPDATE|[predicate -> (id...|null| null| null| 1| Serializable| false|[numTotalRows -> ...|
* | 1|2019-07-29 14:04:31| null| null| DELETE|[predicate -> ["(...|null| null| null| 0| Serializable| false|[numTotalRows -> ...|
* | 0|2019-07-29 14:01:40| null| null| WRITE|[mode -> ErrorIfE...|null| null| null| null| Serializable| true|[numFiles -> 2, n...|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+
NEW QUESTION # 26
What is the top-level object in unity catalog?
- A. Database
- B. Catalog
- C. Table
- D. Workspace
- E. Metastore
Answer: E
Explanation:
Explanation
Key concepts - Azure Databricks | Microsoft Docs
Diagram Description automatically generated
NEW QUESTION # 27
Your team has hundreds of jobs running but it is difficult to track cost of each job run, you are asked to provide a recommendation on how to monitor and track cost across various workloads
- A. Use Tags, during job creation so cost can be easily tracked
- B. Use a single cluster for all the jobs, so cost can be easily tracked
- C. Create jobs in different workspaces, so we can track the cost easily
- D. Use job logs to monitor and track the costs
- E. Use workspace admin reporting
Answer: A
Explanation:
Explanation
The answer is Use Tags, during job creation so cost can be easily tracked Review below link for more details
https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html Here is a view how tags get propagated from pools to clusters and clusters without pools, Diagram Description automatically generated
NEW QUESTION # 28
How do you upgrade an existing workspace managed table to a unity catalog table?
- A. Create or replace table_name format = UNITY using deep clone old_table_name
- B. Create table table_name as select * from hive_metastore.old_schema.old_table
- C. ALTER TABLE table_name SET UNITY_CATALOG = TRUE
- D. Create table table_name format = UNITY as select * from old_table_name
- E. Create table catalog_name.schema_name.table_name
as select * from hive_metastore.old_schema.old_table
Answer: E
Explanation:
Explanation
The answer is Create table catalog_name.schema_name.table_name as select * from hive_metastore.old_schema.old_table Basically, we are moving the data from an internal hive metastore to a metastore and catalog that is registered in the Unity catalog.
note: if it is a managed table the data is copied to a different storage account, for a large tables this can take a lot of time. For an external table the process is different.
Managed table: Upgrade a managed to Unity Catalog
External table: Upgrade an external table to Unity Catalog
NEW QUESTION # 29
Which of the following python statements can be used to replace the schema name and table name in the query?
- A. 1.table_name = "sales"
2.query = "select * from {schema_name}.{table_name}" - B. 1.table_name = "sales"
2.schema_name = "bronze"
3.query = f"select * from schema_name.table_name" - C. 1.table_name = "sales"
2.query = f"select * from + schema_name +"."+table_name" - D. 1.table_name = "sales"
2.query = f"select * from {schema_name}.{table_name}"
Answer: D
Explanation:
Explanation
The answer is
1.table_name = "sales"
2.query = f"select * from {schema_name}.{table_name}"
It is always best to use f strings to replace python variables, rather than using string concatenation.
NEW QUESTION # 30
You are trying to create an object by joining two tables that and it is accessible to data scientist's team, so it does not get dropped if the cluster restarts or if the notebook is detached. What type of object are you trying to create?
- A. External view
- B. Global Temporary view with cache option
- C. View
- D. Global Temporary view
- E. Temporary view
Answer: C
Explanation:
Explanation
Answer is View, A view can be used to join multiple tables but also persist into meta stores so others can accesses it
NEW QUESTION # 31
Which of the following is the correct statement for a session scoped temporary view?
- A. Temporary views stored in memory
- B. Temporary views are created in local_temp database
- C. Temporary views can be still accessed even if the notebook is detached and attached
- D. Temporary views can be still accessed even if cluster is restarted
- E. Temporary views are lost once the notebook is detached and re-attached
Answer: E
Explanation:
Explanation
The answer is Temporary views are lost once the notebook is detached and attached There are two types of temporary views that can be created, Session scoped and Global
*A local/session scoped temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.
*A global temporary view is available to all the notebooks in the cluster, if a cluster restarts global temporary view is lost.
NEW QUESTION # 32
You would like to build a spark streaming process to read from a Kafka queue and write to a Delta table every
15 minutes, what is the correct trigger option
- A. trigger(process "15 minutes")
- B. trigger("15 minutes")
- C. trigger(processingTime = "15 Minutes")
- D. trigger(processingTime = 15)
- E. trigger(15)
Answer: C
Explanation:
Explanation
The answer is trigger(processingTime = "15 Minutes")
Triggers:
*Unspecified
This is the default. This is equivalent to using processingTime="500ms"
*Fixed interval micro-batches .trigger(processingTime="2 minutes")
The query will be executed in micro-batches and kicked off at the user-specified intervals
*One-time micro-batch .trigger(once=True)
The query will execute a single micro-batch to process all the available data and then stop on its own
*One-time micro-batch.trigger .trigger(availableNow=True) -- New feature a better version of (once=True) Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing batches and the resultant files.
NEW QUESTION # 33
You are currently asked to work on building a data pipeline, you have noticed that you are currently working with a data source that has a lot of data quality issues and you need to monitor data quality and enforce it as part of the data ingestion process, which of the following tools can be used to address this problem?
- A. UNITY Catalog and Data Governance
- B. JOBS and TASKS
- C. AUTO LOADER
- D. DELTA LIVE TABLES
- E. STRUCTURED STREAMING with MULTI HOP
Answer: D
Explanation:
Explanation
The answer is, DELTA LIVE TABLES
Delta live tables expectations can be used to identify and quarantine bad data, all of the data quality metrics are stored in the event logs which can be used to later analyze and monitor.
DELTA LIVE Tables expectations
Below are three types of expectations, make sure to pay attention differences between these three.
Retain invalid records:
Use the expect operator when you want to keep records that violate the expectation. Records that violate the expectation are added to the target dataset along with valid records:
Python
[email protected]("valid timestamp", "col("timestamp") > '2012-01-01'")
SQL
1.CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')
Drop invalid records:
Use the expect or drop operator to prevent the processing of invalid records. Records that violate the expectation are dropped from the target dataset:
Python
[email protected]_or_drop("valid_current_page", "current_page_id IS NOT NULL AND cur-rent_page_title IS NOT NULL") SQL
1.CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and cur-rent_page_title IS NOT NULL) ON VIOLATION DROP ROW Fail on invalid records:
When invalid records are unacceptable, use the expect or fail operator to halt execution imme-diately when a record fails validation. If the operation is a table update, the system atomically rolls back the transaction:
Python
[email protected]_or_fail("valid_count", "count > 0")
SQL
1.CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE
NEW QUESTION # 34
The data engineering team maintains the following code:
Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?
- A. No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.
- B. The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
- C. An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.
- D. An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.
- E. A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
Answer: B
Explanation:
Explanation
This is the correct answer because it describes what will occur when this code is executed. The code uses three Delta Lake tables as input sources: accounts, orders, and order_items. These tables are joined together using SQL queries to create a view called new_enriched_itemized_orders_by_account, which contains information about each order item and its associated account details. Then, the code uses write.format("delta").mode("overwrite") to overwrite a target table called enriched_itemized_orders_by_account using the data from the view. This means that every time this code is executed, it will replace all existing data in the target table with new data based on the current valid version of data in each of the three input tables. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Write to Delta tables" section.
NEW QUESTION # 35
......
Databricks-Certified-Professional-Data-Engineer Exam Dumps PDF Guaranteed Success with Accurate & Updated Questions: https://www.dumpstests.com/Databricks-Certified-Professional-Data-Engineer-latest-test-dumps.html