Pass Your Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam on Apr 06, 2024 with 98 Questions [Q17-Q39]

Share

Pass Your Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam on Apr 06, 2024 with 98 Questions

Databricks-Certified-Professional-Data-Engineer Free Exam Study Guide! (Updated 98 Questions)


Databricks Certified Professional Data Engineer exam is designed to test the skills and knowledge of individuals who work with big data and cloud computing technologies. Databricks-Certified-Professional-Data-Engineer exam is primarily focused on assessing candidates’ abilities to design, build, and maintain big data solutions using the Apache Spark platform. Databricks Certified Professional Data Engineer Exam certification is highly valued in the industry and can help individuals demonstrate their proficiency in managing big data projects.


Databricks Certified Professional Data Engineer is an exam designed for professionals who are willing to demonstrate their expertise in building and managing big data pipelines using Databricks. Databricks is a unified analytics platform that provides a collaborative environment for processing large-scale data. The Databricks Certified Professional Data Engineer exam validates the candidate's ability to design, build, and deploy large-scale data processing solutions using Databricks.

 

NEW QUESTION # 17
The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property"contains_pii" = true.
The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?

  • A. DESCRIBE EXTENDED dev.pii test
  • B. SHOW TBLPROPERTIES dev.pii test
  • C. DESCRIBE HISTORY dev.pii test
  • D. DESCRIBE DETAIL dev.pii test
  • E. SHOW TABLES dev

Answer: A

Explanation:
This is the correct answer because it allows manual confirmation that these three requirements have been met.
The requirements are that all tables containing Personal Identifiable Information (PII) must be clearly annotated, which includes adding column comments, table comments, and setting the custom table property
"contains_pii" = true. The DESCRIBE EXTENDED command is used to display detailed information about a table, such as its schema, location, properties, and comments. By using this command on the dev.pii_test table, one can verify that the table has been created with the correct column comments, table comment, and custom table property as specified in the SQL DDL statement. Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under "DESCRIBE EXTENDED" section.


NEW QUESTION # 18
What is the main difference between the below two commands?
1.INSERT OVERWRITE table_name
2.SELECT * FROM table
1.CREATE OR REPLACE TABLE table_name
2.AS SELECT * FROM table

  • A. INSERT OVERWRITE maintains historical data versions by de-fault, CREATE OR REPLACEclears the historical data versions by default
  • B. INSERT OVERWRITE replaces data by default, CREATE OR REPLACE replaces data and Schema by default
  • C. INSERT OVERWRITE replaces data and schema by default, CREATE OR REPLACEreplaces data by default
  • D. Both are same and results in identical outcomes
  • E. INSERT OVERWRITE clears historical data versions by de-fault, CREATE OR REPLACE maintains the historical data versions by default

Answer: B

Explanation:
Explanation
The main difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE(CRAS) is that CRAS can modify the schema of the table, i.e it can add new columns or change data types of existing columns. By default INSERT OVERWRITE only overwrites the data.
INSERT OVERWRITE can also be used to overwrite schema, only when
spark.databricks.delta.schema.autoMerge.enabled is set true if this option is not enabled and if there is a schema mismatch command will fail.


NEW QUESTION # 19
While investigating a data issue in a Delta table, you wanted to review logs to see when and who updated the table, what is the best way to review this data?

  • A. Review event logs in the Workspace
  • B. Check Databricks SQL Audit logs
  • C. Run SQL command DESCRIBE HISTORY table_name
  • D. Review workspace audit logs
  • E. Run SQL SHOW HISTORY table_name

Answer: B

Explanation:
Explanation
The answer is Run SQL command DESCRIBE HISTORY table_name.
here is the sample data of how DESCRIBE HISTORY table_name looks
* +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------
* |version| timestamp|userId|userName|operation| operationParameters|
job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend| operationMetrics|
* +-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------
* | 5|2019-07-29 14:07:47| null| null| DELETE|[predicate -> ["(...|null| null| null| 4| Serializable| false|[numTotalRows -> ...|
* | 4|2019-07-29 14:07:41| null| null| UPDATE|[predicate -> (id...|null| null| null| 3| Serializable| false|[numTotalRows -> ...|
* | 3|2019-07-29 14:07:29| null| null| DELETE|[predicate -> ["(...|null| null| null| 2| Serializable| false|[numTotalRows -> ...|
* | 2|2019-07-29 14:06:56| null| null| UPDATE|[predicate -> (id...|null| null| null| 1| Serializable| false|[numTotalRows -> ...|
* | 1|2019-07-29 14:04:31| null| null| DELETE|[predicate -> ["(...|null| null| null| 0| Serializable| false|[numTotalRows -> ...|
* | 0|2019-07-29 14:01:40| null| null| WRITE|[mode -> ErrorIfE...|null| null| null| null| Serializable| true|[numFiles -> 2, n...|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+


NEW QUESTION # 20
A table is registered with the following code:

Bothusersandordersare Delta Lake tables. Which statement describes the results of queryingrecent_orders?

  • A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
  • B. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
  • C. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • D. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
  • E. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Answer: E


NEW QUESTION # 21
A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

  • A. Unit
  • B. Integration
  • C. Manual
  • D. functional

Answer: A

Explanation:
A unit test is designed to verify the correctness of a small, isolated piece of code, typically a single function.
Testing a mathematical function that calculates the area under acurve is an example of a unit test because it is testing a specific, individual function to ensure it operates as expected.
References:
* Software Testing Fundamentals: Unit Testing


NEW QUESTION # 22
A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?

  • A. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
  • B. Records that violate the expectation cause the job to fail.
  • C. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset.
  • D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
  • E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Answer: A

Explanation:
Explanation
The answer is Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations, Diagram Description automatically generated with medium confidence


NEW QUESTION # 23
A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
Theuser_ltvtable has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?

  • A. Three columns will be returned, but one column will be named "redacted" and contain only null values.
  • B. The email, age. and ltv columns will be returned with the values in user ltv.
  • C. Only the email and ltv columns will be returned; the email column will contain the string
    "REDACTED" in each row.
  • D. Only the email and itv columns will be returned; the email column will contain all null values.
  • E. The email and ltv columns will be returned with the values in user itv.

Answer: C

Explanation:
Explanation
The code creates a view called email_ltv that selects the email and ltv columns from a table called user_ltv, which has the following schema: email STRING, age INT, ltv INT. The code alsouses the CASE WHEN expression to replace the email values with the string "REDACTED" if the user is not a member of the marketing group. The user who executes the query is not a member of the marketing group, so they will only see the email and ltv columns, and the email column will contain the string "REDACTED" in each row.
Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under "CASE expression" section.


NEW QUESTION # 24
A junior data engineer on your team has implemented the following code block.

The viewnew_eventscontains a batch of records with the same schema as theeventsDelta table.
Theevent_idfield serves as a unique key for this table.
When this query is executed, what will happen with new records that have the sameevent_idas an existing record?

  • A. They are ignored.
  • B. They are deleted.
  • C. They are updated.
  • D. They are merged.
  • E. They are inserted.

Answer: A

Explanation:
Explanation
This is the correct answer because it describes what will happen with new records that have the same event_id as an existing record when the query is executed. The query uses the INSERT INTO command to append new records from the view new_events to the table events. However, the INSERT INTO command does not check for duplicate values in the primary key column (event_id) and does not perform any update or delete operations on existing records. Therefore, if there are new records that have the same event_id as an existing record, they will be ignored and not inserted into the table events. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Append data using INSERT INTO" section.
"If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that matches the merge_condition, then the target row is left unchanged."https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#:~:text=If%20none%20o


NEW QUESTION # 25
What is the main difference between AUTO LOADER and COPY INTO?

  • A. AUTO LOADER Supports file notification when performing incremental loads.
  • B. COPY INTO supports schema evolution.
  • C. AUTO LOADER supports reading data from Apache Kafka
  • D. AUTO LOADER supports schema evolution.
  • E. COPY INTO supports file notification when performing incremental loads.

Answer: A

Explanation:
Explanation
Auto loader supports both directory listing and file notification but COPY INTO only supports di-rectory listing.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.

Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
1.Directory listing - List Directory and maintain the state in RocksDB, supports incremental file listing
2.File notification - Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
When to use Auto Loader instead of the COPY INTO?
*You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches.
*You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.
Here are some additional notes on when to use COPY INTO vs Auto Loader
When to use COPY INTO
https://docs.databricks.com/delta/delta-ingest.html#copy-into-sql-command When to use Auto Loader
https://docs.databricks.com/delta/delta-ingest.html#auto-loader


NEW QUESTION # 26
Which of the following is a correct statement on how the data is organized in the storage when when managing a DELTA table?

  • A. All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file.
    (Correct)
  • B. All of the data is broken down into one or many parquet files, log file is removed once the transaction is committed.
  • C. All of the data is stored into one parquet file, log files are broken down into one or many json files.
  • D. All of the data is broken down into one or many parquet files, but the log file is stored as a single json file, and every transaction creates a new data file(s) and log file gets appended.
  • E. All of the data and log are stored in a single parquet file

Answer: A

Explanation:
Explanation
Answer is
All of the data is broken down into one or many parquet files, log files are broken down into one or many json files, and each transaction creates a new data file(s) and log file.
here is sample layout of how DELTA table might look,


NEW QUESTION # 27
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.
Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

  • A. Executor's detail screen and Executor's log files
  • B. Driver's and Executor's log files
  • C. Stage's detail screen and Executor's files
  • D. Stage's detail screen and Query's detail screen

Answer: D

Explanation:
In Apache Spark's UI, indicators of data spilling to disk during the execution of wide transformations can be found in the Stage's detail screen and the Query's detail screen. These screens provide detailed metrics about each stage of a Spark job, including information about memory usage and spill data. If a task is spilling data to disk, it indicates that the data being processed exceeds the available memory, causing Spark to spill data to disk to free up memory. This is an important performance metric as excessive spill can significantly slow down the processing.
References:
* Apache Spark Monitoring and Instrumentation: Spark Monitoring Guide
* Spark UI Explained: Spark UI Documentation


NEW QUESTION # 28
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

  • A. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.
  • B. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
  • C. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.
  • D. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
  • E. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Answer: B

Explanation:
Partitioning the data by the topic field allows the company to apply different access control policies and retention policies for different topics. For example, the company can use the Table Access Control feature to grant or revoke permissions to the registration topic based on user roles or groups. The company can also use the DELETE command to remove records from the registration topic that are older than 14 days, while keeping the records from other topics indefinitely. Partitioning by the topic field also improves the performance of queries that filter by the topic field, as they can skip reading irrelevant partitions. References:
* Table Access Control: https://docs.databricks.com/security/access-control/table-acls/index.html
* DELETE: https://docs.databricks.com/delta/delta-update.html#delete-from-a-table


NEW QUESTION # 29
The below spark command is looking to create a summary table based customerId and the number of times the customerId is present in the event_log delta table and write a one-time micro-batch to a summary table, fill in the blanks to complete the query.
1.spark._________
2. .format("delta")
3. .table("events_log")
4. .groupBy("customerId")
5. .count()
6. ._______
7. .format("delta")
8. .outputMode("complete")
9. .option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
10. .trigger(______)
11. .table("target_table")

  • A. readStream, writeStream, once
  • B. writeStream, readStream, once
  • C. writeStream, readStream, once = True
  • D. writeStream, processingTime = once
  • E. readStream, writeStream, once = True

Answer: E

Explanation:
Explanation
The answer is readStream, writeStream, once = True.
spark.readStream
format("delta")
table("events_log")
groupBy("customerId")
count()
writeStream
format("delta")
outputMode("complete")
option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/") trigger(once = True) table("target_table")


NEW QUESTION # 30
The data engineering team is using a bunch of SQL queries to review data quality and monitor the ETL job every day, which of the following approaches can be used to set up a schedule and auto-mate this process?

  • A. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.
  • B. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL
  • C. They can schedule the query to run every 1 day from the Jobs UI
  • D. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.
  • E. They can schedule the query to run every 12 hours from the Jobs UI.

Answer: D

Explanation:
Explanation
Explanation
Individual queries can be refreshed on a schedule basis,
To set the schedule:
1. Click the query info tab.
Graphical user interface, text, application, email Description automatically generated

* Click the link to the right of Refresh Schedule to open a picker with schedule intervals.
Graphical user interface, application Description automatically generated

* Set the schedule.
The picker scrolls and allows you to choose:
* An interval: 1-30 minutes, 1-12 hours, 1 or 30 days, 1 or 2 weeks
* A time. The time selector displays in the picker only when the interval is greater than 1 day and the day selection is greater than 1 week. When you schedule a specific time, Databricks SQL takes input in your computer's timezone and converts it to UTC. If you want a query to run at a certain time in UTC, you must adjust the picker by your local offset. For example, if you want a query to execute at 00:00 UTC each day, but your current timezone is PDT (UTC-7), you should select 17:00 in the picker:
Graphical user interface Description automatically generated

* Click OK.
Your query will run automatically.
If you experience a scheduled query not executing according to its schedule, you should manually trigger the query to make sure it doesn't fail. However, you should be aware of the following:
* If you schedule an interval-for example, "every 15 minutes"-the interval is calculated from the last successful execution. If you manually execute a query, the scheduled query will not be executed until the interval has passed.
* If you schedule a time, Databricks SQL waits for the results to be "outdated". For example, if you have a query set to refresh every Thursday and you manually execute it on Wednesday, by Thursday the results will still be considered "valid", so the query wouldn't be scheduled for a new execution. Thus, for example, when setting a weekly schedule, check the last query execution time and expect the scheduled query to be executed on the selected day after that execution is a week old. Make sure not to manually execute the query during this time.
If a query execution fails, Databricks SQL retries with a back-off algorithm. The more failures the further away the next retry will be (and it might be beyond the refresh interval).
Refer documentation for additional info,
https://docs.microsoft.com/en-us/azure/databricks/sql/user/queries/schedule-query


NEW QUESTION # 31
At the end of the inventory process, a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data in-crementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically.
Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code.
1.spark.readStream
2..format("cloudfiles")
3..option("_______","csv)
4..option("_______", 'dbfs:/location/checkpoint/')
5..load(data_source)
6..writeStream
7..option("_______",' dbfs:/location/checkpoint/')
8..option("_______", "true")
9..table(table_name))

  • A. format, checkpointlocation, schemalocation, overwrite
  • B. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema
  • C. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, append
  • D. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, overwrite
  • E. cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite

Answer: B

Explanation:
Explanation
The answer is cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema.
Here is the end to end syntax of streaming ELT, below link contains complete options Auto Loader options | Databricks on AWS
1.spark.readStream
2..format("cloudfiles") # Returns a stream data source, reads data as it arrives based on the trigger.
3..option("cloudfiles.format","csv") # Format of the incoming files
4..option("cloudfiles.schemalocation", "dbfs:/location/checkpoint/") The location to store the inferred schema and subsequent changes
5..load(data_source)
6..writeStream
7..option("checkpointlocation","dbfs:/location/checkpoint/") # The location of the stream's checkpoint
8..option("mergeSchema", "true") # Infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.
9..table(table_name)) # target table


NEW QUESTION # 32
You have written a notebook to generate a summary data set for reporting, Notebook was scheduled using the job cluster, but you realized it takes 8 minutes to start the cluster, what feature can be used to start the cluster in a timely fashion so your job can run immediatley?

  • A. Disable auto termination so the cluster is always running
  • B. Setup an additional job to run ahead of the actual job so the cluster is running second job starts
  • C. Use Databricks Premium edition instead of Databricks standard edition
  • D. Pin the cluster in the cluster UI page so it is always available to the jobs
  • E. Use the Databricks cluster pools feature to reduce the startup time

Answer: E

Explanation:
Explanation
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup a pool and follow some best practices,
Graphical user interface, text Description automatically generated


NEW QUESTION # 33
What is the output of the below function when executed with input parameters 1, 3 :
1.def check_input(x,y):
2. if x < y:
3. x= x+1
4. if x<y:
5. x= x+1
6. if x <y:
7. x = x+1
8. return x
check_input(1,3)

  • A. 0
  • B. 1
  • C. 3
    (Correct)
  • D. 2
  • E. 3

Answer: C


NEW QUESTION # 34
The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following steps can be taken to resolve the issue?
Here is the example query
1.--- Get order summary
2.create or replace table orders_summary
3.as
4.select product_id, sum(order_count) order_count
5.from
6. (
7. select product_id,order_count from orders_instore
8. union all
9. select product_id,order_count from orders_online
10. )
11.group by product_id
12.-- get supply summary
13.create or repalce tabe supply_summary
14.as
15.select product_id, sum(supply_count) supply_count
16.from supply
17.group by product_id
18.
19.-- get on hand based on orders summary and supply summary
20.
21.with stock_cte
22.as (
23.select nvl(s.product_id,o.product_id) as product_id,
24. nvl(supply_count,0) - nvl(order_count,0) as on_hand
25.from supply_summary s
26.full outer join orders_summary o
27. on s.product_id = o.product_id
28.)
29.select *
30.from
31.stock_cte
32.where on_hand = 0

  • A. Turn on the Serverless feature for the SQL endpoint.
  • B. Increase the maximum bound of the SQL endpoint's scaling range.
  • C. Turn on the Auto Stop feature for the SQL endpoint.
  • D. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to "Reliability Optimized."
  • E. Increase the cluster size of the SQL endpoint.

Answer: E

Explanation:
Explanation
The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the que-ries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.

SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance.
Scale-out -> Add more clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
SQL endpoint
A picture containing diagram Description automatically generated


NEW QUESTION # 35
You are currently working with the application team to setup a SQL Endpoint point, once the team started consuming the SQL Endpoint you noticed that during peak hours as the number of concur-rent users increases you are seeing degradation in the query performance and the same queries are taking longer to run, which of the following steps can be taken to resolve the issue?

  • A. They can increase the maximum bound of the SQL endpoint's scaling range.
  • B. They can turn on the Serverless feature for the SQL endpoint.
  • C. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance Policy from
    "Cost optimized" to "Reliability Optimized."
  • D. They can turn on the Auto Stop feature for the SQL endpoint.
  • E. They can increase the cluster size(2X-Small to 4X-Large) of the SQL endpoint.

Answer: A

Explanation:
Explanation
The answer is, They can increase the maximum bound of the SQL endpoint's scaling range, when you increase the max scaling range more clusters are added so queries instead of waiting in the queue can start running using available clusters, see below for more explanation.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
1.A SQL Warehouse should have at least one cluster
2.A cluster comprises one driver node and one or many worker nodes
3.No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale up
4.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min
1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
5.Increasing the Warehouse cluster size can improve the performance of a query, example if a query runs for 1 minute in a 2X-Small warehouse size, it may run in 30 Seconds if we change the warehouse size to X-Small.
this is due to 2X-Small has 1 worker node and X-Small has 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
6.A warehouse can have more than one cluster this is called Scale out. If a warehouse is con-figured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
7.A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
Please review the below diagram to understand the above concepts:
Box and whisker chart Description automatically generated

SQL endpoint(SQL Warehouse) scales horizontally(scale-out) and vertical (scale-up), you have to understand when to use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
Databricks SQL automatically scales as soon as it detects queries are in queuing state, in this example scaling is set for min 1 and max 3 which means the warehouse can add three clusters if it detects queries are waiting.
Diagram Description automatically generated

During the warehouse creation or after you have the ability to change the warehouse size (2X-Small....to
...4XLarge) to improve query performance and the maximize scaling range to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out, if you are changing an existing warehouse you may have to restart the warehouse to make the changes effective.
A picture containing diagram Description automatically generated


NEW QUESTION # 36
A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:
ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command?

  • A. A new Delta transaction log Is created for the renamed table.
  • B. The table name change is recorded in the Delta transaction log.
  • C. The table reference in the metastore is updated and no data is changed.
  • D. The table reference in the metastore is updated and all data files are moved.
  • E. All related files and metadata are dropped and recreated in a single ACID transaction.

Answer: C

Explanation:
Explanation
The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.
The result that will occur after running the second command is that the table reference in the metastore is updated and no data is changed. The metastore is a service that stores metadata about tables, such as their schema, location, properties, and partitions. The metastore allows users to access tables using SQL commands or Spark APIs without knowing their physical location or format. When renaming an external table using the ALTER TABLE RENAME TO command, only the table reference in the metastore is updated with the new name; no data files or directories are moved or changed in the storage system. The table will still point to the same location and use the same format as before. However, if renaming a managed table, which is a table whose metadata and data are both managed by Databricks, both the table reference in the metastore and the data files in the default warehouse directory are moved and renamed accordingly. Verified References:
[Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "ALTERTABLE RENAME TO" section; Databricks Documentation, under "Metastore" section; Databricks Documentation, under "Managed and external tables" section.


NEW QUESTION # 37
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day.
At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

  • A. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  • B. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  • C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  • D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  • E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Answer: D

Explanation:
Explanation
This is the correct answer because it can meet the requirement of processing records in less than 10 seconds without modifying the checkpoint directory or dropping records. The trigger once option is a special type of trigger that runs the streaming query only once and terminates after processing all available data. This option can be useful for scenarios where you want to run streaming queries on demand or periodically, rather than continuously. By using the trigger once option and configuring a Databricks job to execute the query every 10 seconds, you can ensure that all backlogged records are processed with each batch and avoid inconsistent execution times. Verified References: [Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Trigger Once" section.


NEW QUESTION # 38
The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?

  • A. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
  • B. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
  • C. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
  • D. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
  • E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Answer: E

Explanation:
https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html The code is using %sh to execute shell code on the driver node. This means that the code is not taking advantage of the worker nodes or Databricks optimized Spark. This is why the code is taking longer to execute. A better approach would be to use Databricks libraries and APIs to read and write data from Git and DBFS, and to leverage the parallelism and performance of Spark. For example, you can use the Databricks Connect feature to run your Python code on a remote Databricks cluster, or you can use the Spark Git Connector to read data from Git repositories as Spark DataFrames.


NEW QUESTION # 39
......


Databricks Certified Professional Data Engineer exam is a rigorous and comprehensive assessment of a candidate's skills in designing, building, and maintaining data pipelines on the Databricks platform. Databricks-Certified-Professional-Data-Engineer exam covers a wide range of topics, including data storage and retrieval, data processing, data transformation, and data visualization. Candidates are tested on their ability to design and implement scalable and reliable data architectures, as well as their proficiency in troubleshooting and optimizing data pipelines.

 

Databricks-Certified-Professional-Data-Engineer Dumps for Databricks Certification Certified Exam Questions and Answer: https://www.dumpstests.com/Databricks-Certified-Professional-Data-Engineer-latest-test-dumps.html