Home > Databricks-Certified-Professional-Data-Engineer / Databricks

Free Databricks (Databricks-Certified-Professional-Data-Engineer) Certification Sample Questions with Online Practice Test [Q94-Q114]

Published date July 24, 2023

5/5 - (3 votes)

Free Databricks (Databricks-Certified-Professional-Data-Engineer) Certification Sample Questions with Online Practice Test

Databricks-Certified-Professional-Data-Engineer Certification Study Guide Pass Databricks-Certified-Professional-Data-Engineer Fast

Databricks Certified Professional Data Engineer (DCPDE) is a certification program designed to validate the skills and knowledge of data professionals on the Databricks platform. Databricks Certified Professional Data Engineer Exam certification is aimed at professionals who design, build, and maintain data processing systems using Apache Spark and Databricks. The DCPDE certification demonstrates a comprehensive understanding of the Databricks platform and the ability to design and implement data processing solutions using Spark.

QUESTION 94
You are asked to setup two tasks in a databricks job, the first task runs a notebook to download the data from a remote system, and the second task is a DLT pipeline that can process this data, how do you plan to configure this in Jobs UI

Single job cannot have a notebook task and DLT Pipeline task, use two different jobs with linear dependency.

Jobs UI does not support DTL pipeline, setup the first task using jobs UI and setup the DLT to run in continuous mode.

Jobs UI does not support DTL pipeline, setup the first task using jobs UI and setup the DLT to run in trigger mode.

Single job can be used to setup both notebook and DLT pipeline, use two different tasks with linear dependency.

Add first step in the DLT pipeline and run the DLT pipeline as triggered mode in JOBS UI

Explanation
The answer is Single job can be used to set up both notebook and DLT pipeline, use two different tasks with linear dependency, Here is the JOB UI
1.Create a notebook task
2.Create DLT task
a.add notebook task as dependency
3.Final view
Create the notebook task
Graphical user interface, text, application, email Description automatically generated

DLT task
Graphical user interface, text, application, email Description automatically generated

Final view
Graphical user interface, text, application, PowerPoint Description automatically generated

Bottom of Form
Top of Form

QUESTION 95
Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the Databricks Lakehouse Platform?

Databricks Repos can facilitate the pull request, review, and approval process before merging branches

Databricks Repos can merge changes from a secondary Git branch into a main Git branch

Databricks Repos can be used to design, develop, and trigger Git automation pipelines

Databricks Repos can store the single-source-of-truth Git repository

Databricks Repos can commit or push code changes to trigger a CI/CD process

Explanation
Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process See below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops.
Diagram Description automatically generated

QUESTION 96
A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for
incremental processing in the ingestion of JSON files. One data engineer comes across the following code
block in the Auto Loader documentation:
1. (streaming_df = spark.readStream.format(“cloudFiles”)
2. .option(“cloudFiles.format”, “json”)
3. .option(“cloudFiles.schemaLocation”, schemaLocation)
4. .load(sourcePath))
Assuming that schemaLocation and sourcePath have been set correctly, which of the following changes does
the data engineer need to make to convert this code block to use Auto Loader to ingest the data?

There is no change required. The inclusion of format(“cloudFiles”) enables the use of Auto Loader

There is no change required. Databricks automatically uses Auto Loader for streaming reads

The data engineer needs to change the format(“cloudFiles”) line to format(“autoLoader”)

The data engineer needs to add the .autoLoader line before the .load(sourcePath) line

There is no change required. The data engineer needs to ask their administrator to turn on Auto Loader

QUESTION 97
The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following steps can be taken to resolve the issue?
Here is the example query
1.— Get order summary
2.create or replace table orders_summary
3.as
4.select product_id, sum(order_count) order_count
5.from
6. (
7. select product_id,order_count from orders_instore
8. union all
9. select product_id,order_count from orders_online
10. )
11.group by product_id
12.– get supply summary
13.create or repalce tabe supply_summary
14.as
15.select product_id, sum(supply_count) supply_count
16.from supply
17.group by product_id
18.
19.– get on hand based on orders summary and supply summary
20.
21.with stock_cte
22.as (
23.select nvl(s.product_id,o.product_id) as product_id,
24. nvl(supply_count,0) – nvl(order_count,0) as on_hand
25.from supply_summary s
26.full outer join orders_summary o
27. on s.product_id = o.product_id
28.)
29.select *
30.from
31.stock_cte
32.where on_hand = 0

Turn on the Serverless feature for the SQL endpoint.

Increase the maximum bound of the SQL endpoint’s scaling range.

Increase the cluster size of the SQL endpoint.

Turn on the Auto Stop feature for the SQL endpoint.

Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to “Reliability Optimized.”

Explanation
The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won’t improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the que-ries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.

SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large….
If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance.
Scale-out -> Add more clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
SQL endpoint
A picture containing diagram Description automatically generated

QUESTION 98
You would like to build a spark streaming process to read from a Kafka queue and write to a Delta table every
15 minutes, what is the correct trigger option

trigger(“15 minutes”)

trigger(process “15 minutes”)

trigger(processingTime = 15)

trigger(processingTime = “15 Minutes”)

trigger(15)

Explanation
The answer is trigger(processingTime = “15 Minutes”)
Triggers:
*Unspecified
This is the default. This is equivalent to using processingTime=”500ms”
*Fixed interval micro-batches .trigger(processingTime=”2 minutes”)
The query will be executed in micro-batches and kicked off at the user-specified intervals
*One-time micro-batch .trigger(once=True)
The query will execute a single micro-batch to process all the available data and then stop on its own
*One-time micro-batch.trigger .trigger(availableNow=True) — New feature a better version of (once=True) Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing batches and the resultant files.

QUESTION 99
A new data engineer has started at a company. The data engineer has recently been added to the company’s
Databricks workspace as [email protected]. The data engineer needs to be able to query the table
sales in the database retail. The new data engineer already has been granted USAGE on the database retail.
Which of the following commands can be used to grant the appropriate permissions to the new data engineer?

GRANT USAGE ON TABLE sales TO [email protected];

GRANT SELECT ON TABLE [email protected] TO sales;

GRANT SELECT ON TABLE sales TO [email protected];

GRANT CREATE ON TABLE sales TO [email protected];

GRANT USAGE ON TABLE [email protected] TO sales;

QUESTION 100
Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data discovery?

DELTA LIVE Pipelines

Unity Catalog

Data Governance

DELTA lake

Lakehouse

QUESTION 101
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE . The table is configured to
run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to
allow for additional testing

All datasets will be updated once and the pipeline will shut down. The compute resources will be
terminated

All datasets will be updated continuously and the pipeline will not shut down. The compute resources
will persist with the pipeline

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
persist after the pipeline is stopped to allow for additional testing

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
be deployed for the update and terminated when the pipeline is stopped

QUESTION 102
You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on time, what steps you can take to reduce the amount of time cluster startup time

Setup a second job ahead of first job to start the cluster, so the cluster is ready with re-sources when the job starts

Use All purpose cluster instead to reduce cluster start up time

Reduce the size of the cluster, smaller the cluster size shorter it takes to start the clus-ter

Use cluster pools to reduce the startup time of the jobs

Use SQL endpoints to reduce the startup time

Explanation
The answer is, Use cluster pools to reduce the startup time of the jobs.
Cluster pools allow us to reserve VM’s ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM’s are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup and follow some best practices,
https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy

QUESTION 103
A junior data engineer needs to create a Spark SQL table my_table for which Spark manages both the data and
the metadata. The metadata and data should also be stored in the Databricks Filesystem (DBFS).
Which of the following commands should a senior data engineer share with the junior data engineer to
complete this task?

1. CREATE MANAGED TABLE my_table (id STRING, value STRING) USING
2. org.apache.spark.sql.parquet OPTIONS (PATH “storage-path”);

1. CREATE TABLE my_table (id STRING, value STRING) USING DBFS;

1. CREATE TABLE my_table (id STRING, value STRING) USING
2. org.apache.spark.sql.parquet OPTIONS (PATH “storage-path”)

1. CREATE TABLE my_table (id STRING, value STRING);

1. CREATE MANAGED TABLE my_table (id STRING, value STRING);

QUESTION 104
What is the type of table created when you issue SQL DDL command CREATE TABLE sales (id int, units int)

Query fails due to missing location

Query fails due to missing format

Managed Delta table

External Table

Managed Parquet table

Explanation
Answer is Managed Delta table
Anytime a table is created without the Location keyword it is considered a managed table, by de-fault all managed tables DELTA tables Syntax CREATE TABLE table_name ( column column_data_type…)

QUESTION 105
How does Lakehouse replace the dependency on using Data lakes and Data warehouses in a Data and Analytics solution?

Open, direct access to data stored in standard data formats.

Supports ACID transactions.

Supports BI and Machine learning workloads

Support for end-to-end streaming and batch workloads

All the above

Explanation
Lakehouse combines the benefits of a data warehouse and data lakes,
Lakehouse = Data Lake + DataWarehouse
Here are some of the major benefits of a lakehouse
Text, letter Description automatically generated

Lakehouse = Data Lake + DataWarehouse
A picture containing text, blackboard Description automatically generated

QUESTION 106
How do you check the location of an existing schema in Delta Lake?

Run SQL command SHOW LOCATION schema_name

Check unity catalog UI

Use Data explorer

Run SQL command DESCRIBE SCHEMA EXTENDED schema_name
E Schemas are internally in-store external hive meta stores like MySQL or SQL Server

Explanation
Here is an example of how it looks
Graphical user interface, text, application, email Description automatically generated

QUESTION 107
How do you create a delta live tables pipeline and deploy using DLT UI?

Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code.

Under Cluster UI, select SPARK UI and select Structured Streaming and click create pipeline and select the notebook with DLT code.

There is no UI, you can only setup DELTA LIVE TABLES using Python and SQL API and select the notebook with DLT code.

Use VS Code and download DBX plugin, once the plugin is loaded you can build DLT pipelines and select the notebook with DLT code.

Within the Workspace UI, click on SQL Endpoint, select Delta Live tables and create pipelinea and select the notebook with DLT code.

Explanation
The answer is, Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code.
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html

QUESTION 108
you are currently working on creating a spark stream process to read and write in for a one-time micro batch, and also rewrite the existing target table, fill in the blanks to complete the below command sucesfully.
1.spark.table(“source_table”)
2..writeStream
3..option(“____”, “dbfs:/location/silver”)
4..outputMode(“____”)
5..trigger(Once=____)
6..table(“target_table”)

checkpointlocation, complete, True

targetlocation, overwrite, True

checkpointlocation, True, overwrite

checkpointlocation, True, complete

checkpointlocation, overwrite, True

QUESTION 109
What is the purpose of the bronze layer in a Multi-hop architecture?

Can be used to eliminate duplicate records

Used as a data source for Machine learning applications.

Perform data quality checks, corrupt data quarantined

Contains aggregated data that is to be consumed into Silver

Provides efficient storage and querying of full unprocessed history of data

Explanation
The answer is Provides efficient storage and querying of full unprocessed history of data Medallion Architecture – Databricks Bronze Layer:
1.Raw copy of ingested data
2.Replaces traditional data lake
3.Provides efficient storage and querying of full, unprocessed history of data
4.No schema is applied at this layer
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.

QUESTION 110
A data engineering team has been using a Databricks SQL query to monitor the performance of an ELT job.
The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL
query returns the number of minutes since the job’s most recent runtime.
Which of the following approaches can enable the data engineering team to be notified if the ELT job has not
been run in an hour?

They can set up an Alert for the query to notify when the ELT job fails

They can set up an Alert for the accompanying dashboard to notify them if the returned value is greater
than 60

They can set up an Alert for the accompanying dashboard to notify when it has not re-freshed in 60
minutes

They can set up an Alert for the query to notify them if the returned value is greater than 60

This type of alerting is not possible in Databricks

QUESTION 111
Which of the following command can be used to drop a managed delta table and the underlying files in the storage?

DROP TABLE table_name CASCADE

DROP TABLE table_name

Use DROP TABLE table_name command and manually delete files using com-mand dbutils.fs.rm(“/path”,True)

DROP TABLE table_name INCLUDE_FILES

DROP TABLE table and run VACUUM command

Explanation
The answer is DROP TABLE table_name,
When a managed table is dropped, the table definition is dropped from metastore and everything including data, metadata, and history are also dropped from storage.

QUESTION 112
At the end of the inventory process, a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data in-crementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically.
Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code.
1.spark.readStream
2..format(“cloudfiles”)
3..option(“_______”,”csv)
4..option(“_______”, ‘dbfs:/location/checkpoint/’)
5..load(data_source)
6..writeStream
7..option(“_______”,’ dbfs:/location/checkpoint/’)
8..option(“_______”, “true”)
9..table(table_name))

format, checkpointlocation, schemalocation, overwrite

cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, overwrite

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, append

Explanation
The answer is cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema.
Here is the end to end syntax of streaming ELT, below link contains complete options Auto Loader options | Databricks on AWS
1.spark.readStream
2..format(“cloudfiles”) # Returns a stream data source, reads data as it arrives based on the trigger.
3..option(“cloudfiles.format”,”csv”) # Format of the incoming files
4..option(“cloudfiles.schemalocation”, “dbfs:/location/checkpoint/”) The location to store the inferred schema and subsequent changes
5..load(data_source)
6..writeStream
7..option(“checkpointlocation”,”dbfs:/location/checkpoint/”) # The location of the stream’s checkpoint
8..option(“mergeSchema”, “true”) # Infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.
9..table(table_name)) # target table

QUESTION 113
You are working on a process to load external CSV files into a delta table by leveraging the COPY INTO command, but after running the command for the second time no data was loaded into the table name, why is that?
1.COPY INTO table_name
2.FROM ‘dbfs:/mnt/raw/*.csv’
3.FILEFORMAT = CSV

COPY INTO only works one time data load

Run REFRESH TABLE sales before running COPY INTO

COPY INTO did not detect new files after the last load

Use incremental = TRUE option to load new files

COPY INTO does not support incremental load, use AUTO LOADER

Explanation
The answer is COPY INTO did not detect new files after the last load,
COPY INTO keeps track of files that were successfully loaded into the table, the next time when the COPY INTO runs it skips them.
FYI, you can change this behavior by using COPY_OPTIONS ‘force’= ‘true’, when this option is enabled all files in the path/pattern are loaded.
1.COPY INTO table_identifier
2. FROM [ file_location | (SELECT identifier_list FROM file_location) ]
3. FILEFORMAT = data_source
4. [FILES = [file_name, … | PATTERN = ‘regex_pattern’]
5. [FORMAT_OPTIONS (‘data_source_reader_option’ = ‘value’, …)]
6. [COPY_OPTIONS ‘force’ = (‘false’|’true’)]

QUESTION 114
What is the purpose of a silver layer in Multi hop architecture?

Replaces a traditional data lake

Efficient storage and querying of full and unprocessed history of data

A schema is enforced, with data quality checks.

Refined views with aggregated data

Optimized query performance for business-critical data

Explanation
The answer is, A schema is enforced, with data quality checks.
Medallion Architecture – Databricks
Silver Layer:
1.Reduces data storage complexity, latency, and redundency
2.Optimizes ETL throughput and analytic query performance
3.Preserves grain of original data (without aggregation)
4.Eliminates duplicate records
5.production schema enforced
6.Data quality checks, quarantine corrupt data
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.

Loading …

Get Perfect Results with Premium Databricks-Certified-Professional-Data-Engineer Dumps Updated 220 Questions: https://www.passtestking.com/Databricks/Databricks-Certified-Professional-Data-Engineer-practice-exam-dumps.html

Categories:Databricks-Certified-Professional-Data-EngineerDatabricks

Tags:Databricks-Certified-Professional-Data-Engineer latest free study questions Databricks-Certified-Professional-Data-Engineer new practice questions free Databricks-Certified-Professional-Data-Engineer practical information Databricks-Certified-Professional-Data-Engineer valid test dumps questions

admin

Databricks-Certified-Data-Engineer-Associate / Databricks

[Oct 12, 2024] Databricks-Certified-Data-Engineer-Associate PDF Recently Updated Questions Dumps to Improve Exam Score [Q30-Q45]

Databricks-Certified-Professional-Data-Engineer Practice Tests

Free Databricks (Databricks-Certified-Professional-Data-Engineer) Certification Sample Questions with Online Practice Test [Q94-Q114]

Related Certifications

Databricks-Certified-Professional-Data-Engineer
Databricks-Certified-Data-Engineer-Associate

Exam Practice Questions

Free Databricks (Databricks-Certified-Professional-Data-Engineer) Certification Sample Questions with Online Practice Test [Q94-Q114]

Leave a Reply Cancel reply

Related Posts

[Oct 12, 2024] Databricks-Certified-Data-Engineer-Associate PDF Recently Updated Questions Dumps to Improve Exam Score [Q30-Q45]

Leave a Reply Cancel reply