[Mar 06, 2023] Valid Associate-Developer-Apache-Spark Test Answers & Associate-Developer-Apache-Spark Exam PDF [Q95-Q112]

Share

[Mar 06, 2023] Valid Associate-Developer-Apache-Spark Test Answers & Associate-Developer-Apache-Spark Exam PDF

Valid Databricks Certification Associate-Developer-Apache-Spark Dumps Ensure Your Passing

NEW QUESTION 95
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))

  • A. 1. length
    2. regexp_replace
    3. lower
    4. col("itemName")
    5. alias
  • B. 1. size
    2. regexp_replace
    3. lower
    4. "itemName"
    5. alias
  • C. 1. size
    2. regexp_extract
    3. lower
    4. col("itemName")
    5. alias
  • D. 1. length
    2. regexp_extract
    3. upper
    4. col("itemName")
    5. as
  • E. 1. lower
    2. regexp_replace
    3. length
    4. "itemName"
    5. alias

Answer: A

Explanation:
Explanation
Correct code block:
itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame:
+------------+
|consonant_ct|
+------------+
| 19|
| 16|
| 10|
+------------+
This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|
\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam.
The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation.
The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.
More info:
- lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation
- regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation
- length: pyspark.sql.functions.length - PySpark 3.1.2 documentation
- alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 96
Which of the following is the deepest level in Spark's execution hierarchy?

  • A. Task
  • B. Slot
  • C. Job
  • D. Executor
  • E. Stage

Answer: A

Explanation:
Explanation
The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

 

NEW QUESTION 97
The code block displayed below contains an error. The code block should configure Spark to split data in 20 parts when exchanging data between executors for joins or aggregations. Find the error.
Code block:
spark.conf.set(spark.sql.shuffle.partitions, 20)

  • A. The code block is missing a parameter.
  • B. The code block sets the wrong option.
  • C. The code block sets the incorrect number of parts.
  • D. The code block uses the wrong command for setting an option.
  • E. The code block expresses the option incorrectly.

Answer: E

Explanation:
Explanation
Correct code block:
spark.conf.set("spark.sql.shuffle.partitions", 20)
The code block expresses the option incorrectly.
Correct! The option should be expressed as a string.
The code block sets the wrong option.
No, spark.sql.shuffle.partitions is the correct option for the use case in the question.
The code block sets the incorrect number of parts.
Wrong, the code block correctly states 20 parts.
The code block uses the wrong command for setting an option.
No, in PySpark spark.conf.set() is the correct command for setting an option.
The code block is missing a parameter.
Incorrect, spark.conf.set() takes two parameters.
More info: Configuration - Spark 3.1.2 Documentation

 

NEW QUESTION 98
Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?

  • A. transactionsDf.withColumnRenamed(col(productId), col(productNumber))
  • B. transactionsDf.withColumnRenamed(productId, productNumber)
  • C. transactionsDf.withColumnRenamed("productNumber", "productId")
  • D. transactionsDf.withColumn("productId", "productNumber")
  • E. transactionsDf.withColumnRenamed("productId", "productNumber")

Answer: E

Explanation:
Explanation
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 99
Which of the following describes Spark actions?

  • A. The driver receives data upon request by actions.
  • B. Stage boundaries are commonly established by actions.
  • C. Writing data to disk is the primary purpose of actions.
  • D. Actions are Spark's way of modifying RDDs.
  • E. Actions are Spark's way of exchanging data between executors.

Answer: A

Explanation:
Explanation
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors.
No. In Spark, data is exchanged between executors via shuffles.
Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable - they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.

 

NEW QUESTION 100
Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

  • A. transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})
  • B. transactionsDf.select('value', 'productId').distinct()
  • C. tranactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'),
    'outer')
  • D. transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()
  • E. transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})

Answer: D

Explanation:
Explanation
transactionsDf.select('value').union(transactionsDf.select('productId')).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).
transactionsDf.select('value', 'productId').distinct()
Wrong. This code block returns unique rows, but not unique values.
transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'}) Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).
transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'}) No. This command will count the number of rows, but will not return unique values.
transactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer') Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read up on the difference between union and join, a link is posted below.
More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation, sql - What is the difference between JOIN and UNION? - Stack Overflow Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 101
The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and transactionDate (in this order). Find the error.
Code block:
transactionsDf.coalesce(14, ("storeId", "transactionDate"))

  • A. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.
  • B. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.
    (Correct)
  • C. The parentheses around the column names need to be removed and .select() needs to be appended to the code block.
  • D. Operator coalesce needs to be replaced by repartition.
  • E. Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

Answer: B

Explanation:
Explanation
Correct code block:
transactionsDf.repartition(14, "storeId", "transactionDate").count()
Since we do not know how many partitions DataFrame transactionsDf has, we cannot safely use coalesce, since it would not make any change if the current number of partitions is smaller than 14.
So, we need to use repartition.
In the Spark documentation, the call structure for repartition is shown like this:
DataFrame.repartition(numPartitions, *cols). The * operator means that any argument after numPartitions will be interpreted as column. Therefore, the brackets need to be removed.
Finally, the question specifies that after the execution the DataFrame should be divided. So, indirectly this question is asking us to append an action to the code block. Since .select() is a transformation. the only possible choice here is .count().
More info: pyspark.sql.DataFrame.repartition - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 102
The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.
Code block:
transactionsDf.agg("storeId").avg("value")

  • A. Instead of avg("value"), avg(col("value")) should be used.
  • B. The avg("value") should be specified as a second argument to agg() instead of being appended to it.
  • C. "storeId" and "value" should be swapped.
  • D. All column names should be wrapped in col() operators.
  • E. agg should be replaced by groupBy.

Answer: E

Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/30.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 103
The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.
Code block:
itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

  • A. The union method should be used instead of join.
  • B. The join expression is malformed.
  • C. The join method is inappropriate.
  • D. The merge method should be used instead of join.
  • E. The join statement is incomplete.

Answer: E

Explanation:
Explanation
Correct code block:
itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId) The join statement is incomplete.
Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is missing in the code block.
The join method is inappropriate.
No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.
The join expression is malformed.
Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.
The merge method should be used instead of join.
False. There is no DataFrame.merge() method in PySpark.
The union method should be used instead of join.
Wrong. DataFrame.union() merges rows, but not columns as requested in the question.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 104
Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+

  • A. itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))
  • B. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contain
  • C. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))
  • D. itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))
  • E. itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").co

Answer: B

Explanation:
Explanation
Result of correct code block:
+-------------------+
|attributes_exploded|
+-------------------+
| winter|
| cooling|
+-------------------+
To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below).
Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 105
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:

  • A. save() is evaluated lazily and needs to be followed by an action.
  • B. transactionsDf.format("parquet").option("mode", "append").save(path)
  • C. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
  • D. The mode option should be omitted so that the command uses the default mode.
  • E. The code block is missing a bucketBy command that takes care of partitions.
  • F. The code block is missing a reference to the DataFrameWriter.

Answer: F

Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").option("mode", "append").save(path)

 

NEW QUESTION 106
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))

  • A. 1. sample
    2. True
    3. 0.15
    4. filter
  • B. 1. sample
    2. False
    3. 0.15
    4. select
  • C. 1. fraction
    2. False
    3. 0.85
    4. select
  • D. 1. sample
    2. 0.85
    3. False
    4. select
  • E. 1. fraction
    2. 0.15
    3. True
    4. where

Answer: B

Explanation:
Explanation
Correct code block:
transactionsDf.sample(withReplacement=False, fraction=0.15).select(avg('predError')) You should remember that getting a random subset of rows means sampling. This, in turn should point you to the DataFrame.sample() method. Once you know this, you can look up the correct order of arguments in the documentation (link below).
Lastly, you have to decide whether to use filter, where or select. where is just an alias for filter(). filter() is not the correct method to use here, since it would only allow you to filter rows based on some condition. However, the question asks to return only the average prediction error. You can control the columns that a query returns with the select() method - so this is the correct method to use here.
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 107
Which of the following describes characteristics of the Dataset API?

  • A. In Python, the Dataset API mainly resembles Pandas' DataFrame API.
  • B. The Dataset API does not support unstructured data.
  • C. The Dataset API is available in Scala, but it is not available in Python.
  • D. In Python, the Dataset API's schema is constructed via type hints.
  • E. The Dataset API does not provide compile-time type safety.

Answer: C

Explanation:
Explanation
The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.
The Dataset API does not provide compile-time type safety.
No - in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data.
In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python.
In Python, the Dataset API mainly resembles Pandas' DataFrame API.
The Dataset API does not exist in Python, only in Scala and Java.

 

NEW QUESTION 108
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

  • A. The Python function is unable to handle null values, resulting in the code block crashing on execution.
  • B. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
  • C. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
  • D. The udf() method does not declare a return type.
  • E. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

Answer: B

Explanation:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.

 

NEW QUESTION 109
The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.
Code block:
transactionsDf.filter(col('predError').in([3, 6])).count()

  • A. Instead of a list, the values need to be passed as single arguments to the in operator.
  • B. The method used on column predError is incorrect.
  • C. The number of rows cannot be determined with the count() operator.
  • D. Numbers 3 and 6 need to be passed as string variables.
  • E. Instead of filter, the select method should be used.

Answer: B

Explanation:
Explanation
Correct code block:
transactionsDf.filter(col('predError').isin([3, 6])).count()
The isin method is the correct one to use here - the in method does not exist for the Column object.
More info: pyspark.sql.Column.isin - PySpark 3.1.2 documentation

 

NEW QUESTION 110
Which of the following describes the role of tasks in the Spark execution hierarchy?

  • A. Within one task, the slots are the unit of work done for each partition of the data.
  • B. Tasks with wide dependencies can be grouped into one stage.
  • C. Stages with narrow dependencies can be grouped into one task.
  • D. Tasks are the smallest element in the execution hierarchy.
  • E. Tasks are the second-smallest element in the execution hierarchy.

Answer: D

Explanation:
Explanation
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

 

NEW QUESTION 111
Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?

  • A. transactionsDf.summary().show()
  • B. transactionsDf.summary("count", "mean", "stddev", "25%", "50%", "75%", "max").show()
  • C. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min").show()
  • D. transactionsDf.summary()
  • E. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min")

Answer: A

Explanation:
Explanation
The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command calculates various statistics (see documentation linked below), including standard deviation and minimum.
Note that the answer that lists many options in the summary() parentheses does not include the minimum, which is asked for in the question.
Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values.
More info:
- pyspark.sql.DataFrame.summary - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 112
......

Associate-Developer-Apache-Spark Dumps Real Exam Questions Test Engine Dumps Training: https://examcollection.realvce.com/Associate-Developer-Apache-Spark-original-questions.html