Enroll Here: Spark MLlIB Cognitive Class Exam Quiz Answers
Spark MLlIB Cognitive Class Certification Answers
Module 1 – Spark MLlIB Data Types Quiz Answers – Cognitive Class
Question 1: Sparse Data generally contains many non-zero values, and few zero values.
- True
- False
Question 2: Local matrices are generally stored in distributed systems and rarely on single machines.
- True
- False
Question 3: Which of the following are distributed matrices?
- Row Matrix
- Column Matrix
- Coordinate Matrix
- Spherical Matrix
- Row Matrix and Coordinate Matrix
- All of the Above
Module 2 – Review Alogrithms Quiz Answers – Cognitive Class
Question 1: Logistic Regression is an algorithm used for predicting numerical values.
- True
- False
Question 2: The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.
- True
- False
Question 3: Which of the following is true about Gaussian Mixture Clustering?
- The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
- The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
- The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
- Gaussian Mixture Clustering uses multiple centroids to cluster data points.
- All of the Above
Module 3 – Spark MLlIB Decision Trees and Random Forests Quiz Answers – Cognitive Class
Question 1: Which of the following is a stopping parameter in a Decision Tree?
- The number of nodes in the tree reaches a specific value.
- The depth of the tree reaches a specific value.
- The breadth of the tree reaches a specific value.
- All of the Above
Question 2: When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.
- True
- False
Question 3: In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.
- True
- False
Module 4 – Spark MLlIB Clustering Quiz Answers – Cognitive Class
Question 1: In Spark MLlib, the initialization mode for the K-Means training method is called
- k-means–
- k-means++
- k-means||
- k-means
Question 2: In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.
- True
- False
Question 3: In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.
- True
- False
Spark MLlIB Final Exam Answers – Cognitive Class
Question 1: In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.
- True
- False
Question 2: In Decision Trees, what is true about the size of a dataset?
- Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
- Large datasets sort feature values, then use the ordered values as split calculations.
- Small datasets create split candidates based on quantile calculations on a sample of the data.
- Small datasets split on random values for the feature.
Question 3: A Logistic Regression algorithm is ineffective as a binary response predictor.
- True
- False
Question 4: What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]
- [1, 6]
- [0, 2, 3, 6]
- [0, 2, 3, 5]
- [2, 3]
Question 5: For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.
- True
- False
Question 6: In a Decision Tree, choosing a very large maxDepth value can:
- Increase accuracy
- Increase the risk of overfitting to the training set
- Increase the cost of training
- All of the Above
- Increase the risk of overfitting and increase the cost of training
Question 7: In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.
- True
- False
Question 8: Increasing the value of epsilon when creating the K-Means Clustering model can:
- Decrease training cost and decrease the number of iterations that the model undergoes
- Decrease training cost and increase the number of iterations that the model undergoes
- Increase training cost and decrease the number of iterations that the model undergoes
- Increase training cost and increase the number of iterations that the model undergoes
Question 9: In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)
- Python List
- Textfile
- CSV file
- RDD
Question 10: What is true about Dense and Sparse Vectors?
- A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
- A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
- A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
- A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.
Question 11: In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.
- True
- False
Question 12: In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.
- True
- False
Question 13: What is true about Labeled Points?
- A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
- B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
- C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
- D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
- All of the Above
- A and C only
Question 14: In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.
- True
- False
Question 15: In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.
- True
- False
Question 16: What is true about the maxDepth parameter for Random Forests?
- A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
- A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
- A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
- A large maxDepth value is preferred since tree averaging yields an increase in overall variance.
Introduction to Spark MLlIB
Spark MLlib (Machine Learning Library) is Apache Spark’s scalable and distributed machine learning library. It provides a set of high-level APIs built on top of Spark that simplifies the development of scalable machine learning applications. MLlib supports various machine learning algorithms and tools for data preprocessing, feature engineering, and model evaluation. Here are key aspects of Spark MLlib:
1. Scalability and Distributed Computing:
- MLlib is designed to scale horizontally and can efficiently handle large-scale machine learning tasks by distributing computations across a Spark cluster.
2. High-Level APIs:
- MLlib provides high-level APIs in Scala, Java, Python, and R, making it accessible to a broad audience of developers and data scientists.
3. Algorithms:
- Supervised Learning:
- Classification: Logistic Regression, Decision Trees, Random Forests, Gradient-Boosted Trees, Support Vector Machines (SVM), etc.
- Regression: Linear Regression, Generalized Linear Regression, etc.
- Unsupervised Learning:
- Clustering: K-Means, Gaussian Mixture Model (GMM), Bisecting K-Means, etc.
- Dimensionality Reduction: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), etc.
4. Data Preprocessing and Feature Engineering:
- MLlib provides tools for feature extraction, transformation, and selection, including vectorization, normalization, and handling missing values.
5. Pipelines:
- MLlib supports the concept of Pipelines, a high-level API for constructing, tuning, and deploying machine learning workflows. Pipelines help streamline the development process and ensure consistency.
6. Model Persistence:
- MLlib allows you to save and load machine learning models for later use. This is crucial for deploying models in production environments.
7. Integration with Spark Ecosystem:
- MLlib seamlessly integrates with other Spark components such as Spark SQL for data manipulation, Spark Streaming for real-time data processing, and Spark GraphX for graph analytics.
8. Hyperparameter Tuning:
- MLlib includes tools for hyperparameter tuning, enabling you to optimize model performance by searching through different combinations of hyperparameters.
9. Model Evaluation:
- MLlib provides metrics for evaluating the performance of machine learning models, including classification metrics, regression metrics, and clustering metrics.
10. Distributed Machine Learning:
- MLlib is designed for distributed machine learning, allowing you to leverage the power of Spark clusters for parallel and distributed processing.
11. Supported File Formats:
- MLlib supports various data formats, including Spark DataFrames, RDDs (Resilient Distributed Datasets), and LIBSVM format for large-scale training datasets.
12. Community and Contributions:
- MLlib benefits from the active Apache Spark community, which contributes to its development and maintenance. The library continues to evolve with new algorithms and features.
13. Extensibility:
- While MLlib provides a rich set of built-in algorithms, it is extensible, allowing you to implement custom algorithms and transformers.