Introduction to Machine Learning Coursera Quiz Answer

Introduction to Machine Learning Quiz Answer

Week-1

Week 1 Comprehensive

1.
Question 1
Which of the following are necessary for supervised machine learning? (Choose all that are correct)

1 point

A model
Learning from data
Labeled training data
Human to teach the machine

2.
Question 2
What decision boundary can logistic regression provide?

1 point

Arbitrarily complex functions
Jagged edges
Smooth curves
Linear

3.
Question 3
What is the primary advantage of using multiple filters?

1 point

More complexity is always better.
This requires less compute power.
This allows the model to look for subtypes of the classification.
This is simpler to implement.

4.
Question 4
Which one of the following best describes transfer learning in the context of document analysis?

1 point

All parameters of the model are different between individuals.
Parameters at the bottom of the model are transferable across all people and documents, while the parameters at the top are different between individuals.
All parameters of the model are transferable across all people and documents.
Parameters at the top of the model are transferable across all people and documents, while the parameters at the bottom are different between individuals.

5.
Question 5
Given the following image of data classifications, which of the following models would you choose?

1 point

Logistic regression
Multilayer perceptron

6.
Question 6
What new feature did neural networks acquire in 2010?

1 point

A new computational platform: the GPU
A new application: image search
A new operation: convolution
A new name: Deep Learning

7.
Question 7
Which of the following is convolved with layer 2 features, or sub-motifs?

1 point

Layer 2 feature map
Layer 1 feature map
Layer 3 feature map

8.
Question 8
Which of the following gives the best conceptual meaning of convolution?

1 point

Surveying a feature map for high-level motif.
Selecting an atomic element from an image.
Stacking a collection of feature maps.
Shifting a filter to every location in an image.

9.
Question 9
What does transfer learning mean in the context of medical imaging?

1 point

Just as assigning categories to images in ImageNet required millions of images, so too does analyzing medical images require millions of labeled medical images.
Sufficient labeled radiological images can be used to learn all of the model parameters, so they can be used for ophthalmological or dermatological images.
Once the convolutional layers are learned from labeled medical images, the top layers can be inferred from the parameters found with data from ImageNet.
Weights of convolutional layers learned from ImageNet transfer to medical images, so we only need learn new parameters at the top of the network.

10.
Question 10
What is the primary advantage of having a deep architecture?

1 point

There is a higher probability that each motif is used in the classifier.
The model shares knowledge between motifs through their shared substructures.
A model can learn each top-level motif in isolation.
The parameters of a deep architecture are less expensive to compute.

Week 2 Comprehensive

1.
Question 1
What does the equation for the loss function do conceptually?

1 point

Mathematically define network outputs
Penalize overconfidence
Ignore historical statistical developments
Reward indecision

2.
Question 2
What is overfitting?

1 point

Overfitting refers to the fact that more complexity is always better, which is why deep learning works.
Model complexity fits too well to training data and will not generalize in the real-world.
Model complexity is perfectly matched to the data.
Model complexity is not enough to capture the nuance of the data and will under-perform in the real-world.

3.
Question 3
Why should the test set only be used once?

1 point

More than one use can lead to bias.
More than one use can lead to overfitting.
The model cannot learn anything new from subsequent uses.
It is expensive to use more than once.

4.
Question 4
Which two of the following describe the purpose of a validation set?

1 point

To estimate the performance of a model.
To pick the best performing model.
To test the performance in lieu of real-world data.
To learn the model parameters.

5.
Question 5
How do we learn our network?

1 point

Gradient descent
Downhill skiing
Monte Carlo simulation
Analytically determine global minimum

6.
Question 6
What technique is used to minimize loss for a large data set?

1 point

Newton’s method
Taylor series expansion
Stochastic gradient descent
Gradient descent

7.
Question 7
Which of the following are benefits of stochastic gradient descent?

1 point

With stochastic gradient descent, the update time does not scale with data size.
Stochastic gradient descent finds the solution more accurately.
Stochastic gradient descent can update many more times than gradient descent.
Stochastic gradient descent gets near the solution quickly.
Stochastic gradient descent finds a more exact gradient than gradient descent.

8.
Question 8
Why is gradient descent computationally expensive for large data sets?

1 point

Large data sets do not permit computing the loss function, so a more expensive measure is used.
Calculating the gradient requires looking at every single data point.
Large data sets require deeper models, which have more parameters.
There are too many local minima for an algorithm to find.

9.
Question 9
What are the two main benefits of early stopping?

1 point

It helps save computation cost.
It performs better in the real world.
It improves the training loss.
There is rigorous statistical theory on it.

10.
Question 10
Why are optimization and validation at odds?

1 point

Optimization seeks to do as well as possible on a training set, while validation seeks to generalize to the real world.
Optimization seeks to generalize to the real world, while validation seeks to do as well as possible on a validation set.
Optimization seeks to do as well as possible on a training set, while validation seeks to do as well as possible on a validation set.
They are not at odds—they have the same goal.

Week 3 Comprehensive

1.
Question 1
Which of the following indicates whether a doctor or machine is doing well at finding positive examples in a data set?

1 point

Positive Predictive Value
Likelihood Ratio
Sensitivity
Specificity

2.
Question 2
Which of the following is used to distinguish the false positive rate from the false negative rate?

1 point

Sensitivity
False Negative
Negative Predictive Value
Specificity

3.
Question 3
Which of the following is the best conceptual definition of one dimensional convolution?

1 point

“Inverting” of a shape, where the inversion matches a feature.
“Sliding” of two signals, where a matched feature gives a high value of convolution.
“Intertwining” of two signals, where one wraps around the other to form a feature.
“Distortion” of one signal, according to the feature shape

4.
Question 4
Which of the following can a user choose when designing a convolutional layer? (Choose all that are correct.)

1 point

Filter depth
Filter size
Filter number
Filter stride
Filter weights

5.
Question 5
What is a fully connected readout?

1 point

A layer with ten classifications.
A layer with connections to all feature maps.
The vectorization of a pooling layer.
A layer with a single neuron for each output class.

6.
Question 6
Why are nonlinear activation functions preferable?

1 point

Nonlinear activation functions are preferable because they are used in generalized linear models in statistics.
Nonlinear activation functions increase the functional capacity of the neural network by allowing the representation of nonlinear relationships between features in input.
Nonlinear activation functions are preferable because they have been used historically.
Nonlinear activation functions are NOT preferable to linear ones, as they lose information in systems with high variance.

7.
Question 7
Which of the following are benefits of pooling? (Choose all that are correct.)

1 point

Decreases bias.
Combats overfitting.
Vectorizes the data.
Encourages translational invariance.
Reduces computational complexity.

8.
Question 8
How are parameters that minimize the loss function found in practice?

1 point

Fractal geometry
Gradient descent
Simplex algorithm
Stochastic gradient descent

9.
Question 9
Which of the following is an advantage of hierarchical representation of image features?

1 point

Eliminating bias.
Decreasing the computational complexity.
Better leveraging all training data.
Decreasing variance in the model.

10.
Question 10
Why does transfer learning work?

1 point

Top-level features are specialized for a particular task, while low-level features are universal to all images.
All layers of filters can be learned by studying the mammalian receptive fields.
Low-level features are specialized for a particular task, while top-level features are universal to all images.
All images are composed of pixels with three color channels.

Week 4 Comprehensive

1.
Question 1
What is meant by “word vector”?

1 point

The latitude and longitude of the place a word originated.
A vector of numbers associated with a word.
Assigning a corresponding number to each word.
A vector consisting of all words in a vocabulary.

2.
Question 2
Which word is a synonym for “word vector”?1 point

Norm
Array
Embedding
Stack

3.
Question 3
What is the term for a set of vectors, with one vector for each word in the vocabulary?

1 point

Space
Array
Codebook
Embedding

4.
Question 4
What is natural language processing?

1 point

Making natural text conform to formal language standards.
Translating natural text characters to unicode representations.
Translating human-readable code to machine-readable instructions.
Taking natural text and making inferences and predictions.

5.
Question 5
What is the goal of learning word vectors?

1 point

Find the hidden or latent features in a text.
Labelling a text corpus, so a human doesn’t have to do it.
Determine the vocabulary in the codebook.
Given a word, predict which words are in its vicinity.

6.
Question 6
What function is the generalization of the logistic function to multiple dimensions?

1 point

Hyperbolic tangent function
Exponential log likelihood
Squash function
Softmax function

7.
Question 7
What is the continuous bag of words (CBOW) approach?

1 point

Vectors for the neighborhood of words are averaged and used to predict word n.
Word n is used to predict the words in the neighborhood of word n.
Word n is learned from a large corpus of words, which a human has labeled.
The code for word n is fed through a CNN and categorized with a softmax.

8.
Question 8
What is the Skip-Gram approach?

1 point