**Introduction to Machine Learning Quiz Answer**

**Week-1 **

Week 1 Comprehensive

1.

Question 1

Which of the following are necessary for supervised machine learning? (Choose all that are correct)

1 point

**A model****Learning from data****Labeled training data**- Human to teach the machine

2.

Question 2

What decision boundary can logistic regression provide?

1 point

- Arbitrarily complex functions
- Jagged edges
- Smooth curves
**Linear**

3.

Question 3

What is the primary advantage of using multiple filters?

1 point

- More complexity is always better.
- This requires less compute power.
**This allows the model to look for subtypes of the classification.**- This is simpler to implement.

4.

Question 4

Which one of the following best describes transfer learning in the context of document analysis?

1 point

- All parameters of the model are different between individuals.
**Parameters at the bottom of the model are transferable across all people and documents, while the parameters at the top are different between individuals.**- All parameters of the model are transferable across all people and documents.
- Parameters at the top of the model are transferable across all people and documents, while the parameters at the bottom are different between individuals.

5.

Question 5

Given the following image of data classifications, which of the following models would you choose?

1 point

**Logistic regression**- Multilayer perceptron

6.

Question 6

What new feature did neural networks acquire in 2010?

1 point

- A new computational platform: the GPU
- A new application: image search
- A new operation: convolution
**A new name: Deep Learning**

7.

Question 7

Which of the following is convolved with layer 2 features, or sub-motifs?

1 point

- Layer 2 feature map
**Layer 1 feature map**- Layer 3 feature map

8.

Question 8

Which of the following gives the best conceptual meaning of convolution?

1 point

- Surveying a feature map for high-level motif.
- Selecting an atomic element from an image.
- Stacking a collection of feature maps.
**Shifting a filter to every location in an image.**

9.

Question 9

What does transfer learning mean in the context of medical imaging?

1 point

- Just as assigning categories to images in ImageNet required millions of images, so too does analyzing medical images require millions of labeled medical images.
- Sufficient labeled radiological images can be used to learn all of the model parameters, so they can be used for ophthalmological or dermatological images.
- Once the convolutional layers are learned from labeled medical images, the top layers can be inferred from the parameters found with data from ImageNet.
**Weights of convolutional layers learned from ImageNet transfer to medical images, so we only need learn new parameters at the top of the network.**

10.

Question 10

What is the primary advantage of having a deep architecture?

1 point

- There is a higher probability that each motif is used in the classifier.
**The model shares knowledge between motifs through their shared substructures.**- A model can learn each top-level motif in isolation.
- The parameters of a deep architecture are less expensive to compute.

Week 2 Comprehensive

1.

Question 1

What does the equation for the loss function do conceptually?

1 point

- Mathematically define network outputs
**Penalize overconfidence**- Ignore historical statistical developments
- Reward indecision

2.

Question 2

What is overfitting?

1 point

- Overfitting refers to the fact that more complexity is always better, which is why deep learning works.
**Model complexity fits too well to training data and will not generalize in the real-world.**- Model complexity is perfectly matched to the data.
- Model complexity is not enough to capture the nuance of the data and will under-perform in the real-world.

3.

Question 3

Why should the test set only be used once?

1 point

**More than one use can lead to bias.**- More than one use can lead to overfitting.
- The model cannot learn anything new from subsequent uses.
- It is expensive to use more than once.

4.

Question 4

Which two of the following describe the purpose of a validation set?

1 point

- To estimate the performance of a model.
**To pick the best performing model.**- To test the performance in lieu of real-world data.
- To learn the model parameters.

5.

Question 5

How do we learn our network?

1 point

**Gradient descent**- Downhill skiing
- Monte Carlo simulation
- Analytically determine global minimum

6.

Question 6

What technique is used to minimize loss for a large data set?

1 point

- Newton’s method
- Taylor series expansion
**Stochastic gradient descent**- Gradient descent

7.

Question 7

Which of the following are benefits of stochastic gradient descent?

1 point

**With stochastic gradient descent, the update time does not scale with data size.**- Stochastic gradient descent finds the solution more accurately.
**Stochastic gradient descent can update many more times than gradient descent.**- Stochastic gradient descent gets near the solution quickly.
- Stochastic gradient descent finds a more exact gradient than gradient descent.

8.

Question 8

Why is gradient descent computationally expensive for large data sets?

1 point

- Large data sets do not permit computing the loss function, so a more expensive measure is used.
**Calculating the gradient requires looking at every single data point.**- Large data sets require deeper models, which have more parameters.
- There are too many local minima for an algorithm to find.

9.

Question 9

What are the two main benefits of early stopping?

1 point

**It helps save computation cost.****It performs better in the real world.**- It improves the training loss.
- There is rigorous statistical theory on it.

10.

Question 10

Why are optimization and validation at odds?

1 point

**Optimization seeks to do as well as possible on a training set, while validation seeks to generalize to the real world.**- Optimization seeks to generalize to the real world, while validation seeks to do as well as possible on a validation set.
- Optimization seeks to do as well as possible on a training set, while validation seeks to do as well as possible on a validation set.
- They are not at odds—they have the same goal.

Week 3 Comprehensive

1.

Question 1

Which of the following indicates whether a doctor or machine is doing well at finding positive examples in a data set?

1 point

- Positive Predictive Value
- Likelihood Ratio
**Sensitivity**- Specificity

2.

Question 2

Which of the following is used to distinguish the false positive rate from the false negative rate?

1 point

- Sensitivity
- False Negative
- Negative Predictive Value
**Specificity**

3.

Question 3

Which of the following is the best conceptual definition of one dimensional convolution?

1 point

- “Inverting” of a shape, where the inversion matches a feature.
**“Sliding” of two signals, where a matched feature gives a high value of convolution.**- “Intertwining” of two signals, where one wraps around the other to form a feature.
- “Distortion” of one signal, according to the feature shape

4.

Question 4

Which of the following can a user choose when designing a convolutional layer? (Choose all that are correct.)

1 point

**Filter depth****Filter size****Filter number****Filter stride**- Filter weights

5.

Question 5

What is a fully connected readout?

1 point

- A layer with ten classifications.
- A layer with connections to all feature maps.
- The vectorization of a pooling layer.
**A layer with a single neuron for each output class.**

6.

Question 6

Why are nonlinear activation functions preferable?

1 point

- Nonlinear activation functions are preferable because they are used in generalized linear models in statistics.
**Nonlinear activation functions increase the functional capacity of the neural network by allowing the representation of nonlinear relationships between features in input.**- Nonlinear activation functions are preferable because they have been used historically.
- Nonlinear activation functions are NOT preferable to linear ones, as they lose information in systems with high variance.

7.

Question 7

Which of the following are benefits of pooling? (Choose all that are correct.)

1 point

**Decreases bias.****Combats overfitting.****Vectorizes the data.****Encourages translational invariance.**- Reduces computational complexity.

8.

Question 8

How are parameters that minimize the loss function found in practice?

1 point

- Fractal geometry
- Gradient descent
- Simplex algorithm
**Stochastic gradient descent**

9.

Question 9

Which of the following is an advantage of hierarchical representation of image features?

1 point

- Eliminating bias.
- Decreasing the computational complexity.
**Better leveraging all training data.**- Decreasing variance in the model.

10.

Question 10

Why does transfer learning work?

1 point

**Top-level features are specialized for a particular task, while low-level features are universal to all images.**- All layers of filters can be learned by studying the mammalian receptive fields.
- Low-level features are specialized for a particular task, while top-level features are universal to all images.
- All images are composed of pixels with three color channels.

Week 4 Comprehensive

1.

Question 1

What is meant by “word vector”?

1 point

- The latitude and longitude of the place a word originated.
**A vector of numbers associated with a word.**- Assigning a corresponding number to each word.
- A vector consisting of all words in a vocabulary.

2.

Question 2

Which word is a synonym for “word vector”?1 point

- Norm
- Array
**Embedding**- Stack

3.

Question 3

What is the term for a set of vectors, with one vector for each word in the vocabulary?

1 point

- Space
- Array
**Codebook**- Embedding

4.

Question 4

What is natural language processing?

1 point

- Making natural text conform to formal language standards.
- Translating natural text characters to unicode representations.
- Translating human-readable code to machine-readable instructions.
**Taking natural text and making inferences and predictions.**

5.

Question 5

What is the goal of learning word vectors?

1 point

- Find the hidden or latent features in a text.
- Labelling a text corpus, so a human doesn’t have to do it.
- Determine the vocabulary in the codebook.
**Given a word, predict which words are in its vicinity.**

6.

Question 6

What function is the generalization of the logistic function to multiple dimensions?

1 point

- Hyperbolic tangent function
- Exponential log likelihood
- Squash function
**Softmax function**

7.

Question 7

What is the continuous bag of words (CBOW) approach?

1 point

**Vectors for the neighborhood of words are averaged and used to predict word n.**- Word n is used to predict the words in the neighborhood of word n.
- Word n is learned from a large corpus of words, which a human has labeled.
- The code for word n is fed through a CNN and categorized with a softmax.

8.

Question 8

What is the Skip-Gram approach?

1 point

**Word n is used to predict the words in the neighborhood of word n.**- The code for word n is fed through a CNN and categorized with a softmax.
- Word n is learned from a large corpus of words, which a human has labeled.
- Vectors for the neighborhood of words are averaged and used to predict word n.

9.

Question 9

What is the goal of the recurrent neural network?

1 point

- Learn a series of images that form a video.
- Predict words more efficiently than Skip-Gram.
**Synthesize a sequence of words.**- Classify an unlabeled image.

10.

Question 10

Which model is the state-of-the-art for text synthesis?

1 point

**Long short-term memory**- CNN
- Multilayer perceptron
- CBOW