1. What is a SequenceFile? | Hadoop Mcqs

A. ASequenceFilecontains a binary encoding of an arbitrary number of homogeneous writable objects.

B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.

C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.

D. ASequenceFilecontains a binary encoding of an arbitrary number of key-value pairs. Each key must be the same type. Each value must be the same type.

Answer: D

 

2. Is there a map input format? | Hadoop Mcqs

A. Yes, but only in Hadoop 0.22+.

B. Yes, there is a special format for map files.

C. No, but sequence file input format can read map files.

D. Both 2 and 3 are correct answers.

Answers: C

 

3. In a MapReduce job, you want each of your to input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies? | Hadoop Mcqs

A. Increase the parameter that controls minimum split size in the job configuration.

B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.

C. Set the number of mappers equal to the number of input files you want to process.

D. Write a custom FileInputFormat and override the method isSplittable to always return false.

Answer: B

 

4. Which of the following best describes the workings of TextInputFormat? | Hadoop Mcqs

A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.

B. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.

C. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the brokenline.

D. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the brokenline.

E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginningof thebroken line.

Answer: D

 

5. Which of the following statements most accurately describes the relationship between MapReduce and Pig? | Hadoop Mcqs

A. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.

B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.

C. Pig programs rely on MapReduce but are extensible, allowing developers to do special-purpose processing not provided by MapReduce.

D. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.

Answer: D

 

6. You need to import a portion of a relational database every day as files to HDFS, and generate Java classes to Interact with your imported data. Which of the following tools should you use to accomplish this? | Hadoop Mcqs

A. Pig

B. Hue

C. Hive

D. Flume

E. Sqoop

F. Oozie

G. fuse-dfs

Answer: C,E

 

7. You have an employee who is a Date Analyst and is very comfortable with SQL. He would like to run ad-hoc analysis on data in your HDFS duster. Which of the following is a data warehousing software built on top of Apache Hadoop that defines a simple SQL-like query language well-suited for this kind of user? | Hadoop Mcqs

A. Pig

B. Hue

C. Hive

D. Sqoop

E. Oozie

F. Flume

G. Hadoop Streaming

Answer: C

 

8. Workflows expressed in Oozie can contain: | Hadoop Mcqs

A. Iterative repetition of MapReduce jobs until a desired answer or state is reached.

B. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.

C. Sequences of MapReduce jobs only; no Pig or Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.

D. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

Answer: D

 

9. You need a distributed, scalable, data Store that allows you random, realtime read/write access to hundreds of terabytes of data. Which of the following would you use? | Hadoop Mcqs

A. Hue

B. Pig

C. Hive

D. Oozie

E. HBase

F. Flume

G. Sqoop

Answer: E

 

10. Which of the following utilities allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer? | Hadoop Mcqs

A. Oozie

B. Sqoop

C. Flume

D. Hadoop Streaming

Answer: D

 

11. You are running a Hadoop cluster with all monitoring facilities properly configured. Which scenario will go undetected.? | Hadoop Mcqs

A. Map or reduce tasks that are stuck in an infinite loop.

B. HDFS is almost full.

C. The NameNode goes down.

D. A DataNode is disconnectedfrom the cluster.

E. MapReduce jobs that are causing excessive memory swaps.

Answer: C

 

12. Which of the following scenarios makes HDFS unavailable? | Hadoop Mcqs

A. JobTracker failure

B. TaskTracker failure

C. DataNode failure

D. NameNode failure

E. Secondary NameNode failure

Answer: A

 

13. Which MapReduce stage serves as a barrier, where all previous stages must be completed before it may proceed? | Hadoop Mcqs

A. Combine

B. Group (a.k.a. 'shuffle')

C. Reduce

D. Write

Ans: A

 

14. Which of the following statements most accurately describes the general approach to error recovery when using MapReduce? | Hadoop Mcqs

A. Ranger

B. Longhorn

C. Lonestar

D. Spur

Ans: A

 

15. The Combine stage, if present, must perform the same aggregation operation as Reduce. | Hadoop Mcqs

A. True

B. False

Ans: B

 

16. What is the implementation language of the Hadoop MapReduce framework? | Hadoop Mcqs

A. Java

B. C

C. FORTRAN

D. Python

Ans: A

 

17. Which of the following MapReduce execution frameworks focus on execution in sharedmemory environments? | Hadoop Mcqs

A. Hadoop

B. Twister

C. Phoenix

Ans: C

 

18. How can a distributed filesystem such as HDFS provide opportunities for optimization of a MapReduce operation? | Hadoop Mcqs

A. Data represented in a distributed filesystem is already sorted.

B. Distributed filesystems must always be resident in memory, which is much faster than disk.

C. Data storage and processing can be co-located on the same node, so that most input data relevant to Map or Reduce will be present on local disks or cache.

D. A distributed filesystem makes random access faster because of the presence of a dedicated node serving file metadata.

Ans: D

 

19. What is the input to the Reduce function? | Hadoop Mcqs

A. One key and a list of all values associated with that key.

B. One key and a list of some values associated with that key.

C. An arbitrarily sized list of key/value pairs.

Ans: A

 

20. Which MapReduce phase is theoretically able to utilize features of the underlying file system in order to optimize parallel execution? | Hadoop Mcqs

A. Split

B. Map

C. Combine

Ans: A

 

21. The size of block in HDFS is

A. 512 bytes

B. 64 MB

C. 1024 KB

D. None of the above

Answer: B

 

22. The switch given to “hadoop fs” command for detailed help is

A. -show

B. -help

C. -?

D. None of the above

Answer: B

 

23. RPC means

A. Remote processing call

B. Remote process call

C. Remote procedure call

D. None of the above

Answer: C

 

24. Which method of the FileSystem object is used for reading a file in HDFS

A. open()

B. access()

C. select()

D. None of the above

Answer: A

 

25. How many states does Writable interface defines

A. Two

B. Four

C. Three

D. None of the above

Answer: A

 

26. What are supported programming languages for Map Reduce?

A. The most common programming language is Java, but scripting languages are also supported via Hadoop streaming.

B. Any programming language that can comply with Map Reduce concept can be supported.

C. Only Java supported since Hadoop was written in Java.

D. Currently Map Reduce supports Java, C, C++ and COBOL.

Answer: A

 

28. What are sequence files and why are they important?

A. Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs

B. Sequence files are a type of the file in the Hadoop framework that allow data to be sorted 

C. Sequence files are intermediate files that are created by Hadoop after the map step

D. Both B and C are correct

Answer: A

 

29. What are map files and why are they important?

A. Map files are stored on the namenode and capture the metadata for all blocks on a particular rack.

This is how Hadoop is "rack aware"

B. Map files are the files that show how the data is distributed in the Hadoop cluster.

C. Map files are generated by Map-Reduce after the reduce step. They show the task distribution during job execution

D. Map files are sorted sequence files that also have an index. The index allows fast data look up.

Answer: D

 

30. How can you use binary data in MapReduce?

A. Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.

B. Binary data cannot be used by Hadoop fremework. Binary data should be converted to a Hadoop compatible format prior to loading.

C. Binary can be used in map-reduce only with very limited functionlity. It cannot be used as a key for example.

D. Hadoop can freely use binary files with map-reduce jobs so long as the files have headers

Answer: A

 

31. What is map - side join?

A . Map-side join is done in the map phase and done in memory

B . Map-side join is a technique in which data is eliminated at the map step

C . Map-side join is a form of map-reduce API which joins data from different locations

D . None of these answers are correct

Answer: A

 

32. What is reduce - side join?

A. Reduce-side join is a technique to eliminate data from initial data set at reduce step

B. Reduce-side join is a technique for merging data from different sources based on a specific key.

C. Reduce-side join is a set of API to merge data from different sources.

D. None of these answers are correct

Answer: B

 

34. What is PIG?

A. Pig is a subset fo the Hadoop API for data processing

B. Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing

C. Pig is a part of the Apache Hadoop project. It is a "PL-SQL" interface for data processing in Hadoop cluster

D. PIG is the third most popular form of meat in the US behind poultry and beef.

Answer: B

 

35. How can you disable the reduce step?

A. The Hadoop administrator has to set the number of the reducer slot to zero on all slave nodes. This will disable the reduce step.

B. It is imposible to disable the reduce step since it is critical part of the Mep-Reduce abstraction.

C. A developer can always set the number of the reducers to zero. That will completely disable the reduce step.

D. While you cannot completely disable reducers you can set output to one. There needs to be at least one reduce step in Map-Reduce abstraction.

Answer: C

 

36. Why would a developer create a map-reduce without the reduce step?

A. Developers should design Map-Reduce jobs without reducers only if no reduce slots are available on the cluster.

B. Developers should never design Map-Reduce jobs without reducers. An error will occur upon compile.

C. There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing.

D. It is not possible to create a map-reduce job without at least one reduce step. A developer may decide to limit to one reducer for debugging purposes.

Answer: C

 

37. What is the default input format?

A. The default input format is xml. Developer can specify other input formats as appropriate if xml is not the correct input.

B. There is no default input format. The input format always should be specified.

C. The default input format is a sequence file format. The data needs to be preprocessed before using the default input format.

D. The default input format is TextInputFormat with byte offset as a key and entire line as a value.

Answer: D

 

38. How can you overwrite the default input format?

A. In order to overwrite default input format, the Hadoop administrator has to change default settings in config file.

B. In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster.

C. The default input format is controlled by each individual mapper and each line needs to be parsed indivudually.

D. None of these answers are correct.

Answer: B

 

39. What are the common problems with map-side join?

A. The most common problem with map-side joins is introducing a high level of code complexity.

This complexity has several downsides: increased risk of bugs and performance degradation.

Developers are cautioned to rarely use map-side joins.

B. The most common problem with map-side joins is lack of the avaialble map slots since map-side joins require a lot of mappers.

C. The most common problems with map-side joins are out of memory exceptions on slave nodes.

D. The most common problem with map-side join is not clearly specifying primary index in the join.

This can lead to very slow performance on large datasets.

Answer: C

 

40. Which is faster: Map-side join or Reduce-side join? Why?

A. Both techniques have about the the same performance expectations.

B. Reduce-side join because join operation is done on HDFS.

C. Map-side join is faster because join operation is done in memory.

D. Reduce-side join because it is executed on a the namenode which will have faster CPU and more memory.

Answer: C

 

41. Will settings using Java API overwrite values in configuration files?

A. No. The configuration settings in the configuration file takes precedence

B. Yes. The configuration settings using Java API take precedence

C. It depends when the developer reads the configuration file. If it is read first then no.

D. Only global configuration settings are captured in configuration files on namenode. There are only a very few job parameters that can be set using Java API.

Answer: B

 

42. What is AVRO?

A. Avro is a java serialization library

B. Avro is a java compression library

C. Avro is a java library that create splittable files

D. None of these answers are correct

Answer: A

 

43. Can you run Map - Reduce jobs directly on Avro data?

A. Yes, Avro was specifically designed for data processing via Map-Reduce

B. Yes, but additional extensive coding is required

C. No, Avro was specifically designed for data storage only

D. Avro specifies metadata that allows easier data access. This data cannot be used as part of mapreduce execution, rather input specification only.

Answer: A

 

44. What is distributed cache?

A. The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step.

B. The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step.

C. The distributed cache is a component that caches java objects.

D. The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

Answer: D

 

45. What is the best performance one can expect from a Hadoop cluster?

A. The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing

B. The best performance expectation one can have is measured in milliseconds. This is because Hadoop executes in parallel across so many machines

C. The best performance expectation one can have is measured in minutes. This is because Hadoop can only be used for batch processing

D. It depends on on the design of the map-reduce program, how many machines in the cluster, and the amount of data being retrieved

Answer: A

 

46. What is writable?

A. Writable is a java interface that needs to be implemented for streaming data to remote servers.

B. Writable is a java interface that needs to be implemented for HDFS writes.

C. Writable is a java interface that needs to be implemented for MapReduce processing.

D. None of these answers are correct.

Answer: C

 

47. The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

A. Writable data types are specifically optimized for network transmissions

B. Writable data types are specifically optimized for file system storage

C. Writable data types are specifically optimized for map-reduce processing

D. Writable data types are specifically optimized for data retrieval

Answer: A

 

48. Can a custom type for data Map-Reduce processing be implemented?

A. No, Hadoop does not provide techniques for custom datatypes.

B. Yes, but only for mappers.

C. Yes, custom data types can be implemented as long as they implement writable interface.

D. Yes, but only for reducers.

Answer: C

 

49. What happens if mapper output does not match reducer input?

A. No, Hadoop does not provide techniques for custom datatypes.

B. Yes, but only for mappers.

C. Yes, custom data types can be implemented as long as they implement writable interface.

D. Yes, but only for reducers.

Answer: C

 

50. Can you provide multiple input paths to a map-reduce jobs?

A. Yes, but only in Hadoop 0.22+.

B. No, Hadoop always operates on one input directory.

C. Yes, developers can add any number of input paths.

D. Yes, but the limit is currently capped at 10 input paths.

Answer: C

 

51. In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

A. Increase the parameter that controls minimum split size in the job configuration.

B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.

C. Set the number of mappers equal to the number of input files you want to process.

D. Write a custom FileInputFormat and override the method isSplitable to always return false.

Answer: D

 

52. Which process describes the lifecycle of a Mapper?

A. The JobTracker calls the TaskTracker’s configure () method, then its map () method and finally its close () method.

B. The TaskTracker spawns a new Mapper to process all records in a single input split.

C. The TaskTracker spawns a new Mapper to process each key-value pair.

D. The JobTracker spawns a new Mapper to process all records in a single file.

Answer: C

 

53. Determine which best describes when the reduce method is first called in a MapReduce job?

A. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.

B. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.

C. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.

D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.

Answer: D

 

54. You have written a Mapper which invokes the following five calls to the OutputColletor.collect method:

output.collect (new Text (“Apple”), new Text (“Red”) ) ;

output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;

output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;

output.collect (new Text (“Cherry”), new Text (“Red”) ) ;

output.collect (new Text (“Apple”), new Text (“Green”) ) ;

How many times will the Reducer’s reduce method be invoked?

A. 6

B. 3

C. 1

D. 0

E. 5

Answer: B

 

55. To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

A. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.

B. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.

C. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.

D. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.

Answer: B

 

56. In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?

A. The values are in sorted order.

B. The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.

C. The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.

D. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.

Answer: B

 

57.  You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

A. Processor and network I/O

B. Disk I/O and network I/O

C. Processor and RAM

D. Processor and disk I/O

Answer: B

 

58. You want to count the number of occurrences for each unique word in the supplied input data.

You’ve decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?

A. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.

B. No, because the sum operation in the reducer is incompatible with the operation of a Combiner.

C. No, because the Reducer and Combiner are separate interfaces.

D. No, because the Combiner is incompatible with a mapper which doesn’t use the same data type for both the key and value.

E. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.

Answer: A

 

59. Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation.

A. TaskTracker

B. NameNode

C. DataNode

D. JobTracker

E. Secondary NameNode

Answer: D

 

60. Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

A. HBase

B. Hue

C. Pig

D. Hive

E. Oozie

F. Flume

G. Sqoop

Answer: A

 

61. What is a SequenceFile?

A. A Sequence Filecontains a binary encoding of an arbitrary number of homo geneous writable objects.

B. A Sequence Filecontains a binary encoding of an arbitrary number of hetero geneous writable objects.

C. A Sequence Filecontains a binary encoding of an arbitrary number of Writable Comparable objects, in sorted order.

D. A Sequence Filecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.

Answer: D

 

62. Given a directory of files with the following structure: line number, tab character, string:

Example:

abialkjfjkaoasdfjksdlkjhqweroij

kadf jhuwqounahagtnbvaswslmnbfgy

kjfteiomndscxeqalkzhtopedkfslkj

You want to send each line as one record to your Mapper. Which InputFormat would you use to complete

the line: setInputFormat (________.class);

A. BDBInputFormat

B. KeyValueTextInputFormat

C. SequenceFileInputFormat

D. SequenceFileAsTextInputFormat

Answer: C

 

63. In a MapReduce job, you want each of you input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

A. Increase the parameter that controls minimum split size in the job configuration.

B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.

C. Set the number of mappers equal to the number of input files you want to process.

D. Write a custom FileInputFormat and override the method isSplittable to always return false.

Answer: B

 

64. Which of the following best describes the workings of TextInputFormat?

A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.

B. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.

C. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the brokenline.

D. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of thebroken line.

Answer: D

 

65. Which of the following statements most accurately describes the relationship between MapReduce and Pig?

A. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.

B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.

C. Pig programs rely on MapReduce but are extensible, allowing developers to do specialpurpose processing not provided by MapReduce.

D. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.

Answer: D

 

66. You need to import a portion of a relational database every day as files to HDFS, and generate Java classes to Interact with your imported data. Which of the following tools should you use to accomplish this?

A. Pig

B. Hue

C. Hive

D. Flume

E. Sqoop

F. Oozie

G. fuse-dfs

Answer: C,E

 

67. You have an employee who is a Date Analyst and is very comfortable with SQL. He would like to run ad-hoc analysis on data in your HDFS duster. Which of the following is a data warehousing software built on top of Apache Hadoop that defines a simple SQL-like query language well-suited for this kind of user?

A. Pig

B. Hue

C. Hive

D. Sqoop

E. Oozie

F. Flume

G. Hadoop Streaming

Answer: C

 

68. Workflows expressed in Oozie can contain:

A. Iterative repetition of MapReduce jobs until a desired answer or state is reached.

B. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.

C. Sequences of MapReduce jobs only; no Pig or Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.

D. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

Answer: D

 

69. You need a distributed, scalable, data Store that allows you random, realtime read/write access to hundreds of terabytes of data. Which of the following would you use?

A. Hue

B. Pig

C. Hive

D. Oozie

E. HBase

F. Flume

G. Sqoop

Answer: E

 

70. Which of the following utilities allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

A. Oozie

B. Sqoop

C. Flume

D. Hadoop Str

Answer: D