1. What is the difference between a Hadoop database and a Relational Database?
Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.
Relational database stores data in predefined containers.
2. What is HDFS?
Stands for Hadoop Distributed File System. It uses a framework involving many machines which store large amounts of data in files over a Hadoop cluster
3.What is MAP REDUCE?
Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.
4.What is the InputSplit in map reduce software?
An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.
5. What is meaning Replication factor?
Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.
6. What is the default replication factor in HDFS?
The default hadoop comes with 3 replication factor. You can set the replication level individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.
Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. this to be true in the big clusters that we manage and operate.
In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.
7. What is the typical block size of an HDFS block?
Default blocksize is 64mb. But 128mb is typical.
8. What is namenode?
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides.Based on meta info maps the incoming job to corresponding data node...
9. How does master slave architecture in the Hadoop?
Totally 5 daemons run in Hadoop Master-slave architecture.
On Master Node : Name Node and Job Tracker and Secondary name node
On Slave : Data Node and Task Tracker
But its recommended to run Secondary name node in a separate machine which have Master node capacity.
10. What is compute and Storage nodes?
I do define Hadoop into 2 ways :
Distributed Processing : Map - Reduce
Distributed Storage : HDFS
Name Node holds Meta info and Data holds exact data and its MR program.
11. Explain how input and output data format of the Hadoop framework?
Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework
How can we control particular key should go in a specific reducer?
By using a custom partitioner.
12. What happens if number of reducers are 0?
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
13. How many instances of JobTracker can run on a Hadoop Cluser?
One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.
14. How NameNode Handles data node failures?
Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.
15. Can I set the number of reducers to zero?
can be given as zero. So, the mapper output is an finalised output and stores in HDFS.