In recent years, especially in the last decade or so, the IT sector has gone through a lot of significant changes. In particular, we have also seen a massive amount of growth in data and its management.
Today, we see an enormous amount of data being shared and accessed every day. Like many buzzwords, the concept of “Big Data” is not always clear and defined. If we go into its depth, Big Data is a way of describing data-related problems which are impossible to solve while using traditional tools and technologies. The difficulties handling data are due to the large volume of data involved. Furthermore, the variety of that data and the amount of time required by the users to process the data are other options that make it hard to handle Big Data.
Moving forward, analyzing, and learning from Big Data has opened several doors of opportunities. A lot of different technologies and platforms have been introduced to learn from these enormous amounts of data that are collected from different sources. Apache Hadoop has been the talk of the town when it comes to managing Big Data. It is considered one of the best and most preferred open-source software for data handling.
In this article, we will go through the Hadoop Cheat Sheet, which can be used as a ready reference and guide while learning Hadoop.
Big Data and Hadoop
In a technology-driven world, data is driving the growth of organizations. Companies across the globe ingest raw data in significant volumes from different sources. But, screening out the useful and insightful data from a big pool of data available is tricky. That’s where Big Data comes into play. Hadoop is an open-source framework introduced by Apache, which is used to process Big Data.
The potential of Big Data in the coming years cannot be ignored. Big Data has a vast scope of its application. Some of the areas are as under:
Areas |
Application of Big Data |
Targeting Consumers |
With the help of Big Data, companies can analyze and understand consumer behaviors and target them in a personalized fashion. |
Science and Research |
Big Data has been playing a vital role in the fields of science and research. |
Finance |
With the help of Big Data algorithms, companies and individuals analyze the market and trading opportunities. |
Health and Medical |
Big Data has been beneficial in managing the health records and medical conditions of individuals. |
Security |
Security agencies around the world are using Big Data as a tool to keep track of terrorists and other security risks. |
Government |
Governments are using Big Data in many forms and for example, keeping track of population, identity-related information of people, and their family information. |
What is Apache Hadoop?
Hadoop is a Big Data management solution based on a framework built by Apache Software Foundation. Hadoop is open-source software. Some of the biggest organizations around the world are extensively using Hadoop for distributed storage and processing of data, which is enormous in terms of volume. Keeping in view the large amount of data involved, Hadoop runs its processing on large computer hardware systems that are purpose-built. Some of the key features of the Hadoop platform are:
- Data storage
- Data Processing
- Data Access
- Data Analysis and Governance
- Data Security
- Operations and Deployment.
Hadoop is a top-level software that is built and used by a diverse group of developers, users, and contributors, regardless of their nationalities under the umbrella of the Apache Software Foundation. It is currently run and governed under the Apache License 2.0.
With the involvement of a considerable amount of data, Hadoop operates on thousands of nodes, and the probability of any node failure is very high. However, the Hadoop platform is resilient enough to cater to this problem with the help of Hadoop distributed file systems. These systems, upon sensing the node failure, immediately divert the data among other nodes, which allows the whole platform to operate without a hassle.
Basic Terms and Definitions
Big Data: Big Data comprises large sets of data that cannot be processed using traditional computing tools and techniques. These sets of data include huge volumes, high velocity, and extensible variety.
Hadoop: An open-source framework by Apache, which is written in Java. This platform allows the distributed processing of large datasets across various clusters of computers using simple programming techniques.
Hadoop Common: These are the libraries and utilities written in Java, which are required by other Hadoop modules containing the necessary scripts and files needed to start Hadoop.
Hadoop YARN: Hadoop YARN is a framework that is used for scheduling jobs and managing different cluster resources.
Hadoop Distributed File System: Simply known as HDFS, Hadoop Distributed File System is a Java-based system that provides data storage, which is scalable and reliable. It also offers a large amount of access to the application data.
Hadoop MapReduce: This framework is used for writing the applications in an easier way, which processes a large amount of data in parallel or large clusters.
Apache Hive: A software infrastructure that is used for data warehousing for Hadoop.
Apache Oozie: Also written in Java, this application is mainly responsible for scheduling Hadoop jobs.
Apache Pig: To execute MapReduce jobs, this data flow platform is used.
Apache Spark: Another open-source framework that is used for cluster computing.
Flume: Flume is also an open-source framework that is used for the collection and transportation of data from source to destination.
HBase: To store Big Data in a scalable way, Apache HBase is used, which is a column-oriented database of Hadoop.
Sqoop: Sqoop is an interface application that works through various commands to transfer data between Hadoop and relational databases.
Hadoop Ecosystem
Hadoop Ecosystem characterizes different components of the Apache software. It can be further divided into the following categories:
Top-Level Interface:
- ETL Tools
- BI Reporting
- RDBMS
Top Level Abstraction:
- PIG
- HIVE
- Sqoop
Distributed Data Processing
- MapReduce
- HBASE (Database with real-time access)
Self-Healing Clustered Storage System
- Hadoop Distributed File System
Hadoop Distributed File System Shell Commands
The Hadoop shell is a set of commands which are administered through the command line of your operating system. This shell has further two sets of commands:
- For file manipulation (similar in purpose and syntax to Linux commands)
- Hadoop administration.
The following list sums up the commands, their usage with examples:
cat: copy the source path to the destination or standard output
Usage:
hdfs dfs -cat URI [URI …]
Example:
hdfs dfs -cat hdfs://<path>/file5
hdfs dfs -cat file:///file6 /user/hadoop/file7
|
chgrp: It changes the group relationship of files
Usage:
hdfs dfs -chgrp [-R] GROUP URI [URI …]
|
chmod: It changes the access permissions of files
Usage:
hdfs dfs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]
Example:
hdfs dfs -chmod 666 test/data1.txt
|
chown: This command changes the owner of files
Usage:
hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Example:
hdfs dfs -chown -R hduser3 /opt/hadoop/logs |
copyFromLocal: It works just like the put command, but the source is restricted to a local file reference.
Usage:
hdfs dfs -copyFromLocal <localsrc> URI
Example:
hdfs dfs -copyFromLocal input/docs/data1.txt hdfs://localhost/user/rosemary/data1.txt
|
copyToLocal: Works like the get command, but the destination is restricted to a local file reference.
Usage:
hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Example:
hdfs dfs -copyToLocal data1.txt data1.copy.txt
|
count: This command counts the number of directories, files, and bytes
Usage:
hdfs dfs -count [-q] <paths>
Example:
hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
|
cp: Copies one or multiple files from a defined source to a designated destination. In the case of various sources, the specified destination must be a directory.
Usage:
hdfs dfs -cp URI [URI …] <dest>
Example:
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
|
du: Displays the size of the specified file or the sizes of data that are contained in the specified directory. Specifying -s option, displays an aggregate summary of file sizes. Whereas, specifying the -h option, layouts the file sizes being readable.
Usage:
hdfs dfs -du [-s] [-h] URI [URI …]
Example:
hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1
|
expunge: This command empties the trash. On deleting a file, it isn’t removed immediately from HDFS; instead, it is renamed to a file in the /trash directory.
Usage:
hdfs dfs –expunge
|
Get: It copies files to the local file system. By specifying the –ignorecrc option, files that fail a cyclic redundancy check (CRC) can still be copied. CRC checksum files contain .crc extension, and they are used to verify the data integrity of another file.
Usage:
hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
Example:
hdfs dfs -get /user/hadoop/file3 localfile
|
getmerge: Link the files in src and writes the result to the specified local destination file. To add a new line character at the end of each file, addnl option is needed.
Usage:
hdfs dfs -getmerge <src> <localdst> [addnl]
Example:
hdfs dfs -getmerge /user/hadoop/mydir/ ~/result_file addnl
|
ls: Returns the statistics of the specified files or directories.
Usage:
hdfs dfs -ls <args>
Example:
hdfs dfs -ls /user/hadoop/file1
|
lsr: This serves as the repetitive version of ls, similar to the Unix command ls -R.
Usage:
hdfs dfs -lsr <args>
Example:
hdfs dfs -lsr /user/hadoop
|
mkdir: It makes directories on one or more specified paths. It works similar to the Unix mkdir -p command, which creates all directories leading up to the specified directory in case they don’t exist already.
Usage:
hdfs dfs -mkdir <paths>
Example:
hdfs dfs -mkdir /user/hadoop/dir5/temp
|
moveFromLocal: It works in a way the put command works, with the exception that the source is deleted after being copied.
Usage:
hdfs dfs -moveFromLocal <localsrc> <dest>
Example:
hdfs dfs -moveFromLocal localfile1 localfile2 /user/hadoop/hadoopdir
|
mv: It is used to moves one or more files from a specified source to a specified destination. However, moving files across file systems isn’t permitted.
Usage:
hdfs dfs -mv URI [URI …] <dest>
Example:
hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
|
put: Used to copies files from the local file system to the destination file system. I can also read input from one file system to another.
Usage:
hdfs dfs -put <localsrc> … <dest>
Example:
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir; hdfs dfs -put – /user/hadoop/hadoopdir
|
rm: This command is used to delete one or multiple files
Usage:
hdfs dfs -rm [-skipTrash] URI [URI …]
Example:
hdfs dfs -rm hdfs://nn.example.com/file8
|
rmr: It works as the repetitive version of –rm
Usage:
hdfs dfs -rmr [-skipTrash] URI [URI …]
Example:
hdfs dfs -rmr /user/hadoop/dir
|
setrep: It changes the duplication factor for a specified file or directory. Using –R makes the change recursively as per the directory structure.
Usage:
hdfs dfs -setrep <rep> [-R] <path>
Example:
hdfs dfs -setrep 3 -R /user/hadoop/dir1
|
Talk to our experts and gather expert opinions on how to learn Hadoop.
stat: Used to display information about any specified path
Usage:
hdfs dfs -stat URI [URI …]
Example:
hdfs dfs -stat /user/hadoop/dir1
|
tail: It shows the last kilobyte of a specified file to stdout. This particular syntax supports the Unix -f option, which empowers the specified file to be monitored. As soon as new lines are added to the file by another process, the tail updates the display accordingly.
Usage:
hdfs dfs -tail [-f] URI
Example:
hdfs dfs -tail /user/hadoop/dir1
|
test: test command returns attributes of the file or directory specified. Different functions are as under:
- -e : Used to decide whether the file or directory exists
- -z : To decide whether the file or directory is empty
- -d : Determines whether the URI is a directory
Usage:
hdfs dfs -test -[ezd] URI
Example:
hdfs dfs -test /user/hadoop/dir1
|
text: This command outputs a specified source file in text format. Valid input file formats used are zip and TextRecordInputStream.
Usage:
hdfs dfs -text <src>
Example:
hdfs dfs -text /user/hadoop/file8.zip
|
Hadoop Administration Commands
Hadoop administration commands are a comprehensive set of commands used for cluster administration. The following list summarizes the most important commands:
balancer: It runs the cluster-balancing utility
Syntax:
hadoop balancer [-threshold <threshold>]
Example:
Hadoop balancer -threshold 20
|
daemonlog: Used to get or set the log level for each daemon (service). It connects to https://host:port/logLevel?log=name and prints or sets the log level of the particular daemon which is running at host:port.
Syntax:
hadoop daemonlog -getlevel <host:port> <name>; hadoop daemonlog -setlevel <host:port> <name> <level>
Example:
Hadoop daemonlog -getlevel 10.250.1.15:50030
|
datanode: It starts and runs the HDFS DataNode service. This service coordinates storage on each slave node. By specifying -rollback, the DataNode is reverted to the previous version.
Syntax:
Hadoop datanode [-rollback]
Example:
Hadoop datanode –rollback
|
dfsadmin: It runs several HDFS administrative operations. -help option is used to get a list of all supported options.
Syntax:
hadoop dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter | leave | get | wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>…<dirname>] [-clrQuota <dirname>…<dirname>] [-restoreFailedStorage true|false|check] [-help [cmd]]
|
mradmin: This command runs various MapReduce administrative operations. -help option is used to see a list of all supported options. Some of the operations are as under:
- -refreshServiceAcl: reloads the service-level authorization policy file
- -refreshQueues: It reloads the queue ACLs and state
- -refreshNodes: refreshes the hosts’ information
- -refreshUserToGroupsMappings: refreshes user-to-groups mappings
- -refreshSuperUserGroupsConfiguration: refreshes superuser proxy groups mappings
- -help [cmd]: displays help for the given command or all commands in general
Syntax:
Hadoop mradmin [ GENERIC_OPTIONS ] [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-help [cmd]]
|
jobtracker: This command executes the MapReduce JobTracker node. It coordinates the data processing system for Hadoop. Specifying -dumpConfiguration, the configuration used by the JobTracker, and the queue configuration in JSON format are written to standard output.
Syntax:
hadoop jobtracker [-dumpConfiguration]
Example:
Hadoop jobtracker –dumpConfiguration
|
namenode: Executes the NameNode, which is used to coordinate the storage for the whole Hadoop cluster. Some other functions are as under:
- -format: NameNode is started, formatted, and stopped
- -upgrade: NameNode starts with the upgrade option
- -rollback: NameNode is rolled back to the last version
- -finalize: The last state of the file system is removed, which makes the most recent upgrade permanent. Furthermore, rollback is no longer available, and the NameNode is eventually stopped
- -importCheckpoint: The image is loaded from the checkpoint directory and saved into the current directory.
Syntax:
hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]
Example:
Hadoop namenode –finalize
|
tasktracker: Executes a MapReduce TaskTracker node
Syntax:
Hadoop tasktracker
|
Hadoop dfsadmin Commands
The dfsadmin tools are a particular set of tools intended to help to find and remove various sets of information related to the Hadoop Distributed File system (HDFS). Furthermore, these commands can also be used to perform administration operations on HDFS as well.
Command |
Function |
-report |
It gives basic file system information and statistics. |
-safemode enter | leave | get | wait |
It manages safe mode. A state of NameNode in which variations to the namespace are not accepted and blocks cannot be replicated nor deleted. The NameNode usually runs in safe mode during start-up so that it doesn’t impulsively start replicating blocks even in the presence of enough replicas in the cluster. |
-refreshNodes |
It pushes the NameNode to reread its configuration, containing the dfs.hosts.exclude file. The NameNode withdraws nodes after their blocks have been copied onto machines, which will remain active. |
-finalizeUpgrade |
It is used to complete the HDFS upgrade process. DataNodes and NameNode delete working directories from the previous version. |
-upgradeProgress status | details | |
It requests the general or detailed current status of the distributed upgrade. Or it can also force the upgrade to proceed. |
-metasave filename |
It saves the NameNode’s primary data structures to the filename in a particular directory, which is specified by hadoop.log.dir property. The file’s name, which is overwritten in case it already exists, contains one line for each of these items: a) DataNodes that are swapping heartbeats with the b) Blocks waiting to be replicated c) Blocks that are being replicated d) Blocks that are looking to get deleted. |
-setQuota |
It defines an upper limit on the number of names in the directory tree. This limit can be ser for one or more directories simultaneously. |
-clrQuota |
It clears the upper limit forced on the number of names in the directory tree. This limit can be removed for one or more directories simultaneously. |
-restoreFailedStorage true | false | |
It activates then turning on or off the automatic attempts to restore failed storage replicas. In case a failed storage location is available again, the system tries to restore edits and the fsimage during a checkpoint. The check option returns the current setting. |
-help [cmd] |
This command displays help information for the given command or all commands in general. |
Hadoop YARN Basic Commands
Commands |
Tasks |
Yarn |
It shows the yarn help |
yarn [–config confdir] |
To define the configuration file, this command is used |
yarn [–loglevel loglevel] |
This command can be used to define the log level. Log level can be a fatal, error, warn, info, debug or trace |
yarn classpath |
To show the Hadoop classpath, this particular command is used |
yarn application |
Used to display and end the Hadoop applications |
yarn applicationattempt |
This command shows the application attempt |
yarn container |
This command displays the container information |
yarn node |
This command depicts the node information |
yarn queue |
It shows the information about the queue |
Conclusion
With the abruptly advancing IT sector and ever-expanding data volumes, it will not be an exaggeration if we say that data drives modern organizations around the world. Understanding and mining this data unravels various patterns and reveals unseen linkages within the vast sky of data, which is indeed becoming a critical and rewarding journey for organizations.
Entities all over the world are now sensing the importance and the need to convert Big Data into Business Intelligence to reap the benefits. A better pool of data leads to better decision making and an enriched way to plan for organizations irrespective of their size, location, product & services, and other aspects. Hadoop, without any doubt, has become the preferred platform for working with vast volumes of data.
If you are looking to start a career in Big Data and data management field, getting your hands onto Hadoop will empower you as a professional to excel in the field of Big Data.
Start your 30-day FREE TRIAL with DataScienceAcademy.io and become an expert in Hadoop, one of the most in-demand languages for Big data and data science career success.