Hadoop Cheat Sheet: 2020 Edition

In recent years, especially in the last decade or so, the IT sector has gone through a lot of significant changes. In particular, we have also seen a massive amount of growth in data and its management.

Today, we see an enormous amount of data being shared and accessed every day. Like many buzzwords, the concept of “Big Data” is not always clear and defined. If we go into its depth, Big Data is a way of describing data-related problems which are impossible to solve while using traditional tools and technologies. The difficulties handling data are due to the large volume of data involved. Furthermore, the variety of that data and the amount of time required by the users to process the data are other options that make it hard to handle Big Data.

Moving forward, analyzing, and learning from Big Data has opened several doors of opportunities. A lot of different technologies and platforms have been introduced to learn from these enormous amounts of data that are collected from different sources. Apache Hadoop has been the talk of the town when it comes to managing Big Data. It is considered one of the best and most preferred open-source software for data handling.

In this article, we will go through the Hadoop Cheat Sheet, which can be used as a ready reference and guide while learning Hadoop.

Big Data and Hadoop

In a technology-driven world, data is driving the growth of organizations. Companies across the globe ingest raw data in significant volumes from different sources. But, screening out the useful and insightful data from a big pool of data available is tricky. That’s where Big Data comes into play. Hadoop is an open-source framework introduced by Apache, which is used to process Big Data.

The potential of Big Data in the coming years cannot be ignored. Big Data has a vast scope of its application. Some of the areas are as under:

Areas	Application of Big Data
Targeting Consumers	With the help of Big Data, companies can analyze and understand consumer behaviors and target them in a personalized fashion.
Science and Research	Big Data has been playing a vital role in the fields of science and research.
Finance	With the help of Big Data algorithms, companies and individuals analyze the market and trading opportunities.
Health and Medical	Big Data has been beneficial in managing the health records and medical conditions of individuals.
Security	Security agencies around the world are using Big Data as a tool to keep track of terrorists and other security risks.
Government	Governments are using Big Data in many forms and for example, keeping track of population, identity-related information of people, and their family information.

What is Apache Hadoop?

Hadoop is a Big Data management solution based on a framework built by Apache Software Foundation. Hadoop is open-source software. Some of the biggest organizations around the world are extensively using Hadoop for distributed storage and processing of data, which is enormous in terms of volume. Keeping in view the large amount of data involved, Hadoop runs its processing on large computer hardware systems that are purpose-built. Some of the key features of the Hadoop platform are:

Data storage
Data Processing
Data Access
Data Analysis and Governance
Data Security
Operations and Deployment.

Hadoop is a top-level software that is built and used by a diverse group of developers, users, and contributors, regardless of their nationalities under the umbrella of the Apache Software Foundation. It is currently run and governed under the Apache License 2.0.

With the involvement of a considerable amount of data, Hadoop operates on thousands of nodes, and the probability of any node failure is very high. However, the Hadoop platform is resilient enough to cater to this problem with the help of Hadoop distributed file systems. These systems, upon sensing the node failure, immediately divert the data among other nodes, which allows the whole platform to operate without a hassle.

Basic Terms and Definitions

Big Data: Big Data comprises large sets of data that cannot be processed using traditional computing tools and techniques. These sets of data include huge volumes, high velocity, and extensible variety.

Hadoop: An open-source framework by Apache, which is written in Java. This platform allows the distributed processing of large datasets across various clusters of computers using simple programming techniques.

Hadoop Common: These are the libraries and utilities written in Java, which are required by other Hadoop modules containing the necessary scripts and files needed to start Hadoop.

Hadoop YARN: Hadoop YARN is a framework that is used for scheduling jobs and managing different cluster resources.

Hadoop Distributed File System: Simply known as HDFS, Hadoop Distributed File System is a Java-based system that provides data storage, which is scalable and reliable. It also offers a large amount of access to the application data.

Hadoop MapReduce: This framework is used for writing the applications in an easier way, which processes a large amount of data in parallel or large clusters.

Apache Hive: A software infrastructure that is used for data warehousing for Hadoop.

Apache Oozie: Also written in Java, this application is mainly responsible for scheduling Hadoop jobs.

Apache Pig: To execute MapReduce jobs, this data flow platform is used.

Apache Spark: Another open-source framework that is used for cluster computing.

Flume: Flume is also an open-source framework that is used for the collection and transportation of data from source to destination.

HBase: To store Big Data in a scalable way, Apache HBase is used, which is a column-oriented database of Hadoop.

Sqoop: Sqoop is an interface application that works through various commands to transfer data between Hadoop and relational databases.

Hadoop Ecosystem

Hadoop Ecosystem characterizes different components of the Apache software. It can be further divided into the following categories:

Top-Level Interface:

ETL Tools
BI Reporting
RDBMS

Top Level Abstraction:

PIG
HIVE
Sqoop

Distributed Data Processing

MapReduce
HBASE (Database with real-time access)

Self-Healing Clustered Storage System

Hadoop Distributed File System

Hadoop Distributed File System Shell Commands

The Hadoop shell is a set of commands which are administered through the command line of your operating system. This shell has further two sets of commands:

For file manipulation (similar in purpose and syntax to Linux commands)
Hadoop administration.

The following list sums up the commands, their usage with examples:

cat: copy the source path to the destination or standard output

Usage:

hdfs dfs -cat URI [URI …]

Example:

hdfs dfs -cat hdfs://<path>/file5

hdfs dfs -cat file:///file6 /user/hadoop/file7

chgrp: It changes the group relationship of files

Usage:

hdfs dfs -chgrp [-R] GROUP URI [URI …]

chmod: It changes the access permissions of files

Usage:

hdfs dfs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]

Example:

hdfs dfs -chmod 666 test/data1.txt

chown: This command changes the owner of files

Usage:

hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Example:

hdfs dfs -chown -R hduser3 /opt/hadoop/logs

copyFromLocal: It works just like the put command, but the source is restricted to a local file reference.

Usage:

hdfs dfs -copyFromLocal <localsrc> URI

Example:

hdfs dfs -copyFromLocal input/docs/data1.txt

hdfs://localhost/user/rosemary/data1.txt

copyToLocal: Works like the get command, but the destination is restricted to a local file reference.

Usage:

hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example:

hdfs dfs -copyToLocal data1.txt data1.copy.txt

count: This command counts the number of directories, files, and bytes

Usage:

hdfs dfs -count [-q] <paths>

Example:

hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

cp: Copies one or multiple files from a defined source to a designated destination. In the case of various sources, the specified destination must be a directory.

Usage:

hdfs dfs -cp URI [URI …] <dest>

Example:

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

du: Displays the size of the specified file or the sizes of data that are contained in the specified directory. Specifying -s option, displays an aggregate summary of file sizes. Whereas, specifying the -h option, layouts the file sizes being readable.

Usage:

hdfs dfs -du [-s] [-h] URI [URI …]

Example:

hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1

expunge: This command empties the trash. On deleting a file, it isn’t removed immediately from HDFS; instead, it is renamed to a file in the /trash directory.

Usage:

hdfs dfs –expunge

Get: It copies files to the local file system. By specifying the –ignorecrc option, files that fail a cyclic redundancy check (CRC) can still be copied. CRC checksum files contain .crc extension, and they are used to verify the data integrity of another file.

Usage:

hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>

Example:

hdfs dfs -get /user/hadoop/file3 localfile

getmerge: Link the files in src and writes the result to the specified local destination file. To add a new line character at the end of each file, addnl option is needed.

Usage:

hdfs dfs -getmerge <src> <localdst> [addnl]

Example:

hdfs dfs -getmerge /user/hadoop/mydir/ ~/result_file addnl

ls: Returns the statistics of the specified files or directories.

Usage:

hdfs dfs -ls <args>

Example:

hdfs dfs -ls /user/hadoop/file1

lsr: This serves as the repetitive version of ls, similar to the Unix command ls -R.

Usage:

hdfs dfs -lsr <args>

Example:

hdfs dfs -lsr /user/hadoop

mkdir: It makes directories on one or more specified paths. It works similar to the Unix mkdir -p command, which creates all directories leading up to the specified directory in case they don’t exist already.

Usage:

hdfs dfs -mkdir <paths>

Example:

hdfs dfs -mkdir /user/hadoop/dir5/temp

moveFromLocal: It works in a way the put command works, with the exception that the source is deleted after being copied.

Usage:

hdfs dfs -moveFromLocal <localsrc> <dest>

Example:

hdfs dfs -moveFromLocal localfile1 localfile2 /user/hadoop/hadoopdir

mv: It is used to moves one or more files from a specified source to a specified destination. However, moving files across file systems isn’t permitted.

Usage:

hdfs dfs -mv URI [URI …] <dest>

Example:

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

put: Used to copies files from the local file system to the destination file system. I can also read input from one file system to another.

Usage:

hdfs dfs -put <localsrc> … <dest>

Example:

hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir; hdfs dfs -put – /user/hadoop/hadoopdir

rm: This command is used to delete one or multiple files

Usage:

hdfs dfs -rm [-skipTrash] URI [URI …]

Example:

hdfs dfs -rm hdfs://nn.example.com/file8

rmr: It works as the repetitive version of –rm

Usage:

hdfs dfs -rmr [-skipTrash] URI [URI …]

Example:

hdfs dfs -rmr /user/hadoop/dir

setrep: It changes the duplication factor for a specified file or directory. Using –R makes the change recursively as per the directory structure.

Usage:

hdfs dfs -setrep <rep> [-R] <path>

Example:

hdfs dfs -setrep 3 -R /user/hadoop/dir1

Talk to our experts and gather expert opinions on how to learn Hadoop.

stat: Used to display information about any specified path

Usage:

hdfs dfs -stat URI [URI …]

Example:

hdfs dfs -stat /user/hadoop/dir1

tail: It shows the last kilobyte of a specified file to stdout. This particular syntax supports the Unix -f option, which empowers the specified file to be monitored. As soon as new lines are added to the file by another process, the tail updates the display accordingly.

Usage:

hdfs dfs -tail [-f] URI

Example:

hdfs dfs -tail /user/hadoop/dir1

test: test command returns attributes of the file or directory specified. Different functions are as under:

-e : Used to decide whether the file or directory exists
-z : To decide whether the file or directory is empty
-d : Determines whether the URI is a directory

Usage:

hdfs dfs -test -[ezd] URI

Example:

hdfs dfs -test /user/hadoop/dir1

text: This command outputs a specified source file in text format. Valid input file formats used are zip and TextRecordInputStream.

Usage:

hdfs dfs -text <src>

Example:

hdfs dfs -text /user/hadoop/file8.zip

Hadoop Administration Commands

Hadoop administration commands are a comprehensive set of commands used for cluster administration. The following list summarizes the most important commands:

balancer: It runs the cluster-balancing utility

Syntax:

hadoop balancer [-threshold <threshold>]

Example:

Hadoop balancer -threshold 20

daemonlog: Used to get or set the log level for each daemon (service). It connects to https://host:port/logLevel?log=name and prints or sets the log level of the particular daemon which is running at host:port.

Syntax:

hadoop daemonlog -getlevel <host:port> <name>; hadoop daemonlog -setlevel <host:port> <name> <level>

Example:

Hadoop daemonlog -getlevel 10.250.1.15:50030

datanode: It starts and runs the HDFS DataNode service. This service coordinates storage on each slave node. By specifying -rollback, the DataNode is reverted to the previous version.

Syntax:

Hadoop datanode [-rollback]

Example:

Hadoop datanode –rollback

dfsadmin: It runs several HDFS administrative operations. -help option is used to get a list of all supported options.

Syntax:

mradmin: This command runs various MapReduce administrative operations. -help option is used to see a list of all supported options. Some of the operations are as under:

-refreshServiceAcl: reloads the service-level authorization policy file
-refreshQueues: It reloads the queue ACLs and state
-refreshNodes: refreshes the hosts’ information
-refreshUserToGroupsMappings: refreshes user-to-groups mappings
-refreshSuperUserGroupsConfiguration: refreshes superuser proxy groups mappings
-help [cmd]: displays help for the given command or all commands in general

Syntax:

Hadoop mradmin [ GENERIC_OPTIONS ] [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-help [cmd]]

jobtracker: This command executes the MapReduce JobTracker node. It coordinates the data processing system for Hadoop. Specifying -dumpConfiguration, the configuration used by the JobTracker, and the queue configuration in JSON format are written to standard output.

Syntax:

hadoop jobtracker [-dumpConfiguration]

Example:

Hadoop jobtracker –dumpConfiguration

namenode: Executes the NameNode, which is used to coordinate the storage for the whole Hadoop cluster. Some other functions are as under:

-format: NameNode is started, formatted, and stopped
-upgrade: NameNode starts with the upgrade option
-rollback: NameNode is rolled back to the last version
-finalize: The last state of the file system is removed, which makes the most recent upgrade permanent. Furthermore, rollback is no longer available, and the NameNode is eventually stopped
-importCheckpoint: The image is loaded from the checkpoint directory and saved into the current directory.

Syntax:

hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]

Example:

Hadoop namenode –finalize

tasktracker: Executes a MapReduce TaskTracker node

Syntax:

Hadoop tasktracker

Hadoop dfsadmin Commands

The dfsadmin tools are a particular set of tools intended to help to find and remove various sets of information related to the Hadoop Distributed File system (HDFS). Furthermore, these commands can also be used to perform administration operations on HDFS as well.

Command	Function
-report	It gives basic file system information and statistics.
-safemode enter \| leave \| get \| wait	It manages safe mode. A state of NameNode in which variations to the namespace are not accepted and blocks cannot be replicated nor deleted. The NameNode usually runs in safe mode during start-up so that it doesn’t impulsively start replicating blocks even in the presence of enough replicas in the cluster.
-refreshNodes	It pushes the NameNode to reread its configuration, containing the dfs.hosts.exclude file. The NameNode withdraws nodes after their blocks have been copied onto machines, which will remain active.
-finalizeUpgrade	It is used to complete the HDFS upgrade process. DataNodes and NameNode delete working directories from the previous version.
-upgradeProgress status \| details \| force	It requests the general or detailed current status of the distributed upgrade. Or it can also force the upgrade to proceed.
-metasave filename	It saves the NameNode’s primary data structures to the filename in a particular directory, which is specified by hadoop.log.dir property. The file’s name, which is overwritten in case it already exists, contains one line for each of these items: a) DataNodes that are swapping heartbeats with the NameNode b) Blocks waiting to be replicated c) Blocks that are being replicated d) Blocks that are looking to get deleted.
-setQuota <quota> <dirname>…<dirname>	It defines an upper limit on the number of names in the directory tree. This limit can be ser for one or more directories simultaneously.
-clrQuota <dirname>…<dirname>	It clears the upper limit forced on the number of names in the directory tree. This limit can be removed for one or more directories simultaneously.
-restoreFailedStorage true \| false \| check	It activates then turning on or off the automatic attempts to restore failed storage replicas. In case a failed storage location is available again, the system tries to restore edits and the fsimage during a checkpoint. The check option returns the current setting.
-help [cmd]	This command displays help information for the given command or all commands in general.

Hadoop YARN Basic Commands

Commands	Tasks
Yarn	It shows the yarn help
yarn [–config confdir]	To define the configuration file, this command is used
yarn [–loglevel loglevel]	This command can be used to define the log level. Log level can be a fatal, error, warn, info, debug or trace
yarn classpath	To show the Hadoop classpath, this particular command is used
yarn application	Used to display and end the Hadoop applications
yarn applicationattempt	This command shows the application attempt
yarn container	This command displays the container information
yarn node	This command depicts the node information
yarn queue	It shows the information about the queue

Conclusion

With the abruptly advancing IT sector and ever-expanding data volumes, it will not be an exaggeration if we say that data drives modern organizations around the world. Understanding and mining this data unravels various patterns and reveals unseen linkages within the vast sky of data, which is indeed becoming a critical and rewarding journey for organizations.

Entities all over the world are now sensing the importance and the need to convert Big Data into Business Intelligence to reap the benefits. A better pool of data leads to better decision making and an enriched way to plan for organizations irrespective of their size, location, product & services, and other aspects. Hadoop, without any doubt, has become the preferred platform for working with vast volumes of data.

If you are looking to start a career in Big Data and data management field, getting your hands onto Hadoop will empower you as a professional to excel in the field of Big Data.

Hadoop Cheat Sheet | QuickStart IT Training

Request More Information

Approve The Cookies

Hadoop Cheat Sheet | QuickStart IT Training

Request More Information