Data Science Cheat Sheet - 2021 Edition

Tom Robertson Data Science Enthusiast
January 01, 2021

Data Science Cheat Sheet - 2021 Edition

The data science field involves many different disciplines and languages which makes it difficult to write a cheat sheet involving most of them. A few come to mind: popular programming languages like Python, SQL, and R. Our courses go into great detail about these languages and much more.

Start Your 30-Day FREE TRIAL with Data Science Academy to gain access to over 80 free courses.

With this in mind, we compiled basic facts about R, Python, and SQL. The following commands may be helpful when brushing up on these common programming languages.

R Commands

Import, export, and quick checks

dat1=read.csv("name.csv") to import a standard CSV file (first row are variable names).
attach(dat1) to set a table as default to look for variables. Use detach() to release.
dat1=read.delim("name.txt") to import a standard tab-delimited file.
dat1=read.fwf("name.prn", widths=c(8,8,8)) fixed width (3 variables, 8 characters wide).
?read.table will help you find more options about importing non-standard data files.
dat1=read.dbf("name.dbf") requires installation of the foreign package to import DBF files.
head(dat1) to check the first few rows and variable names of the data table you imported.
names(dat1) to list variable names in quotation marks (useful for copy and paste to code).
data.frame(names(dat1)) hands you a list of your variables with the column number indicated, and this can be useful for sub-setting a data table.
nrow(dat1) and ncol(dat1) returns the number of rows and columns of a data table.
length(dat1$VAR1[!is.na(dat1$VAR1)] returns a count of non-missing values in a variable.
str(dat1) to check variable types, which is useful to see if the import executed correctly.
write.csv(results, "myresults.csv", na="", row.names=F) to export data. Without any of the option statements, missing values are represented by NA and row numbers will be written out.

Data types and basic data table manipulations

There are three important variable types: numeric, character and factor (a double variable with a numeric and character value). You can always assign or query types: is.factor() or as.factor().
If you import a data table, variables that contain one or more character entries will be set to factor. You can force them to numeric using the following: as.numeric(as.character(dat1$VAR1))
After subsetting or modification, you might want to refresh factor levels with droplevels(dat1)
Data tables can be set as.data.frame(), as.matrix(), as.distance()
names(dat1)=c("ID", "X", "Y", "Z") renames variables. Note that the length of the vector must match the number of variables you have (four in our example).
row.names(dat1)=dat1$ID. assigns an ID field to row names. Note that the default row names are consecutive numbers. For this to work properly, each value in the ID field has to be unique.
To generate unique and descriptive row names that may serve as IDs, you can combine two or more variables: row.names(dat1)=paste(dat1$SITE, dat1$PLOT, sep="-")
If you only have numerical values in your data table, you can transpose it (switch rows and columns): dat1_t=t(dat1). Row names become variables, so run the row.names() function found above first.
dat1[order(X),] orders rows by variable X. dat[order(X,Y),] orders rows by variable X, then variable Y. dat1[order(X,-Y),]. Orders rows by variable X, then descending by variable Y.
fix(dat1) to open the entire data table as a spreadsheet and edit cells with a double-click.

Creating systematic data and data tables

c(1:10) is a generic concatenate function to create a vector, here numbers from 1 to 10.
seq(0, 100, 10) generates a sequence from 0 to 100 in steps of 10.
rep(5,10) replicates 5, 10 times. rep(c(1,2,3),2) gives 1 2 3 1 2 3. rep(c(1,2,3), each=2) gives 1 1 2 2 3 3. This can be helping in creating data entry sheets for experimental designs.
data.frame(VAR1=c(1:10), VAR2=seq(10, 100, 10), VAR3=rep( c("this","that"),5)) creates a data frame from a number of vectors.
expand.grid(SITE=c("A","B"),TREAT=c("low","med","high"), REP=c(1:5)) is an elegant method to create systematic data tables.

Creating random data and random sampling

rnorm(10) takes 10 random samples from a normal distribution with a mean of zero and a standard deviation of 1
runif(10) takes 10 random samples from a uniform distribution between zero and one.
round(rnorm(10)*3+15)) takes 10 random samples from a normal distribution with a mean of 15 and a standard deviation of 3, and with decimals removed by the rounding function.
round(runif(10)*5+15) returns random integers between 15 and 20, uniformly distributed.
sample(c("A","B","C"), 10, replace=TRUE) returns a random sample from any custom vector or variable with replacement.
sample1=dat1[sample(1:nrow(dat1), 50, replace=FALSE),] takes 50 random rows from dat1 (without duplicate sampling). This can be useful to run quick test analyses on subsets of enormous datasets or for bootstrapping.

Sub-setting data tables, conditional subsets

dat1[1:10, 1:5] returns the first 10 rows and the first 5 columns of table dat1.
dat2=dat1[50:70,] returns a subset of rows 50 to 70.
cleandata=dat1[-c(2,7,15),] removes rows 2, 7 and 15.
selectvars=dat1[,c("ID","YIELD")] sub-sets the variables ID and YIELD
selectrows=dat1[dat1$VAR1=="Site 1",] sub-sets entries that were measured at Site 1. Possible conditional operators are == equal, != non-equal, > larger, < smaller, >= larger or equal, <= smaller or equal, & AND, | OR, ! NOT, () brackets to order complex conditional statements.
selecttreats=dat1[dat1$TREAT %in% c("CTRL", "N", "P", "K"),] can replace multiple conditional == statements linked together by OR.

Transforming variables in data tables, conditional transformations

dat2=transform(dat1, VAR1=VAR1*0.4). Multiplies VAR1 by 0.4
dat2=transform(dat1, VAR2=VAR1*2). Creates variable VAR2 that is twice the value of VAR1
dat2=transform(dat1, VAR1=ifelse(VAR3=="Site 1", VAR1*0.4, VAR1)) Multiplies VAR1 by 0.1 for entries measured at Site 1. For other sites the value stays the same. The general format is ifelse(condition, value if true, value if false).

The vegan package offers many useful standard transformations for variables or an entire data table:

dat2=decostand(dat1,"standardize") Check out ?decostand to see all transformations. Merging data tables
dat3=merge(dat1,dat2,by="ID") merge two tables by ID field.
dat3=merge(dat1,dat2,by.x="ID",by.y="STN") merge by an ID field that is differently named in the two datasets.
dat3=merge(dat1,dat2,by=c("LAT","LONG")) merge by multiple ID fields.
dat3=merge(dat1,dat2,by.x="ID",by.y="ID",all.x=T,all.y=F) left merge; all.x=F, all.y=T right merge; all.x=T,all.y=T keep all rows; all.x=F,all.y=F keep matching rows.
cbind(dat1,dat2) On very rare occasions, you merge data without a criteria (ID). This is commonly dangerous, as the commands will combine the two tables together without checking the order.

Python Commands

The following is a list of commands that can be found in many places online. Since they are basic commands, you will find them elsewhere.

Importing Data

Data analysis starts with gathering the data you want to interpret. Pandas gives you a good selection for attaining data into your Python workbook:

Importing Data in Python

pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Reads from a SQL table/database
pd.read_json(json_string) # Reads from a JSON formatted string, URL or file.
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Retrieves the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

Exploring Data

After you imported your data into a Pandas data frame, you can look at the data through the following ways:

df.shape() # Prints number of rows and columns in dataframe
df.head(n) # Prints first n rows of the DataFrame
df.tail(n) # Prints last n rows of the DataFrame
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
s.value_counts(dropna=False) # Views unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
df.describe() # Summary statistics for numerical columns
df.mean() # Returns the mean of all columns
df.corr() # Returns the correlation between columns in a DataFrame
df.count() # Returns the number of non-null values in each DataFrame column
df.max() # Returns the highest value in each column
df.min() # Returns the lowest value in each column
df.median() # Returns the median of each column
df.std() # Returns the standard deviation of each column

Selecting

You’ll usually need to select an element or a subset of an element to go into further analysis of the data. The following commands are useful:

df[col] # Returns column with label col as Series
df[[col1, col2]] # Returns Columns as a new DataFrame
s.iloc[0] # Selection by position (selects first element)
s.loc[0] # Selection by index (selects element at index 0)
df.iloc[0,:] # First row
df.iloc[0,0] # First element of first column

Data Cleaning

If you use real data, then you’ll have to clean it up through data cleaning. The following ways of useful for data cleaning:

df.columns = ['a','b','c'] # Renames columns
pd.isnull() # Checks for null Values, Returns Boolean Array
pd.notnull() # Opposite of s.isnull()
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
df.fillna(x) # Replaces all null values with x
s.fillna(s.mean()) # Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
s.astype(float) # Converts the datatype of the series to float
s.replace(1,'one') # Replaces all values equal to 1 with 'one'
s.replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
df.set_index('column_one') # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index

Filter, Sort, and Group By

The following includes ways for filtering, sorting, and grouping data:

df[df[col] > 0.5] # Rows where the col column is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
df.sort_values(col1) # Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order
df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order
df.groupby(col) # Returns a groupby object for values from one column
df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
df.apply(np.mean) # Applies a function across each column
df.apply(np.max, axis=1) # Applies a function across each row

Joining and Combining

The following ways can help you combine two dataframes:

df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'

Writing Data

After you got results from the entire process, it’s time to export the data:

df.to_csv(filename) # Writes to a CSV file
df.to_excel(filename) # Writes to an Excel file
df.to_sql(table_name, connection_object) # Writes to a SQL table
df.to_json(filename) # Writes to a file in JSON format
df.to_html(filename) # Saves as an HTML table
df.to_clipboard() # Writes to the clipboard

Machine Learning

The Scikit-Learn library has many helpful ways to train and apply machine learning models. The following short Scikit-Learn tutorial gives more information for the code below.

If you would like a full list of the Supervised Learning, Unsupervised Learning, Dataset Transformation and Model Evaluation modules in Scikit-Learn, visit its user guide for more information.

# Import libraries and modules

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib

# Load red wine data.

dataset_url = 'https://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

data = pd.read_csv(dataset_url, sep=';')

# Split data into training and test sets

y = data.quality

X = data.drop('quality', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2,

random_state=123,

stratify=y)

# Declare data preprocessing steps

pipeline = make_pipeline(preprocessing.StandardScaler(),

RandomForestRegressor(n_estimators=100))

# Declare hyperparameters to tune

hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],

'randomforestregressor__max_depth': [None, 5, 3, 1]}

# Tune model using cross-validation pipeline

clf = GridSearchCV(pipeline, hyperparameters, cv=10)

clf.fit(X_train, y_train)

# Refit on the entire training set

# No additional code needed if clf.refit == True (default is True)

# Evaluate model pipeline on test data

pred = clf.predict(X_test)

print r2_score(y_test, pred)

print mean_squared_error(y_test, pred)

# Save model for future use

joblib.dump(clf, 'rf_regressor.pkl')

# To load: clf2 = joblib.load('rf_regressor.pkl')

SQL

Here are popular SQL statements:

Query data in columns c1, c2 from a table

SELECT c1, c2 FROM t;

Query all rows and columns from a table

SELECT * FROM t;

Query data and filter rows with a condition

SELECT c1, c2 FROM t

WHERE condition;

Query distinct rows from a table

SELECT DISTINCT c1 FROM t

WHERE condition;

Sort the result set in ascending or descending order

SELECT c1, c2 FROM t

ORDER BY c1 ASC [DESC];

Skip offset of rows and return the next n rows

SELECT c1, c2 FROM t

ORDER BY c1

LIMIT n OFFSET offset;

Group rows using an aggregate function

SELECT c1, aggregate(c2)

FROM t

GROUP BY c1;

Filter groups using HAVING clause

SELECT c1, aggregate(c2)

FROM t

GROUP BY c1

HAVING condition;

Querying from multiple tables

Inner join t1 and t2

SELECT c1, c2

FROM t1

INNER JOIN t2 ON condition;

Left join t1 and t1

SELECT c1, c2

FROM t1

LEFT JOIN t2 ON condition;

Right join t1 and t2

SELECT c1, c2

FROM t1

RIGHT JOIN t2 ON condition;

Perform full outer join

SELECT c1, c2

FROM t1

FULL OUTER JOIN t2 ON condition;

Produce a Cartesian product of rows in tables

SELECT c1, c2

FROM t1

CROSS JOIN t2;

Another way to perform cross join

SELECT c1, c2

FROM t1, t2;

Join t1 to itself using INNER JOIN clause

SELECT c1, c2

FROM t1 A

INNER JOIN t1 B ON condition;

Using SQL Operators

Combine rows from two queries

SELECT c1, c2 FROM t1

UNION [ALL]

SELECT c1, c2 FROM t2;

Return the intersection of two queries

SELECT c1, c2 FROM t1

INTERSECT

SELECT c1, c2 FROM t2;

Subtract a result set from another result set

SELECT c1, c2 FROM t1

MINUS

SELECT c1, c2 FROM t2;

Query rows using pattern matching %, _

SELECT c1, c2 FROM t1

WHERE c1 [NOT] LIKE pattern;

Query rows in a list

SELECT c1, c2 FROM t

WHERE c1 [NOT] IN value_list;

Query rows between two values

SELECT c1, c2 FROM t

WHERE c1 BETWEEN low AND high;

Check if values in a table is NULL or not

SELECT c1, c2 FROM t

WHERE c1 IS [NOT] NULL;

Managing tables

Create a new table with three columns

CREATE TABLE t (

id INT PRIMARY KEY,

name VARCHAR NOT NULL,

price INT DEFAULT 0

);

Delete the table from the database

DROP TABLE t ;

Add a new column to the table

ALTER TABLE t ADD column;

Drop column c from the table

ALTER TABLE t DROP COLUMN c ;

Add a constraint

ALTER TABLE t ADD constraint;

Drop a constraint

ALTER TABLE t DROP constraint;

Rename a table from t1 to t2

ALTER TABLE t1 RENAME TO t2;

Rename column c1 to c2

ALTER TABLE t1 RENAME c1 TO c2 ;

Remove all data in a table

TRUNCATE TABLE t;

Using SQL constraints

Set c1 and c2 as a primary key

CREATE TABLE t(

c1 INT, c2 INT, c3 VARCHAR,

PRIMARY KEY (c1,c2)

);

Set c2 column as a foreign key

CREATE TABLE t1(

c1 INT PRIMARY KEY,

c2 INT,

FOREIGN KEY (c2) REFERENCES t2(c2)

);

Make the values in c1 and c2 unique

CREATE TABLE t(

c1 INT, c1 INT,

UNIQUE(c2,c3)

);

Ensure c1 > 0 and values in c1 >= c2

CREATE TABLE t(

c1 INT, c2 INT,

CHECK(c1> 0 AND c1 >= c2)

);

Set values in c2 column not NULL

CREATE TABLE t(

c1 INT PRIMARY KEY,

c2 VARCHAR NOT NULL

);

Modifying Data

Insert one row into a table

INSERT INTO t(column_list)

VALUES(value_list);

Insert multiple rows into a table

INSERT INTO t(column_list)

VALUES (value_list),

(value_list), …;

Insert rows from t2 into t1

INSERT INTO t1(column_list)

SELECT column_list

FROM t2;

Update new value in the column c1 for all rows

UPDATE t

SET c1 = new_value;

Update values in the column c1, c2 that match the condition

UPDATE t

SET c1 = new_value,

c2 = new_value

WHERE condition;

Delete all data in a table

DELETE FROM t;

Delete subset of rows in a table

DELETE FROM t

WHERE condition;

Managing Views

Create a new view that consists of c1 and c2

CREATE VIEW v(c1,c2)

SELECT c1, c2

FROM t;

Create a new view with check option

CREATE VIEW v(c1,c2)

SELECT c1, c2

FROM t;

WITH [CASCADED | LOCAL] CHECK OPTION;

Create a recursive view

CREATE RECURSIVE VIEW v

select-statement -- anchor part

UNION [ALL]

select-statement; -- recursive part

Create a temporary view

CREATE TEMPORARY VIEW v

SELECT c1, c2

FROM t;

Delete a view

DROP VIEW view_name;

Managing indexes

Create an index on c1 and c2 of the t table

CREATE INDEX idx_name

ON t(c1,c2);

Create a unique index on c3, c4 of the t table

CREATE UNIQUE INDEX idx_name

ON t(c3,c4)

Drop an index

DROP INDEX idx_name;

Managing triggers

Create or modify a trigger

CREATE OR MODIFY TRIGGER trigger_name

WHEN EVENT

ON table_name TRIGGER_TYPE

EXECUTE stored_procedure;

WHEN

BEFORE – invoke before the event occurs
AFTER – invoke after the event occurs

EVENT

INSERT – invoke for INSERT
UPDATE – invoke for UPDATE
DELETE – invoke for DELETE

TRIGGER_TYPE

FOR EACH ROW
FOR EACH STATEMENT

Delete a specific trigger

DROP TRIGGER trigger_name;

Data Science Training

We hope you found this short cheat sheet useful in your studies, whether for a certification test or an interview. Our courses go into even greater detail about these languages and much more.

Connect with our experts to learn more about our data science courses.

Previous Post Next Post

Request A Quote

Data Science Cheat Sheet - 2021 Edition

Data Science Cheat Sheet - 2021 Edition

R Commands

Import, export, and quick checks

Data types and basic data table manipulations

Creating systematic data and data tables

Creating random data and random sampling

Sub-setting data tables, conditional subsets

Transforming variables in data tables, conditional transformations

Python Commands

Importing Data

Importing Data in Python

Exploring Data

Selecting

Data Cleaning

Filter, Sort, and Group By

Joining and Combining

Writing Data

Machine Learning

SQL

VPAT Heading Level

VPAT Heading Level

Approve The Cookies

Resources for IT Learners and Leaders

Request A Quote

Data Science Cheat Sheet - 2021 Edition

Data Science Cheat Sheet - 2021 Edition

Data Science Cheat Sheet - 2021 Edition

R Commands

Import, export, and quick checks

Data types and basic data table manipulations

Creating systematic data and data tables

Creating random data and random sampling

Sub-setting data tables, conditional subsets

Transforming variables in data tables, conditional transformations

Python Commands

Importing Data

Importing Data in Python

Exploring Data

Selecting

Data Cleaning

Filter, Sort, and Group By

Joining and Combining

Writing Data

Machine Learning

SQL

VPAT Heading Level

VPAT Heading Level