Data Science Cheat Sheet - 2021 Edition
The data science field involves many different disciplines and languages which makes it difficult to write a cheat sheet involving most of them. A few come to mind: popular programming languages like Python, SQL, and R. Our courses go into great detail about these languages and much more.
Start Your 30-Day FREE TRIAL with Data Science Academy to gain access to over 80 free courses.
With this in mind, we compiled basic facts about R, Python, and SQL. The following commands may be helpful when brushing up on these common programming languages.
R Commands
Import, export, and quick checks
- dat1=read.csv("name.csv") to import a standard CSV file (first row are variable names).
- attach(dat1) to set a table as default to look for variables. Use detach() to release.
- dat1=read.delim("name.txt") to import a standard tab-delimited file.
- dat1=read.fwf("name.prn", widths=c(8,8,8)) fixed width (3 variables, 8 characters wide).
- ?read.table will help you find more options about importing non-standard data files.
- dat1=read.dbf("name.dbf") requires installation of the foreign package to import DBF files.
- head(dat1) to check the first few rows and variable names of the data table you imported.
- names(dat1) to list variable names in quotation marks (useful for copy and paste to code).
- data.frame(names(dat1)) hands you a list of your variables with the column number indicated, and this can be useful for sub-setting a data table.
- nrow(dat1) and ncol(dat1) returns the number of rows and columns of a data table.
- length(dat1$VAR1[!is.na(dat1$VAR1)] returns a count of non-missing values in a variable.
- str(dat1) to check variable types, which is useful to see if the import executed correctly.
- write.csv(results, "myresults.csv", na="", row.names=F) to export data. Without any of the option statements, missing values are represented by NA and row numbers will be written out.
Data types and basic data table manipulations
- There are three important variable types: numeric, character and factor (a double variable with a numeric and character value). You can always assign or query types: is.factor() or as.factor().
- If you import a data table, variables that contain one or more character entries will be set to factor. You can force them to numeric using the following: as.numeric(as.character(dat1$VAR1))
- After subsetting or modification, you might want to refresh factor levels with droplevels(dat1)
- Data tables can be set as.data.frame(), as.matrix(), as.distance()
- names(dat1)=c("ID", "X", "Y", "Z") renames variables. Note that the length of the vector must match the number of variables you have (four in our example).
- row.names(dat1)=dat1$ID. assigns an ID field to row names. Note that the default row names are consecutive numbers. For this to work properly, each value in the ID field has to be unique.
- To generate unique and descriptive row names that may serve as IDs, you can combine two or more variables: row.names(dat1)=paste(dat1$SITE, dat1$PLOT, sep="-")
- If you only have numerical values in your data table, you can transpose it (switch rows and columns): dat1_t=t(dat1). Row names become variables, so run the row.names() function found above first.
- dat1[order(X),] orders rows by variable X. dat[order(X,Y),] orders rows by variable X, then variable Y. dat1[order(X,-Y),]. Orders rows by variable X, then descending by variable Y.
- fix(dat1) to open the entire data table as a spreadsheet and edit cells with a double-click.
Creating systematic data and data tables
- c(1:10) is a generic concatenate function to create a vector, here numbers from 1 to 10.
- seq(0, 100, 10) generates a sequence from 0 to 100 in steps of 10.
- rep(5,10) replicates 5, 10 times. rep(c(1,2,3),2) gives 1 2 3 1 2 3. rep(c(1,2,3), each=2) gives 1 1 2 2 3 3. This can be helping in creating data entry sheets for experimental designs.
- data.frame(VAR1=c(1:10), VAR2=seq(10, 100, 10), VAR3=rep( c("this","that"),5)) creates a data frame from a number of vectors.
- expand.grid(SITE=c("A","B"),TREAT=c("low","med","high"), REP=c(1:5)) is an elegant method to create systematic data tables.
Creating random data and random sampling
- rnorm(10) takes 10 random samples from a normal distribution with a mean of zero and a standard deviation of 1
- runif(10) takes 10 random samples from a uniform distribution between zero and one.
- round(rnorm(10)*3+15)) takes 10 random samples from a normal distribution with a mean of 15 and a standard deviation of 3, and with decimals removed by the rounding function.
- round(runif(10)*5+15) returns random integers between 15 and 20, uniformly distributed.
- sample(c("A","B","C"), 10, replace=TRUE) returns a random sample from any custom vector or variable with replacement.
- sample1=dat1[sample(1:nrow(dat1), 50, replace=FALSE),] takes 50 random rows from dat1 (without duplicate sampling). This can be useful to run quick test analyses on subsets of enormous datasets or for bootstrapping.
Sub-setting data tables, conditional subsets
- dat1[1:10, 1:5] returns the first 10 rows and the first 5 columns of table dat1.
- dat2=dat1[50:70,] returns a subset of rows 50 to 70.
- cleandata=dat1[-c(2,7,15),] removes rows 2, 7 and 15.
- selectvars=dat1[,c("ID","YIELD")] sub-sets the variables ID and YIELD
- selectrows=dat1[dat1$VAR1=="Site 1",] sub-sets entries that were measured at Site 1. Possible conditional operators are == equal, != non-equal, > larger, < smaller, >= larger or equal, <= smaller or equal, & AND, | OR, ! NOT, () brackets to order complex conditional statements.
- selecttreats=dat1[dat1$TREAT %in% c("CTRL", "N", "P", "K"),] can replace multiple conditional == statements linked together by OR.
Transforming variables in data tables, conditional transformations
- dat2=transform(dat1, VAR1=VAR1*0.4). Multiplies VAR1 by 0.4
- dat2=transform(dat1, VAR2=VAR1*2). Creates variable VAR2 that is twice the value of VAR1
- dat2=transform(dat1, VAR1=ifelse(VAR3=="Site 1", VAR1*0.4, VAR1)) Multiplies VAR1 by 0.1 for entries measured at Site 1. For other sites the value stays the same. The general format is ifelse(condition, value if true, value if false).
The vegan package offers many useful standard transformations for variables or an entire data table:
- dat2=decostand(dat1,"standardize") Check out ?decostand to see all transformations. Merging data tables
- dat3=merge(dat1,dat2,by="ID") merge two tables by ID field.
- dat3=merge(dat1,dat2,by.x="ID",by.y="STN") merge by an ID field that is differently named in the two datasets.
- dat3=merge(dat1,dat2,by=c("LAT","LONG")) merge by multiple ID fields.
- dat3=merge(dat1,dat2,by.x="ID",by.y="ID",all.x=T,all.y=F) left merge; all.x=F, all.y=T right merge; all.x=T,all.y=T keep all rows; all.x=F,all.y=F keep matching rows.
- cbind(dat1,dat2) On very rare occasions, you merge data without a criteria (ID). This is commonly dangerous, as the commands will combine the two tables together without checking the order.
Python Commands
The following is a list of commands that can be found in many places online. Since they are basic commands, you will find them elsewhere.
Importing Data
Data analysis starts with gathering the data you want to interpret. Pandas gives you a good selection for attaining data into your Python workbook:
Importing Data in Python
- pd.read_csv(filename) # From a CSV file
- pd.read_table(filename) # From a delimited text file (like TSV)
- pd.read_excel(filename) # From an Excel file
- pd.read_sql(query, connection_object) # Reads from a SQL table/database
- pd.read_json(json_string) # Reads from a JSON formatted string, URL or file.
- pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
- pd.read_clipboard() # Retrieves the contents of your clipboard and passes it to read_table()
- pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists
Exploring Data
After you imported your data into a Pandas data frame, you can look at the data through the following ways:
- df.shape() # Prints number of rows and columns in dataframe
- df.head(n) # Prints first n rows of the DataFrame
- df.tail(n) # Prints last n rows of the DataFrame
- df.info() # Index, Datatype and Memory information
- df.describe() # Summary statistics for numerical columns
- s.value_counts(dropna=False) # Views unique values and counts
- df.apply(pd.Series.value_counts) # Unique values and counts for all columns
- df.describe() # Summary statistics for numerical columns
- df.mean() # Returns the mean of all columns
- df.corr() # Returns the correlation between columns in a DataFrame
- df.count() # Returns the number of non-null values in each DataFrame column
- df.max() # Returns the highest value in each column
- df.min() # Returns the lowest value in each column
- df.median() # Returns the median of each column
- df.std() # Returns the standard deviation of each column
Selecting
You’ll usually need to select an element or a subset of an element to go into further analysis of the data. The following commands are useful:
- df[col] # Returns column with label col as Series
- df[[col1, col2]] # Returns Columns as a new DataFrame
- s.iloc[0] # Selection by position (selects first element)
- s.loc[0] # Selection by index (selects element at index 0)
- df.iloc[0,:] # First row
- df.iloc[0,0] # First element of first column
Data Cleaning
If you use real data, then you’ll have to clean it up through data cleaning. The following ways of useful for data cleaning:
- df.columns = ['a','b','c'] # Renames columns
- pd.isnull() # Checks for null Values, Returns Boolean Array
- pd.notnull() # Opposite of s.isnull()
- df.dropna() # Drops all rows that contain null values
- df.dropna(axis=1) # Drops all columns that contain null values
- df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
- df.fillna(x) # Replaces all null values with x
- s.fillna(s.mean()) # Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
- s.astype(float) # Converts the datatype of the series to float
- s.replace(1,'one') # Replaces all values equal to 1 with 'one'
- s.replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three'
- df.rename(columns=lambda x: x + 1) # Mass renaming of columns
- df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
- df.set_index('column_one') # Changes the index
- df.rename(index=lambda x: x + 1) # Mass renaming of index
Filter, Sort, and Group By
The following includes ways for filtering, sorting, and grouping data:
- df[df[col] > 0.5] # Rows where the col column is greater than 0.5
- df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
- df.sort_values(col1) # Sorts values by col1 in ascending order
- df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order
- df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order
- df.groupby(col) # Returns a groupby object for values from one column
- df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
- df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
- df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
- df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
- df.apply(np.mean) # Applies a function across each column
- df.apply(np.max, axis=1) # Applies a function across each row
Joining and Combining
The following ways can help you combine two dataframes:
- df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
- pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
- df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'
Writing Data
After you got results from the entire process, it’s time to export the data:
- df.to_csv(filename) # Writes to a CSV file
- df.to_excel(filename) # Writes to an Excel file
- df.to_sql(table_name, connection_object) # Writes to a SQL table
- df.to_json(filename) # Writes to a file in JSON format
- df.to_html(filename) # Saves as an HTML table
- df.to_clipboard() # Writes to the clipboard
Machine Learning
The Scikit-Learn library has many helpful ways to train and apply machine learning models. The following short Scikit-Learn tutorial gives more information for the code below.
If you would like a full list of the Supervised Learning, Unsupervised Learning, Dataset Transformation and Model Evaluation modules in Scikit-Learn, visit its user guide for more information.
# Import libraries and modules
import numpy as np
import pandas as pd
- from sklearn.model_selection import train_test_split
- from sklearn import preprocessing
- from sklearn.ensemble import RandomForestRegressor
- from sklearn.pipeline import make_pipeline
- from sklearn.model_selection import GridSearchCV
- from sklearn.metrics import mean_squared_error, r2_score
- from sklearn.externals import joblib
# Load red wine data.
dataset_url = 'https://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')
# Split data into training and test sets
y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=123,
stratify=y)
# Declare data preprocessing steps
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))
# Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]}
# Tune model using cross-validation pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)
# Refit on the entire training set
# No additional code needed if clf.refit == True (default is True)
# Evaluate model pipeline on test data
pred = clf.predict(X_test)
print r2_score(y_test, pred)
print mean_squared_error(y_test, pred)
# Save model for future use
joblib.dump(clf, 'rf_regressor.pkl')
# To load: clf2 = joblib.load('rf_regressor.pkl')
SQL
Here are popular SQL statements:
Query data in columns c1, c2 from a table
SELECT c1, c2 FROM t;
Query all rows and columns from a table
SELECT * FROM t;
Query data and filter rows with a condition
SELECT c1, c2 FROM t
WHERE condition;
Query distinct rows from a table
SELECT DISTINCT c1 FROM t
WHERE condition;
Sort the result set in ascending or descending order
SELECT c1, c2 FROM t
ORDER BY c1 ASC [DESC];
Skip offset of rows and return the next n rows
SELECT c1, c2 FROM t
ORDER BY c1
LIMIT n OFFSET offset;
Group rows using an aggregate function
SELECT c1, aggregate(c2)
FROM t
GROUP BY c1;
Filter groups using HAVING clause
SELECT c1, aggregate(c2)
FROM t
GROUP BY c1
HAVING condition;
Querying from multiple tables
Inner join t1 and t2
SELECT c1, c2
FROM t1
INNER JOIN t2 ON condition;
Left join t1 and t1
SELECT c1, c2
FROM t1
LEFT JOIN t2 ON condition;
Right join t1 and t2
SELECT c1, c2
FROM t1
RIGHT JOIN t2 ON condition;
Perform full outer join
SELECT c1, c2
FROM t1
FULL OUTER JOIN t2 ON condition;
Produce a Cartesian product of rows in tables
SELECT c1, c2
FROM t1
CROSS JOIN t2;
Another way to perform cross join
SELECT c1, c2
FROM t1, t2;
Join t1 to itself using INNER JOIN clause
SELECT c1, c2
FROM t1 A
INNER JOIN t1 B ON condition;
Using SQL Operators
Combine rows from two queries
SELECT c1, c2 FROM t1
UNION [ALL]
SELECT c1, c2 FROM t2;
Return the intersection of two queries
SELECT c1, c2 FROM t1
INTERSECT
SELECT c1, c2 FROM t2;
Subtract a result set from another result set
SELECT c1, c2 FROM t1
MINUS
SELECT c1, c2 FROM t2;
Query rows using pattern matching %, _
SELECT c1, c2 FROM t1
WHERE c1 [NOT] LIKE pattern;
Query rows in a list
SELECT c1, c2 FROM t
WHERE c1 [NOT] IN value_list;
Query rows between two values
SELECT c1, c2 FROM t
WHERE c1 BETWEEN low AND high;
Check if values in a table is NULL or not
SELECT c1, c2 FROM t
WHERE c1 IS [NOT] NULL;
Managing tables
Create a new table with three columns
CREATE TABLE t (
id INT PRIMARY KEY,
name VARCHAR NOT NULL,
price INT DEFAULT 0
);
Delete the table from the database
DROP TABLE t ;
Add a new column to the table
ALTER TABLE t ADD column;
Drop column c from the table
ALTER TABLE t DROP COLUMN c ;
Add a constraint
ALTER TABLE t ADD constraint;
Drop a constraint
ALTER TABLE t DROP constraint;
Rename a table from t1 to t2
ALTER TABLE t1 RENAME TO t2;
Rename column c1 to c2
ALTER TABLE t1 RENAME c1 TO c2 ;
Remove all data in a table
TRUNCATE TABLE t;
Using SQL constraints
Set c1 and c2 as a primary key
CREATE TABLE t(
c1 INT, c2 INT, c3 VARCHAR,
PRIMARY KEY (c1,c2)
);
Set c2 column as a foreign key
CREATE TABLE t1(
c1 INT PRIMARY KEY,
c2 INT,
FOREIGN KEY (c2) REFERENCES t2(c2)
);
Make the values in c1 and c2 unique
CREATE TABLE t(
c1 INT, c1 INT,
UNIQUE(c2,c3)
);
Ensure c1 > 0 and values in c1 >= c2
CREATE TABLE t(
c1 INT, c2 INT,
CHECK(c1> 0 AND c1 >= c2)
);
Set values in c2 column not NULL
CREATE TABLE t(
c1 INT PRIMARY KEY,
c2 VARCHAR NOT NULL
);
Modifying Data
Insert one row into a table
INSERT INTO t(column_list)
VALUES(value_list);
Insert multiple rows into a table
INSERT INTO t(column_list)
VALUES (value_list),
(value_list), …;
Insert rows from t2 into t1
INSERT INTO t1(column_list)
SELECT column_list
FROM t2;
Update new value in the column c1 for all rows
UPDATE t
SET c1 = new_value;
Update values in the column c1, c2 that match the condition
UPDATE t
SET c1 = new_value,
c2 = new_value
WHERE condition;
Delete all data in a table
DELETE FROM t;
Delete subset of rows in a table
DELETE FROM t
WHERE condition;
Managing Views
Create a new view that consists of c1 and c2
CREATE VIEW v(c1,c2)
AS
SELECT c1, c2
FROM t;
Create a new view with check option
CREATE VIEW v(c1,c2)
AS
SELECT c1, c2
FROM t;
WITH [CASCADED | LOCAL] CHECK OPTION;
Create a recursive view
CREATE RECURSIVE VIEW v
AS
select-statement -- anchor part
UNION [ALL]
select-statement; -- recursive part
Create a temporary view
CREATE TEMPORARY VIEW v
AS
SELECT c1, c2
FROM t;
Delete a view
DROP VIEW view_name;
Managing indexes
Create an index on c1 and c2 of the t table
CREATE INDEX idx_name
ON t(c1,c2);
Create a unique index on c3, c4 of the t table
CREATE UNIQUE INDEX idx_name
ON t(c3,c4)
Drop an index
DROP INDEX idx_name;
Managing triggers
Create or modify a trigger
CREATE OR MODIFY TRIGGER trigger_name
WHEN EVENT
ON table_name TRIGGER_TYPE
EXECUTE stored_procedure;
WHEN
- BEFORE – invoke before the event occurs
- AFTER – invoke after the event occurs
EVENT
- INSERT – invoke for INSERT
- UPDATE – invoke for UPDATE
- DELETE – invoke for DELETE
TRIGGER_TYPE
- FOR EACH ROW
- FOR EACH STATEMENT
Delete a specific trigger
DROP TRIGGER trigger_name;
Data Science Training
We hope you found this short cheat sheet useful in your studies, whether for a certification test or an interview. Our courses go into even greater detail about these languages and much more.