Data Science Cheat Sheet - 2021 Edition

Data Science Cheat Sheet

The data science field involves many different disciplines and languages which makes it difficult to write a cheat sheet involving most of them. A few come to mind: popular programming languages like Python, SQL, and R. Our courses go into great detail about these languages and much more.

Start Your 30-Day FREE TRIAL with Data Science Academy to gain access to over 80 free courses.

With this in mind, we compiled basic facts about R, Python, and SQL. The following commands may be helpful when brushing up on these common programming languages.

R Commands

Import, export, and quick checks

  •  dat1=read.csv("name.csv") to import a standard CSV file (first row are variable names).
  •  attach(dat1) to set a table as default to look for variables. Use detach() to release.
  •  dat1=read.delim("name.txt") to import a standard tab-delimited file.
  •  dat1=read.fwf("name.prn", widths=c(8,8,8)) fixed width (3 variables, 8 characters wide).
  •  ?read.table will help you find more options about importing non-standard data files.
  •  dat1=read.dbf("name.dbf") requires installation of the foreign package to import DBF files.
  •  head(dat1) to check the first few rows and variable names of the data table you imported.
  •  names(dat1) to list variable names in quotation marks (useful for copy and paste to code).
  •  data.frame(names(dat1)) hands you a list of your variables with the column number indicated, and this can be useful for sub-setting a data table.
  •  nrow(dat1) and ncol(dat1) returns the number of rows and columns of a data table.
  •  length(dat1$VAR1[!is.na(dat1$VAR1)] returns a count of non-missing values in a variable.
  •  str(dat1) to check variable types, which is useful to see if the import executed correctly.
  •  write.csv(results, "myresults.csv", na="", row.names=F) to export data. Without any of the option statements, missing values are represented by NA and row numbers will be written out.

Data types and basic data table manipulations

  •  There are three important variable types: numeric, character and factor (a double variable with a numeric and character value). You can always assign or query types: is.factor() or as.factor().
  •  If you import a data table, variables that contain one or more character entries will be set to factor. You can force them to numeric using the following: as.numeric(as.character(dat1$VAR1))
  •  After subsetting or modification, you might want to refresh factor levels with droplevels(dat1)
  •  Data tables can be set as.data.frame(), as.matrix(), as.distance()
  •  names(dat1)=c("ID", "X", "Y", "Z") renames variables. Note that the length of the vector must match the number of variables you have (four in our example).
  •  row.names(dat1)=dat1$ID. assigns an ID field to row names. Note that the default row names are consecutive numbers. For this to work properly, each value in the ID field has to be unique.
  •  To generate unique and descriptive row names that may serve as IDs, you can combine two or more variables: row.names(dat1)=paste(dat1$SITE, dat1$PLOT, sep="-")
  •  If you only have numerical values in your data table, you can transpose it (switch rows and columns): dat1_t=t(dat1). Row names become variables, so run the row.names() function found above first.
  •  dat1[order(X),] orders rows by variable X. dat[order(X,Y),] orders rows by variable X, then variable Y. dat1[order(X,-Y),]. Orders rows by variable X, then descending by variable Y.
  • fix(dat1) to open the entire data table as a spreadsheet and edit cells with a double-click.

Creating systematic data and data tables

  •  c(1:10) is a generic concatenate function to create a vector, here numbers from 1 to 10.
  •  seq(0, 100, 10) generates a sequence from 0 to 100 in steps of 10.
  •  rep(5,10) replicates 5, 10 times. rep(c(1,2,3),2) gives 1 2 3 1 2 3. rep(c(1,2,3), each=2) gives 1 1 2 2 3 3. This can be helping in creating data entry sheets for experimental designs.
  •  data.frame(VAR1=c(1:10), VAR2=seq(10, 100, 10), VAR3=rep( c("this","that"),5)) creates a data frame from a number of vectors.
  •  expand.grid(SITE=c("A","B"),TREAT=c("low","med","high"), REP=c(1:5)) is an elegant method to create systematic data tables.

Creating random data and random sampling

  •  rnorm(10) takes 10 random samples from a normal distribution with a mean of zero and a standard deviation of 1
  •  runif(10) takes 10 random samples from a uniform distribution between zero and one.
  •  round(rnorm(10)*3+15)) takes 10 random samples from a normal distribution with a mean of 15 and a standard deviation of 3, and with decimals removed by the rounding function.
  •  round(runif(10)*5+15) returns random integers between 15 and 20, uniformly distributed.
  •  sample(c("A","B","C"), 10, replace=TRUE) returns a random sample from any custom vector or variable with replacement.
  •  sample1=dat1[sample(1:nrow(dat1), 50, replace=FALSE),] takes 50 random rows from dat1 (without duplicate sampling). This can be useful to run quick test analyses on subsets of enormous datasets or for bootstrapping.

Sub-setting data tables, conditional subsets

  •  dat1[1:10, 1:5] returns the first 10 rows and the first 5 columns of table dat1.
  •  dat2=dat1[50:70,] returns a subset of rows 50 to 70.
  •  cleandata=dat1[-c(2,7,15),] removes rows 2, 7 and 15.
  •  selectvars=dat1[,c("ID","YIELD")] sub-sets the variables ID and YIELD
  •  selectrows=dat1[dat1$VAR1=="Site 1",] sub-sets entries that were measured at Site 1. Possible conditional operators are == equal, != non-equal, > larger, < smaller, >= larger or equal, <= smaller or equal, & AND, | OR, ! NOT, () brackets to order complex conditional statements.
  •  selecttreats=dat1[dat1$TREAT %in% c("CTRL", "N", "P", "K"),] can replace multiple conditional == statements linked together by OR.

Transforming variables in data tables, conditional transformations

  •  dat2=transform(dat1, VAR1=VAR1*0.4). Multiplies VAR1 by 0.4
  •  dat2=transform(dat1, VAR2=VAR1*2). Creates variable VAR2 that is twice the value of VAR1
  •  dat2=transform(dat1, VAR1=ifelse(VAR3=="Site 1", VAR1*0.4, VAR1)) Multiplies VAR1 by 0.1 for entries measured at Site 1. For other sites the value stays the same. The general format is ifelse(condition, value if true, value if false).

 The vegan package offers many useful standard transformations for variables or an entire data table:

  • dat2=decostand(dat1,"standardize") Check out ?decostand to see all transformations. Merging data tables
  •  dat3=merge(dat1,dat2,by="ID") merge two tables by ID field.
  •  dat3=merge(dat1,dat2,by.x="ID",by.y="STN") merge by an ID field that is differently named in the two datasets.
  •  dat3=merge(dat1,dat2,by=c("LAT","LONG")) merge by multiple ID fields.
  •  dat3=merge(dat1,dat2,by.x="ID",by.y="ID",all.x=T,all.y=F) left merge; all.x=F, all.y=T right merge; all.x=T,all.y=T keep all rows; all.x=F,all.y=F keep matching rows.
  •  cbind(dat1,dat2) On very rare occasions, you merge data without a criteria (ID). This is commonly dangerous, as the commands will combine the two tables together without checking the order.

Python Commands

The following is a list of commands that can be found in many places online. Since they are basic commands, you will find them elsewhere.  

Importing Data

Data analysis starts with gathering the data you want to interpret. Pandas gives you a good selection for attaining data into your Python workbook:

Importing Data in Python

  • pd.read_csv(filename) # From a CSV file
  • pd.read_table(filename) # From a delimited text file (like TSV)
  • pd.read_excel(filename) # From an Excel file
  • pd.read_sql(query, connection_object) # Reads from a SQL table/database
  • pd.read_json(json_string) # Reads from a JSON formatted string, URL or file.
  • pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
  • pd.read_clipboard() # Retrieves the contents of your clipboard and passes it to read_table()
  • pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

Exploring Data

After you imported your data into a Pandas data frame, you can look at the data through the following ways:

  • df.shape() # Prints number of rows and columns in dataframe
  • df.head(n) # Prints first n rows of the DataFrame
  • df.tail(n) # Prints last n rows of the DataFrame
  • df.info() # Index, Datatype and Memory information
  • df.describe() # Summary statistics for numerical columns
  • s.value_counts(dropna=False) # Views unique values and counts
  • df.apply(pd.Series.value_counts) # Unique values and counts for all columns
  • df.describe() # Summary statistics for numerical columns
  • df.mean() # Returns the mean of all columns
  • df.corr() # Returns the correlation between columns in a DataFrame
  • df.count() # Returns the number of non-null values in each DataFrame column
  • df.max() # Returns the highest value in each column
  • df.min() # Returns the lowest value in each column
  • df.median() # Returns the median of each column
  • df.std() # Returns the standard deviation of each column

Selecting

You’ll usually need to select an element or a subset of an element to go into further analysis of the data. The following commands are useful:

  • df[col] # Returns column with label col as Series
  • df[[col1, col2]] # Returns Columns as a new DataFrame
  • s.iloc[0] # Selection by position (selects first element)
  • s.loc[0] # Selection by index (selects element at index 0)
  • df.iloc[0,:] # First row
  • df.iloc[0,0] # First element of first column

Data Cleaning

If you use real data, then you’ll have to clean it up through data cleaning. The following ways of useful for data cleaning:

  • df.columns = ['a','b','c'] # Renames columns
  • pd.isnull() # Checks for null Values, Returns Boolean Array
  • pd.notnull() # Opposite of s.isnull()
  • df.dropna() # Drops all rows that contain null values
  • df.dropna(axis=1) # Drops all columns that contain null values
  • df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
  • df.fillna(x) # Replaces all null values with x
  • s.fillna(s.mean()) # Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
  • s.astype(float) # Converts the datatype of the series to float
  • s.replace(1,'one') # Replaces all values equal to 1 with 'one'
  • s.replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three'
  • df.rename(columns=lambda x: x + 1) # Mass renaming of columns
  • df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
  • df.set_index('column_one') # Changes the index
  • df.rename(index=lambda x: x + 1) # Mass renaming of index

Filter, Sort, and Group By

The following includes ways for filtering, sorting, and grouping data:

  • df[df[col] > 0.5] # Rows where the col column is greater than 0.5
  • df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
  • df.sort_values(col1) # Sorts values by col1 in ascending order
  • df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order
  • df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order
  • df.groupby(col) # Returns a groupby object for values from one column
  • df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
  • df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
  • df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
  • df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
  • df.apply(np.mean) # Applies a function across each column
  • df.apply(np.max, axis=1) # Applies a function across each row

Joining and Combining

The following ways can help you combine two dataframes:

  • df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
  • pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
  • df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'

Writing Data

After you got results from the entire process, it’s time to export the data:

  • df.to_csv(filename) # Writes to a CSV file
  • df.to_excel(filename) # Writes to an Excel file
  • df.to_sql(table_name, connection_object) # Writes to a SQL table
  • df.to_json(filename) # Writes to a file in JSON format
  • df.to_html(filename) # Saves as an HTML table
  • df.to_clipboard() # Writes to the clipboard

Machine Learning

The Scikit-Learn library has many helpful ways to train and apply machine learning models. The following short Scikit-Learn tutorial gives more information for the code below.

If you would like a full list of the Supervised Learning, Unsupervised Learning, Dataset Transformation and Model Evaluation modules in Scikit-Learn, visit its user guide for more information.

# Import libraries and modules

import numpy as np

import pandas as pd

  • from sklearn.model_selection import train_test_split
  • from sklearn import preprocessing
  • from sklearn.ensemble import RandomForestRegressor
  • from sklearn.pipeline import make_pipeline
  • from sklearn.model_selection import GridSearchCV
  • from sklearn.metrics import mean_squared_error, r2_score
  • from sklearn.externals import joblib

# Load red wine data.

dataset_url = 'https://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

data = pd.read_csv(dataset_url, sep=';')

# Split data into training and test sets

y = data.quality

X = data.drop('quality', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,

                                                    test_size=0.2,

                                                    random_state=123,

                                                    stratify=y)

# Declare data preprocessing steps

pipeline = make_pipeline(preprocessing.StandardScaler(),

                         RandomForestRegressor(n_estimators=100))

# Declare hyperparameters to tune

hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],

                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

# Tune model using cross-validation pipeline

clf = GridSearchCV(pipeline, hyperparameters, cv=10)

clf.fit(X_train, y_train)

# Refit on the entire training set

# No additional code needed if clf.refit == True (default is True)

# Evaluate model pipeline on test data

pred = clf.predict(X_test)

print r2_score(y_test, pred)

print mean_squared_error(y_test, pred)

# Save model for future use

joblib.dump(clf, 'rf_regressor.pkl')

# To load: clf2 = joblib.load('rf_regressor.pkl')

SQL

Here are popular SQL statements:

Query data in columns c1, c2 from a table

SELECT c1, c2 FROM t;

Query all rows and columns from a table

SELECT * FROM t;

Query data and filter rows with a condition

SELECT c1, c2 FROM t

WHERE condition;

Query distinct rows from a table

SELECT DISTINCT c1 FROM t

WHERE condition;

Sort the result set in ascending or descending order

SELECT c1, c2 FROM t

ORDER BY c1 ASC [DESC];

Skip offset of rows and return the next n rows

SELECT c1, c2 FROM t

ORDER BY c1

LIMIT n OFFSET offset;

Group rows using an aggregate function

SELECT c1, aggregate(c2)

FROM t

GROUP BY c1;

Filter groups using HAVING clause

SELECT c1, aggregate(c2)

FROM t

GROUP BY c1

HAVING condition;

Querying from multiple tables

Inner join t1 and t2

SELECT c1, c2

FROM t1

INNER JOIN t2 ON condition;

Left join t1 and t1

SELECT c1, c2

FROM t1

LEFT JOIN t2 ON condition;

Right join t1 and t2

SELECT c1, c2

FROM t1

RIGHT JOIN t2 ON condition;

Perform full outer join

SELECT c1, c2

FROM t1

FULL OUTER JOIN t2 ON condition;

Produce a Cartesian product of rows in tables

SELECT c1, c2

FROM t1

CROSS JOIN t2;

Another way to perform cross join

SELECT c1, c2

FROM t1, t2;

Join t1 to itself using INNER JOIN clause

SELECT c1, c2

FROM t1 A

INNER JOIN t1 B ON condition;

Using SQL Operators

Combine rows from two queries

SELECT c1, c2 FROM t1

UNION [ALL]

SELECT c1, c2 FROM t2;

Return the intersection of two queries

SELECT c1, c2 FROM t1

INTERSECT

SELECT c1, c2 FROM t2;

Subtract a result set from another result set

SELECT c1, c2 FROM t1

MINUS

SELECT c1, c2 FROM t2;

Query rows using pattern matching %, _

SELECT c1, c2 FROM t1

WHERE c1 [NOT] LIKE pattern;

Query rows in a list

SELECT c1, c2 FROM t

WHERE c1 [NOT] IN value_list;

Query rows between two values

SELECT c1, c2 FROM t

WHERE  c1 BETWEEN low AND high;

Check if values in a table is NULL or not

SELECT c1, c2 FROM t

WHERE  c1 IS [NOT] NULL;

Managing tables

Create a new table with three columns

CREATE TABLE t (

     id INT PRIMARY KEY,

     name VARCHAR NOT NULL,

     price INT DEFAULT 0

);

Delete the table from the database

DROP TABLE t ;

Add a new column to the table

ALTER TABLE t ADD column;

Drop column c from the table

ALTER TABLE t DROP COLUMN c ;

Add a constraint

ALTER TABLE t ADD constraint;

Drop a constraint

ALTER TABLE t DROP constraint;

Rename a table from t1 to t2

ALTER TABLE t1 RENAME TO t2;

Rename column c1 to c2

ALTER TABLE t1 RENAME c1 TO c2 ;

Remove all data in a table

TRUNCATE TABLE t;

Using SQL constraints

Set c1 and c2 as a primary key

CREATE TABLE t(

    c1 INT, c2 INT, c3 VARCHAR,

    PRIMARY KEY (c1,c2)

);

Set c2 column as a foreign key

CREATE TABLE t1(

    c1 INT PRIMARY KEY

    c2 INT,

    FOREIGN KEY (c2) REFERENCES t2(c2)

);

Make the values in c1 and c2 unique

CREATE TABLE t(

    c1 INT, c1 INT,

    UNIQUE(c2,c3)

);

Ensure c1 > 0 and values in c1 >= c2

CREATE TABLE t(

  c1 INT, c2 INT,

  CHECK(c1> 0 AND c1 >= c2)

);

Set values in c2 column not NULL

CREATE TABLE t(

     c1 INT PRIMARY KEY,

     c2 VARCHAR NOT NULL

);

Modifying Data

Insert one row into a table

INSERT INTO t(column_list)

VALUES(value_list);

Insert multiple rows into a table

INSERT INTO t(column_list)

VALUES (value_list),

       (value_list), …;

Insert rows from t2 into t1

INSERT INTO t1(column_list)

SELECT column_list

FROM t2;

Update new value in the column c1 for all rows

UPDATE t

SET c1 = new_value;

Update values in the column c1, c2 that match the condition

UPDATE t

SET c1 = new_value,

        c2 = new_value

WHERE condition;

Delete all data in a table

DELETE FROM t;

Delete subset of rows in a table

DELETE FROM t

WHERE condition;

Managing Views

Create a new view that consists  of c1 and c2

CREATE VIEW v(c1,c2)

AS

SELECT c1, c2

FROM t;

Create a new view with check option

CREATE VIEW v(c1,c2)

AS

SELECT c1, c2

FROM t;

WITH [CASCADED | LOCAL] CHECK OPTION;

Create a recursive view

CREATE RECURSIVE VIEW v

AS

select-statement -- anchor part

UNION [ALL]

select-statement; -- recursive part

Create a temporary view

CREATE TEMPORARY VIEW v

AS

SELECT c1, c2

FROM t;

Delete a view

DROP VIEW view_name;

Managing indexes

Create an index on c1 and c2 of the t table

CREATE INDEX idx_name

ON t(c1,c2);

Create a unique index on c3, c4 of the t table

CREATE UNIQUE INDEX idx_name

ON t(c3,c4)

Drop an index

DROP INDEX idx_name;

Managing triggers

Create or modify a trigger

CREATE OR MODIFY TRIGGER trigger_name

WHEN EVENT

ON table_name TRIGGER_TYPE

EXECUTE stored_procedure;

WHEN

  • BEFORE – invoke before the event occurs
  • AFTER – invoke after the event occurs

EVENT

  • INSERT – invoke for INSERT
  • UPDATE – invoke for UPDATE
  • DELETE – invoke for DELETE

TRIGGER_TYPE

  • FOR EACH ROW
  • FOR EACH STATEMENT

Delete a specific trigger

DROP TRIGGER trigger_name;

Data Science Training

We hope you found this short cheat sheet useful in your studies, whether for a certification test or an interview. Our courses go into even greater detail about these languages and much more.

Connect with our experts to learn more about our data science courses.

Previous Post Next Post