From time to time, I have done various tasks in SQL and Python. However, Pandas’ syntax is quite different from SQL. With SQL, you declare what you want in a sentence that almost reads like English. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it.

Therefore, a phrasebook would be quite necessary during my everyday work. Here I will try to record the equivalents in both SQL and Pandas.

Import Data in Pandas

import pandas as pd

airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')

Phrases

SELECT, WHERE, DISTINCT, LIMIT

SQL Pandas
SELECT * FROM airports airports
SELECT * FROM airports LIMIT 3 airports.head(3)
SELECT id FROM airports WHERE ident = 'KLAX' airports[airports.ident == 'KLAX'].id
SELECT DISTINCT type FROM airport airports.type.unique()

SELECT with Multiple Conditions

SQL Pandas
SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base' airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]
SELECT ident,name, municipality FROM airports WHERE iso_region = 'US-CA' OR type = 'large_airport' airports[(airports.iso_region == 'US-CA') | (airports.type == 'large_airport')][['ident', 'name', 'municipality']]
SELECT video_id, title FROM df WHERE likes >= 10000 df.loc[df['likes'] >= 10000, ['video_id', 'title']]
SELECT video_id, title FROM df WHERE likes >= 10000 AND dislike < 5000 df.loc[(df['likes'] >= 1000000) & (df['dislikes'] < 5000), ['video_id', 'title']]

As an example the below code returns all videos that contains the word ‘math’ in their description.

SELECT  video_id
        ,title 
FROM    df 
WHERE   description LIKE '%math%';
df_notnull = df.loc[~pd.isnull(df['description']), :]

df_notnull.loc[like(df_notnull['description'], '.* math .*'), ['video_id', 'title']].drop_duplicates()

IS (NOT) NULL

SQL Pandas
SELECT * FROM frame WHERE col2 IS NULL frame[frame['col2'].isna()]
SELECT * FROM frame WHERE col2 IS NOT NULL frame[frame['col2'].notna()]

ORDER BY

SQL Pandas
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type DESC airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)

IN… NOT IN

SQL Pandas
SELECT * FROM airports WHERE type IN ('heliport', 'balloonport') airports[airports.type.isin(['heliport', 'balloonport'])]
SELECT * FROM airports WHERE type NOT IN ('heliport', 'balloonport') airports[~airports.type.isin(['heliport', 'balloonport'])]

COUNT, GROUP BY, ORDER BY

SQL Pandas
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type airports.groupby(['iso_country', 'type']).size()
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])
SQL Pandas
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type airports.groupby(['iso_country', 'type']).size()
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

Notice that in the pandas code we used size() and not count(). This is because count() in pandas applies the function to each column, returning the number of not null records within each.

In [19]: tips.groupby('sex').count()
Out[19]: 
        total_bill  tip  smoker  day  time  size
sex                                             
Female          87   87      87   87    87    87
Male           157  157     157  157   157   157

Alternatively, we could have applied the count() method to an individual column:

In [20]: tips.groupby('sex')['total_bill'].count()
Out[20]: 
sex
Female     87
Male      157
Name: total_bill, dtype: int64

HAVING

SQL Pandas
SELECT type, COUNT(*) FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)

Top N rows with LIMIT OFFSET

SQL Pandas
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 by_country.nlargest(10, columns='airport_count')
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10 by_country.nlargest(20, columns='airport_count').tail(10)

Top N rows per group with ROW_NUMBER()

SELECT * FROM (
  SELECT t.*
        ,ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [36]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
   ....:                     .groupby(['day'])
   ....:                     .cumcount() + 1)
   ....:      .query('rn < 3')
   ....:      .sort_values(['day', 'rn']))
   ....: 
Out[36]: 
     total_bill    tip     sex smoker   day    time  size  rn
95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
212       48.33   9.00    Male     No   Sat  Dinner     4   2
156       48.17   5.00    Male     No   Sun  Dinner     6   1
182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
142       41.19   5.00    Male     No  Thur   Lunch     5   2

Aggregate functions (MIN, MAX, MEAN, SUM)

SQL Pandas
SELECT MIN(views) FROM table table.loc[:, ['views']].min()
SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft), MEDIAN(length_ft) FROM runways runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
SELECT channel_title, SUM(views), SUM(likes), SUM(dislikes) FROM df GROUP BY channel_title df.loc[:, ['channel_title', 'views', 'likes', 'dislikes'] ].groupby(['channel_title']).sum()

Multiple functions can also be applied at once.

SQL Pandas
SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SQL Pandas
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day; tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})

JOIN

SQL Pandas
SELECT airport_ident, type, DESCription, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = 'KLAX' airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref', right_on='id', how='inner')[['airport_ident', 'type', 'DESCription', 'frequency_mhz']]
SELECT * FROM df1 LEFT JOIN df2 ON df1.key = df2.key; pd.merge(df1, df2, on='key', how='left')
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key; pd.merge(df1, df2, on='key', how='outer')

UNION ALL & UNION

SQL Pandas
SELECT name, municipality FROM airports WHERE ident = 'KLAX' union all SELECT name, municipality FROM airports WHERE ident = 'KLGB' pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']], airports[airports.ident == 'KLGB'][['name', 'municipality']]])

To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates():

pd.concat([df1, df2]).drop_duplicates()

INSERT

SQL Pandas
create table heroes (id integer, name text); df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})
insert into heroes values (1, 'Harry Potter'); df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})
insert into heroes values (2, 'Ron Weasley');
insert into heroes values (3, 'Hermione Granger'); pd.concat([df1, df2]).reset_index(drop=True)
new_row = pd.DataFrame({'video_id': ['EkZGBdY0vlg'],
                        'channel_title': ['Professor Leonard'],
                        'title': ['Calculus 3 Lecture 13.3: Partial Derivatives']})
df = df.append(new_row, ignore_index=True)

UPDTAE

SQL Pandas
update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' WHERE ident == 'KLAX' airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'

ALTER

SQL Pandas
ALTER TABLE table ADD column df['like_ratio'] = df['likes'] / (df['likes'] + df['dislikes'])

DELETE

SQL Pandas
DELETE FROM lax_freq WHERE type = 'MISC' lax_freq = lax_freq[lax_freq.type != 'MISC']
lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)

In pandas we can use .drop() method to remove the rows whose indices we pass in. Unlike other methods this one doesn't accept boolean arrays as input. So we must convert our condition's output to indices. We can do that with np.where() function.

In the example below we deleted all the rows where channel_title != '3Blue1Brown'.

DELETE FROM df WHERE channel_title != '3Blue1Brown';
# only keep the opposite
df = df[df.channel_title == '3Blue1Brown']
# or drop based on the condition:
df = df.drop(np.where(df['channel_title'] != '3Blue1Brown')[0])

Immutability

I need to mention one important thing — immutability. By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. For example, here is how you would reset an index in-place:

df.reset_index(drop=True, inplace=True)

However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:

df['url'] = 'http://google.com'

or added a new calculated column:

df['total_cost'] = df['price'] * df['quantity']

Pandas Export

df.to_csv(...)  # csv file
df.to_hdf(...)  # HDF5 file
df.to_pickle(...)  # serialized object
df.to_sql(...)  # to SQL database
df.to_excel(...)  # to Excel sheet
df.to_json(...)  # to JSON string
df.to_html(...)  # render as HTML table
df.to_feather(...)  # binary feather-format
df.to_latex(...)  # tabular environment table
df.to_stata(...)  # Stata binary data files
df.to_msgpack(...)	# msgpack (serialize) object
df.to_gbq(...)  # to a Google BigQuery table.
df.to_string(...)  # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

References