Rewrite SQL Queries in Pandas

From time to time, I have done various tasks in SQL and Python. However, Pandas’ syntax is quite different from SQL. With SQL, you declare what you want in a sentence that almost reads like English. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it.

Therefore, a phrasebook would be quite necessary during my everyday work. Here I will try to record the equivalents in both SQL and Pandas.

Import Data in Pandas

import pandas as pd

airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')

Phrases

SELECT, WHERE, DISTINCT, LIMIT

SQL	Pandas
`SELECT * FROM airports`	`airports`
`SELECT * FROM airports LIMIT 3`	`airports.head(3)`
`SELECT id FROM airports WHERE ident = 'KLAX'`	`airports[airports.ident == 'KLAX'].id`
`SELECT DISTINCT type FROM airport`	`airports.type.unique()`

SELECT with Multiple Conditions

SQL	Pandas
`SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base'`	`airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]`
`SELECT ident,name, municipality FROM airports WHERE iso_region = 'US-CA' OR type = 'large_airport'`	`airports[(airports.iso_region == 'US-CA') \| (airports.type == 'large_airport')][['ident', 'name', 'municipality']]`
`SELECT video_id, title FROM df WHERE likes >= 10000`	`df.loc[df['likes'] >= 10000, ['video_id', 'title']]`
`SELECT video_id, title FROM df WHERE likes >= 10000 AND dislike < 5000`	`df.loc[(df['likes'] >= 1000000) & (df['dislikes'] < 5000), ['video_id', 'title']]`

As an example the below code returns all videos that contains the word ‘math’ in their description.

SELECT  video_id
        ,title 
FROM    df 
WHERE   description LIKE '%math%';

df_notnull = df.loc[~pd.isnull(df['description']), :]

df_notnull.loc[like(df_notnull['description'], '.* math .*'), ['video_id', 'title']].drop_duplicates()

IS (NOT) NULL

SQL	Pandas
`SELECT * FROM frame WHERE col2 IS NULL`	`frame[frame['col2'].isna()]`
`SELECT * FROM frame WHERE col2 IS NOT NULL`	`frame[frame['col2'].notna()]`

ORDER BY

SQL	Pandas
`SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type`	`airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')`
`SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type DESC`	`airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)`

IN… NOT IN

SQL	Pandas
`SELECT * FROM airports WHERE type IN ('heliport', 'balloonport')`	`airports[airports.type.isin(['heliport', 'balloonport'])]`
`SELECT * FROM airports WHERE type NOT IN ('heliport', 'balloonport')`	`airports[~airports.type.isin(['heliport', 'balloonport'])]`

COUNT, GROUP BY, ORDER BY

SQL	Pandas
`SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type`	`airports.groupby(['iso_country', 'type']).size()`
`SELECT iso_country, type, COUNT() FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT() DESC`	`airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])`

SQL	Pandas
`SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type`	`airports.groupby(['iso_country', 'type']).size()`
`SELECT iso_country, type, COUNT() FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT() DESC`	`airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])`

Notice that in the pandas code we used size() and not count(). This is because count() in pandas applies the function to each column, returning the number of not null records within each.

In [19]: tips.groupby('sex').count()
Out[19]: 
        total_bill  tip  smoker  day  time  size
sex                                             
Female          87   87      87   87    87    87
Male           157  157     157  157   157   157

Alternatively, we could have applied the count() method to an individual column:

In [20]: tips.groupby('sex')['total_bill'].count()
Out[20]: 
sex
Female     87
Male      157
Name: total_bill, dtype: int64

HAVING

SQL	Pandas
`SELECT type, COUNT() FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT() > 1000 ORDER BY COUNT(*) DESC`	`airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)`

Top N rows with LIMIT OFFSET

SQL	Pandas
`SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10`	`by_country.nlargest(10, columns='airport_count')`
`SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10`	`by_country.nlargest(20, columns='airport_count').tail(10)`

Top N rows per group with ROW_NUMBER()

SELECT * FROM (
  SELECT t.*
        ,ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [36]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
   ....:                     .groupby(['day'])
   ....:                     .cumcount() + 1)
   ....:      .query('rn < 3')
   ....:      .sort_values(['day', 'rn']))
   ....: 
Out[36]: 
     total_bill    tip     sex smoker   day    time  size  rn
95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
212       48.33   9.00    Male     No   Sat  Dinner     4   2
156       48.17   5.00    Male     No   Sun  Dinner     6   1
182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
142       41.19   5.00    Male     No  Thur   Lunch     5   2

Aggregate functions (MIN, MAX, MEAN, SUM)

SQL	Pandas
`SELECT MIN(views) FROM table`	`table.loc[:, ['views']].min()`
`SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft), MEDIAN(length_ft) FROM runways`	`runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})`
`SELECT channel_title, SUM(views), SUM(likes), SUM(dislikes) FROM df GROUP BY channel_title`	`df.loc[:, ['channel_title', 'views', 'likes', 'dislikes'] ].groupby(['channel_title']).sum()`

Multiple functions can also be applied at once.

SQL	Pandas
`SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day`	`tips.groupby('day').agg({'tip': np.mean, 'day': np.size})`

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SQL	Pandas
`SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;`	`tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})`

JOIN

SQL	Pandas
`SELECT airport_ident, type, DESCription, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = 'KLAX'`	`airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref', right_on='id', how='inner')[['airport_ident', 'type', 'DESCription', 'frequency_mhz']]`
`SELECT * FROM df1 LEFT JOIN df2 ON df1.key = df2.key;`	`pd.merge(df1, df2, on='key', how='left')`
`SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;`	`pd.merge(df1, df2, on='key', how='outer')`

UNION ALL & UNION

SQL	Pandas
`SELECT name, municipality FROM airports WHERE ident = 'KLAX' union all SELECT name, municipality FROM airports WHERE ident = 'KLGB'`	`pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']], airports[airports.ident == 'KLGB'][['name', 'municipality']]])`

To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates():

pd.concat([df1, df2]).drop_duplicates()

INSERT

SQL	Pandas
`create table heroes (id integer, name text);`	`df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})`
`insert into heroes values (1, 'Harry Potter');`	`df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})`
`insert into heroes values (2, 'Ron Weasley');`
`insert into heroes values (3, 'Hermione Granger');`	`pd.concat([df1, df2]).reset_index(drop=True)`

new_row = pd.DataFrame({'video_id': ['EkZGBdY0vlg'],
                        'channel_title': ['Professor Leonard'],
                        'title': ['Calculus 3 Lecture 13.3: Partial Derivatives']})
df = df.append(new_row, ignore_index=True)

UPDTAE

SQL	Pandas
`update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' WHERE ident == 'KLAX'`	`airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'`

ALTER

SQL	Pandas
`ALTER TABLE table ADD column`	`df['like_ratio'] = df['likes'] / (df['likes'] + df['dislikes'])`

DELETE

SQL	Pandas
`DELETE FROM lax_freq WHERE type = 'MISC'`	`lax_freq = lax_freq[lax_freq.type != 'MISC']`
	`lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)`

In pandas we can use .drop() method to remove the rows whose indices we pass in. Unlike other methods this one doesn't accept boolean arrays as input. So we must convert our condition's output to indices. We can do that with np.where() function.

In the example below we deleted all the rows where channel_title != '3Blue1Brown'.

DELETE FROM df WHERE channel_title != '3Blue1Brown';

# only keep the opposite
df = df[df.channel_title == '3Blue1Brown']
# or drop based on the condition:
df = df.drop(np.where(df['channel_title'] != '3Blue1Brown')[0])

Immutability

I need to mention one important thing — immutability. By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. For example, here is how you would reset an index in-place:

df.reset_index(drop=True, inplace=True)

However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:

df['url'] = 'http://google.com'

or added a new calculated column:

df['total_cost'] = df['price'] * df['quantity']

Pandas Export

df.to_csv(...)  # csv file
df.to_hdf(...)  # HDF5 file
df.to_pickle(...)  # serialized object
df.to_sql(...)  # to SQL database
df.to_excel(...)  # to Excel sheet
df.to_json(...)  # to JSON string
df.to_html(...)  # render as HTML table
df.to_feather(...)  # binary feather-format
df.to_latex(...)  # tabular environment table
df.to_stata(...)  # Stata binary data files
df.to_msgpack(...)	# msgpack (serialize) object
df.to_gbq(...)  # to a Google BigQuery table.
df.to_string(...)  # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

References

Pandas Doc - Comparison with SQL