Rewrite SQL Queries in Pandas

Simon Huang

Simon Huang

@simon

From time to time, I have done various tasks in SQL and Python. However, Pandas’ syntax is quite different from SQL. With SQL, you declare what you want in a sentence that almost reads like English. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it. Therefore, a phrasebook would be quite necessary during my everyday work. Here I will try to record the equivalents in both SQL and Pandas.

Import Data in Pandas

import pandas as pd

airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')

Phrases

SELECT, WHERE, DISTINCT, LIMIT

SQLPandas
SELECT * FROM airportsairports
SELECT * FROM airports LIMIT 3airports.head(3)
SELECT id FROM airports WHERE ident = 'KLAX'airports[airports.ident == 'KLAX'].id
SELECT DISTINCT type FROM airportairports.type.unique()

SELECT with Multiple Conditions

SQLPandas
SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base'airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]
SELECT ident,name, municipality FROM airports WHERE iso_region = 'US-CA' OR type = 'large_airport'`airports[(airports.iso_region == ‘US-CA’)(airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]`
SELECT video_id, title FROM df WHERE likes >= 10000df.loc[df['likes'] >= 10000, ['video_id', 'title']]
SELECT video_id, title FROM df WHERE likes >= 10000 AND dislike < 5000df.loc[(df['likes'] >= 1000000) & (df['dislikes'] < 5000), ['video_id', 'title']]

As an example the below code returns all videos that contains the word ‘math’ in their description.

SELECT  video_id
        ,title
FROM    df
WHERE   description LIKE '%math%';
df_notnull = df.loc[~pd.isnull(df['description']), :]

df_notnull.loc[like(df_notnull['description'], '.* math .*'), ['video_id', 'title']].drop_duplicates()

IS (NOT) NULL

SQLPandas
SELECT * FROM frame WHERE col2 IS NULLframe[frame['col2'].isna()]
SELECT * FROM frame WHERE col2 IS NOT NULLframe[frame['col2'].notna()]

ORDER BY

SQLPandas
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY typeairport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type DESCairport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)

IN… NOT IN

SQLPandas
SELECT * FROM airports WHERE type IN ('heliport', 'balloonport')airports[airports.type.isin(['heliport', 'balloonport'])]
SELECT * FROM airports WHERE type NOT IN ('heliport', 'balloonport')airports[~airports.type.isin(['heliport', 'balloonport'])]

COUNT, GROUP BY, ORDER BY

SQLPandas
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, typeairports.groupby(['iso_country', 'type']).size()
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESCairports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])
SQLPandas
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, typeairports.groupby(['iso_country', 'type']).size()
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESCairports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

Notice that in the pandas code we used size() and not count(). This is because count() in pandas applies the function to each column, returning the number of not null records within each.

In [19]: tips.groupby('sex').count()
Out[19]:
        total_bill  tip  smoker  day  time  size
sex
Female          87   87      87   87    87    87
Male           157  157     157  157   157   157

Alternatively, we could have applied the count() method to an individual column:

In [20]: tips.groupby('sex')['total_bill'].count()
Out[20]:
sex
Female     87
Male      157
Name: total_bill, dtype: int64

HAVING

SQLPandas
SELECT type, COUNT(*) FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESCairports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)

Top N rows with LIMIT OFFSET

SQLPandas
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10by_country.nlargest(10, columns='airport_count')
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10by_country.nlargest(20, columns='airport_count').tail(10)

Top N rows per group with ROW_NUMBER()

SELECT * FROM (
  SELECT t.*
        ,ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [36]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
   ....:                     .groupby(['day'])
   ....:                     .cumcount() + 1)
   ....:      .query('rn < 3')
   ....:      .sort_values(['day', 'rn']))
   ....:
Out[36]:
     total_bill    tip     sex smoker   day    time  size  rn
95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
212       48.33   9.00    Male     No   Sat  Dinner     4   2
156       48.17   5.00    Male     No   Sun  Dinner     6   1
182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
142       41.19   5.00    Male     No  Thur   Lunch     5   2

Aggregate functions (MIN, MAX, MEAN, SUM)

SQLPandas
SELECT MIN(views) FROM tabletable.loc[:, ['views']].min()
SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft), MEDIAN(length_ft) FROM runwaysrunways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
SELECT channel_title, SUM(views), SUM(likes), SUM(dislikes) FROM df GROUP BY channel_titledf.loc[:, ['channel_title', 'views', 'likes', 'dislikes'] ].groupby(['channel_title']).sum()

Multiple functions can also be applied at once.

SQLPandas
SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY daytips.groupby('day').agg({'tip': np.mean, 'day': np.size})

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SQLPandas
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})

JOIN

SQLPandas
SELECT airport_ident, type, DESCription, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = 'KLAX'airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref', right_on='id', how='inner')[['airport_ident', 'type', 'DESCription', 'frequency_mhz']]
SELECT * FROM df1 LEFT JOIN df2 ON df1.key = df2.key;pd.merge(df1, df2, on='key', how='left')
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;pd.merge(df1, df2, on='key', how='outer')

UNION ALL & UNION

SQLPandas
SELECT name, municipality FROM airports WHERE ident = 'KLAX' union all SELECT name, municipality FROM airports WHERE ident = 'KLGB'pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']], airports[airports.ident == 'KLGB'][['name', 'municipality']]])

To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates():

pd.concat([df1, df2]).drop_duplicates()

INSERT

SQLPandas
create table heroes (id integer, name text);df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})
insert into heroes values (1, 'Harry Potter');df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})
insert into heroes values (2, 'Ron Weasley');
insert into heroes values (3, 'Hermione Granger');pd.concat([df1, df2]).reset_index(drop=True)
new_row = pd.DataFrame({'video_id': ['EkZGBdY0vlg'],
                        'channel_title': ['Professor Leonard'],
                        'title': ['Calculus 3 Lecture 13.3: Partial Derivatives']})
df = df.append(new_row, ignore_index=True)

UPDTAE

SQLPandas
update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' WHERE ident == 'KLAX'airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'

ALTER

SQLPandas
ALTER TABLE table ADD columndf['like_ratio'] = df['likes'] / (df['likes'] + df['dislikes'])

DELETE

SQLPandas
DELETE FROM lax_freq WHERE type = 'MISC'lax_freq = lax_freq[lax_freq.type != 'MISC']
lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)

In pandas we can use .drop() method to remove the rows whose indices we pass in. Unlike other methods this one doesn’t accept boolean arrays as input. So we must convert our condition’s output to indices. We can do that with np.where() function. In the example below we deleted all the rows where channel_title != ‘3Blue1Brown’.

DELETE FROM df WHERE channel_title != '3Blue1Brown';
# only keep the opposite
df = df[df.channel_title == '3Blue1Brown']
# or drop based on the condition:
df = df.drop(np.where(df['channel_title'] != '3Blue1Brown')[0])

Immutability

I need to mention one important thing — immutability. By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. For example, here is how you would reset an index in-place:

df.reset_index(drop=True, inplace=True)

However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:

df['url'] = 'http://google.com'

or added a new calculated column:

df['total_cost'] = df['price'] * df['quantity']

Pandas Export

df.to_csv(...)  # csv file
df.to_hdf(...)  # HDF5 file
df.to_pickle(...)  # serialized object
df.to_sql(...)  # to SQL database
df.to_excel(...)  # to Excel sheet
df.to_json(...)  # to JSON string
df.to_html(...)  # render as HTML table
df.to_feather(...)  # binary feather-format
df.to_latex(...)  # tabular environment table
df.to_stata(...)  # Stata binary data files
df.to_msgpack(...)	# msgpack (serialize) object
df.to_gbq(...)  # to a Google BigQuery table.
df.to_string(...)  # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

References

Fizzy

© 2025 Fizzy All Rights Reserved.