From time to time, I have done various tasks in SQL and Python. However, Pandas’ syntax is quite different from SQL. With SQL, you declare what you want in a sentence that almost reads like English. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it.
Therefore, a phrasebook would be quite necessary during my everyday work. Here I will try to record the equivalents in both SQL and Pandas.
Import Data in Pandas
Phrases
SELECT, WHERE, DISTINCT, LIMIT
SQL |
Pandas |
SELECT * FROM airports |
airports |
SELECT * FROM airports LIMIT 3 |
airports.head(3) |
SELECT id FROM airports WHERE ident = 'KLAX' |
airports[airports.ident == 'KLAX'].id |
SELECT DISTINCT type FROM airport |
airports.type.unique() |
SELECT with Multiple Conditions
SQL |
Pandas |
SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base' |
airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')] |
SELECT ident,name, municipality FROM airports WHERE iso_region = 'US-CA' OR type = 'large_airport' |
airports[(airports.iso_region == 'US-CA') | (airports.type == 'large_airport')][['ident', 'name', 'municipality']] |
SELECT video_id, title FROM df WHERE likes >= 10000 |
df.loc[df['likes'] >= 10000, ['video_id', 'title']] |
SELECT video_id, title FROM df WHERE likes >= 10000 AND dislike < 5000 |
df.loc[(df['likes'] >= 1000000) & (df['dislikes'] < 5000), ['video_id', 'title']] |
As an example the below code returns all videos that contains the word ‘math’ in their description.
IS (NOT) NULL
SQL |
Pandas |
SELECT * FROM frame WHERE col2 IS NULL |
frame[frame['col2'].isna()] |
SELECT * FROM frame WHERE col2 IS NOT NULL |
frame[frame['col2'].notna()] |
ORDER BY
SQL |
Pandas |
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type |
airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type') |
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type DESC |
airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False) |
IN… NOT IN
SQL |
Pandas |
SELECT * FROM airports WHERE type IN ('heliport', 'balloonport') |
airports[airports.type.isin(['heliport', 'balloonport'])] |
SELECT * FROM airports WHERE type NOT IN ('heliport', 'balloonport') |
airports[~airports.type.isin(['heliport', 'balloonport'])] |
COUNT, GROUP BY, ORDER BY
SQL |
Pandas |
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type |
airports.groupby(['iso_country', 'type']).size() |
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC |
airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False]) |
SQL |
Pandas |
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, type |
airports.groupby(['iso_country', 'type']).size() |
SELECT iso_country, type, COUNT(*) FROM airports GROUP BY iso_country, type ORDER BY iso_country, COUNT(*) DESC |
airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False]) |
Notice that in the pandas code we used size()
and not count()
. This is because count()
in pandas applies the function to each column, returning the number of not null records within each.
Alternatively, we could have applied the count() method to an individual column:
HAVING
SQL |
Pandas |
SELECT type, COUNT(*) FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC |
airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False) |
Top N rows with LIMIT OFFSET
SQL |
Pandas |
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 |
by_country.nlargest(10, columns='airport_count') |
SELECT iso_country FROM by_country ORDER BY size DESC LIMIT 10 OFFSET 10 |
by_country.nlargest(20, columns='airport_count').tail(10) |
Top N rows per group with ROW_NUMBER()
Aggregate functions (MIN, MAX, MEAN, SUM)
SQL |
Pandas |
SELECT MIN(views) FROM table |
table.loc[:, ['views']].min() |
SELECT MAX(length_ft), MIN(length_ft), AVG(length_ft), MEDIAN(length_ft) FROM runways |
runways.agg({'length_ft': ['min', 'max', 'mean', 'median']}) |
SELECT channel_title, SUM(views), SUM(likes), SUM(dislikes) FROM df GROUP BY channel_title |
df.loc[:, ['channel_title', 'views', 'likes', 'dislikes'] ].groupby(['channel_title']).sum() |
Multiple functions can also be applied at once.
SQL |
Pandas |
SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day |
tips.groupby('day').agg({'tip': np.mean, 'day': np.size}) |
Grouping by more than one column is done by passing a list of columns to the groupby()
method.
SQL |
Pandas |
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day; |
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]}) |
JOIN
SQL |
Pandas |
SELECT airport_ident, type, DESCription, frequency_mhz FROM airport_freq JOIN airports ON airport_freq.airport_ref = airports.id WHERE airports.ident = 'KLAX' |
airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref', right_on='id', how='inner')[['airport_ident', 'type', 'DESCription', 'frequency_mhz']] |
SELECT * FROM df1 LEFT JOIN df2 ON df1.key = df2.key; |
pd.merge(df1, df2, on='key', how='left') |
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key; |
pd.merge(df1, df2, on='key', how='outer') |
UNION ALL & UNION
SQL |
Pandas |
SELECT name, municipality FROM airports WHERE ident = 'KLAX' union all SELECT name, municipality FROM airports WHERE ident = 'KLGB' |
pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']], airports[airports.ident == 'KLGB'][['name', 'municipality']]]) |
To deduplicate things (equivalent of UNION
), you’d also have to add .drop_duplicates()
:
INSERT
SQL |
Pandas |
create table heroes (id integer, name text); |
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']}) |
insert into heroes values (1, 'Harry Potter'); |
df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']}) |
insert into heroes values (2, 'Ron Weasley'); |
|
insert into heroes values (3, 'Hermione Granger'); |
pd.concat([df1, df2]).reset_index(drop=True) |
UPDTAE
SQL |
Pandas |
update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' WHERE ident == 'KLAX' |
airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx' |
ALTER
SQL |
Pandas |
ALTER TABLE table ADD column |
df['like_ratio'] = df['likes'] / (df['likes'] + df['dislikes']) |
DELETE
SQL |
Pandas |
DELETE FROM lax_freq WHERE type = 'MISC' |
lax_freq = lax_freq[lax_freq.type != 'MISC'] |
|
lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index) |
In pandas we can use .drop() method to remove the rows whose indices we pass in. Unlike other methods this one doesn't accept boolean arrays as input. So we must convert our condition's output to indices. We can do that with np.where() function.
In the example below we deleted all the rows where channel_title != '3Blue1Brown'.
Immutability
I need to mention one important thing — immutability. By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. For example, here is how you would reset an index in-place:
However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:
or added a new calculated column:
Pandas Export
References