Data Cleaning: Filter Records Base on Conditions
Simon Huang
@simon
Take the following as an example to filter a data frame based on certain conditions:
There are 2 users and some of their corresponding bill dates and bill amounts, we want to only keep ==one record for each user== with their ==billing’s starting date== and the ==average bill amount==.
Dataframe
import pandas as pd
raw_data = {'User':[1,1,1,2,2,2,1],
'Time':['2014-01-01','2014-02-01','2014-03-01','2018-02-01','2018-01-01','2018-03-01','2014-04-01'],
'Amount':[32,13,48,98,64,23,27]}
data = pd.DataFrame(raw_data)
data | User | Time | Amount | |
|---|---|---|---|
| 0 | 1 | 2014-01-01 | 32 |
| 1 | 1 | 2014-02-01 | 13 |
| 2 | 1 | 2014-03-01 | 48 |
| 3 | 2 | 2018-02-01 | 98 |
| 4 | 2 | 2018-01-01 | 64 |
| 5 | 2 | 2018-03-01 | 23 |
| 6 | 1 | 2014-04-01 | 27 |
Solution 1
Package: Pandas Function: dataframe.groupby() Docs: pandas.DataFrame.groupby
# generate new columns for grouped results
data_amount = round(data.groupby(['User'], as_index=False)['Amount'].mean(),2)
data_time = data.groupby('User', as_index=False)['Time'].min() # merge columns to main dataframe
data_merge1 = pd.merge(data,data_time,on='User',how='left')
data_merge2 = pd.merge(data_merge1,data_amount,on='User',how='left')
data_merge2 | User | Time_x | Amount_x | Time_y | Amount_y | |
|---|---|---|---|---|---|
| 0 | 1 | 2014-01-01 | 32 | 2014-01-01 | 30.00 |
| 1 | 1 | 2014-02-01 | 13 | 2014-01-01 | 30.00 |
| 2 | 1 | 2014-03-01 | 48 | 2014-01-01 | 30.00 |
| 3 | 2 | 2018-02-01 | 98 | 2018-01-01 | 61.67 |
| 4 | 2 | 2018-01-01 | 64 | 2018-01-01 | 61.67 |
| 5 | 2 | 2018-03-01 | 23 | 2018-01-01 | 61.67 |
| 6 | 1 | 2014-04-01 | 27 | 2014-01-01 | 30.00 |
# drop old columns and rename new columns
data_new = data_merge2.drop(columns=['Time_x','Amount_x'])
data_new = data_new.drop_duplicates(['User'])
data_new = data_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data_new | User | Date | Amount_mean | |
|---|---|---|---|
| 0 | 1 | 2014-01-01 | 30.00 |
| 3 | 2 | 2018-01-01 | 61.67 |
Alternative
Alternatively, we could also drop_duplicates first then merge new columns:
data2 = data.drop_duplicates(['User'])
data2_merge1 = pd.merge(data2,data_time,on='User',how='left')
data2_merge2 = pd.merge(data2_merge1,data_amount,on='User',how='left')
data2_merge2 | User | Time_x | Amount_x | Time_y | Amount_y | |
|---|---|---|---|---|---|
| 0 | 1 | 2014-01-01 | 32 | 2014-01-01 | 30.00 |
| 1 | 2 | 2018-02-01 | 98 | 2018-01-01 | 61.67 |
data2_new = data2_merge2.drop(columns=['Time_x','Amount_x'])
data2_new = data2_new.drop_duplicates(['User'])
data2_new = data2_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data2_new | User | Date | Amount_mean | |
|---|---|---|---|
| 0 | 1 | 2014-01-01 | 30.00 |
| 1 | 2 | 2018-01-01 | 61.67 |
Solution 2
Here we sort the value by User and Time ascendingly, then drop_cuplicates. We get a dataframe ordered by User then Time.
data3 = data.sort_values(by=['User','Time'],ascending=True).drop_duplicates(['User'])
data3 | User | Time | Amount | |
|---|---|---|---|
| 0 | 1 | 2014-01-01 | 32 |
| 4 | 2 | 2018-01-01 | 64 |
Now we only need to merge the Amount_mean to the dataframe:
data3 = pd.merge(data3,data_amount,on='User',how='left')
data3 | User | Time | Amount_x | Amount_y | |
|---|---|---|---|---|
| 0 | 1 | 2014-01-01 | 32 | 30.00 |
| 1 | 2 | 2018-01-01 | 64 | 61.67 |
Lastly, drop old column and rename new column:
data3_new = data3.drop(columns=['Amount_x'])
data3_new = data3_new.rename(columns={'Amount_y': 'Amount_mean','Time':'Date'})
data3_new | User | Date | Amount_mean | |
|---|---|---|---|
| 0 | 1 | 2014-01-01 | 30.00 |
| 1 | 2 | 2018-01-01 | 61.67 |