Data Cleaning: Filter Records Base on Conditions

Simon Huang

Simon Huang

@simon

Take the following as an example to filter a data frame based on certain conditions:

There are 2 users and some of their corresponding bill dates and bill amounts, we want to only keep ==one record for each user== with their ==billing’s starting date== and the ==average bill amount==.

Dataframe

import pandas as pd

raw_data = {'User':[1,1,1,2,2,2,1],
        'Time':['2014-01-01','2014-02-01','2014-03-01','2018-02-01','2018-01-01','2018-03-01','2014-04-01'],
        'Amount':[32,13,48,98,64,23,27]}
data = pd.DataFrame(raw_data)
data
UserTimeAmount
012014-01-0132
112014-02-0113
212014-03-0148
322018-02-0198
422018-01-0164
522018-03-0123
612014-04-0127

Solution 1

Package: Pandas Function: dataframe.groupby() Docs: pandas.DataFrame.groupby

# generate new columns for grouped results
data_amount = round(data.groupby(['User'], as_index=False)['Amount'].mean(),2)
data_time = data.groupby('User', as_index=False)['Time'].min()
# merge columns to main dataframe
data_merge1 = pd.merge(data,data_time,on='User',how='left')
data_merge2 = pd.merge(data_merge1,data_amount,on='User',how='left')
data_merge2
UserTime_xAmount_xTime_yAmount_y
012014-01-01322014-01-0130.00
112014-02-01132014-01-0130.00
212014-03-01482014-01-0130.00
322018-02-01982018-01-0161.67
422018-01-01642018-01-0161.67
522018-03-01232018-01-0161.67
612014-04-01272014-01-0130.00
# drop old columns and rename new columns
data_new = data_merge2.drop(columns=['Time_x','Amount_x'])
data_new = data_new.drop_duplicates(['User'])
data_new = data_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data_new
UserDateAmount_mean
012014-01-0130.00
322018-01-0161.67

Alternative

Alternatively, we could also drop_duplicates first then merge new columns:

data2 = data.drop_duplicates(['User'])
data2_merge1 = pd.merge(data2,data_time,on='User',how='left')
data2_merge2 = pd.merge(data2_merge1,data_amount,on='User',how='left')
data2_merge2
UserTime_xAmount_xTime_yAmount_y
012014-01-01322014-01-0130.00
122018-02-01982018-01-0161.67
data2_new = data2_merge2.drop(columns=['Time_x','Amount_x'])
data2_new = data2_new.drop_duplicates(['User'])
data2_new = data2_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data2_new
UserDateAmount_mean
012014-01-0130.00
122018-01-0161.67

Solution 2

Here we sort the value by User and Time ascendingly, then drop_cuplicates. We get a dataframe ordered by User then Time.

data3 = data.sort_values(by=['User','Time'],ascending=True).drop_duplicates(['User'])
data3
UserTimeAmount
012014-01-0132
422018-01-0164

Now we only need to merge the Amount_mean to the dataframe:

data3 = pd.merge(data3,data_amount,on='User',how='left')
data3
UserTimeAmount_xAmount_y
012014-01-013230.00
122018-01-016461.67

Lastly, drop old column and rename new column:

data3_new = data3.drop(columns=['Amount_x'])
data3_new = data3_new.rename(columns={'Amount_y': 'Amount_mean','Time':'Date'})
data3_new
UserDateAmount_mean
012014-01-0130.00
122018-01-0161.67

Source Code

Fizzy

© 2025 Fizzy All Rights Reserved.