Data Cleaning: Filter Records Base on Conditions

Take the following as an example to filter a data frame based on certain conditions:

There are 2 users and some of their corresponding bill dates and bill amounts, we want to only keep one record for each user with their billing's starting date and the average bill amount.

Dataframe

import pandas as pd

raw_data = {'User':[1,1,1,2,2,2,1], 
        'Time':['2014-01-01','2014-02-01','2014-03-01','2018-02-01','2018-01-01','2018-03-01','2014-04-01'],
        'Amount':[32,13,48,98,64,23,27]}
data = pd.DataFrame(raw_data)
data

	User	Time	Amount
0	1	2014-01-01	32
1	1	2014-02-01	13
2	1	2014-03-01	48
3	2	2018-02-01	98
4	2	2018-01-01	64
5	2	2018-03-01	23
6	1	2014-04-01	27

Solution 1

Package: Pandas
Function: dataframe.groupby()
Docs: pandas.DataFrame.groupby

# generate new columns for grouped results
data_amount = round(data.groupby(['User'], as_index=False)['Amount'].mean(),2)
data_time = data.groupby('User', as_index=False)['Time'].min()

# merge columns to main dataframe
data_merge1 = pd.merge(data,data_time,on='User',how='left')
data_merge2 = pd.merge(data_merge1,data_amount,on='User',how='left')
data_merge2

	User	Time_x	Amount_x	Time_y	Amount_y
0	1	2014-01-01	32	2014-01-01	30.00
1	1	2014-02-01	13	2014-01-01	30.00
2	1	2014-03-01	48	2014-01-01	30.00
3	2	2018-02-01	98	2018-01-01	61.67
4	2	2018-01-01	64	2018-01-01	61.67
5	2	2018-03-01	23	2018-01-01	61.67
6	1	2014-04-01	27	2014-01-01	30.00

# drop old columns and rename new columns
data_new = data_merge2.drop(columns=['Time_x','Amount_x'])
data_new = data_new.drop_duplicates(['User'])
data_new = data_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
3	2	2018-01-01	61.67

Alternative

Alternatively, we could also drop_duplicates first then merge new columns:

data2 = data.drop_duplicates(['User'])
data2_merge1 = pd.merge(data2,data_time,on='User',how='left')
data2_merge2 = pd.merge(data2_merge1,data_amount,on='User',how='left')
data2_merge2

User	Time_x	Amount_x	Time_y	Amount_y
0	1	2014-01-01	32	2014-01-01
1	2	2018-02-01	98	2018-01-01

data2_new = data2_merge2.drop(columns=['Time_x','Amount_x'])
data2_new = data2_new.drop_duplicates(['User'])
data2_new = data2_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data2_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
1	2	2018-01-01	61.67

Solution 2

Here we sort the value by User and Time ascendingly, then drop_cuplicates. We get a dataframe ordered by User then Time.

data3 = data.sort_values(by=['User','Time'],ascending=True).drop_duplicates(['User'])
data3

	User	Time	Amount
0	1	2014-01-01	32
4	2	2018-01-01	64

Now we only need to merge the Amount_mean to the dataframe:

data3 = pd.merge(data3,data_amount,on='User',how='left')
data3

	User	Time	Amount_x	Amount_y
0	1	2014-01-01	32	30.00
1	2	2018-01-01	64	61.67

Lastly, drop old column and rename new column:

data3_new = data3.drop(columns=['Amount_x'])
data3_new = data3_new.rename(columns={'Amount_y': 'Amount_mean','Time':'Date'})
data3_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
1	2	2018-01-01	61.67

Source Code

Simon

Hangzhou, China

I am a human learner. Yes, you can call me Simon.

Dataframe

Solution 1

Alternative

Solution 2