Data Cleaning: Filter Records Base on Conditions

Take the following as an example to filter a data frame based on certain conditions:

There are 2 users and some of their corresponding bill dates and bill amounts, we want to only keep ==one record for each user== with their ==billing’s starting date== and the ==average bill amount==.

Dataframe

import pandas as pd

raw_data = {'User':[1,1,1,2,2,2,1],
        'Time':['2014-01-01','2014-02-01','2014-03-01','2018-02-01','2018-01-01','2018-03-01','2014-04-01'],
        'Amount':[32,13,48,98,64,23,27]}
data = pd.DataFrame(raw_data)
data

	User	Time	Amount
0	1	2014-01-01	32
1	1	2014-02-01	13
2	1	2014-03-01	48
3	2	2018-02-01	98
4	2	2018-01-01	64
5	2	2018-03-01	23
6	1	2014-04-01	27

Solution 1

Package: Pandas Function: dataframe.groupby() Docs: pandas.DataFrame.groupby

# generate new columns for grouped results
data_amount = round(data.groupby(['User'], as_index=False)['Amount'].mean(),2)
data_time = data.groupby('User', as_index=False)['Time'].min()

# merge columns to main dataframe
data_merge1 = pd.merge(data,data_time,on='User',how='left')
data_merge2 = pd.merge(data_merge1,data_amount,on='User',how='left')
data_merge2

	User	Time_x	Amount_x	Time_y	Amount_y
0	1	2014-01-01	32	2014-01-01	30.00
1	1	2014-02-01	13	2014-01-01	30.00
2	1	2014-03-01	48	2014-01-01	30.00
3	2	2018-02-01	98	2018-01-01	61.67
4	2	2018-01-01	64	2018-01-01	61.67
5	2	2018-03-01	23	2018-01-01	61.67
6	1	2014-04-01	27	2014-01-01	30.00

# drop old columns and rename new columns
data_new = data_merge2.drop(columns=['Time_x','Amount_x'])
data_new = data_new.drop_duplicates(['User'])
data_new = data_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
3	2	2018-01-01	61.67

Alternative

Alternatively, we could also drop_duplicates first then merge new columns:

data2 = data.drop_duplicates(['User'])
data2_merge1 = pd.merge(data2,data_time,on='User',how='left')
data2_merge2 = pd.merge(data2_merge1,data_amount,on='User',how='left')
data2_merge2

User	Time_x	Amount_x	Time_y	Amount_y
0	1	2014-01-01	32	2014-01-01	30.00
1	2	2018-02-01	98	2018-01-01	61.67

data2_new = data2_merge2.drop(columns=['Time_x','Amount_x'])
data2_new = data2_new.drop_duplicates(['User'])
data2_new = data2_new.rename(columns={'Amount_y': 'Amount_mean','Time_y':'Date'})
data2_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
1	2	2018-01-01	61.67

Solution 2

Here we sort the value by User and Time ascendingly, then drop_cuplicates. We get a dataframe ordered by User then Time.

data3 = data.sort_values(by=['User','Time'],ascending=True).drop_duplicates(['User'])
data3

	User	Time	Amount
0	1	2014-01-01	32
4	2	2018-01-01	64

Now we only need to merge the Amount_mean to the dataframe:

data3 = pd.merge(data3,data_amount,on='User',how='left')
data3

	User	Time	Amount_x	Amount_y
0	1	2014-01-01	32	30.00
1	2	2018-01-01	64	61.67

Lastly, drop old column and rename new column:

data3_new = data3.drop(columns=['Amount_x'])
data3_new = data3_new.rename(columns={'Amount_y': 'Amount_mean','Time':'Date'})
data3_new

	User	Date	Amount_mean
0	1	2014-01-01	30.00
1	2	2018-01-01	61.67

Source Code

Dataframe

Solution 1

Alternative

Solution 2

✦ Fizzy

Resources

About