Take the following as an example to filter a data frame based on certain conditions:
There are 2 users and some of their corresponding bill dates and bill amounts, we want to only keep one record for each user with their billing's starting date and the average bill amount.
Dataframe
|
User |
Time |
Amount |
0 |
1 |
2014-01-01 |
32 |
1 |
1 |
2014-02-01 |
13 |
2 |
1 |
2014-03-01 |
48 |
3 |
2 |
2018-02-01 |
98 |
4 |
2 |
2018-01-01 |
64 |
5 |
2 |
2018-03-01 |
23 |
6 |
1 |
2014-04-01 |
27 |
Solution 1
Package: Pandas
Function: dataframe.groupby()
Docs: pandas.DataFrame.groupby
|
User |
Time_x |
Amount_x |
Time_y |
Amount_y |
0 |
1 |
2014-01-01 |
32 |
2014-01-01 |
30.00 |
1 |
1 |
2014-02-01 |
13 |
2014-01-01 |
30.00 |
2 |
1 |
2014-03-01 |
48 |
2014-01-01 |
30.00 |
3 |
2 |
2018-02-01 |
98 |
2018-01-01 |
61.67 |
4 |
2 |
2018-01-01 |
64 |
2018-01-01 |
61.67 |
5 |
2 |
2018-03-01 |
23 |
2018-01-01 |
61.67 |
6 |
1 |
2014-04-01 |
27 |
2014-01-01 |
30.00 |
|
User |
Date |
Amount_mean |
0 |
1 |
2014-01-01 |
30.00 |
3 |
2 |
2018-01-01 |
61.67 |
Alternative
Alternatively, we could also drop_duplicates first then merge new columns:
User |
Time_x |
Amount_x |
Time_y |
Amount_y |
0 |
1 |
2014-01-01 |
32 |
2014-01-01 |
1 |
2 |
2018-02-01 |
98 |
2018-01-01 |
|
User |
Date |
Amount_mean |
0 |
1 |
2014-01-01 |
30.00 |
1 |
2 |
2018-01-01 |
61.67 |
Solution 2
Here we sort the value by User and Time ascendingly, then drop_cuplicates. We get a dataframe ordered by User then Time.
|
User |
Time |
Amount |
0 |
1 |
2014-01-01 |
32 |
4 |
2 |
2018-01-01 |
64 |
Now we only need to merge the Amount_mean to the dataframe:
|
User |
Time |
Amount_x |
Amount_y |
0 |
1 |
2014-01-01 |
32 |
30.00 |
1 |
2 |
2018-01-01 |
64 |
61.67 |
Lastly, drop old column and rename new column:
|
User |
Date |
Amount_mean |
0 |
1 |
2014-01-01 |
30.00 |
1 |
2 |
2018-01-01 |
61.67 |
Source Code