Pandas GroupBy and Top by Group

So, often working on a data project you will have to report the top items in a group, based on some value. This could be states (group), cities (item), and poverty rate (value). This could be applications (group), metrics (item), and count of occurrence (value). The point is we want to find items in a group that are interesting based on a value.

The setup

I will use a random dataframe with 3 columns: category, item, and value. Some of the problem domains fit the random data frame, and for some you will have to imagine that the items are completely different names. So imagine that foo_baz is a stand-in for any value.

Random Dataframe Creation

df

Random Dataframe Display

A simple Top N by group

Suppose you want to show someone the Top 1 or Top 10 items by group. Here is what we do: sort the values in a descending order, groupby, and then take the head(N).

Top N by group

Top N by group output

Top as a percentage of the group's value

Suppose the dataset is count (value) of votes for a mayoral candidate (item) by city (group). We might be interested in the count of votes if it is above a certain percentage of the total votes cast in that city. We first have to calculate the total votes cast in the city.

Groupby and Sum in Pandas

df

Groupby and Sum in Pandas output

Then we use this grouped-by sum to compute the pct of the item's value in each group. So that is which candidates got over 2.5% of the vote in their city.

We use apply on the dataframe and use axis=1 to apply over the columns, for each row.

Above 2.5% of the group's total output

df

Above 2.5% of the group's total output

Top Ntile for the group

Suppose the dataset is counts (value) of people living in poverty by county (item) and by state (group). We might want to see the # of counties that are in the top 10% of counties in that state, for count of people living in poverty.

Top 10% ntile by group

The important part is the lambda with the quantile function and the apply with > and choosing the right ntile for each group.

df

Top 10% ntile by group

Conclusion

We just went through three ways to give the top items in a group based on a value. This applies to many situations and is generalized in three techniques here: Top N, Top above percentage, and top above an ntile.