When should you use group by in general? I would say group by is a good idea any time you want to analyse some pandas series by some category. Group by in Python Pandas essentially splits the data into different groups depending on a variable/category of your choice. For example, the expression data.groupby(‘year’) will split our current DataFrame by year.
The groupby() function returns a GroupBy object but essentially describes how the rows of the original dataset have been split. The GroupBy object groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. Let's have a closer look.
If you simply run
df.groupby('column_for_grouping') you will get a Python object that will look similar to
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd69e143208>. You may want to know how DataFrameGroupBy object looks internally. So lets print groups split by continent within our DataFrameGroupBy object by iterating through groups.
Now you can see that we have a continent as a key and a DataFrame corresponding to each continent as a value.
You also can access a specific Dataframe from DataFrameGroupBy object using
In the example above we got DataFrame that corresponds to Europe by running
g.get_group('Europe'). Note that I used
.head() in order to shorten the list of countries.
What Group By can do
“Group by” is used in a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Out of these, the most straightforward step is the split. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the apply step, following actions are the most common:
Aggregation: compute a summary statistic for each group.
- Compute group sums or means.
- Compute group sizes / counts.
Transformation: perform some group-specific computations and return a like-indexed object.
- Standardize data within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to a group-wise computation that evaluates True or False.
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.
GroupBY using multiple columns
On a DataFrame, we obtain a GroupBy object by calling
groupby(). We could naturally group by either one column of the DataFrame or multiple columns using
Now we split the data into groups by job title and company and saved as a GroupBy object called "group".
Similarity to SQL
If you are familiar to SQL GroupBy in Pandas would be no stranger to you. After all
df.groupby uses the same logic as
SELECT * FROM city_data GROUP BY city; where city_data is a name of a table you are using for grouping by column city.
Aggregation - Compute a summary statistic for each group
Computing group sums or means is a very common thing in data analysis. So let's see how this can be done.
Now that we have DataFrameGroupBy object called "g" in order to get mean for the specific column you just need to add .your_column.mean().
Or if you want to see means for all columns you just need to run
The same can be done with a GroupBy object "group" which we grouped by multiple values.
Applying multiple functions at once .agg()
Often we need to calculate more than one function for the group. Doing this individually is time-consuming so this is why
.agg() was created.
With DataFrameGroupBy object you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
g.agg(['count', 'min', 'max', 'mean'])
Group By function and .describe()
You can also use
.describe() with the Group By function, but in comparison to
.agg() you can't manually assign wich summary functions to run.
Now you understand the basics the GroupBy functionality in Pandas. If you want to go deeper into the subject check out official Pandas documentation for the function.
There are plenty of resources online explaining how to use GroupBy function, and I’d recommend conquering this syntax if you’re serious about using Pandas.