Guides
The ultimate beginners guide to Group by in Python Pandas
When should you use group by in general? I would say group by is a good idea any time you want to analyse some pandas series by some category. Group by in Python Pandas essentially splits the data into different groups depending on a variable/category of your choice. For example, the expression data.groupby(‘year’) will split our current DataFrame by year.
GroupBy object
The groupby() function returns a GroupBy object but essentially describes how the rows of the original dataset have been split. The GroupBy object groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. Let's have a closer look.
If you simply run df.groupby('column_for_grouping')
you will get a Python object that will look similar to <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd69e143208>
. You may want to know how DataFrameGroupBy object looks internally. So lets print groups split by continent within our DataFrameGroupBy object by iterating through groups.
Now you can see that we have a continent as a key and a DataFrame corresponding to each continent as a value.
Get_group()
You also can access a specific Dataframe from DataFrameGroupBy object using get_group()
.
In the example above we got DataFrame that corresponds to Europe by running g.get_group('Europe')
. Note that I used .head()
in order to shorten the list of countries.
What Group By can do
“Group by” is used in a process involving one or more of the following steps:
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Out of these, the most straightforward step is the split. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the apply step, following actions are the most common:
Aggregation: compute a summary statistic for each group.
Examples:
Compute group sums or means.
Compute group sizes / counts.
Transformation: perform some group-specific computations and return a like-indexed object.
Examples:
Standardize data within a group.
Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to a group-wise computation that evaluates True or False.
Examples:
Discard data that belongs to groups with only a few members.
Filter out data based on the group sum or mean.
GroupBY using multiple columns
On a DataFrame, we obtain a GroupBy object by calling groupby()
. We could naturally group by either one column of the DataFrame or multiple columns using df.groupby(['column1', 'column2']
Now we split the data into groups by job title and company and saved as a GroupBy object called "group".
Similarity to SQL
If you are familiar to SQL GroupBy in Pandas would be no stranger to you. After all df.groupby('column_for_grouping')
uses the same logic as SELECT * FROM city_data GROUP BY city;
where city_data is a name of a table you are using for grouping by column city.
Aggregation - Compute a summary statistic for each group
Computing group sums or means is a very common thing in data analysis. So let's see how this can be done.
Now that we have DataFrameGroupBy object called "g" in order to get mean for the specific column you just need to add .your_column.mean().
g.your_column.mean()
Or if you want to see means for all columns you just need to run g.mean()
The same can be done with a GroupBy object "group" which we grouped by multiple values.
Applying multiple functions at once .agg([])
Often we need to calculate more than one function for the group. Doing this individually is time-consuming so this is why .agg([])
was created.
With DataFrameGroupBy object you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
g.agg(['count', 'min', 'max', 'mean'])
Group By function and .describe()
You can also use .describe()
with the Group By function, but in comparison to .agg([])
you can't manually assign wich summary functions to run.
Conclusion
Now you understand the basics the GroupBy functionality in Pandas. If you want to go deeper into the subject check out official Pandas documentation for the function.
There are plenty of resources online explaining how to use GroupBy function, and I’d recommend conquering this syntax if you’re serious about using Pandas.
Continue Reading
Apps
Timestripe - my new favourite productivity app
March 5, 2023
Guides
How to scrape tables from websites using Pandas read_html() function
February 2, 2023
Guides
Drop all duplicate rows across multiple columns in Python Pandas
January 28, 2023
Guides
How to create effective prompts for AI image generation
August 15, 2022
Guides
Generate Huge Datasets With Fake Data Easily and Quickly using Python and Faker
April 16, 2022
Guides
How to change or update a specific cell in Python Pandas Dataframe
March 25, 2021
Guides
How to add a row at the top in Pandas dataframe
March 22, 2021
Guides
Creating WordClouds in Python from a single-column in Pandas dataframe
November 15, 2020
Guides
Python Regex examples - How to use Regex with Pandas
September 9, 2020
Guides
Python regular expressions (RegEx) simple yet complete guide for beginners
September 15, 2020
Guides
8 Python Pandas Value_counts() tricks that make your work more efficient
May 31, 2020
Guides
Exploring Correlation in Python: Pandas, SciPy
May 5, 2020
Guides
How to add new columns to Pandas dataframe?
March 22, 2020
Guides
Delete column/row from a Pandas dataframe using .drop() method
February 2, 2020
Guides
How to visualize data with Matplotlib from a Pandas Dataframe
November 15, 2019
Guides
Guide to renaming columns with Python Pandas
July 2, 2019
Guides
How to suppress scientific notation in Pandas
July 12, 2019
Guides
The complete beginners guide to Pandas
June 29, 2019
Guides
Data project #1: Stockmarket analysis
June 29, 2019
Guides
Use Jupyter notebooks anywhere
June 10, 2019