Guides

The ultimate beginners guide to Group by in Python Pandas

August 8, 2019

August 8, 2019

When should you use group by in general? I would say group by is a good idea any time you want to analyse some pandas series by some category.  Group by in Python Pandas essentially splits the data into different groups depending on a variable/category of your choice. For example, the expression data.groupby(‘year’) will split our current DataFrame by year.

GroupBy object

The groupby() function returns a GroupBy object but essentially describes how the rows of the original dataset have been split. The GroupBy object groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. Let's have a closer look.

If you simply run df.groupby('column_for_grouping') you will get a Python object that will look similar to <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd69e143208>. You may want to know how DataFrameGroupBy object looks internally. So lets print groups split by continent within our DataFrameGroupBy object by iterating through groups.

Now you can see that we have a continent as a key and a DataFrame corresponding to each continent as a value.

Get_group()

You also can access a specific Dataframe from DataFrameGroupBy object using get_group().

In the example above we got DataFrame that corresponds to Europe by running g.get_group('Europe'). Note that I used .head() in order to shorten the list of countries.

What Group By can do

“Group by” is used in a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria.

  • Applying a function to each group independently.

  • Combining the results into a data structure.

Out of these,  the most straightforward step is the split. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the apply step, following actions are the most common:

Aggregation: compute a summary statistic for each group.

Examples:

  • Compute group sums or means.

  • Compute group sizes / counts.

Transformation: perform some group-specific computations and return a like-indexed object.

Examples:

  • Standardize data within a group.

  • Filling NAs within groups with a value derived from each group.

Filtration: discard some groups, according to a group-wise computation that evaluates True or False.

Examples:

  • Discard data that belongs to groups with only a few members.

  • Filter out data based on the group sum or mean.

GroupBY using multiple columns

On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either one column of the DataFrame or multiple columns using df.groupby(['column1', 'column2']

Now we split the data into groups by job title and company and saved as a GroupBy object called "group".

Similarity to SQL

If you are familiar to SQL GroupBy in Pandas would be no stranger to you. After all df.groupby('column_for_grouping') uses the same logic as SELECT * FROM city_data GROUP BY city; where city_data is a name of a table you are using for grouping by column city.

Aggregation - Compute a summary statistic for each group

Computing group sums or means is a very common thing in data analysis. So let's see how this can be done.

Now that we have DataFrameGroupBy object called "g" in order to get mean for the specific column you just need to add .your_column.mean().

g.your_column.mean()

Or if you want to see means for all columns you just need to run g.mean()

The same can be done with a GroupBy object "group" which we grouped by multiple values.

Applying multiple functions at once .agg([])

Often we need to calculate more than one function for the group. Doing this individually is time-consuming so this is why .agg([]) was created.  

With DataFrameGroupBy object you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

g.agg(['count', 'min', 'max', 'mean'])

Group By function and .describe()

You can also use .describe() with the Group By function, but in comparison to .agg([]) you can't manually assign wich summary functions to run.

Conclusion

Now you understand the basics the GroupBy functionality in Pandas. If you want to go deeper into the subject check out official Pandas documentation for the function.  

There are plenty of resources online explaining how to use GroupBy function, and I’d recommend conquering this syntax if you’re serious about using Pandas.

Subscribe

Get fresh web design stories, tips, and resources delivered straight to your inbox every week.

Get fresh web design stories, tips, and resources delivered straight to your inbox every week.

Continue Reading

Apps

Timestripe - my new favourite productivity app

March 5, 2023

Guides

How to scrape tables from websites using Pandas read_html() function

February 2, 2023

Guides

Drop all duplicate rows across multiple columns in Python Pandas

January 28, 2023

Guides

How to create effective prompts for AI image generation

August 15, 2022

Guides

Generate Huge Datasets With Fake Data Easily and Quickly using Python and Faker

April 16, 2022

Guides

How to change or update a specific cell in Python Pandas Dataframe

March 25, 2021

Guides

How to add a row at the top in Pandas dataframe

March 22, 2021

Guides

Creating WordClouds in Python from a single-column in Pandas dataframe

November 15, 2020

Guides

Python Regex examples - How to use Regex with Pandas

September 9, 2020

Guides

Python regular expressions (RegEx) simple yet complete guide for beginners

September 15, 2020

Guides

8 Python Pandas Value_counts() tricks that make your work more efficient

May 31, 2020

Guides

Exploring Correlation in Python: Pandas, SciPy

May 5, 2020

Guides

How to add new columns to Pandas dataframe?

March 22, 2020

Guides

Delete column/row from a Pandas dataframe using .drop() method

February 2, 2020

Guides

How to visualize data with Matplotlib from a Pandas Dataframe

November 15, 2019

Guides

Guide to renaming columns with Python Pandas

July 2, 2019

Guides

How to suppress scientific notation in Pandas

July 12, 2019

Guides

The complete beginners guide to Pandas

June 29, 2019

Guides

Data project #1: Stockmarket analysis

June 29, 2019

Blue and red light digital wallpaper
Blue and red light digital wallpaper

Guides

Use Jupyter notebooks anywhere

June 10, 2019