Group

The groupby method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets programs.

Let’s use it to total up the accidents by helicopter make and model. First, let’s prepare our data:

import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

Now let’s group by make and model. You start by passing the field you want to group on to the function:

accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7c8564b05e80>

That’s a nice start, but you’ll notice you don’t get much back. The data has been grouped, but we haven’t chosen what to do with it yet. If we want the total by model, we can use the size method:

accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109                2
AIRBUS 130                1
AIRBUS 135                4
AIRBUS 350               29
BELL 206                 30
BELL 407                 13
HUGHES 369               13
MCDONNELL DOUGLAS 369     6
ROBINSON R22             20
ROBINSON R44             38
SCHWEIZER 269             5
SIKORSKY 76               2
dtype: int64

The result is much like value_counts, but we’re allowed to run all kinds of statistical operations on the group, like sum, mean and std. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method:

accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum()
latimes_make_and_model
AGUSTA 109                5
AIRBUS 130                1
AIRBUS 135               11
AIRBUS 350               81
BELL 206                 61
BELL 407                 35
HUGHES 369               19
MCDONNELL DOUGLAS 369     7
ROBINSON R22             27
ROBINSON R44             71
SCHWEIZER 269             7
SIKORSKY 76              11
Name: total_fatalities, dtype: int64

Or we could calculate the mean number of fatalities per accident for each model:

accident_list.groupby("latimes_make_and_model")["total_fatalities"].mean()
latimes_make_and_model
AGUSTA 109               2.500000
AIRBUS 130               1.000000
AIRBUS 135               2.750000
AIRBUS 350               2.793103
BELL 206                 2.033333
BELL 407                 2.692308
HUGHES 369               1.461538
MCDONNELL DOUGLAS 369    1.166667
ROBINSON R22             1.350000
ROBINSON R44             1.868421
SCHWEIZER 269            1.400000
SIKORSKY 76              5.500000
Name: total_fatalities, dtype: float64

Saving your groups

You can save grouped data to a variable just like any other pandas operation. This is useful when you want to perform multiple operations on the same grouping:

# Save the grouped data
by_model = accident_list.groupby("latimes_make_and_model")

# Now we can run multiple operations
print("Total accidents by model:")
print(by_model.size())
print("\nTotal fatalities by model:")
print(by_model["total_fatalities"].sum())
Total accidents by model:
latimes_make_and_model
AGUSTA 109                2
AIRBUS 130                1
AIRBUS 135                4
AIRBUS 350               29
BELL 206                 30
BELL 407                 13
HUGHES 369               13
MCDONNELL DOUGLAS 369     6
ROBINSON R22             20
ROBINSON R44             38
SCHWEIZER 269             5
SIKORSKY 76               2
dtype: int64

Total fatalities by model:
latimes_make_and_model
AGUSTA 109                5
AIRBUS 130                1
AIRBUS 135               11
AIRBUS 350               81
BELL 206                 61
BELL 407                 35
HUGHES 369               19
MCDONNELL DOUGLAS 369     7
ROBINSON R22             27
ROBINSON R44             71
SCHWEIZER 269             7
SIKORSKY 76              11
Name: total_fatalities, dtype: int64

Multiple statistics at once

You can calculate multiple statistics at once using the agg method:

accident_list.groupby("latimes_make_and_model")["total_fatalities"].agg(['count', 'sum', 'mean'])
count sum mean
latimes_make_and_model
AGUSTA 109 2 5 2.500000
AIRBUS 130 1 1 1.000000
AIRBUS 135 4 11 2.750000
AIRBUS 350 29 81 2.793103
BELL 206 30 61 2.033333
BELL 407 13 35 2.692308
HUGHES 369 13 19 1.461538
MCDONNELL DOUGLAS 369 6 7 1.166667
ROBINSON R22 20 27 1.350000
ROBINSON R44 38 71 1.868421
SCHWEIZER 269 5 7 1.400000
SIKORSKY 76 2 11 5.500000

This gives us a comprehensive view of accidents and fatalities by model. We can see the count of accidents, total fatalities, and average fatalities per accident for each helicopter model.

Now that we understand grouping, let’s learn how to sort our results to find the models with the most accidents or fatalities.