Group¶
The groupby method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets programs.
Let’s use it to total up the accidents by helicopter make and model. First, let’s prepare our data:
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
Now let’s group by make and model. You start by passing the field you want to group on to the function:
accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7c8564b05e80>
That’s a nice start, but you’ll notice you don’t get much back. The data has been grouped, but we haven’t chosen what to do with it yet. If we want the total by model, we can use the size method:
accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109 2
AIRBUS 130 1
AIRBUS 135 4
AIRBUS 350 29
BELL 206 30
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
ROBINSON R22 20
ROBINSON R44 38
SCHWEIZER 269 5
SIKORSKY 76 2
dtype: int64
The result is much like value_counts, but we’re allowed to run all kinds of statistical operations on the group, like sum, mean and std. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method:
accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum()
latimes_make_and_model
AGUSTA 109 5
AIRBUS 130 1
AIRBUS 135 11
AIRBUS 350 81
BELL 206 61
BELL 407 35
HUGHES 369 19
MCDONNELL DOUGLAS 369 7
ROBINSON R22 27
ROBINSON R44 71
SCHWEIZER 269 7
SIKORSKY 76 11
Name: total_fatalities, dtype: int64
Or we could calculate the mean number of fatalities per accident for each model:
accident_list.groupby("latimes_make_and_model")["total_fatalities"].mean()
latimes_make_and_model
AGUSTA 109 2.500000
AIRBUS 130 1.000000
AIRBUS 135 2.750000
AIRBUS 350 2.793103
BELL 206 2.033333
BELL 407 2.692308
HUGHES 369 1.461538
MCDONNELL DOUGLAS 369 1.166667
ROBINSON R22 1.350000
ROBINSON R44 1.868421
SCHWEIZER 269 1.400000
SIKORSKY 76 5.500000
Name: total_fatalities, dtype: float64
Saving your groups¶
You can save grouped data to a variable just like any other pandas operation. This is useful when you want to perform multiple operations on the same grouping:
# Save the grouped data
by_model = accident_list.groupby("latimes_make_and_model")
# Now we can run multiple operations
print("Total accidents by model:")
print(by_model.size())
print("\nTotal fatalities by model:")
print(by_model["total_fatalities"].sum())
Total accidents by model:
latimes_make_and_model
AGUSTA 109 2
AIRBUS 130 1
AIRBUS 135 4
AIRBUS 350 29
BELL 206 30
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
ROBINSON R22 20
ROBINSON R44 38
SCHWEIZER 269 5
SIKORSKY 76 2
dtype: int64
Total fatalities by model:
latimes_make_and_model
AGUSTA 109 5
AIRBUS 130 1
AIRBUS 135 11
AIRBUS 350 81
BELL 206 61
BELL 407 35
HUGHES 369 19
MCDONNELL DOUGLAS 369 7
ROBINSON R22 27
ROBINSON R44 71
SCHWEIZER 269 7
SIKORSKY 76 11
Name: total_fatalities, dtype: int64
Multiple statistics at once¶
You can calculate multiple statistics at once using the agg method:
accident_list.groupby("latimes_make_and_model")["total_fatalities"].agg(['count', 'sum', 'mean'])
| count | sum | mean | |
|---|---|---|---|
| latimes_make_and_model | |||
| AGUSTA 109 | 2 | 5 | 2.500000 |
| AIRBUS 130 | 1 | 1 | 1.000000 |
| AIRBUS 135 | 4 | 11 | 2.750000 |
| AIRBUS 350 | 29 | 81 | 2.793103 |
| BELL 206 | 30 | 61 | 2.033333 |
| BELL 407 | 13 | 35 | 2.692308 |
| HUGHES 369 | 13 | 19 | 1.461538 |
| MCDONNELL DOUGLAS 369 | 6 | 7 | 1.166667 |
| ROBINSON R22 | 20 | 27 | 1.350000 |
| ROBINSON R44 | 38 | 71 | 1.868421 |
| SCHWEIZER 269 | 5 | 7 | 1.400000 |
| SIKORSKY 76 | 2 | 11 | 5.500000 |
This gives us a comprehensive view of accidents and fatalities by model. We can see the count of accidents, total fatalities, and average fatalities per accident for each helicopter model.
Now that we understand grouping, let’s learn how to sort our results to find the models with the most accidents or fatalities.