--- jupytext: text_representatiThat's a nice start, but you'll notice you don't get much back. The data has been grouped, but we haven't chosen what to do with it yet. If we want the total by model, we can use the `size` method. ```python # %% print(accident_list.groupby("latimes_make_and_model").size()) ``` The result is much like `value_counts`, but we're allowed run to all kinds of statistical operations on the group, like `sum`, `mean` and `std`. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method. ```python # %% print(accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum()) ```sion: .md format_name: myst format_version: '0.8' jupytext_version: '1.4.1' kernelspec: display_name: Python 3 language: python name: python3 --- # Group
The [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the ["pivot table"](https://en.wikipedia.org/wiki/Pivot_table) feature found in most spreadsheets programs. Let's use it to total up the accidents by helicopter make and model. You start by passing the field you want to group on to the function. ```python # %% import pandas as pd accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv") accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper() # Group by make and model print(accident_list.groupby("latimes_make_and_model")) ``` That's a nice start, but you’ll notice you don’t get much back. The data has been grouped, but we haven’t chosen what to do with it yet. If we want the total by model, we can use the `size` method. ```{code-cell} :tags: [show-input] accident_list.groupby("latimes_make_and_model").size() ``` The result is much like `value_counts`, but we're allowed run to all kinds of statistical operations on the group, like `sum`, `mean` and `std`. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method. ```{code-cell} :tags: [show-input] accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum() ``` ## Reset a DataFrame You may notice that even though the result of a `groupby` has two columns, pandas does not return a clean-looking table the same way other operations like `head` do. In most instances, you can convert ugly tables like the ones above into a pretty DataFrame by tacking the `reset_index` method onto the end of your code. ```python # %% result = accident_list.groupby("latimes_make_and_model").size().reset_index() print(result) ``` Why doesn't `groupby` return a DataFrame? Why does `reset_index` have such a weird name? Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few. As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved, you should consider [contributing](https://pandas.pydata.org/pandas-docs/stable/development/contributing.html) to the Python code that operates the library. You can clean up the `0` column name assigned by pandas with the `rename` method. ```python # %% result = accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index() print(result) ``` Now save that as a variable. ```python # %% accident_counts = accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index() ``` That will return a DataFrame with the accident totals we need to calculate a rate. Inspect it with `head`. ```python # %% print(accident_counts.head()) ``` Now we‘ve got a ranking we can work with.