{ "cells": [ { "cell_type": "markdown", "id": "d5e1c188", "metadata": {}, "source": [ "# Group\n", "\n", "The [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the [\"pivot table\"](https://en.wikipedia.org/wiki/Pivot_table) feature found in most spreadsheets programs.\n", "\n", "Let's use it to total up the accidents by helicopter make and model. First, let's prepare our data:" ] }, { "cell_type": "code", "execution_count": null, "id": "846aa3c2", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")\n", "accident_list[\"latimes_make_and_model\"] = accident_list[\"latimes_make_and_model\"].str.upper()" ] }, { "cell_type": "markdown", "id": "5a51763c", "metadata": {}, "source": [ "Now let's group by make and model. You start by passing the field you want to group on to the function:" ] }, { "cell_type": "code", "execution_count": null, "id": "fbebbcfb", "metadata": {}, "outputs": [], "source": [ "accident_list.groupby(\"latimes_make_and_model\")" ] }, { "cell_type": "markdown", "id": "f0718b3b", "metadata": {}, "source": [ "That's a nice start, but you'll notice you don't get much back. The data has been grouped, but we haven't chosen what to do with it yet. If we want the total by model, we can use the `size` method:" ] }, { "cell_type": "code", "execution_count": null, "id": "1ea102d1", "metadata": {}, "outputs": [], "source": [ "accident_list.groupby(\"latimes_make_and_model\").size()" ] }, { "cell_type": "markdown", "id": "b46146f9", "metadata": {}, "source": [ "The result is much like `value_counts`, but we're allowed to run all kinds of statistical operations on the group, like `sum`, `mean` and `std`. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method:" ] }, { "cell_type": "code", "execution_count": null, "id": "bb609108", "metadata": {}, "outputs": [], "source": [ "accident_list.groupby(\"latimes_make_and_model\")[\"total_fatalities\"].sum()" ] }, { "cell_type": "markdown", "id": "4025ba54", "metadata": {}, "source": [ "Or we could calculate the mean number of fatalities per accident for each model:" ] }, { "cell_type": "code", "execution_count": null, "id": "142058b9", "metadata": {}, "outputs": [], "source": [ "accident_list.groupby(\"latimes_make_and_model\")[\"total_fatalities\"].mean()" ] }, { "cell_type": "markdown", "id": "bc747678", "metadata": {}, "source": [ "## Saving your groups\n", "\n", "You can save grouped data to a variable just like any other pandas operation. This is useful when you want to perform multiple operations on the same grouping:" ] }, { "cell_type": "code", "execution_count": null, "id": "4923c4c2", "metadata": {}, "outputs": [], "source": [ "# Save the grouped data\n", "by_model = accident_list.groupby(\"latimes_make_and_model\")\n", "\n", "# Now we can run multiple operations\n", "print(\"Total accidents by model:\")\n", "print(by_model.size())\n", "print(\"\\nTotal fatalities by model:\")\n", "print(by_model[\"total_fatalities\"].sum())" ] }, { "cell_type": "markdown", "id": "79ed97c9", "metadata": {}, "source": [ "## Multiple statistics at once\n", "\n", "You can calculate multiple statistics at once using the `agg` method:" ] }, { "cell_type": "code", "execution_count": null, "id": "ee0a451c", "metadata": {}, "outputs": [], "source": [ "accident_list.groupby(\"latimes_make_and_model\")[\"total_fatalities\"].agg(['count', 'sum', 'mean'])" ] }, { "cell_type": "markdown", "id": "8d2a7111", "metadata": {}, "source": [ "This gives us a comprehensive view of accidents and fatalities by model. We can see the count of accidents, total fatalities, and average fatalities per accident for each helicopter model.\n", "\n", "Now that we understand grouping, let's learn how to sort our results to find the models with the most accidents or fatalities." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }