{ "cells": [ { "cell_type": "markdown", "id": "8c994d47", "metadata": {}, "source": [ "# Sort\n", "\n", "Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.\n", "\n", "Let's start by preparing our data:" ] }, { "cell_type": "code", "execution_count": null, "id": "5d303773", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Load accident data\n", "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")\n", "accident_list[\"latimes_make_and_model\"] = accident_list[\"latimes_make_and_model\"].str.upper()\n", "\n", "# Create accident counts\n", "accident_counts = accident_list.groupby(\"latimes_make_and_model\").size().reset_index().rename(columns={0: \"accidents\"})\n", "\n", "# Load survey data and merge\n", "survey = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv\")\n", "survey[\"latimes_make_and_model\"] = survey[\"latimes_make_and_model\"].str.upper()\n", "merged_list = pd.merge(accident_counts, survey, on=\"latimes_make_and_model\")\n", "\n", "# Calculate accident rates\n", "merged_list[\"per_hour\"] = merged_list.accidents / merged_list.total_hours\n", "merged_list[\"per_100k_hours\"] = (merged_list.accidents / merged_list.total_hours) * 100_000\n", "\n", "print(\"Data prepared for sorting analysis\")\n", "merged_list.head()" ] }, { "cell_type": "markdown", "id": "6cdfea16", "metadata": {}, "source": [ "## Basic sorting\n", "\n", "The [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method is how pandas sorts DataFrames. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field:" ] }, { "cell_type": "code", "execution_count": null, "id": "5808099e", "metadata": {}, "outputs": [], "source": [ "merged_list.sort_values(\"per_100k_hours\")" ] }, { "cell_type": "markdown", "id": "51da884b", "metadata": {}, "source": [ "Note that by default `sort_values` returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called `ascending`. When it is set to `False`, the DataFrame is sorted in descending order:" ] }, { "cell_type": "code", "execution_count": null, "id": "7b6e0d92", "metadata": {}, "outputs": [], "source": [ "merged_list.sort_values(\"per_100k_hours\", ascending=False)" ] }, { "cell_type": "markdown", "id": "2b28525c", "metadata": {}, "source": [ "## Sorting by multiple columns\n", "\n", "You can also sort by multiple columns by passing a list of column names. This is useful when you want a primary and secondary sort:" ] }, { "cell_type": "code", "execution_count": null, "id": "52d6b3ad", "metadata": {}, "outputs": [], "source": [ "# Sort by accidents (descending) then by total hours (descending)\n", "merged_list.sort_values([\"accidents\", \"total_hours\"], ascending=[False, False])" ] }, { "cell_type": "markdown", "id": "98118c3a", "metadata": {}, "source": [ "## Finding the top and bottom values\n", "\n", "Sometimes you just want the highest or lowest values. Pandas provides convenient methods for this:" ] }, { "cell_type": "code", "execution_count": null, "id": "5a5f9678", "metadata": {}, "outputs": [], "source": [ "# Top 5 models by accident rate\n", "print(\"Top 5 models by accident rate per 100k hours:\")\n", "print(merged_list.nlargest(5, \"per_100k_hours\")[[\"latimes_make_and_model\", \"per_100k_hours\"]])" ] }, { "cell_type": "code", "execution_count": null, "id": "fa8ebb4d", "metadata": {}, "outputs": [], "source": [ "# Bottom 5 models by accident rate\n", "print(\"Bottom 5 models by accident rate per 100k hours:\")\n", "print(merged_list.nsmallest(5, \"per_100k_hours\")[[\"latimes_make_and_model\", \"per_100k_hours\"]])" ] }, { "cell_type": "markdown", "id": "1de68a8a", "metadata": {}, "source": [ "## Practice exercises\n", "\n", "Congratulations! With sorting, you've covered most of the basic skills necessary to access and analyze data with pandas. Here are some practice questions you can answer using the techniques we've learned:\n", "\n", "1. What's the date of the most recent fatal helicopter accident in Texas?\n", "2. How many fatalities occurred in Texas accidents?\n", "3. What helicopter model logged the most flight hours?\n", "4. Where did the accident with the NTSB number `ERA13LA057` occur?\n", "\n", "Try to answer these questions using the sorting and filtering techniques we've learned!" ] }, { "cell_type": "code", "execution_count": null, "id": "1951c9ef", "metadata": {}, "outputs": [], "source": [ "# Space for your practice exercises\n", "# Try to answer the questions above using pandas methods" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }