{ "cells": [ { "cell_type": "markdown", "id": "f2859827", "metadata": {}, "source": [ "# Merge\n", "\n", "One of the most powerful features of pandas is the ability to merge datasets together. This is essential for data journalism, where you often need to combine information from multiple sources to tell a complete story.\n", "\n", "In our helicopter accident analysis, we want to combine accident data with flight hour data to calculate accident rates - a much more meaningful metric than raw accident counts." ] }, { "cell_type": "markdown", "id": "205001f1", "metadata": {}, "source": [ "## Setup: Load both datasets\n", "\n", "Let's start by loading both our accident data and the FAA survey data that contains flight hours:" ] }, { "cell_type": "code", "execution_count": null, "id": "3589a5d5", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Load accident data\n", "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")\n", "accident_list[\"latimes_make_and_model\"] = accident_list[\"latimes_make_and_model\"].str.upper()\n", "\n", "print(f\"Loaded {len(accident_list)} accidents\")\n", "accident_list.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "12ad5c62", "metadata": {}, "outputs": [], "source": [ "# Load FAA survey data with flight hours\n", "survey = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv\")\n", "survey[\"latimes_make_and_model\"] = survey[\"latimes_make_and_model\"].str.upper()\n", "\n", "print(f\"Loaded {len(survey)} helicopter models with flight hour data\")\n", "survey.head()" ] }, { "cell_type": "markdown", "id": "327a789a", "metadata": {}, "source": [ "## Prepare data for merging\n", "\n", "Before merging, we need to create a summary of accidents by helicopter model:" ] }, { "cell_type": "code", "execution_count": null, "id": "1fd349f3", "metadata": {}, "outputs": [], "source": [ "# Count accidents by make and model\n", "accident_counts = accident_list.groupby([\"latimes_make\", \"latimes_make_and_model\"]).size().rename(\"accidents\").reset_index()\n", "\n", "print(f\"Accident counts for {len(accident_counts)} helicopter models\")\n", "accident_counts.head()" ] }, { "cell_type": "markdown", "id": "27dea1a5", "metadata": {}, "source": [ "## Basic merge operation\n", "\n", "Now we can merge the accident counts with the flight hour data. We'll use the `merge` function to combine the datasets on the `latimes_make_and_model` column:" ] }, { "cell_type": "code", "execution_count": null, "id": "c8a0b8ef", "metadata": {}, "outputs": [], "source": [ "# Merge the datasets\n", "merged_list = pd.merge(accident_counts, survey, on=\"latimes_make_and_model\")\n", "\n", "print(f\"Merged dataset has {len(merged_list)} helicopter models\")\n", "merged_list.head()" ] }, { "cell_type": "markdown", "id": "316c3fa8", "metadata": {}, "source": [ "## Understanding the merge\n", "\n", "Let's examine what happened during the merge:" ] }, { "cell_type": "code", "execution_count": null, "id": "a57e456a", "metadata": {}, "outputs": [], "source": [ "print(f\"Original accident counts: {len(accident_counts)} models\")\n", "print(f\"Original survey data: {len(survey)} models\")\n", "print(f\"Merged result: {len(merged_list)} models\")\n", "print(\"\\nThe merge kept only models that appear in both datasets (inner join)\")" ] }, { "cell_type": "markdown", "id": "bd72621a", "metadata": {}, "source": [ "## Calculate accident rates\n", "\n", "Now that we have both accident counts and flight hours in the same dataset, we can calculate meaningful accident rates:" ] }, { "cell_type": "code", "execution_count": null, "id": "48c477d7", "metadata": {}, "outputs": [], "source": [ "# Calculate accident rate per flight hour\n", "merged_list[\"per_hour\"] = merged_list[\"accidents\"] / merged_list[\"total_hours\"]\n", "\n", "# Calculate accident rate per 100,000 flight hours (more readable)\n", "merged_list[\"per_100k_hours\"] = (merged_list[\"accidents\"] / merged_list[\"total_hours\"]) * 100_000\n", "\n", "print(\"Added accident rate calculations:\")\n", "merged_list[[\"latimes_make_and_model\", \"accidents\", \"total_hours\", \"per_100k_hours\"]].head()" ] }, { "cell_type": "markdown", "id": "3430d0df", "metadata": {}, "source": [ "## Analyze the results\n", "\n", "Now we can see which helicopter models have the highest accident rates:" ] }, { "cell_type": "code", "execution_count": null, "id": "7ca91c02", "metadata": {}, "outputs": [], "source": [ "# Sort by accident rate to see the most dangerous models\n", "dangerous_models = merged_list.sort_values(\"per_100k_hours\", ascending=False)\n", "print(\"Top 10 helicopter models by accident rate (per 100k flight hours):\")\n", "dangerous_models[[\"latimes_make_and_model\", \"accidents\", \"total_hours\", \"per_100k_hours\"]].head(10)" ] }, { "cell_type": "markdown", "id": "def339f0", "metadata": {}, "source": [ "## Different types of merges\n", "\n", "Pandas supports different types of merges. Let's explore what happens with different join types:" ] }, { "cell_type": "code", "execution_count": null, "id": "22f06fcf", "metadata": {}, "outputs": [], "source": [ "# Left join - keep all accident data, even without flight hours\n", "left_merge = pd.merge(accident_counts, survey, on=\"latimes_make_and_model\", how=\"left\")\n", "print(f\"Left merge result: {len(left_merge)} models (includes all accident data)\")\n", "\n", "# Check how many models don't have flight hour data\n", "missing_hours = left_merge[\"total_hours\"].isnull().sum()\n", "print(f\"Models with accidents but no flight hour data: {missing_hours}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5d3ccced", "metadata": {}, "outputs": [], "source": [ "# Right join - keep all survey data, even without accidents\n", "right_merge = pd.merge(accident_counts, survey, on=\"latimes_make_and_model\", how=\"right\")\n", "print(f\"Right merge result: {len(right_merge)} models (includes all survey data)\")\n", "\n", "# Fill missing accident counts with 0\n", "right_merge[\"accidents\"] = right_merge[\"accidents\"].fillna(0)\n", "no_accidents = (right_merge[\"accidents\"] == 0).sum()\n", "print(f\"Models with flight hours but no recorded accidents: {no_accidents}\")" ] }, { "cell_type": "markdown", "id": "3e64b29d", "metadata": {}, "source": [ "## Key insights from the merge\n", "\n", "The merge operation revealed important insights about helicopter safety:" ] }, { "cell_type": "code", "execution_count": null, "id": "01e65f3b", "metadata": {}, "outputs": [], "source": [ "# Summary statistics\n", "print(\"Summary of accident rates:\")\n", "print(f\"Average accident rate: {merged_list['per_100k_hours'].mean():.2f} per 100k hours\")\n", "print(f\"Median accident rate: {merged_list['per_100k_hours'].median():.2f} per 100k hours\")\n", "print(f\"Highest accident rate: {merged_list['per_100k_hours'].max():.2f} per 100k hours\")\n", "print(f\"Lowest accident rate: {merged_list['per_100k_hours'].min():.2f} per 100k hours\")" ] }, { "cell_type": "markdown", "id": "3832f35d", "metadata": {}, "source": [ "Merging datasets is a fundamental skill in data journalism. It allows you to combine information from different sources to create more meaningful analyses, like calculating rates instead of just raw counts. This helicopter accident analysis demonstrates how merging accident data with flight hour data provides much more insight than looking at accidents alone." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }