Merge¶
One of the most powerful features of pandas is the ability to merge datasets together. This is essential for data journalism, where you often need to combine information from multiple sources to tell a complete story.
In our helicopter accident analysis, we want to combine accident data with flight hour data to calculate accident rates - a much more meaningful metric than raw accident counts.
Setup: Load both datasets¶
Let’s start by loading both our accident data and the FAA survey data that contains flight hours:
import pandas as pd
# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
print(f"Loaded {len(accident_list)} accidents")
accident_list.head()
Loaded 163 accidents
| event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20061222X01838 | BELL | 407 | NYC07FA048 | 2006 | 12/14/2006 00:00:00 | DAGSBORO | DE | USA | 2 | BELL | 407 | BELL 407 |
| 1 | 20060817X01187 | ROBINSON | R22 BETA | LAX06LA257 | 2006 | 08/10/2006 00:00:00 | TUCSON | AZ | USA | 1 | ROBINSON | R22 | ROBINSON R22 |
| 2 | 20060111X00044 | ROBINSON | R44 | MIA06FA039 | 2006 | 01/01/2006 00:00:00 | GRAND RIDGE | FL | USA | 3 | ROBINSON | R44 | ROBINSON R44 |
| 3 | 20060419X00461 | ROBINSON | R44 II | DFW06FA102 | 2006 | 04/13/2006 00:00:00 | FREDERICKSBURG | TX | USA | 2 | ROBINSON | R44 | ROBINSON R44 |
| 4 | 20060208X00181 | ROBINSON | R44 | SEA06LA052 | 2006 | 02/06/2006 00:00:00 | HELENA | MT | USA | 1 | ROBINSON | R44 | ROBINSON R44 |
# Load FAA survey data with flight hours
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
print(f"Loaded {len(survey)} helicopter models with flight hour data")
survey.head()
Loaded 12 helicopter models with flight hour data
| latimes_make_and_model | total_hours | |
|---|---|---|
| 0 | AGUSTA 109 | 362172 |
| 1 | AIRBUS 130 | 1053786 |
| 2 | AIRBUS 135 | 884596 |
| 3 | AIRBUS 350 | 3883490 |
| 4 | BELL 206 | 5501308 |
Prepare data for merging¶
Before merging, we need to create a summary of accidents by helicopter model:
# Count accidents by make and model
accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()
print(f"Accident counts for {len(accident_counts)} helicopter models")
accident_counts.head()
Accident counts for 12 helicopter models
| latimes_make | latimes_make_and_model | accidents | |
|---|---|---|---|
| 0 | AGUSTA | AGUSTA 109 | 2 |
| 1 | AIRBUS | AIRBUS 130 | 1 |
| 2 | AIRBUS | AIRBUS 135 | 4 |
| 3 | AIRBUS | AIRBUS 350 | 29 |
| 4 | BELL | BELL 206 | 30 |
Basic merge operation¶
Now we can merge the accident counts with the flight hour data. We’ll use the merge function to combine the datasets on the latimes_make_and_model column:
# Merge the datasets
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
print(f"Merged dataset has {len(merged_list)} helicopter models")
merged_list.head()
Merged dataset has 12 helicopter models
| latimes_make | latimes_make_and_model | accidents | total_hours | |
|---|---|---|---|---|
| 0 | AGUSTA | AGUSTA 109 | 2 | 362172 |
| 1 | AIRBUS | AIRBUS 130 | 1 | 1053786 |
| 2 | AIRBUS | AIRBUS 135 | 4 | 884596 |
| 3 | AIRBUS | AIRBUS 350 | 29 | 3883490 |
| 4 | BELL | BELL 206 | 30 | 5501308 |
Understanding the merge¶
Let’s examine what happened during the merge:
print(f"Original accident counts: {len(accident_counts)} models")
print(f"Original survey data: {len(survey)} models")
print(f"Merged result: {len(merged_list)} models")
print("\nThe merge kept only models that appear in both datasets (inner join)")
Original accident counts: 12 models
Original survey data: 12 models
Merged result: 12 models
The merge kept only models that appear in both datasets (inner join)
Calculate accident rates¶
Now that we have both accident counts and flight hours in the same dataset, we can calculate meaningful accident rates:
# Calculate accident rate per flight hour
merged_list["per_hour"] = merged_list["accidents"] / merged_list["total_hours"]
# Calculate accident rate per 100,000 flight hours (more readable)
merged_list["per_100k_hours"] = (merged_list["accidents"] / merged_list["total_hours"]) * 100_000
print("Added accident rate calculations:")
merged_list[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head()
Added accident rate calculations:
| latimes_make_and_model | accidents | total_hours | per_100k_hours | |
|---|---|---|---|---|
| 0 | AGUSTA 109 | 2 | 362172 | 0.552224 |
| 1 | AIRBUS 130 | 1 | 1053786 | 0.094896 |
| 2 | AIRBUS 135 | 4 | 884596 | 0.452184 |
| 3 | AIRBUS 350 | 29 | 3883490 | 0.746751 |
| 4 | BELL 206 | 30 | 5501308 | 0.545325 |
Analyze the results¶
Now we can see which helicopter models have the highest accident rates:
# Sort by accident rate to see the most dangerous models
dangerous_models = merged_list.sort_values("per_100k_hours", ascending=False)
print("Top 10 helicopter models by accident rate (per 100k flight hours):")
dangerous_models[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head(10)
Top 10 helicopter models by accident rate (per 100k flight hours):
| latimes_make_and_model | accidents | total_hours | per_100k_hours | |
|---|---|---|---|---|
| 9 | ROBINSON R44 | 38 | 2359729 | 1.610354 |
| 7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544 |
| 6 | HUGHES 369 | 13 | 1201688 | 1.081812 |
| 3 | AIRBUS 350 | 29 | 3883490 | 0.746751 |
| 8 | ROBINSON R22 | 20 | 2970806 | 0.673218 |
| 5 | BELL 407 | 13 | 2113788 | 0.615010 |
| 0 | AGUSTA 109 | 2 | 362172 | 0.552224 |
| 4 | BELL 206 | 30 | 5501308 | 0.545325 |
| 2 | AIRBUS 135 | 4 | 884596 | 0.452184 |
| 10 | SCHWEIZER 269 | 5 | 1139326 | 0.438856 |
Different types of merges¶
Pandas supports different types of merges. Let’s explore what happens with different join types:
# Left join - keep all accident data, even without flight hours
left_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="left")
print(f"Left merge result: {len(left_merge)} models (includes all accident data)")
# Check how many models don't have flight hour data
missing_hours = left_merge["total_hours"].isnull().sum()
print(f"Models with accidents but no flight hour data: {missing_hours}")
Left merge result: 12 models (includes all accident data)
Models with accidents but no flight hour data: 0
# Right join - keep all survey data, even without accidents
right_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="right")
print(f"Right merge result: {len(right_merge)} models (includes all survey data)")
# Fill missing accident counts with 0
right_merge["accidents"] = right_merge["accidents"].fillna(0)
no_accidents = (right_merge["accidents"] == 0).sum()
print(f"Models with flight hours but no recorded accidents: {no_accidents}")
Right merge result: 12 models (includes all survey data)
Models with flight hours but no recorded accidents: 0
Key insights from the merge¶
The merge operation revealed important insights about helicopter safety:
# Summary statistics
print("Summary of accident rates:")
print(f"Average accident rate: {merged_list['per_100k_hours'].mean():.2f} per 100k hours")
print(f"Median accident rate: {merged_list['per_100k_hours'].median():.2f} per 100k hours")
print(f"Highest accident rate: {merged_list['per_100k_hours'].max():.2f} per 100k hours")
print(f"Lowest accident rate: {merged_list['per_100k_hours'].min():.2f} per 100k hours")
Summary of accident rates:
Average accident rate: 0.68 per 100k hours
Median accident rate: 0.58 per 100k hours
Highest accident rate: 1.61 per 100k hours
Lowest accident rate: 0.09 per 100k hours
Merging datasets is a fundamental skill in data journalism. It allows you to combine information from different sources to create more meaningful analyses, like calculating rates instead of just raw counts. This helicopter accident analysis demonstrates how merging accident data with flight hour data provides much more insight than looking at accidents alone.