Merge¶

One of the most powerful features of pandas is the ability to merge datasets together. This is essential for data journalism, where you often need to combine information from multiple sources to tell a complete story.

In our helicopter accident analysis, we want to combine accident data with flight hour data to calculate accident rates - a much more meaningful metric than raw accident counts.

Setup: Load both datasets¶

Let’s start by loading both our accident data and the FAA survey data that contains flight hours:

import pandas as pd

# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

print(f"Loaded {len(accident_list)} accidents")
accident_list.head()

Loaded 163 accidents

	event_id	ntsb_make	ntsb_model	ntsb_number	year	date	city	state	country	total_fatalities	latimes_make	latimes_model	latimes_make_and_model
0	20061222X01838	BELL	407	NYC07FA048	2006	12/14/2006 00:00:00	DAGSBORO	DE	USA	2	BELL	407	BELL 407
1	20060817X01187	ROBINSON	R22 BETA	LAX06LA257	2006	08/10/2006 00:00:00	TUCSON	AZ	USA	1	ROBINSON	R22	ROBINSON R22
2	20060111X00044	ROBINSON	R44	MIA06FA039	2006	01/01/2006 00:00:00	GRAND RIDGE	FL	USA	3	ROBINSON	R44	ROBINSON R44
3	20060419X00461	ROBINSON	R44 II	DFW06FA102	2006	04/13/2006 00:00:00	FREDERICKSBURG	TX	USA	2	ROBINSON	R44	ROBINSON R44
4	20060208X00181	ROBINSON	R44	SEA06LA052	2006	02/06/2006 00:00:00	HELENA	MT	USA	1	ROBINSON	R44	ROBINSON R44

# Load FAA survey data with flight hours
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()

print(f"Loaded {len(survey)} helicopter models with flight hour data")
survey.head()

Loaded 12 helicopter models with flight hour data

	latimes_make_and_model	total_hours
0	AGUSTA 109	362172
1	AIRBUS 130	1053786
2	AIRBUS 135	884596
3	AIRBUS 350	3883490
4	BELL 206	5501308

Prepare data for merging¶

Before merging, we need to create a summary of accidents by helicopter model:

# Count accidents by make and model
accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()

print(f"Accident counts for {len(accident_counts)} helicopter models")
accident_counts.head()

Accident counts for 12 helicopter models

	latimes_make	latimes_make_and_model	accidents
0	AGUSTA	AGUSTA 109	2
1	AIRBUS	AIRBUS 130	1
2	AIRBUS	AIRBUS 135	4
3	AIRBUS	AIRBUS 350	29
4	BELL	BELL 206	30

Basic merge operation¶

Now we can merge the accident counts with the flight hour data. We’ll use the merge function to combine the datasets on the latimes_make_and_model column:

# Merge the datasets
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

print(f"Merged dataset has {len(merged_list)} helicopter models")
merged_list.head()

Merged dataset has 12 helicopter models

	latimes_make	latimes_make_and_model	accidents	total_hours
0	AGUSTA	AGUSTA 109	2	362172
1	AIRBUS	AIRBUS 130	1	1053786
2	AIRBUS	AIRBUS 135	4	884596
3	AIRBUS	AIRBUS 350	29	3883490
4	BELL	BELL 206	30	5501308

Understanding the merge¶

Let’s examine what happened during the merge:

print(f"Original accident counts: {len(accident_counts)} models")
print(f"Original survey data: {len(survey)} models")
print(f"Merged result: {len(merged_list)} models")
print("\nThe merge kept only models that appear in both datasets (inner join)")

Original accident counts: 12 models
Original survey data: 12 models
Merged result: 12 models

The merge kept only models that appear in both datasets (inner join)

Calculate accident rates¶

Now that we have both accident counts and flight hours in the same dataset, we can calculate meaningful accident rates:

# Calculate accident rate per flight hour
merged_list["per_hour"] = merged_list["accidents"] / merged_list["total_hours"]

# Calculate accident rate per 100,000 flight hours (more readable)
merged_list["per_100k_hours"] = (merged_list["accidents"] / merged_list["total_hours"]) * 100_000

print("Added accident rate calculations:")
merged_list[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head()

Added accident rate calculations:

	latimes_make_and_model	accidents	total_hours	per_100k_hours
0	AGUSTA 109	2	362172	0.552224
1	AIRBUS 130	1	1053786	0.094896
2	AIRBUS 135	4	884596	0.452184
3	AIRBUS 350	29	3883490	0.746751
4	BELL 206	30	5501308	0.545325

Analyze the results¶

Now we can see which helicopter models have the highest accident rates:

# Sort by accident rate to see the most dangerous models
dangerous_models = merged_list.sort_values("per_100k_hours", ascending=False)
print("Top 10 helicopter models by accident rate (per 100k flight hours):")
dangerous_models[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head(10)

Top 10 helicopter models by accident rate (per 100k flight hours):

	latimes_make_and_model	accidents	total_hours	per_100k_hours
9	ROBINSON R44	38	2359729	1.610354
7	MCDONNELL DOUGLAS 369	6	550689	1.089544
6	HUGHES 369	13	1201688	1.081812
3	AIRBUS 350	29	3883490	0.746751
8	ROBINSON R22	20	2970806	0.673218
5	BELL 407	13	2113788	0.615010
0	AGUSTA 109	2	362172	0.552224
4	BELL 206	30	5501308	0.545325
2	AIRBUS 135	4	884596	0.452184
10	SCHWEIZER 269	5	1139326	0.438856

Different types of merges¶

Pandas supports different types of merges. Let’s explore what happens with different join types:

# Left join - keep all accident data, even without flight hours
left_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="left")
print(f"Left merge result: {len(left_merge)} models (includes all accident data)")

# Check how many models don't have flight hour data
missing_hours = left_merge["total_hours"].isnull().sum()
print(f"Models with accidents but no flight hour data: {missing_hours}")

Left merge result: 12 models (includes all accident data)
Models with accidents but no flight hour data: 0

# Right join - keep all survey data, even without accidents
right_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="right")
print(f"Right merge result: {len(right_merge)} models (includes all survey data)")

# Fill missing accident counts with 0
right_merge["accidents"] = right_merge["accidents"].fillna(0)
no_accidents = (right_merge["accidents"] == 0).sum()
print(f"Models with flight hours but no recorded accidents: {no_accidents}")

Right merge result: 12 models (includes all survey data)
Models with flight hours but no recorded accidents: 0

Key insights from the merge¶

The merge operation revealed important insights about helicopter safety:

# Summary statistics
print("Summary of accident rates:")
print(f"Average accident rate: {merged_list['per_100k_hours'].mean():.2f} per 100k hours")
print(f"Median accident rate: {merged_list['per_100k_hours'].median():.2f} per 100k hours")
print(f"Highest accident rate: {merged_list['per_100k_hours'].max():.2f} per 100k hours")
print(f"Lowest accident rate: {merged_list['per_100k_hours'].min():.2f} per 100k hours")

Summary of accident rates:
Average accident rate: 0.68 per 100k hours
Median accident rate: 0.58 per 100k hours
Highest accident rate: 1.61 per 100k hours
Lowest accident rate: 0.09 per 100k hours

Merging datasets is a fundamental skill in data journalism. It allows you to combine information from different sources to create more meaningful analyses, like calculating rates instead of just raw counts. This helicopter accident analysis demonstrates how merging accident data with flight hour data provides much more insight than looking at accidents alone.