Merge

One of the most powerful features of pandas is the ability to merge datasets together. This is essential for data journalism, where you often need to combine information from multiple sources to tell a complete story.

In our helicopter accident analysis, we want to combine accident data with flight hour data to calculate accident rates - a much more meaningful metric than raw accident counts.

Setup: Load both datasets

Let’s start by loading both our accident data and the FAA survey data that contains flight hours:

import pandas as pd

# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

print(f"Loaded {len(accident_list)} accidents")
accident_list.head()
Loaded 163 accidents
event_id ntsb_make ntsb_model ntsb_number year date city state country total_fatalities latimes_make latimes_model latimes_make_and_model
0 20061222X01838 BELL 407 NYC07FA048 2006 12/14/2006 00:00:00 DAGSBORO DE USA 2 BELL 407 BELL 407
1 20060817X01187 ROBINSON R22 BETA LAX06LA257 2006 08/10/2006 00:00:00 TUCSON AZ USA 1 ROBINSON R22 ROBINSON R22
2 20060111X00044 ROBINSON R44 MIA06FA039 2006 01/01/2006 00:00:00 GRAND RIDGE FL USA 3 ROBINSON R44 ROBINSON R44
3 20060419X00461 ROBINSON R44 II DFW06FA102 2006 04/13/2006 00:00:00 FREDERICKSBURG TX USA 2 ROBINSON R44 ROBINSON R44
4 20060208X00181 ROBINSON R44 SEA06LA052 2006 02/06/2006 00:00:00 HELENA MT USA 1 ROBINSON R44 ROBINSON R44
# Load FAA survey data with flight hours
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()

print(f"Loaded {len(survey)} helicopter models with flight hour data")
survey.head()
Loaded 12 helicopter models with flight hour data
latimes_make_and_model total_hours
0 AGUSTA 109 362172
1 AIRBUS 130 1053786
2 AIRBUS 135 884596
3 AIRBUS 350 3883490
4 BELL 206 5501308

Prepare data for merging

Before merging, we need to create a summary of accidents by helicopter model:

# Count accidents by make and model
accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()

print(f"Accident counts for {len(accident_counts)} helicopter models")
accident_counts.head()
Accident counts for 12 helicopter models
latimes_make latimes_make_and_model accidents
0 AGUSTA AGUSTA 109 2
1 AIRBUS AIRBUS 130 1
2 AIRBUS AIRBUS 135 4
3 AIRBUS AIRBUS 350 29
4 BELL BELL 206 30

Basic merge operation

Now we can merge the accident counts with the flight hour data. We’ll use the merge function to combine the datasets on the latimes_make_and_model column:

# Merge the datasets
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

print(f"Merged dataset has {len(merged_list)} helicopter models")
merged_list.head()
Merged dataset has 12 helicopter models
latimes_make latimes_make_and_model accidents total_hours
0 AGUSTA AGUSTA 109 2 362172
1 AIRBUS AIRBUS 130 1 1053786
2 AIRBUS AIRBUS 135 4 884596
3 AIRBUS AIRBUS 350 29 3883490
4 BELL BELL 206 30 5501308

Understanding the merge

Let’s examine what happened during the merge:

print(f"Original accident counts: {len(accident_counts)} models")
print(f"Original survey data: {len(survey)} models")
print(f"Merged result: {len(merged_list)} models")
print("\nThe merge kept only models that appear in both datasets (inner join)")
Original accident counts: 12 models
Original survey data: 12 models
Merged result: 12 models

The merge kept only models that appear in both datasets (inner join)

Calculate accident rates

Now that we have both accident counts and flight hours in the same dataset, we can calculate meaningful accident rates:

# Calculate accident rate per flight hour
merged_list["per_hour"] = merged_list["accidents"] / merged_list["total_hours"]

# Calculate accident rate per 100,000 flight hours (more readable)
merged_list["per_100k_hours"] = (merged_list["accidents"] / merged_list["total_hours"]) * 100_000

print("Added accident rate calculations:")
merged_list[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head()
Added accident rate calculations:
latimes_make_and_model accidents total_hours per_100k_hours
0 AGUSTA 109 2 362172 0.552224
1 AIRBUS 130 1 1053786 0.094896
2 AIRBUS 135 4 884596 0.452184
3 AIRBUS 350 29 3883490 0.746751
4 BELL 206 30 5501308 0.545325

Analyze the results

Now we can see which helicopter models have the highest accident rates:

# Sort by accident rate to see the most dangerous models
dangerous_models = merged_list.sort_values("per_100k_hours", ascending=False)
print("Top 10 helicopter models by accident rate (per 100k flight hours):")
dangerous_models[["latimes_make_and_model", "accidents", "total_hours", "per_100k_hours"]].head(10)
Top 10 helicopter models by accident rate (per 100k flight hours):
latimes_make_and_model accidents total_hours per_100k_hours
9 ROBINSON R44 38 2359729 1.610354
7 MCDONNELL DOUGLAS 369 6 550689 1.089544
6 HUGHES 369 13 1201688 1.081812
3 AIRBUS 350 29 3883490 0.746751
8 ROBINSON R22 20 2970806 0.673218
5 BELL 407 13 2113788 0.615010
0 AGUSTA 109 2 362172 0.552224
4 BELL 206 30 5501308 0.545325
2 AIRBUS 135 4 884596 0.452184
10 SCHWEIZER 269 5 1139326 0.438856

Different types of merges

Pandas supports different types of merges. Let’s explore what happens with different join types:

# Left join - keep all accident data, even without flight hours
left_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="left")
print(f"Left merge result: {len(left_merge)} models (includes all accident data)")

# Check how many models don't have flight hour data
missing_hours = left_merge["total_hours"].isnull().sum()
print(f"Models with accidents but no flight hour data: {missing_hours}")
Left merge result: 12 models (includes all accident data)
Models with accidents but no flight hour data: 0
# Right join - keep all survey data, even without accidents
right_merge = pd.merge(accident_counts, survey, on="latimes_make_and_model", how="right")
print(f"Right merge result: {len(right_merge)} models (includes all survey data)")

# Fill missing accident counts with 0
right_merge["accidents"] = right_merge["accidents"].fillna(0)
no_accidents = (right_merge["accidents"] == 0).sum()
print(f"Models with flight hours but no recorded accidents: {no_accidents}")
Right merge result: 12 models (includes all survey data)
Models with flight hours but no recorded accidents: 0

Key insights from the merge

The merge operation revealed important insights about helicopter safety:

# Summary statistics
print("Summary of accident rates:")
print(f"Average accident rate: {merged_list['per_100k_hours'].mean():.2f} per 100k hours")
print(f"Median accident rate: {merged_list['per_100k_hours'].median():.2f} per 100k hours")
print(f"Highest accident rate: {merged_list['per_100k_hours'].max():.2f} per 100k hours")
print(f"Lowest accident rate: {merged_list['per_100k_hours'].min():.2f} per 100k hours")
Summary of accident rates:
Average accident rate: 0.68 per 100k hours
Median accident rate: 0.58 per 100k hours
Highest accident rate: 1.61 per 100k hours
Lowest accident rate: 0.09 per 100k hours

Merging datasets is a fundamental skill in data journalism. It allows you to combine information from different sources to create more meaningful analyses, like calculating rates instead of just raw counts. This helicopter accident analysis demonstrates how merging accident data with flight hour data provides much more insight than looking at accidents alone.