Concat¶
Concatenation allows you to combine multiple DataFrames by stacking them vertically (adding rows) or horizontally (adding columns). This is useful when you have data split across multiple files or sources that need to be combined.
Setup data¶
Let’s create some sample datasets to demonstrate concatenation:
import pandas as pd
# Load the main accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
print(f"Loaded {len(accident_list)} total accidents")
print(f"Date range: {accident_list['date'].min()} to {accident_list['date'].max()}")
Loaded 163 total accidents
Date range: 01/01/2006 00:00:00 to 12/31/2014 00:00:00
Vertical concatenation (combining rows)¶
Let’s split our data and then recombine it to demonstrate vertical concatenation:
# Split data by state for demonstration
california_accidents = accident_list[accident_list["state"] == "CA"]
texas_accidents = accident_list[accident_list["state"] == "TX"]
florida_accidents = accident_list[accident_list["state"] == "FL"]
print(f"California accidents: {len(california_accidents)}")
print(f"Texas accidents: {len(texas_accidents)}")
print(f"Florida accidents: {len(florida_accidents)}")
California accidents: 16
Texas accidents: 15
Florida accidents: 9
# Concatenate the state datasets back together
combined_states = pd.concat([california_accidents, texas_accidents, florida_accidents])
print(f"Combined dataset: {len(combined_states)} accidents")
print("\nState breakdown:")
print(combined_states["state"].value_counts())
Combined dataset: 40 accidents
State breakdown:
state
CA 16
TX 15
FL 9
Name: count, dtype: int64
Adding row labels during concatenation¶
You can add labels to identify which dataset each row came from:
# Concatenate with keys to identify source
labeled_concat = pd.concat(
[california_accidents, texas_accidents, florida_accidents],
keys=["CA", "TX", "FL"]
)
print(f"Labeled concatenation shape: {labeled_concat.shape}")
print("\nIndex structure:")
print(labeled_concat.index.names)
print(labeled_concat.head())
Labeled concatenation shape: (40, 13)
Index structure:
[None, None]
event_id ntsb_make ntsb_model ntsb_number year \
CA 7 20060916X01352 HUGHES 369D LAX06LA288 2006
9 20060306X00268 SCHWEIZER 269C-1 LAX06LA123 2006
16 20060510X00540 ROBINSON R44 RAVEN II LAX06FA156 2006
36 20080529X00746 AEROSPATIALE AS-350-D SEA08MA136 2008
48 20080710X01015 SCHWEIZER 269 C-1 LAX08LA213 2008
date city state country total_fatalities \
CA 7 09/06/2006 00:00:00 SOMIS CA USA 2
9 02/25/2006 00:00:00 VALENCIA CA USA 1
16 05/01/2006 00:00:00 DESERT CENTER CA USA 2
36 05/24/2008 00:00:00 AVALON CA USA 3
48 07/04/2008 00:00:00 SUNOL CA USA 2
latimes_make latimes_model latimes_make_and_model
CA 7 HUGHES 369 HUGHES 369
9 SCHWEIZER 269 SCHWEIZER 269
16 ROBINSON R44 ROBINSON R44
36 AIRBUS 350 AIRBUS 350
48 SCHWEIZER 269 SCHWEIZER 269
Horizontal concatenation (combining columns)¶
Let’s create additional data to demonstrate horizontal concatenation:
# Create summary statistics for each accident
accident_summary = pd.DataFrame({
'total_people': accident_list['total_fatalities'] +
accident_list['total_serious_injuries'] +
accident_list['total_minor_injuries'],
'severity_score': accident_list['total_fatalities'] * 3 +
accident_list['total_serious_injuries'] * 2 +
accident_list['total_minor_injuries'] * 1
}, index=accident_list.index)
print("Summary statistics:")
accident_summary.head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3811 try:
-> 3812 return self._engine.get_loc(casted_key)
3813 except KeyError as err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'total_serious_injuries'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[5], line 4
1 # Create summary statistics for each accident
2 accident_summary = pd.DataFrame({
3 'total_people': accident_list['total_fatalities'] +
----> 4 accident_list['total_serious_injuries'] +
5 accident_list['total_minor_injuries'],
6 'severity_score': accident_list['total_fatalities'] * 3 +
7 accident_list['total_serious_injuries'] * 2 +
8 accident_list['total_minor_injuries'] * 1
9 }, index=accident_list.index)
11 print("Summary statistics:")
12 accident_summary.head()
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
4105 if self.columns.nlevels > 1:
4106 return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
4108 if is_integer(indexer):
4109 indexer = [indexer]
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
3814 if isinstance(casted_key, slice) or (
3815 isinstance(casted_key, abc.Iterable)
3816 and any(isinstance(x, slice) for x in casted_key)
3817 ):
3818 raise InvalidIndexError(key)
-> 3819 raise KeyError(key) from err
3820 except TypeError:
3821 # If we have a listlike key, _check_indexing_error will raise
3822 # InvalidIndexError. Otherwise we fall through and re-raise
3823 # the TypeError.
3824 self._check_indexing_error(key)
KeyError: 'total_serious_injuries'
# Concatenate horizontally to add new columns
enhanced_data = pd.concat([accident_list, accident_summary], axis=1)
print(f"Original columns: {len(accident_list.columns)}")
print(f"Enhanced columns: {len(enhanced_data.columns)}")
print(f"New columns added: {list(accident_summary.columns)}")
Handling missing data during concatenation¶
When DataFrames have different columns, concatenation will fill missing values with NaN:
# Create DataFrames with different columns
basic_info = accident_list[['accident_number', 'date', 'state']].head(3)
detailed_info = accident_list[['accident_number', 'location', 'total_fatalities']].head(3)
print("Basic info columns:", list(basic_info.columns))
print("Detailed info columns:", list(detailed_info.columns))
# Concatenate DataFrames with different columns
mixed_concat = pd.concat([basic_info, detailed_info])
print("\nConcatenated result:")
print(mixed_concat)
Ignoring index during concatenation¶
Sometimes you want to reset the index when concatenating:
# Concatenate and reset index
reset_index_concat = pd.concat(
[california_accidents.head(2), texas_accidents.head(2)],
ignore_index=True
)
print("Concatenation with reset index:")
print(reset_index_concat[['accident_number', 'state', 'location']])
Practical example: Combining multiple CSV files¶
This is a common scenario when you have data split across multiple files:
# Simulate reading multiple files and combining them
# In practice, you might do something like:
# files = ['data2020.csv', 'data2021.csv', 'data2022.csv']
# dataframes = [pd.read_csv(f) for f in files]
# combined = pd.concat(dataframes, ignore_index=True)
# For demonstration, split by year and recombine
accident_list['date'] = pd.to_datetime(accident_list['date'])
accident_list['year'] = accident_list['date'].dt.year
year_2015 = accident_list[accident_list['year'] == 2015]
year_2016 = accident_list[accident_list['year'] == 2016]
year_2017 = accident_list[accident_list['year'] == 2017]
# Combine years with source tracking
multi_year = pd.concat(
[year_2015, year_2016, year_2017],
keys=['2015', '2016', '2017'],
names=['source_year', 'original_index']
)
print(f"Combined multi-year data: {len(multi_year)} accidents")
print("\nAccidents by source year:")
print(multi_year.groupby(level=0).size())
Concatenation is essential when working with data from multiple sources or files. It allows you to combine datasets efficiently while maintaining data integrity and providing options for handling mismatched structures.