Compute¶
Computing new values from existing data is one of the most common tasks in data analysis. Whether you’re calculating rates, percentages, or creating categorical variables, pandas provides powerful tools for creating new columns and transforming your data.
Setup data¶
import pandas as pd
# Load and prepare the data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
print(f"Loaded {len(accident_list)} accidents")
accident_list.head()
Loaded 163 accidents
| event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20061222X01838 | BELL | 407 | NYC07FA048 | 2006 | 12/14/2006 00:00:00 | DAGSBORO | DE | USA | 2 | BELL | 407 | BELL 407 |
| 1 | 20060817X01187 | ROBINSON | R22 BETA | LAX06LA257 | 2006 | 08/10/2006 00:00:00 | TUCSON | AZ | USA | 1 | ROBINSON | R22 | ROBINSON R22 |
| 2 | 20060111X00044 | ROBINSON | R44 | MIA06FA039 | 2006 | 01/01/2006 00:00:00 | GRAND RIDGE | FL | USA | 3 | ROBINSON | R44 | ROBINSON R44 |
| 3 | 20060419X00461 | ROBINSON | R44 II | DFW06FA102 | 2006 | 04/13/2006 00:00:00 | FREDERICKSBURG | TX | USA | 2 | ROBINSON | R44 | ROBINSON R44 |
| 4 | 20060208X00181 | ROBINSON | R44 | SEA06LA052 | 2006 | 02/06/2006 00:00:00 | HELENA | MT | USA | 1 | ROBINSON | R44 | ROBINSON R44 |
Basic arithmetic operations¶
You can perform mathematical operations on columns to create new computed fields:
# Calculate total people involved (fatalities + injuries)
accident_list["total_people"] = accident_list["total_fatalities"] + accident_list["total_serious_injuries"] + accident_list["total_minor_injuries"]
print("Added total_people column:")
accident_list[["total_fatalities", "total_serious_injuries", "total_minor_injuries", "total_people"]].head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3811 try:
-> 3812 return self._engine.get_loc(casted_key)
3813 except KeyError as err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'total_serious_injuries'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[2], line 2
1 # Calculate total people involved (fatalities + injuries)
----> 2 accident_list["total_people"] = accident_list["total_fatalities"] + accident_list["total_serious_injuries"] + accident_list["total_minor_injuries"]
4 print("Added total_people column:")
5 accident_list[["total_fatalities", "total_serious_injuries", "total_minor_injuries", "total_people"]].head()
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
4105 if self.columns.nlevels > 1:
4106 return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
4108 if is_integer(indexer):
4109 indexer = [indexer]
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
3814 if isinstance(casted_key, slice) or (
3815 isinstance(casted_key, abc.Iterable)
3816 and any(isinstance(x, slice) for x in casted_key)
3817 ):
3818 raise InvalidIndexError(key)
-> 3819 raise KeyError(key) from err
3820 except TypeError:
3821 # If we have a listlike key, _check_indexing_error will raise
3822 # InvalidIndexError. Otherwise we fall through and re-raise
3823 # the TypeError.
3824 self._check_indexing_error(key)
KeyError: 'total_serious_injuries'
Conditional calculations¶
Use numpy.where() or boolean indexing for conditional calculations:
import numpy as np
# Create severity categories
accident_list["severity"] = np.where(
accident_list["total_fatalities"] > 0, "Fatal",
np.where(accident_list["total_serious_injuries"] > 0, "Serious", "Minor")
)
print("Severity categories:")
accident_list["severity"].value_counts()
Working with dates¶
Convert date strings to datetime objects for date calculations:
# Convert date column to datetime
accident_list["date"] = pd.to_datetime(accident_list["date"])
# Extract year, month, day
accident_list["year"] = accident_list["date"].dt.year
accident_list["month"] = accident_list["date"].dt.month
accident_list["day_of_week"] = accident_list["date"].dt.day_name()
print("Accidents by year:")
print(accident_list["year"].value_counts().sort_index())
Using apply() for complex calculations¶
For more complex computations, use the apply() method with custom functions:
def calculate_fatality_rate(row):
"""Calculate what percentage of people involved died"""
if row["total_people"] == 0:
return 0
return (row["total_fatalities"] / row["total_people"]) * 100
accident_list["fatality_rate"] = accident_list.apply(calculate_fatality_rate, axis=1)
print("Fatality rate statistics:")
print(accident_list["fatality_rate"].describe())
Ranking and percentiles¶
Create rankings and percentile scores:
# Rank accidents by total people involved
accident_list["severity_rank"] = accident_list["total_people"].rank(ascending=False, method="dense")
# Show the most severe accidents
most_severe = accident_list.nsmallest(10, "severity_rank")
print("Top 10 most severe accidents by people involved:")
most_severe[["accident_number", "date", "location", "total_people", "total_fatalities", "severity_rank"]].head()
Computing new values from your data is essential for analysis. These techniques allow you to transform raw data into meaningful insights and create the specific metrics you need for your investigations.