Compute¶

Computing new values from existing data is one of the most common tasks in data analysis. Whether you’re calculating rates, percentages, or creating categorical variables, pandas provides powerful tools for creating new columns and transforming your data.

Setup data¶

import pandas as pd

# Load and prepare the data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

print(f"Loaded {len(accident_list)} accidents")
accident_list.head()

Loaded 163 accidents

	event_id	ntsb_make	ntsb_model	ntsb_number	year	date	city	state	country	total_fatalities	latimes_make	latimes_model	latimes_make_and_model
0	20061222X01838	BELL	407	NYC07FA048	2006	12/14/2006 00:00:00	DAGSBORO	DE	USA	2	BELL	407	BELL 407
1	20060817X01187	ROBINSON	R22 BETA	LAX06LA257	2006	08/10/2006 00:00:00	TUCSON	AZ	USA	1	ROBINSON	R22	ROBINSON R22
2	20060111X00044	ROBINSON	R44	MIA06FA039	2006	01/01/2006 00:00:00	GRAND RIDGE	FL	USA	3	ROBINSON	R44	ROBINSON R44
3	20060419X00461	ROBINSON	R44 II	DFW06FA102	2006	04/13/2006 00:00:00	FREDERICKSBURG	TX	USA	2	ROBINSON	R44	ROBINSON R44
4	20060208X00181	ROBINSON	R44	SEA06LA052	2006	02/06/2006 00:00:00	HELENA	MT	USA	1	ROBINSON	R44	ROBINSON R44

Basic arithmetic operations¶

You can perform mathematical operations on columns to create new computed fields:

# Calculate total people involved (fatalities + injuries)
accident_list["total_people"] = accident_list["total_fatalities"] + accident_list["total_serious_injuries"] + accident_list["total_minor_injuries"]

print("Added total_people column:")
accident_list[["total_fatalities", "total_serious_injuries", "total_minor_injuries", "total_people"]].head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'total_serious_injuries'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[2], line 2
      1 # Calculate total people involved (fatalities + injuries)
----> 2 accident_list["total_people"] = accident_list["total_fatalities"] + accident_list["total_serious_injuries"] + accident_list["total_minor_injuries"]
      4 print("Added total_people column:")
      5 accident_list[["total_fatalities", "total_serious_injuries", "total_minor_injuries", "total_people"]].head()

File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'total_serious_injuries'

Conditional calculations¶

Use numpy.where() or boolean indexing for conditional calculations:

import numpy as np

# Create severity categories
accident_list["severity"] = np.where(
    accident_list["total_fatalities"] > 0, "Fatal",
    np.where(accident_list["total_serious_injuries"] > 0, "Serious", "Minor")
)

print("Severity categories:")
accident_list["severity"].value_counts()

Working with dates¶

Convert date strings to datetime objects for date calculations:

# Convert date column to datetime
accident_list["date"] = pd.to_datetime(accident_list["date"])

# Extract year, month, day
accident_list["year"] = accident_list["date"].dt.year
accident_list["month"] = accident_list["date"].dt.month
accident_list["day_of_week"] = accident_list["date"].dt.day_name()

print("Accidents by year:")
print(accident_list["year"].value_counts().sort_index())

Using apply() for complex calculations¶

For more complex computations, use the apply() method with custom functions:

def calculate_fatality_rate(row):
    """Calculate what percentage of people involved died"""
    if row["total_people"] == 0:
        return 0
    return (row["total_fatalities"] / row["total_people"]) * 100

accident_list["fatality_rate"] = accident_list.apply(calculate_fatality_rate, axis=1)

print("Fatality rate statistics:")
print(accident_list["fatality_rate"].describe())

Ranking and percentiles¶

Create rankings and percentile scores:

# Rank accidents by total people involved
accident_list["severity_rank"] = accident_list["total_people"].rank(ascending=False, method="dense")

# Show the most severe accidents
most_severe = accident_list.nsmallest(10, "severity_rank")
print("Top 10 most severe accidents by people involved:")
most_severe[["accident_number", "date", "location", "total_people", "total_fatalities", "severity_rank"]].head()

Computing new values from your data is essential for analysis. These techniques allow you to transform raw data into meaningful insights and create the specific metrics you need for your investigations.