Columns¶
One of the most powerful features of pandas is the ability to access and analyze individual columns from your DataFrame. This allows you to focus on specific variables and perform detailed analysis on the data that matters most to your investigation.
Setup data¶
Let’s start by loading our accident data:
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
print(f"Loaded {len(accident_list)} accidents")
print(f"Columns: {list(accident_list.columns)}")
Loaded 163 accidents
Columns: ['event_id', 'ntsb_make', 'ntsb_model', 'ntsb_number', 'year', 'date', 'city', 'state', 'country', 'total_fatalities', 'latimes_make', 'latimes_model', 'latimes_make_and_model']
Accessing individual columns¶
We’ll begin with the latimes_make_and_model column, which records the standardized name of each helicopter that crashed. To access its contents separate from the rest of the DataFrame, append a pair of square brackets with the column’s name in quotes inside:
# Access a specific column
accident_list["latimes_make_and_model"]
0 BELL 407
1 ROBINSON R22
2 ROBINSON R44
3 ROBINSON R44
4 ROBINSON R44
...
158 BELL 407
159 SCHWEIZER 269
160 BELL 206
161 AIRBUS 350
162 ROBINSON R44
Name: latimes_make_and_model, Length: 163, dtype: object
That will list the column out as a Series, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.
Analyzing column data¶
There’s a built-in pandas tool that will total up the frequency of values in a column. The method is called value_counts and it’s just as easy to use as sum, min or max. All you need to do is add a period after the column name and chain it on the tail end of your code:
accident_list["latimes_make_and_model"].value_counts()
latimes_make_and_model
ROBINSON R44 38
AIRBUS 350 29
BELL 206 28
ROBINSON R22 20
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
SCHWEIZER 269 5
AIRBUS 135 4
bell 206 2
SIKORSKY 76 2
AGUSTA 109 2
AIRBUS 130 1
Name: count, dtype: int64
Working with numeric columns¶
Let’s look at the fatalities column and see what statistical methods we can apply:
# Basic statistics for total fatalities
fatalities = accident_list["total_fatalities"]
print(f"Total fatalities across all accidents: {fatalities.sum()}")
print(f"Average fatalities per accident: {fatalities.mean():.2f}")
print(f"Maximum fatalities in single accident: {fatalities.max()}")
print(f"Minimum fatalities: {fatalities.min()}")
Total fatalities across all accidents: 336
Average fatalities per accident: 2.06
Maximum fatalities in single accident: 9
Minimum fatalities: 1
String operations on columns¶
For text columns, pandas provides special string methods through the .str accessor:
# Convert make and model to uppercase for consistency
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
print("Converted to uppercase:")
accident_list["latimes_make_and_model"].value_counts().head()
Converted to uppercase:
latimes_make_and_model
ROBINSON R44 38
BELL 206 30
AIRBUS 350 29
ROBINSON R22 20
BELL 407 13
Name: count, dtype: int64
# Check if location contains certain words
airport_mask = accident_list["location"].str.contains("Airport", na=False)
print(f"Accidents at locations containing 'Airport': {airport_mask.sum()}")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3811 try:
-> 3812 return self._engine.get_loc(casted_key)
3813 except KeyError as err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'location'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[6], line 2
1 # Check if location contains certain words
----> 2 airport_mask = accident_list["location"].str.contains("Airport", na=False)
3 print(f"Accidents at locations containing 'Airport': {airport_mask.sum()}")
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
4105 if self.columns.nlevels > 1:
4106 return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
4108 if is_integer(indexer):
4109 indexer = [indexer]
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
3814 if isinstance(casted_key, slice) or (
3815 isinstance(casted_key, abc.Iterable)
3816 and any(isinstance(x, slice) for x in casted_key)
3817 ):
3818 raise InvalidIndexError(key)
-> 3819 raise KeyError(key) from err
3820 except TypeError:
3821 # If we have a listlike key, _check_indexing_error will raise
3822 # InvalidIndexError. Otherwise we fall through and re-raise
3823 # the TypeError.
3824 self._check_indexing_error(key)
KeyError: 'location'
Creating new columns¶
You can create new columns by assigning values to a new column name:
# Create a column indicating if accident was fatal
accident_list["is_fatal"] = accident_list["total_fatalities"] > 0
print("Fatal vs non-fatal accidents:")
accident_list["is_fatal"].value_counts()
# Create a severity category based on fatalities
def categorize_severity(fatalities):
if fatalities == 0:
return "Non-fatal"
elif fatalities <= 2:
return "Low fatality"
else:
return "High fatality"
accident_list["severity_category"] = accident_list["total_fatalities"].apply(categorize_severity)
print("Accident severity categories:")
accident_list["severity_category"].value_counts()
Selecting multiple columns¶
You can select multiple columns by passing a list of column names:
# Select specific columns for analysis
key_columns = ["accident_number", "date", "state", "latimes_make_and_model", "total_fatalities", "severity_category"]
summary_data = accident_list[key_columns]
summary_data.head()
Column information¶
You can get information about your columns and their data types:
# Get column data types
print("Column data types:")
print(accident_list.dtypes)
# Check for missing values in each column
print("Missing values per column:")
print(accident_list.isnull().sum())
Working with individual columns is fundamental to pandas data analysis. These techniques allow you to clean, transform, and analyze your data column by column, building up to more complex analyses that combine multiple variables.