Columns¶

One of the most powerful features of pandas is the ability to access and analyze individual columns from your DataFrame. This allows you to focus on specific variables and perform detailed analysis on the data that matters most to your investigation.

Setup data¶

Let’s start by loading our accident data:

import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")

print(f"Loaded {len(accident_list)} accidents")
print(f"Columns: {list(accident_list.columns)}")

Loaded 163 accidents
Columns: ['event_id', 'ntsb_make', 'ntsb_model', 'ntsb_number', 'year', 'date', 'city', 'state', 'country', 'total_fatalities', 'latimes_make', 'latimes_model', 'latimes_make_and_model']

Accessing individual columns¶

We’ll begin with the latimes_make_and_model column, which records the standardized name of each helicopter that crashed. To access its contents separate from the rest of the DataFrame, append a pair of square brackets with the column’s name in quotes inside:

# Access a specific column
accident_list["latimes_make_and_model"]

         BELL 407
     ROBINSON R22
     ROBINSON R44
     ROBINSON R44
     ROBINSON R44
           ...      
       BELL 407
  SCHWEIZER 269
       BELL 206
     AIRBUS 350
   ROBINSON R44
Name: latimes_make_and_model, Length: 163, dtype: object

That will list the column out as a Series, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.

Analyzing column data¶

There’s a built-in pandas tool that will total up the frequency of values in a column. The method is called value_counts and it’s just as easy to use as sum, min or max. All you need to do is add a period after the column name and chain it on the tail end of your code:

accident_list["latimes_make_and_model"].value_counts()

latimes_make_and_model
ROBINSON R44             38
AIRBUS 350               29
BELL 206                 28
ROBINSON R22             20
BELL 407                 13
HUGHES 369               13
MCDONNELL DOUGLAS 369     6
SCHWEIZER 269             5
AIRBUS 135                4
bell 206                  2
SIKORSKY 76               2
AGUSTA 109                2
AIRBUS 130                1
Name: count, dtype: int64

Working with numeric columns¶

Let’s look at the fatalities column and see what statistical methods we can apply:

# Basic statistics for total fatalities
fatalities = accident_list["total_fatalities"]
print(f"Total fatalities across all accidents: {fatalities.sum()}")
print(f"Average fatalities per accident: {fatalities.mean():.2f}")
print(f"Maximum fatalities in single accident: {fatalities.max()}")
print(f"Minimum fatalities: {fatalities.min()}")

Total fatalities across all accidents: 336
Average fatalities per accident: 2.06
Maximum fatalities in single accident: 9
Minimum fatalities: 1

String operations on columns¶

For text columns, pandas provides special string methods through the .str accessor:

# Convert make and model to uppercase for consistency
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
print("Converted to uppercase:")
accident_list["latimes_make_and_model"].value_counts().head()

Converted to uppercase:

latimes_make_and_model
ROBINSON R44    38
BELL 206        30
AIRBUS 350      29
ROBINSON R22    20
BELL 407        13
Name: count, dtype: int64

# Check if location contains certain words
airport_mask = accident_list["location"].str.contains("Airport", na=False)
print(f"Accidents at locations containing 'Airport': {airport_mask.sum()}")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'location'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 2
      1 # Check if location contains certain words
----> 2 airport_mask = accident_list["location"].str.contains("Airport", na=False)
      3 print(f"Accidents at locations containing 'Airport': {airport_mask.sum()}")

File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File ~/checkouts/readthedocs.org/user_builds/first-python-notebook-vscode/envs/latest/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'location'

Creating new columns¶

You can create new columns by assigning values to a new column name:

# Create a column indicating if accident was fatal
accident_list["is_fatal"] = accident_list["total_fatalities"] > 0
print("Fatal vs non-fatal accidents:")
accident_list["is_fatal"].value_counts()

# Create a severity category based on fatalities
def categorize_severity(fatalities):
    if fatalities == 0:
        return "Non-fatal"
    elif fatalities <= 2:
        return "Low fatality"
    else:
        return "High fatality"

accident_list["severity_category"] = accident_list["total_fatalities"].apply(categorize_severity)
print("Accident severity categories:")
accident_list["severity_category"].value_counts()

Selecting multiple columns¶

You can select multiple columns by passing a list of column names:

# Select specific columns for analysis
key_columns = ["accident_number", "date", "state", "latimes_make_and_model", "total_fatalities", "severity_category"]
summary_data = accident_list[key_columns]
summary_data.head()

Column information¶

You can get information about your columns and their data types:

# Get column data types
print("Column data types:")
print(accident_list.dtypes)

# Check for missing values in each column
print("Missing values per column:")
print(accident_list.isnull().sum())

Working with individual columns is fundamental to pandas data analysis. These techniques allow you to clean, transform, and analyze your data column by column, building up to more complex analyses that combine multiple variables.