{ "cells": [ { "cell_type": "markdown", "id": "fa3f9f7b", "metadata": {}, "source": [ "# Columns\n", "\n", "One of the most powerful features of pandas is the ability to access and analyze individual columns from your DataFrame. This allows you to focus on specific variables and perform detailed analysis on the data that matters most to your investigation." ] }, { "cell_type": "markdown", "id": "b3cd2124", "metadata": {}, "source": [ "## Setup data\n", "\n", "Let's start by loading our accident data:" ] }, { "cell_type": "code", "execution_count": null, "id": "b4a65ea9", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")\n", "\n", "print(f\"Loaded {len(accident_list)} accidents\")\n", "print(f\"Columns: {list(accident_list.columns)}\")" ] }, { "cell_type": "markdown", "id": "8caa2cf0", "metadata": {}, "source": [ "## Accessing individual columns\n", "\n", "We'll begin with the `latimes_make_and_model` column, which records the standardized name of each helicopter that crashed. To access its contents separate from the rest of the DataFrame, append a pair of square brackets with the column's name in quotes inside:" ] }, { "cell_type": "code", "execution_count": null, "id": "60015f15", "metadata": {}, "outputs": [], "source": [ "# Access a specific column\n", "accident_list[\"latimes_make_and_model\"]" ] }, { "cell_type": "markdown", "id": "793fe724", "metadata": {}, "source": [ "That will list the column out as a `Series`, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column." ] }, { "cell_type": "markdown", "id": "cefb536a", "metadata": {}, "source": [ "## Analyzing column data\n", "\n", "There's a built-in pandas tool that will total up the frequency of values in a column. The method is called [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) and it's just as easy to use as `sum`, `min` or `max`. All you need to do is add a period after the column name and chain it on the tail end of your code:" ] }, { "cell_type": "code", "execution_count": null, "id": "a85f28a8", "metadata": {}, "outputs": [], "source": [ "accident_list[\"latimes_make_and_model\"].value_counts()" ] }, { "cell_type": "markdown", "id": "a0d3ba61", "metadata": {}, "source": [ "## Working with numeric columns\n", "\n", "Let's look at the fatalities column and see what statistical methods we can apply:" ] }, { "cell_type": "code", "execution_count": null, "id": "ac1988bb", "metadata": {}, "outputs": [], "source": [ "# Basic statistics for total fatalities\n", "fatalities = accident_list[\"total_fatalities\"]\n", "print(f\"Total fatalities across all accidents: {fatalities.sum()}\")\n", "print(f\"Average fatalities per accident: {fatalities.mean():.2f}\")\n", "print(f\"Maximum fatalities in single accident: {fatalities.max()}\")\n", "print(f\"Minimum fatalities: {fatalities.min()}\")" ] }, { "cell_type": "markdown", "id": "36031420", "metadata": {}, "source": [ "## String operations on columns\n", "\n", "For text columns, pandas provides special string methods through the `.str` accessor:" ] }, { "cell_type": "code", "execution_count": null, "id": "6de654bd", "metadata": {}, "outputs": [], "source": [ "# Convert make and model to uppercase for consistency\n", "accident_list[\"latimes_make_and_model\"] = accident_list[\"latimes_make_and_model\"].str.upper()\n", "print(\"Converted to uppercase:\")\n", "accident_list[\"latimes_make_and_model\"].value_counts().head()" ] }, { "cell_type": "code", "execution_count": null, "id": "117cbb92", "metadata": {}, "outputs": [], "source": [ "# Check if location contains certain words\n", "airport_mask = accident_list[\"location\"].str.contains(\"Airport\", na=False)\n", "print(f\"Accidents at locations containing 'Airport': {airport_mask.sum()}\")" ] }, { "cell_type": "markdown", "id": "6eb7bb4e", "metadata": {}, "source": [ "## Creating new columns\n", "\n", "You can create new columns by assigning values to a new column name:" ] }, { "cell_type": "code", "execution_count": null, "id": "62bb3cf4", "metadata": {}, "outputs": [], "source": [ "# Create a column indicating if accident was fatal\n", "accident_list[\"is_fatal\"] = accident_list[\"total_fatalities\"] > 0\n", "print(\"Fatal vs non-fatal accidents:\")\n", "accident_list[\"is_fatal\"].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "9f59cf00", "metadata": {}, "outputs": [], "source": [ "# Create a severity category based on fatalities\n", "def categorize_severity(fatalities):\n", " if fatalities == 0:\n", " return \"Non-fatal\"\n", " elif fatalities <= 2:\n", " return \"Low fatality\"\n", " else:\n", " return \"High fatality\"\n", "\n", "accident_list[\"severity_category\"] = accident_list[\"total_fatalities\"].apply(categorize_severity)\n", "print(\"Accident severity categories:\")\n", "accident_list[\"severity_category\"].value_counts()" ] }, { "cell_type": "markdown", "id": "b3ef6873", "metadata": {}, "source": [ "## Selecting multiple columns\n", "\n", "You can select multiple columns by passing a list of column names:" ] }, { "cell_type": "code", "execution_count": null, "id": "e366b874", "metadata": {}, "outputs": [], "source": [ "# Select specific columns for analysis\n", "key_columns = [\"accident_number\", \"date\", \"state\", \"latimes_make_and_model\", \"total_fatalities\", \"severity_category\"]\n", "summary_data = accident_list[key_columns]\n", "summary_data.head()" ] }, { "cell_type": "markdown", "id": "db474e99", "metadata": {}, "source": [ "## Column information\n", "\n", "You can get information about your columns and their data types:" ] }, { "cell_type": "code", "execution_count": null, "id": "cc424676", "metadata": {}, "outputs": [], "source": [ "# Get column data types\n", "print(\"Column data types:\")\n", "print(accident_list.dtypes)" ] }, { "cell_type": "code", "execution_count": null, "id": "27bcbd04", "metadata": {}, "outputs": [], "source": [ "# Check for missing values in each column\n", "print(\"Missing values per column:\")\n", "print(accident_list.isnull().sum())" ] }, { "cell_type": "markdown", "id": "5948650c", "metadata": {}, "source": [ "Working with individual columns is fundamental to pandas data analysis. These techniques allow you to clean, transform, and analyze your data column by column, building up to more complex analyses that combine multiple variables." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }