{ "cells": [ { "cell_type": "markdown", "id": "81018940", "metadata": {}, "source": [ "# Concat\n", "\n", "Concatenation allows you to combine multiple DataFrames by stacking them vertically (adding rows) or horizontally (adding columns). This is useful when you have data split across multiple files or sources that need to be combined." ] }, { "cell_type": "markdown", "id": "956945d2", "metadata": {}, "source": [ "## Setup data\n", "\n", "Let's create some sample datasets to demonstrate concatenation:" ] }, { "cell_type": "code", "execution_count": null, "id": "5f95b9d4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Load the main accident data\n", "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")\n", "\n", "print(f\"Loaded {len(accident_list)} total accidents\")\n", "print(f\"Date range: {accident_list['date'].min()} to {accident_list['date'].max()}\")" ] }, { "cell_type": "markdown", "id": "2674789a", "metadata": {}, "source": [ "## Vertical concatenation (combining rows)\n", "\n", "Let's split our data and then recombine it to demonstrate vertical concatenation:" ] }, { "cell_type": "code", "execution_count": null, "id": "d866ab64", "metadata": {}, "outputs": [], "source": [ "# Split data by state for demonstration\n", "california_accidents = accident_list[accident_list[\"state\"] == \"CA\"]\n", "texas_accidents = accident_list[accident_list[\"state\"] == \"TX\"]\n", "florida_accidents = accident_list[accident_list[\"state\"] == \"FL\"]\n", "\n", "print(f\"California accidents: {len(california_accidents)}\")\n", "print(f\"Texas accidents: {len(texas_accidents)}\")\n", "print(f\"Florida accidents: {len(florida_accidents)}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d45fb046", "metadata": {}, "outputs": [], "source": [ "# Concatenate the state datasets back together\n", "combined_states = pd.concat([california_accidents, texas_accidents, florida_accidents])\n", "\n", "print(f\"Combined dataset: {len(combined_states)} accidents\")\n", "print(\"\\nState breakdown:\")\n", "print(combined_states[\"state\"].value_counts())" ] }, { "cell_type": "markdown", "id": "11b23053", "metadata": {}, "source": [ "## Adding row labels during concatenation\n", "\n", "You can add labels to identify which dataset each row came from:" ] }, { "cell_type": "code", "execution_count": null, "id": "9e4eb1a0", "metadata": {}, "outputs": [], "source": [ "# Concatenate with keys to identify source\n", "labeled_concat = pd.concat(\n", " [california_accidents, texas_accidents, florida_accidents],\n", " keys=[\"CA\", \"TX\", \"FL\"]\n", ")\n", "\n", "print(f\"Labeled concatenation shape: {labeled_concat.shape}\")\n", "print(\"\\nIndex structure:\")\n", "print(labeled_concat.index.names)\n", "print(labeled_concat.head())" ] }, { "cell_type": "markdown", "id": "d188f6d4", "metadata": {}, "source": [ "## Horizontal concatenation (combining columns)\n", "\n", "Let's create additional data to demonstrate horizontal concatenation:" ] }, { "cell_type": "code", "execution_count": null, "id": "9e0f68a4", "metadata": {}, "outputs": [], "source": [ "# Create summary statistics for each accident\n", "accident_summary = pd.DataFrame({\n", " 'total_people': accident_list['total_fatalities'] + \n", " accident_list['total_serious_injuries'] + \n", " accident_list['total_minor_injuries'],\n", " 'severity_score': accident_list['total_fatalities'] * 3 + \n", " accident_list['total_serious_injuries'] * 2 + \n", " accident_list['total_minor_injuries'] * 1\n", "}, index=accident_list.index)\n", "\n", "print(\"Summary statistics:\")\n", "accident_summary.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "50ba6f6c", "metadata": {}, "outputs": [], "source": [ "# Concatenate horizontally to add new columns\n", "enhanced_data = pd.concat([accident_list, accident_summary], axis=1)\n", "\n", "print(f\"Original columns: {len(accident_list.columns)}\")\n", "print(f\"Enhanced columns: {len(enhanced_data.columns)}\")\n", "print(f\"New columns added: {list(accident_summary.columns)}\")" ] }, { "cell_type": "markdown", "id": "2913df57", "metadata": {}, "source": [ "## Handling missing data during concatenation\n", "\n", "When DataFrames have different columns, concatenation will fill missing values with NaN:" ] }, { "cell_type": "code", "execution_count": null, "id": "babb829c", "metadata": {}, "outputs": [], "source": [ "# Create DataFrames with different columns\n", "basic_info = accident_list[['accident_number', 'date', 'state']].head(3)\n", "detailed_info = accident_list[['accident_number', 'location', 'total_fatalities']].head(3)\n", "\n", "print(\"Basic info columns:\", list(basic_info.columns))\n", "print(\"Detailed info columns:\", list(detailed_info.columns))\n", "\n", "# Concatenate DataFrames with different columns\n", "mixed_concat = pd.concat([basic_info, detailed_info])\n", "print(\"\\nConcatenated result:\")\n", "print(mixed_concat)" ] }, { "cell_type": "markdown", "id": "6329294b", "metadata": {}, "source": [ "## Ignoring index during concatenation\n", "\n", "Sometimes you want to reset the index when concatenating:" ] }, { "cell_type": "code", "execution_count": null, "id": "d9ec3c14", "metadata": {}, "outputs": [], "source": [ "# Concatenate and reset index\n", "reset_index_concat = pd.concat(\n", " [california_accidents.head(2), texas_accidents.head(2)], \n", " ignore_index=True\n", ")\n", "\n", "print(\"Concatenation with reset index:\")\n", "print(reset_index_concat[['accident_number', 'state', 'location']])" ] }, { "cell_type": "markdown", "id": "e71c0f63", "metadata": {}, "source": [ "## Practical example: Combining multiple CSV files\n", "\n", "This is a common scenario when you have data split across multiple files:" ] }, { "cell_type": "code", "execution_count": null, "id": "59b0f054", "metadata": {}, "outputs": [], "source": [ "# Simulate reading multiple files and combining them\n", "# In practice, you might do something like:\n", "# files = ['data2020.csv', 'data2021.csv', 'data2022.csv']\n", "# dataframes = [pd.read_csv(f) for f in files]\n", "# combined = pd.concat(dataframes, ignore_index=True)\n", "\n", "# For demonstration, split by year and recombine\n", "accident_list['date'] = pd.to_datetime(accident_list['date'])\n", "accident_list['year'] = accident_list['date'].dt.year\n", "\n", "year_2015 = accident_list[accident_list['year'] == 2015]\n", "year_2016 = accident_list[accident_list['year'] == 2016]\n", "year_2017 = accident_list[accident_list['year'] == 2017]\n", "\n", "# Combine years with source tracking\n", "multi_year = pd.concat(\n", " [year_2015, year_2016, year_2017],\n", " keys=['2015', '2016', '2017'],\n", " names=['source_year', 'original_index']\n", ")\n", "\n", "print(f\"Combined multi-year data: {len(multi_year)} accidents\")\n", "print(\"\\nAccidents by source year:\")\n", "print(multi_year.groupby(level=0).size())" ] }, { "cell_type": "markdown", "id": "4df25eb8", "metadata": {}, "source": [ "Concatenation is essential when working with data from multiple sources or files. It allows you to combine datasets efficiently while maintaining data integrity and providing options for handling mismatched structures." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }