{ "cells": [ { "cell_type": "markdown", "id": "e0a525d5", "metadata": {}, "source": [ "# Data\n", "\n", "In 2018, the Los Angeles Times published an investigation headlined, [\"The Robinson R44, the world's best-selling civilian helicopter, has a long history of deadly crashes.\"](https://www.latimes.com/projects/la-me-robinson-helicopters/)\n", "\n", "It reported that Robinson's R44 led all major models with the highest fatal accident rate from 2006 to 2016. The analysis was [published on GitHub](https://github.com/datadesk/helicopter-accident-analysis) as a series of Jupyter notebooks.\n", "\n", "The findings were drawn from two key datasets:\n", "\n", "1. The National Transportation Safety Board's [Aviation Accident Database](https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx)\n", "2. The Federal Aviation Administration's [General Aviation and Part 135 Activity Survey](https://www.faa.gov/data_research/aviation_data_statistics/general_aviation/)\n", "\n", "After a significant amount of work gathering and cleaning the source data, the number of accidents for each helicopter model were normalized using the flight hour estimates in the survey. For the purposes of this demonstration, we will read in tidied versions of each file that are ready for analysis.\n", "\n", "The data are structured in rows of comma-separated values. This is known as a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). It is the most common way you will find data published online. The pandas library is able to read in files from a variety formats, including CSV." ] }, { "cell_type": "code", "execution_count": null, "id": "805cad17", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "e331bbce", "metadata": {}, "source": [ "## Reading a CSV file\n", "\n", "The pandas library can read files in many different formats, including CSV. For this class, we will use a dataset containing a record of helicopter accidents released by the National Transportation Safety Board." ] }, { "cell_type": "code", "execution_count": null, "id": "4ba49fc7", "metadata": {}, "outputs": [], "source": [ "accident_list = pd.read_csv(\"https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv\")" ] }, { "cell_type": "markdown", "id": "e4f48448", "metadata": {}, "source": [ "```{warning}\n", "You need the full URL path displayed in the example to access the file. While you could laboriously type it out, feel free to copy and paste it into your code.\n", "```\n", "\n", "After you run the code, you should see a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might. Take a moment to look at the columns and rows in the output, which contain the data we'll use in our analysis.\n", "\n", "You shouldn't see anything after running the import. That's a good thing. It means our DataFrame has been saved under the name `accident_list`, which we can now begin interacting with.\n", "\n", "We can do this by calling [\"methods\"](https://en.wikipedia.org/wiki/Method_(computer_programming)) that pandas makes available to all DataFrames. You may not have known it at the time, but `read_csv` is one of these methods. There are dozens more that can do all sorts of interesting things. Let's start with some easy ones that analysts use all the time." ] }, { "cell_type": "markdown", "id": "f5dd8594", "metadata": {}, "source": [ "## The `head` method\n", "\n", "To preview the first few rows of the dataset, try the [`head`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method." ] }, { "cell_type": "code", "execution_count": null, "id": "0ac232f7", "metadata": {}, "outputs": [], "source": [ "accident_list.head()" ] }, { "cell_type": "markdown", "id": "99df9adc", "metadata": {}, "source": [ "The `head` method is one of pandas' most basic tools for inspecting a DataFrame. By default, it returns the first five rows.\n", "\n", "You can see how it has structured the data into a table with columns like `accident_number`, `date`, `location`, `latimes_make_and_model`, etc.\n", "\n", "Notice how pandas has automatically numbered each row (called the \"index\") starting from zero. This is how pandas keeps track of your data." ] }, { "cell_type": "markdown", "id": "46024984", "metadata": {}, "source": [ "## The `info` method\n", "\n", "The [`info`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method can give you a high-level overview of the dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "740cfdbc", "metadata": {}, "outputs": [], "source": [ "accident_list.info()" ] }, { "cell_type": "markdown", "id": "cbce0aae", "metadata": {}, "source": [ "This summary can tell you quite a bit about your dataset. You can see:\n", "\n", "- How many rows and columns are in the DataFrame\n", "- The data type of each column (object, int64, float64, etc.)\n", "- How much memory the DataFrame is using\n", "- Whether there are any missing values (non-null count)\n", "\n", "This information is often helpful when you're first exploring a dataset." ] }, { "cell_type": "markdown", "id": "e59f1f3a", "metadata": {}, "source": [ "## The `shape` property\n", "\n", "If you just want to know how big the dataset is, the [`shape`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html) property returns the size as a tuple with the number of rows first and the number of columns second." ] }, { "cell_type": "code", "execution_count": null, "id": "9bc37ed6", "metadata": {}, "outputs": [], "source": [ "accident_list.shape" ] }, { "cell_type": "markdown", "id": "9b0c74b5", "metadata": {}, "source": [ "So our dataset has 6,101 rows and 22 columns.\n", "\n", "Notice that we didn't add parentheses after `shape` like we did with `head()` and `info()`. That's because `shape` is a property, not a method. The distinction may seem arbitrary now, but you'll learn to recognize the difference as you get more practice with pandas." ] }, { "cell_type": "markdown", "id": "2ef35465", "metadata": {}, "source": [ "Now let's start analyzing this data. In the next section, we'll learn how to filter, sort and group our DataFrame to find insights in the helicopter accident data." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }