Data

In 2018, the Los Angeles Times published an investigation headlined, “The Robinson R44, the world’s best-selling civilian helicopter, has a long history of deadly crashes.”

It reported that Robinson’s R44 led all major models with the highest fatal accident rate from 2006 to 2016. The analysis was published on GitHub as a series of Jupyter notebooks.

The findings were drawn from two key datasets:

  1. The National Transportation Safety Board’s Aviation Accident Database

  2. The Federal Aviation Administration’s General Aviation and Part 135 Activity Survey

After a significant amount of work gathering and cleaning the source data, the number of accidents for each helicopter model were normalized using the flight hour estimates in the survey. For the purposes of this demonstration, we will read in tidied versions of each file that are ready for analysis.

The data are structured in rows of comma-separated values. This is known as a CSV file. It is the most common way you will find data published online. The pandas library is able to read in files from a variety formats, including CSV.

import pandas as pd

Reading a CSV file

The pandas library can read files in many different formats, including CSV. For this class, we will use a dataset containing a record of helicopter accidents released by the National Transportation Safety Board.

accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")

Warning

You need the full URL path displayed in the example to access the file. While you could laboriously type it out, feel free to copy and paste it into your code.

After you run the code, you should see a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might. Take a moment to look at the columns and rows in the output, which contain the data we’ll use in our analysis.

You shouldn’t see anything after running the import. That’s a good thing. It means our DataFrame has been saved under the name accident_list, which we can now begin interacting with.

We can do this by calling “methods” that pandas makes available to all DataFrames. You may not have known it at the time, but read_csv is one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.

The head method

To preview the first few rows of the dataset, try the head method.

accident_list.head()
event_id ntsb_make ntsb_model ntsb_number year date city state country total_fatalities latimes_make latimes_model latimes_make_and_model
0 20061222X01838 BELL 407 NYC07FA048 2006 12/14/2006 00:00:00 DAGSBORO DE USA 2 BELL 407 BELL 407
1 20060817X01187 ROBINSON R22 BETA LAX06LA257 2006 08/10/2006 00:00:00 TUCSON AZ USA 1 ROBINSON R22 ROBINSON R22
2 20060111X00044 ROBINSON R44 MIA06FA039 2006 01/01/2006 00:00:00 GRAND RIDGE FL USA 3 ROBINSON R44 ROBINSON R44
3 20060419X00461 ROBINSON R44 II DFW06FA102 2006 04/13/2006 00:00:00 FREDERICKSBURG TX USA 2 ROBINSON R44 ROBINSON R44
4 20060208X00181 ROBINSON R44 SEA06LA052 2006 02/06/2006 00:00:00 HELENA MT USA 1 ROBINSON R44 ROBINSON R44

The head method is one of pandas’ most basic tools for inspecting a DataFrame. By default, it returns the first five rows.

You can see how it has structured the data into a table with columns like accident_number, date, location, latimes_make_and_model, etc.

Notice how pandas has automatically numbered each row (called the “index”) starting from zero. This is how pandas keeps track of your data.

The info method

The info method can give you a high-level overview of the dataset.

accident_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   event_id                163 non-null    object
 1   ntsb_make               163 non-null    object
 2   ntsb_model              163 non-null    object
 3   ntsb_number             163 non-null    object
 4   year                    163 non-null    int64 
 5   date                    163 non-null    object
 6   city                    163 non-null    object
 7   state                   162 non-null    object
 8   country                 163 non-null    object
 9   total_fatalities        163 non-null    int64 
 10  latimes_make            163 non-null    object
 11  latimes_model           163 non-null    object
 12  latimes_make_and_model  163 non-null    object
dtypes: int64(2), object(11)
memory usage: 16.7+ KB

This summary can tell you quite a bit about your dataset. You can see:

  • How many rows and columns are in the DataFrame

  • The data type of each column (object, int64, float64, etc.)

  • How much memory the DataFrame is using

  • Whether there are any missing values (non-null count)

This information is often helpful when you’re first exploring a dataset.

The shape property

If you just want to know how big the dataset is, the shape property returns the size as a tuple with the number of rows first and the number of columns second.

accident_list.shape
(163, 13)

So our dataset has 6,101 rows and 22 columns.

Notice that we didn’t add parentheses after shape like we did with head() and info(). That’s because shape is a property, not a method. The distinction may seem arbitrary now, but you’ll learn to recognize the difference as you get more practice with pandas.

Now let’s start analyzing this data. In the next section, we’ll learn how to filter, sort and group our DataFrame to find insights in the helicopter accident data.