Data¶
In 2018, the Los Angeles Times published an investigation headlined, “The Robinson R44, the world’s best-selling civilian helicopter, has a long history of deadly crashes.”
It reported that Robinson’s R44 led all major models with the highest fatal accident rate from 2006 to 2016. The analysis was published on GitHub as a series of Jupyter notebooks.
The findings were drawn from two key datasets:
The National Transportation Safety Board’s Aviation Accident Database
The Federal Aviation Administration’s General Aviation and Part 135 Activity Survey
After a significant amount of work gathering and cleaning the source data, the number of accidents for each helicopter model were normalized using the flight hour estimates in the survey. For the purposes of this demonstration, we will read in tidied versions of each file that are ready for analysis.
The data are structured in rows of comma-separated values. This is known as a CSV file. It is the most common way you will find data published online. The pandas library is able to read in files from a variety formats, including CSV.
import pandas as pd
Reading a CSV file¶
The pandas library can read files in many different formats, including CSV. For this class, we will use a dataset containing a record of helicopter accidents released by the National Transportation Safety Board.
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
Warning
You need the full URL path displayed in the example to access the file. While you could laboriously type it out, feel free to copy and paste it into your code.
After you run the code, you should see a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might. Take a moment to look at the columns and rows in the output, which contain the data we’ll use in our analysis.
You shouldn’t see anything after running the import. That’s a good thing. It means our DataFrame has been saved under the name accident_list, which we can now begin interacting with.
We can do this by calling “methods” that pandas makes available to all DataFrames. You may not have known it at the time, but read_csv is one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.
The head method¶
To preview the first few rows of the dataset, try the head method.
accident_list.head()
| event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20061222X01838 | BELL | 407 | NYC07FA048 | 2006 | 12/14/2006 00:00:00 | DAGSBORO | DE | USA | 2 | BELL | 407 | BELL 407 |
| 1 | 20060817X01187 | ROBINSON | R22 BETA | LAX06LA257 | 2006 | 08/10/2006 00:00:00 | TUCSON | AZ | USA | 1 | ROBINSON | R22 | ROBINSON R22 |
| 2 | 20060111X00044 | ROBINSON | R44 | MIA06FA039 | 2006 | 01/01/2006 00:00:00 | GRAND RIDGE | FL | USA | 3 | ROBINSON | R44 | ROBINSON R44 |
| 3 | 20060419X00461 | ROBINSON | R44 II | DFW06FA102 | 2006 | 04/13/2006 00:00:00 | FREDERICKSBURG | TX | USA | 2 | ROBINSON | R44 | ROBINSON R44 |
| 4 | 20060208X00181 | ROBINSON | R44 | SEA06LA052 | 2006 | 02/06/2006 00:00:00 | HELENA | MT | USA | 1 | ROBINSON | R44 | ROBINSON R44 |
The head method is one of pandas’ most basic tools for inspecting a DataFrame. By default, it returns the first five rows.
You can see how it has structured the data into a table with columns like accident_number, date, location, latimes_make_and_model, etc.
Notice how pandas has automatically numbered each row (called the “index”) starting from zero. This is how pandas keeps track of your data.
The info method¶
The info method can give you a high-level overview of the dataset.
accident_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event_id 163 non-null object
1 ntsb_make 163 non-null object
2 ntsb_model 163 non-null object
3 ntsb_number 163 non-null object
4 year 163 non-null int64
5 date 163 non-null object
6 city 163 non-null object
7 state 162 non-null object
8 country 163 non-null object
9 total_fatalities 163 non-null int64
10 latimes_make 163 non-null object
11 latimes_model 163 non-null object
12 latimes_make_and_model 163 non-null object
dtypes: int64(2), object(11)
memory usage: 16.7+ KB
This summary can tell you quite a bit about your dataset. You can see:
How many rows and columns are in the DataFrame
The data type of each column (object, int64, float64, etc.)
How much memory the DataFrame is using
Whether there are any missing values (non-null count)
This information is often helpful when you’re first exploring a dataset.
The shape property¶
If you just want to know how big the dataset is, the shape property returns the size as a tuple with the number of rows first and the number of columns second.
accident_list.shape
(163, 13)
So our dataset has 6,101 rows and 22 columns.
Notice that we didn’t add parentheses after shape like we did with head() and info(). That’s because shape is a property, not a method. The distinction may seem arbitrary now, but you’ll learn to recognize the difference as you get more practice with pandas.
Now let’s start analyzing this data. In the next section, we’ll learn how to filter, sort and group our DataFrame to find insights in the helicopter accident data.