Data¶

In 2018, the Los Angeles Times published an investigation headlined, “The Robinson R44, the world’s best-selling civilian helicopter, has a long history of deadly crashes.”

It reported that Robinson’s R44 led all major models with the highest fatal accident rate from 2006 to 2016. The analysis was published on GitHub as a series of Jupyter notebooks.

The findings were drawn from two key datasets:

The National Transportation Safety Board’s Aviation Accident Database
The Federal Aviation Administration’s General Aviation and Part 135 Activity Survey

After a significant amount of work gathering and cleaning the source data, the number of accidents for each helicopter model were normalized using the flight hour estimates in the survey. For the purposes of this demonstration, we will read in tidied versions of each file that are ready for analysis.

The data are structured in rows of comma-separated values. This is known as a CSV file. It is the most common way you will find data published online. The pandas library is able to read in files from a variety formats, including CSV.

import pandas as pd

Reading a CSV file¶

The pandas library can read files in many different formats, including CSV. For this class, we will use a dataset containing a record of helicopter accidents released by the National Transportation Safety Board.

accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")

Warning

You need the full URL path displayed in the example to access the file. While you could laboriously type it out, feel free to copy and paste it into your code.

After you run the code, you should see a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might. Take a moment to look at the columns and rows in the output, which contain the data we’ll use in our analysis.

You shouldn’t see anything after running the import. That’s a good thing. It means our DataFrame has been saved under the name accident_list, which we can now begin interacting with.

We can do this by calling “methods” that pandas makes available to all DataFrames. You may not have known it at the time, but read_csv is one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.

The `head` method¶

To preview the first few rows of the dataset, try the head method.

accident_list.head()

	event_id	ntsb_make	ntsb_model	ntsb_number	year	date	city	state	country	total_fatalities	latimes_make	latimes_model	latimes_make_and_model
0	20061222X01838	BELL	407	NYC07FA048	2006	12/14/2006 00:00:00	DAGSBORO	DE	USA	2	BELL	407	BELL 407
1	20060817X01187	ROBINSON	R22 BETA	LAX06LA257	2006	08/10/2006 00:00:00	TUCSON	AZ	USA	1	ROBINSON	R22	ROBINSON R22
2	20060111X00044	ROBINSON	R44	MIA06FA039	2006	01/01/2006 00:00:00	GRAND RIDGE	FL	USA	3	ROBINSON	R44	ROBINSON R44
3	20060419X00461	ROBINSON	R44 II	DFW06FA102	2006	04/13/2006 00:00:00	FREDERICKSBURG	TX	USA	2	ROBINSON	R44	ROBINSON R44
4	20060208X00181	ROBINSON	R44	SEA06LA052	2006	02/06/2006 00:00:00	HELENA	MT	USA	1	ROBINSON	R44	ROBINSON R44

The head method is one of pandas’ most basic tools for inspecting a DataFrame. By default, it returns the first five rows.

You can see how it has structured the data into a table with columns like accident_number, date, location, latimes_make_and_model, etc.

Notice how pandas has automatically numbered each row (called the “index”) starting from zero. This is how pandas keeps track of your data.

The `info` method¶

The info method can give you a high-level overview of the dataset.

accident_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   event_id                163 non-null    object
 1   ntsb_make               163 non-null    object
 2   ntsb_model              163 non-null    object
 3   ntsb_number             163 non-null    object
 4   year                    163 non-null    int64 
 5   date                    163 non-null    object
 6   city                    163 non-null    object
 7   state                   162 non-null    object
 8   country                 163 non-null    object
 9   total_fatalities        163 non-null    int64 
 10  latimes_make            163 non-null    object
 11  latimes_model           163 non-null    object
 12  latimes_make_and_model  163 non-null    object
dtypes: int64(2), object(11)
memory usage: 16.7+ KB

This summary can tell you quite a bit about your dataset. You can see:

How many rows and columns are in the DataFrame
The data type of each column (object, int64, float64, etc.)
How much memory the DataFrame is using
Whether there are any missing values (non-null count)

This information is often helpful when you’re first exploring a dataset.

The `shape` property¶

If you just want to know how big the dataset is, the shape property returns the size as a tuple with the number of rows first and the number of columns second.

accident_list.shape

(163, 13)

So our dataset has 6,101 rows and 22 columns.

Notice that we didn’t add parentheses after shape like we did with head() and info(). That’s because shape is a property, not a method. The distinction may seem arbitrary now, but you’ll learn to recognize the difference as you get more practice with pandas.

Now let’s start analyzing this data. In the next section, we’ll learn how to filter, sort and group our DataFrame to find insights in the helicopter accident data.

Data¶

Reading a CSV file¶

The `head` method¶

The `info` method¶

The `shape` property¶

First Python Notebook VSCode

Navigation

Data¶

Reading a CSV file¶

The head method¶

The info method¶

The shape property¶

The `head` method¶

The `info` method¶

The `shape` property¶