Sort¶

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.

Let’s start by preparing our data:

import pandas as pd

# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

# Create accident counts
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})

# Load survey data and merge
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

# Calculate accident rates
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000

print("Data prepared for sorting analysis")
merged_list.head()

Data prepared for sorting analysis

	latimes_make_and_model	accidents	total_hours	per_hour	per_100k_hours
0	AGUSTA 109	2	362172	5.522238e-06	0.552224
1	AIRBUS 130	1	1053786	9.489593e-07	0.094896
2	AIRBUS 135	4	884596	4.521838e-06	0.452184
3	AIRBUS 350	29	3883490	7.467510e-06	0.746751
4	BELL 206	30	5501308	5.453249e-06	0.545325

Basic sorting¶

The sort_values method is how pandas sorts DataFrames. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field:

merged_list.sort_values("per_100k_hours")

	latimes_make_and_model	accidents	total_hours	per_hour	per_100k_hours
1	AIRBUS 130	1	1053786	9.489593e-07	0.094896
11	SIKORSKY 76	2	915515	2.184563e-06	0.218456
10	SCHWEIZER 269	5	1139326	4.388560e-06	0.438856
2	AIRBUS 135	4	884596	4.521838e-06	0.452184
4	BELL 206	30	5501308	5.453249e-06	0.545325
0	AGUSTA 109	2	362172	5.522238e-06	0.552224
5	BELL 407	13	2113788	6.150096e-06	0.615010
8	ROBINSON R22	20	2970806	6.732180e-06	0.673218
3	AIRBUS 350	29	3883490	7.467510e-06	0.746751
6	HUGHES 369	13	1201688	1.081812e-05	1.081812
7	MCDONNELL DOUGLAS 369	6	550689	1.089544e-05	1.089544
9	ROBINSON R44	38	2359729	1.610354e-05	1.610354

Note that by default sort_values returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called ascending. When it is set to False, the DataFrame is sorted in descending order:

merged_list.sort_values("per_100k_hours", ascending=False)

	latimes_make_and_model	accidents	total_hours	per_hour	per_100k_hours
9	ROBINSON R44	38	2359729	1.610354e-05	1.610354
7	MCDONNELL DOUGLAS 369	6	550689	1.089544e-05	1.089544
6	HUGHES 369	13	1201688	1.081812e-05	1.081812
3	AIRBUS 350	29	3883490	7.467510e-06	0.746751
8	ROBINSON R22	20	2970806	6.732180e-06	0.673218
5	BELL 407	13	2113788	6.150096e-06	0.615010
0	AGUSTA 109	2	362172	5.522238e-06	0.552224
4	BELL 206	30	5501308	5.453249e-06	0.545325
2	AIRBUS 135	4	884596	4.521838e-06	0.452184
10	SCHWEIZER 269	5	1139326	4.388560e-06	0.438856
11	SIKORSKY 76	2	915515	2.184563e-06	0.218456
1	AIRBUS 130	1	1053786	9.489593e-07	0.094896

Sorting by multiple columns¶

You can also sort by multiple columns by passing a list of column names. This is useful when you want a primary and secondary sort:

# Sort by accidents (descending) then by total hours (descending)
merged_list.sort_values(["accidents", "total_hours"], ascending=[False, False])

	latimes_make_and_model	accidents	total_hours	per_hour	per_100k_hours
9	ROBINSON R44	38	2359729	1.610354e-05	1.610354
4	BELL 206	30	5501308	5.453249e-06	0.545325
3	AIRBUS 350	29	3883490	7.467510e-06	0.746751
8	ROBINSON R22	20	2970806	6.732180e-06	0.673218
5	BELL 407	13	2113788	6.150096e-06	0.615010
6	HUGHES 369	13	1201688	1.081812e-05	1.081812
7	MCDONNELL DOUGLAS 369	6	550689	1.089544e-05	1.089544
10	SCHWEIZER 269	5	1139326	4.388560e-06	0.438856
2	AIRBUS 135	4	884596	4.521838e-06	0.452184
11	SIKORSKY 76	2	915515	2.184563e-06	0.218456
0	AGUSTA 109	2	362172	5.522238e-06	0.552224
1	AIRBUS 130	1	1053786	9.489593e-07	0.094896

Finding the top and bottom values¶

Sometimes you just want the highest or lowest values. Pandas provides convenient methods for this:

# Top 5 models by accident rate
print("Top 5 models by accident rate per 100k hours:")
print(merged_list.nlargest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])

Top 5 models by accident rate per 100k hours:
  latimes_make_and_model  per_100k_hours
         ROBINSON R44        1.610354
MCDONNELL DOUGLAS 369        1.089544
           HUGHES 369        1.081812
           AIRBUS 350        0.746751
         ROBINSON R22        0.673218

# Bottom 5 models by accident rate
print("Bottom 5 models by accident rate per 100k hours:")
print(merged_list.nsmallest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])

Bottom 5 models by accident rate per 100k hours:
   latimes_make_and_model  per_100k_hours
            AIRBUS 130        0.094896
          SIKORSKY 76        0.218456
        SCHWEIZER 269        0.438856
            AIRBUS 135        0.452184
              BELL 206        0.545325

Practice exercises¶

Congratulations! With sorting, you’ve covered most of the basic skills necessary to access and analyze data with pandas. Here are some practice questions you can answer using the techniques we’ve learned:

What’s the date of the most recent fatal helicopter accident in Texas?
How many fatalities occurred in Texas accidents?
What helicopter model logged the most flight hours?
Where did the accident with the NTSB number ERA13LA057 occur?

Try to answer these questions using the sorting and filtering techniques we’ve learned!

# Space for your practice exercises
# Try to answer the questions above using pandas methods

Sort¶

Basic sorting¶

Sorting by multiple columns¶

Finding the top and bottom values¶

Practice exercises¶

First Python Notebook VSCode

Navigation