Sort

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.

Let’s start by preparing our data:

import pandas as pd

# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()

# Create accident counts
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})

# Load survey data and merge
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

# Calculate accident rates
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000

print("Data prepared for sorting analysis")
merged_list.head()
Data prepared for sorting analysis
latimes_make_and_model accidents total_hours per_hour per_100k_hours
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
4 BELL 206 30 5501308 5.453249e-06 0.545325

Basic sorting

The sort_values method is how pandas sorts DataFrames. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field:

merged_list.sort_values("per_100k_hours")
latimes_make_and_model accidents total_hours per_hour per_100k_hours
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
4 BELL 206 30 5501308 5.453249e-06 0.545325
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
5 BELL 407 13 2113788 6.150096e-06 0.615010
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354

Note that by default sort_values returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called ascending. When it is set to False, the DataFrame is sorted in descending order:

merged_list.sort_values("per_100k_hours", ascending=False)
latimes_make_and_model accidents total_hours per_hour per_100k_hours
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
5 BELL 407 13 2113788 6.150096e-06 0.615010
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
4 BELL 206 30 5501308 5.453249e-06 0.545325
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896

Sorting by multiple columns

You can also sort by multiple columns by passing a list of column names. This is useful when you want a primary and secondary sort:

# Sort by accidents (descending) then by total hours (descending)
merged_list.sort_values(["accidents", "total_hours"], ascending=[False, False])
latimes_make_and_model accidents total_hours per_hour per_100k_hours
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354
4 BELL 206 30 5501308 5.453249e-06 0.545325
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
5 BELL 407 13 2113788 6.150096e-06 0.615010
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896

Finding the top and bottom values

Sometimes you just want the highest or lowest values. Pandas provides convenient methods for this:

# Top 5 models by accident rate
print("Top 5 models by accident rate per 100k hours:")
print(merged_list.nlargest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])
Top 5 models by accident rate per 100k hours:
  latimes_make_and_model  per_100k_hours
9           ROBINSON R44        1.610354
7  MCDONNELL DOUGLAS 369        1.089544
6             HUGHES 369        1.081812
3             AIRBUS 350        0.746751
8           ROBINSON R22        0.673218
# Bottom 5 models by accident rate
print("Bottom 5 models by accident rate per 100k hours:")
print(merged_list.nsmallest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])
Bottom 5 models by accident rate per 100k hours:
   latimes_make_and_model  per_100k_hours
1              AIRBUS 130        0.094896
11            SIKORSKY 76        0.218456
10          SCHWEIZER 269        0.438856
2              AIRBUS 135        0.452184
4                BELL 206        0.545325

Practice exercises

Congratulations! With sorting, you’ve covered most of the basic skills necessary to access and analyze data with pandas. Here are some practice questions you can answer using the techniques we’ve learned:

  1. What’s the date of the most recent fatal helicopter accident in Texas?

  2. How many fatalities occurred in Texas accidents?

  3. What helicopter model logged the most flight hours?

  4. Where did the accident with the NTSB number ERA13LA057 occur?

Try to answer these questions using the sorting and filtering techniques we’ve learned!

# Space for your practice exercises
# Try to answer the questions above using pandas methods