Sort¶
Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.
Let’s start by preparing our data:
import pandas as pd
# Load accident data
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
# Create accident counts
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
# Load survey data and merge
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
# Calculate accident rates
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000
print("Data prepared for sorting analysis")
merged_list.head()
Data prepared for sorting analysis
| latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
|---|---|---|---|---|---|
| 0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
| 1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
| 2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
| 3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
| 4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
Basic sorting¶
The sort_values method is how pandas sorts DataFrames. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field:
merged_list.sort_values("per_100k_hours")
| latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
|---|---|---|---|---|---|
| 1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
| 11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
| 10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
| 2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
| 4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
| 0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
| 5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
| 8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
| 3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
| 6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
| 7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
| 9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
Note that by default sort_values returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called ascending. When it is set to False, the DataFrame is sorted in descending order:
merged_list.sort_values("per_100k_hours", ascending=False)
| latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
|---|---|---|---|---|---|
| 9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
| 7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
| 6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
| 3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
| 8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
| 5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
| 0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
| 4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
| 2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
| 10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
| 11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
| 1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
Sorting by multiple columns¶
You can also sort by multiple columns by passing a list of column names. This is useful when you want a primary and secondary sort:
# Sort by accidents (descending) then by total hours (descending)
merged_list.sort_values(["accidents", "total_hours"], ascending=[False, False])
| latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
|---|---|---|---|---|---|
| 9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
| 4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
| 3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
| 8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
| 5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
| 6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
| 7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
| 10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
| 2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
| 11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
| 0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
| 1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
Finding the top and bottom values¶
Sometimes you just want the highest or lowest values. Pandas provides convenient methods for this:
# Top 5 models by accident rate
print("Top 5 models by accident rate per 100k hours:")
print(merged_list.nlargest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])
Top 5 models by accident rate per 100k hours:
latimes_make_and_model per_100k_hours
9 ROBINSON R44 1.610354
7 MCDONNELL DOUGLAS 369 1.089544
6 HUGHES 369 1.081812
3 AIRBUS 350 0.746751
8 ROBINSON R22 0.673218
# Bottom 5 models by accident rate
print("Bottom 5 models by accident rate per 100k hours:")
print(merged_list.nsmallest(5, "per_100k_hours")[["latimes_make_and_model", "per_100k_hours"]])
Bottom 5 models by accident rate per 100k hours:
latimes_make_and_model per_100k_hours
1 AIRBUS 130 0.094896
11 SIKORSKY 76 0.218456
10 SCHWEIZER 269 0.438856
2 AIRBUS 135 0.452184
4 BELL 206 0.545325
Practice exercises¶
Congratulations! With sorting, you’ve covered most of the basic skills necessary to access and analyze data with pandas. Here are some practice questions you can answer using the techniques we’ve learned:
What’s the date of the most recent fatal helicopter accident in Texas?
How many fatalities occurred in Texas accidents?
What helicopter model logged the most flight hours?
Where did the accident with the NTSB number
ERA13LA057occur?
Try to answer these questions using the sorting and filtering techniques we’ve learned!
# Space for your practice exercises
# Try to answer the questions above using pandas methods