Data Analysis: Intermediate#
After polishing off the intro tutorial, you’re ready to devour some more techniques to simplify your life as a data analyst.
Getting Started#
import numpy as np
import pandas as pd
import geopandas as gpd
Create a New Column Using a Dictionary to Map the Values#
Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our df
, we want a new column that shows the state.
df
: person and birthplace
Person |
Birthplace |
---|---|
Leslie Knope |
Eagleton, Indiana |
Tom Haverford |
South Carolina |
Ann Perkins |
Michigan |
Ben Wyatt |
Partridge, Minnesota |
Write a Function#
# Create a function called state_abbrev.
def state_abbrev(row):
# The find function returns the index of where 'Indiana' is found in
#the column. If it cannot find it, it returns -1.
if row.Birthplace.find('Indiana') != -1:
return 'IN'
elif row.Birthplace.find('South Carolina') != -1:
return 'SC'
# For an exact match, we would write it this way.
elif row.Birthplace == 'Michigan':
return 'MI'
elif row.Birthplace.find('Minnesota') != -1:
return 'MI'
# Apply this function and create the State column.
df['State'] = df.apply(state_abbrev, axis = 1)
Use a Dictionary to Map the Values#
But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation.
state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC',
'Michigan': 'MI', 'Partridge, Minnesota': 'MN'}
df['State'] = df.Birthplace.map(state_abbrev1)
But, if we wanted to avoid writing out all the possible combinations, we would first extract the state portion of the city-state text. Then we could map the state’s full name with its abbreviation.
# The split function splits at the comma and expand the columns.
# Everything is stored in a new df called 'fullname'.
fullname = df['Birthplace'].str.split(",", expand = True)
# Add the City column into our df by extracting the first column (0) from fullname.
df['City'] = fullname[0]
# Add the State column by extracting the second column (1) from fullname.
df['State_full'] = fullname[1]
# Tom Haverford's birthplace is South Carolina. We don't have city information.
# So, the City column would be incorrectly filled in with South Carolina, and
# the State would say None.
# Fix these so the Nones actually display the state information correctly.
df['State_full'] = df.apply(lambda row: row.City if row.State == None else
row.State_full, axis = 1)
# Now, use a dictionary to map the values.
state_abbrev2 = {'Indiana': 'IN', 'South Carolina': 'SC',
'Michigan': 'MI', 'Minnesota': 'MN'}
df['State'] = df.Birthplace.map(state_abbrev2)
All 3 methods would give us this df
:
Person |
Birthplace |
State |
---|---|---|
Leslie Knope |
Eagleton, Indiana |
IN |
Tom Haverford |
South Carolina |
SC |
Ann Perkins |
Michigan |
MI |
Ben Wyatt |
Partridge, Minnesota |
MN |
Loop over Columns with a Dictionary#
If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop.
columns = ['colA', 'colB', 'colC']
for c in columns:
# Fill in missing values for all columns with zeros
df[c] = df[c].fillna(0)
# Multiply all columns by 0.5
df[c] = df[c] * 0.5
Loop over Dataframes with a Dictionary#
It’s easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District.
This type of question is perfect for a loop. Each df will be spatially joined to the geodataframe council_district
, followed by some aggregation.
business
: list of Pawnee stores
Business |
longitude |
latitude |
Sales_millions |
geometry |
---|---|---|---|---|
Paunch Burger |
x1 |
y1 |
5 |
Point(x1, y1) |
Sweetums |
x2 |
y2 |
30 |
Point(x2, y2) |
Jurassic Fork |
x3 |
y3 |
2 |
Point(x3, y3) |
Gryzzl |
x4 |
y4 |
40 |
Point(x4, y4) |
tom
: list of Tom Haverford businesses
Business |
longitude |
latitude |
Sales_millions |
geometry |
---|---|---|---|---|
Tom’s Bistro |
x1 |
y1 |
30 |
Point(x1, y1) |
Entertainment 720 |
x2 |
y2 |
1 |
Point(x2, y2) |
Rent-A-Swag |
x3 |
y3 |
4 |
Point(x3, y3) |
# Save our existing dfs into a dictionary. The business df is named
# 'pawnee"; the tom df is named 'tom'.
dfs = {'pawnee': business, 'tom': tom}
# Create an empty dictionary called summary_dfs to hold the results
summary_dfs = {}
# Loop over key-value pairs
## Keys: pawnee, tom (names given to dataframes)
## Values: business, tom (dataframes)
for key, value in dfs.items():
# Use f string to define a variable join_df (result of our spatial join)
## join_{key} would be join_pawnee or join_tom in the loop
join_df = "join_{key}"
# Spatial join
join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects')
# Calculate summary stats with groupby, agg, then save it into summary_dfs,
# naming it 'pawnee' or 'tom'.
summary_dfs[key] = join.groupby('ID').agg(
{'Business': 'count', 'Sales_millions': 'sum'})
Now, our summary_dfs
dictionary contains 2 items, which are the 2 dataframes with everything aggregated.
# To view the contents of this dictionary
for key, value in summary_dfs.items():
display(key)
display(value)
# To access the df
summary_dfs["pawnee"]
summary_dfs["tom"]
join_tom
: result of spatial join between tom and council_district
Business |
longitude |
latitude |
Sales_millions |
geometry |
ID |
---|---|---|---|---|---|
Tom’s Bistro |
x1 |
y1 |
30 |
Point(x1, y1) |
1 |
Entertainment 720 |
x2 |
y2 |
1 |
Point(x2, y2) |
3 |
Rent-A-Swag |
x3 |
y3 |
4 |
Point(x3, y3) |
3 |
summary_dfs["tom"]
: result of the counting number of Tom’s businesses by CD
ID |
Business |
Sales_millions |
---|---|---|
1 |
1 |
30 |
3 |
2 |
5 |