Data Analysis: Intermediate#

After polishing off the intro tutorial, you’re ready to devour some more techniques to simplify your life as a data analyst.

Getting Started#

import numpy as np
import pandas as pd
import geopandas as gpd

Create a New Column Using a Dictionary to Map the Values#

Sometimes, you want to create a new column by converting one set of values into a different set of values. We could write a function or we could use the map function to add a new column. For our df, we want a new column that shows the state.

df: person and birthplace

Person

Birthplace

Leslie Knope

Eagleton, Indiana

Tom Haverford

South Carolina

Ann Perkins

Michigan

Ben Wyatt

Partridge, Minnesota

Write a Function#

Quick refresher on functions

# Create a function called state_abbrev.
def state_abbrev(row):
    # The find function returns the index of where 'Indiana' is found in
    #the column. If it cannot find it, it returns -1.
    if row.Birthplace.find('Indiana') != -1:
        return 'IN'
    elif row.Birthplace.find('South Carolina') != -1:
        return 'SC'
    # For an exact match, we would write it this way.
    elif row.Birthplace == 'Michigan':
        return 'MI'
    elif row.Birthplace.find('Minnesota') != -1:
        return 'MI'

# Apply this function and create the State column.
df['State'] = df.apply(state_abbrev, axis = 1)

Use a Dictionary to Map the Values#

But, writing a function could take up a lot of space, especially with all the if-elif-else statements. Alternatively, a dictionary would also work. We could use a dictionary and map the four different city-state values into the state abbreviation.

state_abbrev1 = {'Eagleton, Indiana': 'IN', 'South Carolina': 'SC',
                'Michigan': 'MI', 'Partridge, Minnesota': 'MN'}

df['State'] = df.Birthplace.map(state_abbrev1)

But, if we wanted to avoid writing out all the possible combinations, we would first extract the state portion of the city-state text. Then we could map the state’s full name with its abbreviation.

# The split function splits at the comma and expand the columns.
# Everything is stored in a new df called 'fullname'.
fullname = df['Birthplace'].str.split(",", expand = True)

# Add the City column into our df by extracting the first column (0) from fullname.
df['City'] = fullname[0]

# Add the State column by extracting the second column (1) from fullname.
df['State_full'] = fullname[1]


# Tom Haverford's birthplace is South Carolina. We don't have city information.
# So, the City column would be incorrectly filled in with South Carolina, and
# the State would say None.
# Fix these so the Nones actually display the state information correctly.

df['State_full'] = df.apply(lambda row: row.City if row.State == None else
                    row.State_full, axis = 1)

# Now, use a dictionary to map the values.
state_abbrev2 = {'Indiana': 'IN', 'South Carolina': 'SC',
                'Michigan': 'MI', 'Minnesota': 'MN'}

df['State'] = df.Birthplace.map(state_abbrev2)

All 3 methods would give us this df:

Person

Birthplace

State

Leslie Knope

Eagleton, Indiana

IN

Tom Haverford

South Carolina

SC

Ann Perkins

Michigan

MI

Ben Wyatt

Partridge, Minnesota

MN

Loop over Columns with a Dictionary#

If there are operations or data transformations that need to be performed on multiple columns, the best way to do that is with a loop.

columns = ['colA', 'colB', 'colC']

for c in columns:
    # Fill in missing values for all columns with zeros
    df[c] = df[c].fillna(0)
    # Multiply all columns by 0.5
    df[c] = df[c] * 0.5

Loop over Dataframes with a Dictionary#

It’s easier and more efficient to use a loop to do the same operations over the different dataframes (df). Here, we want to find the number of Pawnee businesses and Tom Haverford businesses located in each Council District.

This type of question is perfect for a loop. Each df will be spatially joined to the geodataframe council_district, followed by some aggregation.

business: list of Pawnee stores

Business

longitude

latitude

Sales_millions

geometry

Paunch Burger

x1

y1

5

Point(x1, y1)

Sweetums

x2

y2

30

Point(x2, y2)

Jurassic Fork

x3

y3

2

Point(x3, y3)

Gryzzl

x4

y4

40

Point(x4, y4)

tom: list of Tom Haverford businesses

Business

longitude

latitude

Sales_millions

geometry

Tom’s Bistro

x1

y1

30

Point(x1, y1)

Entertainment 720

x2

y2

1

Point(x2, y2)

Rent-A-Swag

x3

y3

4

Point(x3, y3)

# Save our existing dfs into a dictionary. The business df is named
# 'pawnee"; the tom df is named 'tom'.
dfs = {'pawnee': business, 'tom': tom}

# Create an empty dictionary called summary_dfs to hold the results
summary_dfs = {}

# Loop over key-value pairs
## Keys: pawnee, tom (names given to dataframes)
## Values: business, tom (dataframes)

for key, value in dfs.items():
    # Use f string to define a variable join_df (result of our spatial join)
    ## join_{key} would be join_pawnee or join_tom in the loop
    join_df = "join_{key}"

    # Spatial join
    join_df = gpd.sjoin(value, council_district, how = 'inner', op = 'intersects')

    # Calculate summary stats with groupby, agg, then save it into summary_dfs,
    # naming it 'pawnee' or 'tom'.
    summary_dfs[key] = join.groupby('ID').agg(
        {'Business': 'count', 'Sales_millions': 'sum'})

Now, our summary_dfs dictionary contains 2 items, which are the 2 dataframes with everything aggregated.

# To view the contents of this dictionary
for key, value in summary_dfs.items():
    display(key)
    display(value)

# To access the df
summary_dfs["pawnee"]
summary_dfs["tom"]

join_tom: result of spatial join between tom and council_district

Business

longitude

latitude

Sales_millions

geometry

ID

Tom’s Bistro

x1

y1

30

Point(x1, y1)

1

Entertainment 720

x2

y2

1

Point(x2, y2)

3

Rent-A-Swag

x3

y3

4

Point(x3, y3)

3

summary_dfs["tom"]: result of the counting number of Tom’s businesses by CD

ID

Business

Sales_millions

1

1

30

3

2

5