Data Analysis: Intro

Data Analysis: Intro#

Below are Python tutorials covering the basics of data cleaning and wrangling. Chris Albon’s guide is particularly helpful. Rather than reinventing the wheel, this tutorial instead highlights specific methods and operations that might make your life easier as a data analyst.

Import and export data in Python
Merge tabular and geospatial data
Functions
Grouping
Aggregating
Export aggregated output

Getting Started#

import numpy as np
import pandas as pd
import geopandas as gpd

Import and Export Data in Python for Data Analysis#

Local files#

We import a tabular dataframe my_csv.csv and an Excel spreadsheet my_excel.xlsx.

df = pd.read_csv('./folder/my_csv.csv')

df = pd.read_excel('./folder/my_excel.xlsx', sheet_name = 'Sheet1')

GCS#

The data we use outside of the warehouse can be stored in GCS buckets.

# Read from GCS
df = pd.read_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')

#Write to GCS
df.to_csv('gs://calitp-analytics-data/data-analyses/bucket-name/df_csv.csv')

Refer to the Data Management best practices and Basics of Working with Geospatial Data for additional information on importing various file types.

Merge Tabular and Geospatial Data for Data Analysis#

Merging data from multiple sources creates one large dataframe (df) to perform data analysis. Let’s say there are 3 sources of data that need to be merged:

Dataframe #1: council_population (tabular)

CD	Council_Member	Population
1	Leslie Knope	1,500
2	Jeremy Jamm	2,000
3	Douglass Howser	2,250

Dataframe #2: paunch_locations (geospatial)

Store	City	Sales_millions	CD	Geometry
1	Pawnee	$5	1	(x1,y1)
2	Pawnee	$2.5	2	(x2, y2)
3	Pawnee	$2.5	3	(x3, y3)
4	Eagleton	$2		(x4, y4)
5	Pawnee	$4	1	(x5, y5)
6	Pawnee	$6	2	(x6, y6)
7	Indianapolis	$7		(x7, y7)

If paunch_locations did not come with the council district information, use a spatial join to attach the council district within which the store falls. More on spatial joins here.

Dataframe #3: council_boundaries (geospatial)

District	Geometry
1	polygon
2	polygon
3	polygon

First, merge paunch_locations with council_population using the CD column, which they have in common.

merge1 = pd.merge(
    paunch_locations,
    council_population,
    on = 'CD',
    how = 'inner',
    validate = 'm:1'
)

# m:1 many-to-1 merge means that CD appears multiple times in
# paunch_locations, but only once in council_population.

Next, merge merge1 and council_boundaries. Columns don’t have to have the same names to be matched on, as long as they hold the same values.

merge2 = pd.merge(
    merge1,
    council_boundaries,
    left_on = 'CD',
    right_on = 'District',
    how = 'left',
    validate = 'm:1'
)

Here are some things to know about merge2:

merge2 is a geodataframe (gdf) because the base, paunch_locations, is a gdf.
Pandas allows the merge to take place even if the Geometry column appears in both dfs. The resulting df contains 2 renamed Geometry columns; Geometry_x corresponds to the left df Geometry and Geometry_y for the right df.
Geopandas still designates a geometry to use. To see what which geometry column is set, type merge2.geometry.name. To change the geometry to a different column, type merge2 = merge2.set_geometry('new_column').

merge2 looks like this:

Store	City	Sales_millions	CD	Geometry_x	Council_Member	Population	Geometry_y
1	Pawnee	$5	1	(x1,y1)	Leslie Knope	1,500	polygon
2	Pawnee	$2.5	2	(x2, y2)	Jeremy Jamm	2,000	polygon
3	Pawnee	$2.5	3	(x3, y3)	Douglass Howser	2,250	polygon
5	Pawnee	$4	1	(x5, y5)	Leslie Knope	1,500	polygon
6	Pawnee	$6	2	(x6, y6)	Jeremy Jamm	2,000	polygon

Functions#

A function is a set of instructions to do something. It can be as simple as changing values in a column or as complicated as a series of steps to clean, group, aggregate, and plot the data.

Lambda Functions#

Lambda functions are quick and dirty. You don’t even have to name the function! These are used for one-off functions that you don’t need to save for repeated use within the script or notebook. You can use it for any simple function (e.g., if-else statements, etc) you want to apply to all rows of the df.

df: Andy Dwyer’s band names and number of songs played under that name

Band	Songs
Mouse Rat	30
Scarecrow Boat	15
Jet Black Pope	4
Nothing Rhymes with Orange	6

If-Else Statements#

# Create column called duration. If Songs > 10, duration is 'long'.
# Otherwise, duration is 'short'.
df['duration'] = df.apply(lambda row: 'long' if row.Songs > 10
                else 'short', axis = 1)

# Create column called famous. If Band is 'Mouse Rat', famous is 1,
# otherwise 0.
df['famous'] = df.apply(lambda row: 1 if row.Band == 'Mouse Rat'
                else 0, axis = 1)

# An equivalent full function would be:
def tag_famous(row):
    if row.Band == 'Mouse Rat':
        return 1
    else:
        return 0

df['famous'] = df.apply(tag_famous, axis = 1)

df

Band	Songs	duration	famous
Mouse Rat	30	long	1
Scarecrow Boat	15	long	0
Jet Black Pope	4	short	0
Nothing Rhymes with Orange	6	short	0

Other Lambda Functions#

# Split the band name at the spaces
# [1] means we want to extract the second word
# [0:2] means we want to start at the first character
# and stop at (but not include) the 3rd character
df['word2_start'] = df.apply(lambda x:
                    x.Band.split(" ")[1][0:2], axis = 1)
df

Band	Songs	word2_start
Mouse Rat	30	Ra
Scarecrow Boat	15	Bo
Jet Black Pope	4	Po
Nothing Rhymes with Orange	6	Or

Apply over Dataframe#

You should use a full function when a function is too complicated to be a lambda function. These functions are defined by a name and are called upon to operate on the rows of a dataframe. You can also write more complex functions that bundle together all the steps (including nesting more functions) you want to execute over the dataframe.

df.apply is one common usage of a function.

def years_active(row):
    if row.Band == 'Mouse Rat':
        return '2009-2014'
    elif row.Band == 'Scarecrow Boat':
        return '2009'
    elif (row.Band == 'Jet Black Pope') or (row.Band ==
    'Nothing Rhymes with Orange'):
        return '2008'

df['Active'] = df.apply(years_active, axis = 1)
df

Band	Songs	Active
Mouse Rat	30	2009-2014
Scarecrow Boat	15	2009
Jet Black Pope	4	2008
Nothing Rhymes with Orange	6	2008

Grouping#

Sometimes it’s necessary to create a new column to group together certain values of a column. Here are two ways to accomplish this:

Method #1: Write a function using if-else statement and apply it using a lambda function.

# The function is called elected_year, and it operates on every row.
def elected_year(row):
    # For each row, if Council_Member says 'Leslie Knope', then return 2012
    # as the value.
    if row.Council_Member == 'Leslie Knope':
        return 2012
    elif row.Council_Member == 'Jeremy Jamm':
        return 2008
    elif row.Council_Member == 'Douglass Howser':
        return 2006

# Use a lambda function to apply the elected_year function to all rows in the df.
# Don't forget axis = 1 (apply function to all rows)!
council_population['Elected'] = council_population.apply(lambda row:
    elected_year(row), axis = 1)

council_population

CD	Council_Member	Population	Elected
1	Leslie Knope	1,500	2012
2	Jeremy Jamm	2,000	2008
3	Douglass Howser	2,250	2006

Method #2: Loop over every value, fill in the new column value, then attach that new column.

# Create a list to store the new column
sales_group = []

for row in paunch_locations['Sales_millions']:
    # If sales are more than $3M, but less than $5M, tag as moderate.
    if (row >= 3) and (row <= 5):
        sales_group.append('moderate')
    # If sales are more than $5M, tag as high.
    elif row >=5:
        sales_group.append('high')
    # Anything else, aka, if sales are less than $3M, tag as low.
    else:
        sales_group.append('low')


paunch_locations['sales_group'] = sales_group

paunch_locations

Store	City	Sales_millions	CD	Geometry	sales_group
1	Pawnee	$5	1	(x1,y1)	moderate
2	Pawnee	$2.5	2	(x2, y2)	low
3	Pawnee	$2.5	3	(x3, y3)	low
4	Eagleton	$2		(x4, y4)	low
5	Pawnee	$4	1	(x5, y5)	moderate
6	Pawnee	$6	2	(x6, y6)	high
7	Indianapolis	$7		(x7, y7)	high

Aggregating#

One of the most common form of summary statistics is aggregating by groups. In Excel, it’s called a pivot table. In ArcGIS, it’s doing a dissolve and calculating summary statistics. There are two ways to do it in Python: groupby and agg or pivot_table.

To answer the question of how many Paunch Burger locations there are per Council District and the sales generated per resident,

# Method #1: groupby and agg
pivot = (merge2.groupby(['CD'])
        .agg({'Sales_millions': 'sum',
              'Store': 'count',
              'Population': 'mean'}
             ).reset_index()
        )

# Method #2: pivot table
pivot = merge2.pivot_table(
        index= ['CD'],
        values = ['Sales_millions', 'Store', 'Population'],
        aggfunc= {
            'Sales_millions': 'sum',
            'Store': 'count',
            'Population': 'mean'}
       ).reset_index()

    # to only find one type of summary statistic, use aggfunc = 'sum'

# reset_index() will compress the headers of the table, forcing them to appear
# in 1 row rather than 2 separate rows

pivot looks like this:

CD	Sales_millions	Store	Council_Member	Population
1	$9	2	Leslie Knope	1,500
2	$8.5	2	Jeremy Jamm	2,000
3	$2.5	1	Douglass Howser	2,250

Export Aggregated Output#

Python can do most of the heavy lifting for data cleaning, transformations, and general wrangling. But, for charts or tables, it might be preferable to finish in Excel so that visualizations conform to the corporate style guide.

Dataframes can be exported into Excel and written into multiple sheets.

import xlsxwriter

# initiate a writer
writer = pd.ExcelWriter('../outputs/filename.xlsx', engine='xlsxwriter')

council_population.to_excel(writer, sheet_name = 'council_pop')
paunch_locations.to_excel(writer, sheet_name = 'paunch_locations')
merge2.to_excel(writer, sheet_name = 'merged_data')
pivot.to_excel(writer, sheet_name = 'pivot')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

Geodataframes can be exported as a shapefile or GeoJSON to visualize in ArcGIS/QGIS.

gdf.to_file(driver = 'ESRI Shapefile', filename = '../folder/my_shapefile.shp' )

gdf.to_file(driver = 'GeoJSON', filename = '../folder/my_geojson.geojson')