Helpful Links

Helpful Links#

Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.

Data Analysis
- Python
- Pandas
- Summarizing
- Merging
- Dates
- Monetary Values
- Tidy Data
Visualizations
- Charts
- Maps
- DataFrames
- Ipywidgets
- Markdown
- ReviewNB

Data Analysis#

Python#

Pandas#

Summarizing#

Merging#

When working with data sets where the “merge on” column is a string data type, it can be difficult to get the DataFrames to join. For example, df1 lists County of Sonoma, Human Services Department, Adult and Aging Division, but df2 references the same department as: County of Sonoma (Human Services Department) .
- Potential Solution #1: fill in a column in one DataFrame that has a partial match with the string values in another one.
- Potential Solution #2: use the package fuzzymatcher. This will require you to carefully comb through for any bad matches.

Dates#

Use shift to calculate the number of days between two dates.

df['n_days_between'] = (df['prepared_date'] - df.shift(1)['prepared_date']).dt.days

Assign fiscal year to a date.

# Make sure your column is a date time object
df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1)

Monetary Values#

Reformat values that are in scientific notation into millions or thousands.
- Example in notebook.

    x=alt.X("Funding Amount", axis=alt.Axis(format="$.2s", title="Obligated Funding ($2021)"))

Reformat values from 19000000 to $19.0M.
Adjust for inflation.

# Must install and import cpi package for the function to work.
def adjust_prices(df):
    cols =  ["total_requested",
           "fed_requested",
           "ac_requested"]

    def inflation_table(base_year):
        cpi.update()
        series_df = cpi.series.get(area="U.S. city average").to_dataframe()
        inflation_df = (series_df[series_df.year >= 2008]
                        .pivot_table(index='year', values='value', aggfunc='mean')
                        .reset_index()
                       )
        denominator = inflation_df.value.loc[inflation_df.year==base_year].iloc[0]

        inflation_df = inflation_df.assign(
        inflation = inflation_df.value.divide(denominator)
        )

        return inflation_df

    ##get cpi table
    cpi = inflation_table(2021)
    cpi.update
    cpi = (cpi>>select(_.year, _.value))
    cpi_dict = dict(zip(cpi['year'], cpi['value']))


    for col in cols:
        multiplier = df["prepared_y"].map(cpi_dict)

        ##using 270.97 for 2021 dollars
        df[f"adjusted_{col}"] = ((df[col] * 270.97) / multiplier)
    return df

Tidy Data#

Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like groupby(), pivot(), and melt() work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.

Learn more about Tidy Data here.

Visualization#

Charts#

Altair#

def add_tooltip(chart, tooltip1, tooltip2):
    chart = (
        chart.encode(tooltip= [tooltip1,tooltip2]))
    return chart

Helpful Links

Contents

Helpful Links#

Data Analysis#

Python#

Pandas#

Summarizing#

Merging#

Dates#

Monetary Values#

Tidy Data#

Visualization#

Charts#

Altair#

Maps#

DataFrames#

ipywidgets#

Tabs#

Markdown#

ReviewNB on GitHub#