Helpful Links#
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.
Data Analysis#
Python#
Pandas#
Summarizing#
Merging#
When working with data sets where the “merge on” column is a string data type, it can be difficult to get the DataFrames to join. For example, df1 lists County of Sonoma, Human Services Department, Adult and Aging Division, but df2 references the same department as: County of Sonoma (Human Services Department) .
Potential Solution #1: fill in a column in one DataFrame that has a partial match with the string values in another one.
Potential Solution #2: use the package fuzzymatcher. This will require you to carefully comb through for any bad matches.
Potential Solution #3: if you don’t have too many values, use a dictionary.
Dates#
df['n_days_between'] = (df['prepared_date'] - df.shift(1)['prepared_date']).dt.days
# Make sure your column is a date time object
df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1)
Monetary Values#
x=alt.X("Funding Amount", axis=alt.Axis(format="$.2s", title="Obligated Funding ($2021)"))
Adjust for inflation.
# Must install and import cpi package for the function to work.
def adjust_prices(df):
cols = ["total_requested",
"fed_requested",
"ac_requested"]
def inflation_table(base_year):
cpi.update()
series_df = cpi.series.get(area="U.S. city average").to_dataframe()
inflation_df = (series_df[series_df.year >= 2008]
.pivot_table(index='year', values='value', aggfunc='mean')
.reset_index()
)
denominator = inflation_df.value.loc[inflation_df.year==base_year].iloc[0]
inflation_df = inflation_df.assign(
inflation = inflation_df.value.divide(denominator)
)
return inflation_df
##get cpi table
cpi = inflation_table(2021)
cpi.update
cpi = (cpi>>select(_.year, _.value))
cpi_dict = dict(zip(cpi['year'], cpi['value']))
for col in cols:
multiplier = df["prepared_y"].map(cpi_dict)
##using 270.97 for 2021 dollars
df[f"adjusted_{col}"] = ((df[col] * 270.97) / multiplier)
return df
Tidy Data#
Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like groupby()
, pivot()
, and melt()
work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.
Learn more about Tidy Data here.
Visualization#
Charts#
Altair#
Manually concatenate a bar chart and line chart to create a dual axis graph.
Resolving the error ‘TypeError: Object of type ‘Timestamp’ is not JSON serializable’
Add tooltip to chart functions.
def add_tooltip(chart, tooltip1, tooltip2):
chart = (
chart.encode(tooltip= [tooltip1,tooltip2]))
return chart
Maps#
DataFrames#
ipywidgets#
Tabs#
Create tabs to switch between different views.