This notebook covers the process of acquiring and cleaning data.

A. Data Extraction
B. Data Wrangling

In [1]:
import requests
import json
import pandas as pd
import numpy as np

A. Data Extraction¶

The data were collected by making API requests to the U.S. Census Bureau’s ACS 2023 1-Year datasets. Information about ACS table types and variables: LINK

To construct API request URLs, we first need to identify variable codes and geographic code (ucgid) for county-level data.

  1. Identifying variables:
  • Variables are gathered from 3 different types of tables: Subject Tables, Detailed Tables, Data Profiles.

  • A variable code starts with a table ID with the first letter representing the table type: "S" for Subject Tables, "B" for Detailed Tables, "DP" for Data Profiles. For examples: S1701_C03_001E, DP03_0009PE, etc.

  • Relevant variables were selected by browsing tables and variable labels on https://data.census.gov. Each selected variable label was then mapped to its corresponding variable code using the following variable metadata documentations:

    • Detailed Tables
    • Subject Tables
    • Data Profiles
  • A more reliable way to retrieve variable code is using retrieving_variable_codes.ipynb.

    Description of selected variables and corresponding variable codes:

Column Description Original table Variable code
county County name and state All tables county
poverty_rate Poverty rate in percentage S1701 S1701_C03_001E
health_insurance The percentage of people having health insurance S2701 S2701_C03_001E
bachelor_18-24 The number of people aged 18–24 with a bachelor’s degree S1501 S1501_C01_005E
bachelor_25+ The number of people aged 25 and over with a bachelor’s degree S1501 S1501_C01_012E
population_18-24 The number of people aged 18–24 S1501 S1501_C01_001E
population_25+ The number of people aged 25 and over S1501 S1501_C01_006E
unemployment_rate The percentage of people that are unemployed DP03 DP03_0009PE
public_transit The percentage of workers who commute to work by public transits DP03 DP03_0021PE
with_public_assistance The number of households receiving public assistance income B19057 B19057_002E
total_households The total number of households B19057 B19057_001E
median_house_value Median house value in dollars B25077 B25077_001E
median_gross_rent Median gross rent in dollars B25077 B25064_001E
  1. Identifying geographic code:
  • Documentations for ucgid: LINK
  • The ucgid that returns data for all counties within the U.S. is: 0100000US$0500000
    • 0100000US: Fully qualified GEOID of the United States (parent geography)
    • $: A separator that links parent and child geographic levels
    • 0500000: The summary level code for counties
In [2]:
# Retrieve data from Subject tables
subject_col_ids = ",".join(['NAME','S1701_C03_001E','S2701_C03_001E','S1901_C01_012E','S1501_C01_005E','S1501_C01_012E','S1501_C01_001E','S1501_C01_006E'])
url = 'https://api.census.gov/data/2023/acs/acs1/subject'
response = requests.get(f'{url}?get={subject_col_ids}&ucgid=pseudo(0100000US$0500000)&descriptive=true')
content = response.json()

# Store the raw data in local data store
with open('data1.json', 'w') as file:
    json.dump(content, file)

df1 = pd.read_json('data1.json')
df1.head()
Out[2]:
0 1 2 3 4 5 6 7 8
0 NAME S1701_C03_001E S2701_C03_001E S1901_C01_012E S1501_C01_005E S1501_C01_012E S1501_C01_001E S1501_C01_006E ucgid
1 Geographic Area Name Estimate!!Percent below poverty level!!Populat... Estimate!!Percent Insured!!Civilian noninstitu... Estimate!!Households!!Median income (dollars) Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Geography
2 Baldwin County, Alabama 9.7 93.2 72915 2213 39152 15552 184852 0500000US01003
3 Calhoun County, Alabama 21.3 91.1 50780 605 10275 13499 78420 0500000US01015
4 Cullman County, Alabama 18.3 89.4 58923 661 7026 7223 64307 0500000US01043
In [3]:
# Retrieve data from Data Profiles tables
dp_col_ids = ",".join(['NAME','DP03_0009PE','DP03_0021PE'])
url = 'https://api.census.gov/data/2023/acs/acs1/profile'
response = requests.get(f'{url}?get={dp_col_ids}&ucgid=pseudo(0100000US$0500000)&descriptive=true')
content = response.json()

# Store the raw data in local data store
with open('data2.json', 'w') as file:
    json.dump(content, file)

df2 = pd.read_json('data2.json')
df2.head()
Out[3]:
0 1 2 3
0 NAME DP03_0009PE DP03_0021PE ucgid
1 Geographic Area Name Percent!!EMPLOYMENT STATUS!!Civilian labor for... Percent!!COMMUTING TO WORK!!Workers 16 years a... Geography
2 DeKalb County, Alabama 3.8 0.0 0500000US01049
3 Baldwin County, Alabama 2.0 0.0 0500000US01003
4 Calhoun County, Alabama 5.4 0.0 0500000US01015
In [4]:
# Retrieve data from Detailed Base tables
db_col_ids = ",".join(['NAME','B19057_002E','B19057_001E','B25077_001E','B25064_001E'])
url = 'https://api.census.gov/data/2023/acs/acs1'
response = requests.get(f'{url}?get={db_col_ids}&ucgid=pseudo(0100000US$0500000)&descriptive=true')
content = response.json()

# Store the raw data in local data store
with open('data3.json', 'w') as file:
    json.dump(content, file)

df3 = pd.read_json('data3.json')
df3.head()
Out[4]:
0 1 2 3 4 5
0 NAME B19057_002E B19057_001E B25077_001E B25064_001E ucgid
1 Geographic Area Name Estimate!!Total:!!With public assistance income Estimate!!Total: Estimate!!Median value (dollars) Estimate!!Median gross rent Geography
2 Baldwin County, Alabama 618 105698 307000 1286 0500000US01003
3 Calhoun County, Alabama 766 46310 160900 782 0500000US01015
4 Cullman County, Alabama 483 35961 204000 788 0500000US01043
In [5]:
print(df1.shape)
print(df2.shape)
print(df3.shape)
(856, 9)
(856, 4)
(856, 6)
In [6]:
df1.drop(columns=[8], inplace=True)
df2.drop(columns=[3], inplace=True)
In [7]:
merged_df = df1.merge(df2, on=0).merge(df3, on=0)
merged_df.head()
Out[7]:
0 1_x 2_x 3_x 4_x 5_x 6 7 1_y 2_y 1 2 3_y 4_y 5_y
0 NAME S1701_C03_001E S2701_C03_001E S1901_C01_012E S1501_C01_005E S1501_C01_012E S1501_C01_001E S1501_C01_006E DP03_0009PE DP03_0021PE B19057_002E B19057_001E B25077_001E B25064_001E ucgid
1 Geographic Area Name Estimate!!Percent below poverty level!!Populat... Estimate!!Percent Insured!!Civilian noninstitu... Estimate!!Households!!Median income (dollars) Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Percent!!EMPLOYMENT STATUS!!Civilian labor for... Percent!!COMMUTING TO WORK!!Workers 16 years a... Estimate!!Total:!!With public assistance income Estimate!!Total: Estimate!!Median value (dollars) Estimate!!Median gross rent Geography
2 Baldwin County, Alabama 9.7 93.2 72915 2213 39152 15552 184852 2.0 0.0 618 105698 307000 1286 0500000US01003
3 Calhoun County, Alabama 21.3 91.1 50780 605 10275 13499 78420 5.4 0.0 766 46310 160900 782 0500000US01015
4 Cullman County, Alabama 18.3 89.4 58923 661 7026 7223 64307 3.8 0.3 483 35961 204000 788 0500000US01043
In [8]:
merged_df.shape
Out[8]:
(856, 15)
In [9]:
merged_df.to_csv('raw_data.csv', index=False)

B. Data Wrangling¶

In [10]:
cleaned_df = merged_df.copy()

1. Tidiness Issue: Column headers are codes¶

In [11]:
cleaned_df.columns = cleaned_df.loc[0]
cleaned_df.drop(0, inplace=True)
cleaned_df.reset_index(drop=True, inplace=True)
cleaned_df.head(1)
Out[11]:
NAME S1701_C03_001E S2701_C03_001E S1901_C01_012E S1501_C01_005E S1501_C01_012E S1501_C01_001E S1501_C01_006E DP03_0009PE DP03_0021PE B19057_002E B19057_001E B25077_001E B25064_001E ucgid
0 Geographic Area Name Estimate!!Percent below poverty level!!Populat... Estimate!!Percent Insured!!Civilian noninstitu... Estimate!!Households!!Median income (dollars) Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT... Percent!!EMPLOYMENT STATUS!!Civilian labor for... Percent!!COMMUTING TO WORK!!Workers 16 years a... Estimate!!Total:!!With public assistance income Estimate!!Total: Estimate!!Median value (dollars) Estimate!!Median gross rent Geography

Issue: Column headers are alphanumeric codes that are hard to understand. These headers should be converted into short descriptive names.

Solution: Convert column headers into readable column names.

The first row contains column labels that describe the columns. We can take a look at these column labels to understand each column.

In [12]:
# Check for the column labels
for i in cleaned_df.columns:
    label = cleaned_df.loc[0,i]
    print(f'{i}: {label}')
NAME: Geographic Area Name
S1701_C03_001E: Estimate!!Percent below poverty level!!Population for whom poverty status is determined
S2701_C03_001E: Estimate!!Percent Insured!!Civilian noninstitutionalized population
S1901_C01_012E: Estimate!!Households!!Median income (dollars)
S1501_C01_005E: Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Bachelor's degree or higher
S1501_C01_012E: Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor's degree
S1501_C01_001E: Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years
S1501_C01_006E: Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over
DP03_0009PE: Percent!!EMPLOYMENT STATUS!!Civilian labor force!!Unemployment Rate
DP03_0021PE: Percent!!COMMUTING TO WORK!!Workers 16 years and over!!Public transportation (excluding taxicab)
B19057_002E: Estimate!!Total:!!With public assistance income
B19057_001E: Estimate!!Total:
B25077_001E: Estimate!!Median value (dollars)
B25064_001E: Estimate!!Median gross rent
ucgid: Geography
In [13]:
# Create column names
col_names = ['county','poverty_rate','health_insurance','median_household_income','bachelor_18-24','bachelor_25+','population_18-24','population_25+',
             'unemployment_rate','public_transit','with_public_assistance','total_households','median_house_value','median_gross_rent','ucgid']

# Apply column names
cleaned_df.columns = col_names

# Drop the column label row
cleaned_df.drop(0, inplace=True)
cleaned_df.reset_index(drop=True, inplace=True)
cleaned_df.head(2)
Out[13]:
county poverty_rate health_insurance median_household_income bachelor_18-24 bachelor_25+ population_18-24 population_25+ unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent ucgid
0 Baldwin County, Alabama 9.7 93.2 72915 2213 39152 15552 184852 2.0 0.0 618 105698 307000 1286 0500000US01003
1 Calhoun County, Alabama 21.3 91.1 50780 605 10275 13499 78420 5.4 0.0 766 46310 160900 782 0500000US01015

2. Quality Issue: Incorrect data types¶

In [14]:
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 854 entries, 0 to 853
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   county                   854 non-null    object
 1   poverty_rate             854 non-null    object
 2   health_insurance         854 non-null    object
 3   median_household_income  854 non-null    object
 4   bachelor_18-24           854 non-null    object
 5   bachelor_25+             854 non-null    object
 6   population_18-24         854 non-null    object
 7   population_25+           854 non-null    object
 8   unemployment_rate        854 non-null    object
 9   public_transit           854 non-null    object
 10  with_public_assistance   854 non-null    object
 11  total_households         854 non-null    object
 12  median_house_value       854 non-null    object
 13  median_gross_rent        854 non-null    object
 14  ucgid                    854 non-null    object
dtypes: object(15)
memory usage: 100.2+ KB

Issue: Some columns are expected to be integer or float instead of object.

Solution: Convert some columns to numeric types (integer or float).

In [15]:
# Identify integer columns
int_cols = ['median_household_income','bachelor_18-24','bachelor_25+','population_18-24','population_25+',
            'with_public_assistance','total_households','median_house_value','median_gross_rent']
# Identify float columns
float_cols = ['poverty_rate','health_insurance','unemployment_rate','public_transit']
# Convert data type from object to int or float. The below codes will raise a ValueError if a column contains non-numeric characters.
for col in int_cols:
    cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='raise')
for col in float_cols:
    cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='raise')
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 854 entries, 0 to 853
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   county                   854 non-null    object 
 1   poverty_rate             854 non-null    float64
 2   health_insurance         854 non-null    float64
 3   median_household_income  854 non-null    int64  
 4   bachelor_18-24           854 non-null    int64  
 5   bachelor_25+             854 non-null    int64  
 6   population_18-24         854 non-null    int64  
 7   population_25+           854 non-null    int64  
 8   unemployment_rate        854 non-null    float64
 9   public_transit           854 non-null    float64
 10  with_public_assistance   854 non-null    int64  
 11  total_households         854 non-null    int64  
 12  median_house_value       854 non-null    int64  
 13  median_gross_rent        854 non-null    int64  
 14  ucgid                    854 non-null    object 
dtypes: float64(4), int64(9), object(2)
memory usage: 100.2+ KB

3. Quality Issue: Completeness Issues¶

When changing data type of a column from object to int or float, pandas will raise a ValueError if the column contains string because pandas cannot convert non-numeric values (such as strings) to integers. The above code was executed without any error, meaning there's no string character used as annotation for missing value.

In [16]:
cleaned_df.isna().sum()
Out[16]:
county                     0
poverty_rate               0
health_insurance           0
median_household_income    0
bachelor_18-24             0
bachelor_25+               0
population_18-24           0
population_25+             0
unemployment_rate          0
public_transit             0
with_public_assistance     0
total_households           0
median_house_value         0
median_gross_rent          0
ucgid                      0
dtype: int64

There seems to be no null value in the dataset. However, a 0 value might indicate missing value. So, let's inspect if there's any 0 in the dataset.

In [17]:
# Identify integer columns with at least one 0
for col in int_cols:
    if (cleaned_df[col] == 0).any():
        print(col)
bachelor_18-24
with_public_assistance
In [18]:
# Identify float columns with at least one 0
for col in float_cols:
    if (cleaned_df[col] == 0).any():
        print(col)
public_transit

bachelor_18-24 and with_public_assistance need further inspection to see if 0 values are actual observations or missing values.
It's normal for public_transit to have 0 values as this variable represent the percentage of people commuting to work by public transit, and many U.S. counties don't have public transit system.

Inspecting bachelor_18-24:
This column represents the number of bachelor's degree holders aged 18-24.

In [19]:
# Identify rows where 0 occurs in 'bachelor_18-24'
cleaned_df[cleaned_df['bachelor_18-24'] == 0]
Out[19]:
county poverty_rate health_insurance median_household_income bachelor_18-24 bachelor_25+ population_18-24 population_25+ unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent ucgid
24 Cochise County, Arizona 15.9 90.2 52025 0 13568 9920 89455 6.3 0.0 1512 50919 235200 988 0500000US04003
38 Lonoke County, Arkansas 8.4 95.1 75327 0 8512 5355 52158 5.0 0.0 435 29388 189300 1065 0500000US05085
119 Columbia County, Florida 16.8 83.2 60812 0 5873 4627 53112 7.8 -999999999.0 511 27280 230300 998 0500000US12023
186 Spalding County, Georgia 13.9 85.2 56169 0 6052 6546 47107 3.1 0.0 885 26898 240400 1073 0500000US13255
200 Twin Falls County, Idaho 9.3 92.5 66643 0 8520 8634 62141 2.6 0.0 1157 35050 329000 1142 0500000US16083
277 McCracken County, Kentucky 15.6 95.4 60720 0 7386 4572 48018 1.3 -999999999.0 206 28934 179600 764 0500000US21145

The 0 value occurs 6 times in bachelor_18-24. The occurrences of 0 need further inspection to determine whether they are missing values or actual observations.

We can cross-check these 0's against another data from the U.S. Census Bureau in 2022. This dataset contains the above 6 counties.

In [20]:
# Retrieve data in 2022
ucgid = ','.join(cleaned_df['ucgid'][cleaned_df['bachelor_18-24'] == 0])
url = f'https://api.census.gov/data/2022/acs/acs1/subject?get=NAME,S1501_C01_001E,S1501_C01_005E&ucgid={ucgid}'
data = requests.get(url).json()
crosscheck = pd.DataFrame(data,columns=['county','population_18-24','bachelor_18-24','ucgid'])
crosscheck
Out[20]:
county population_18-24 bachelor_18-24 ucgid
0 NAME S1501_C01_001E S1501_C01_005E ucgid
1 Cochise County, Arizona 10485 849 0500000US04003
2 Lonoke County, Arkansas 6009 90 0500000US05085
3 Columbia County, Florida 5922 116 0500000US12023
4 Spalding County, Georgia 6533 27 0500000US13255
5 Twin Falls County, Idaho 7574 607 0500000US16083
6 McCracken County, Kentucky 5663 512 0500000US21145

The "crosscheck" dataset includes population_18-24 and bachelor_18-24 across six selected counties in 2022. According to the "crosscheck" dataframe, in 2022, these 6 counties had individuals aged 18-24 with a bachelor's degree. It's very unlikely that these counties had no bachelor's degree holder just in 2023.

Issue: The occurrence of 0 means these are missing values, not actual observations.

Solution: Impute the missing values using data in 2022. The estimated value is calculated based on the proportion of bachelor's degree holders aged 18-24 compared to the total population aged 18-24 in the 2022 dataset.

In [21]:
crosscheck.drop(0, inplace=True) # Remove the first row in "crosscheck"
# Convert datatype in "crosscheck" before calculation
crosscheck['bachelor_18-24'] = pd.to_numeric(crosscheck['bachelor_18-24'])
crosscheck['population_18-24'] = pd.to_numeric(crosscheck['population_18-24'])

county_list = []

for i in range(1,7):
    ratio_22 = crosscheck.loc[i,'bachelor_18-24']/crosscheck.loc[i,'population_18-24']
    
    county = crosscheck.loc[i,'county']
    population_23 = cleaned_df['population_18-24'][cleaned_df['county'] == county]
    
    impute_val = (population_23 * ratio_22).astype(int)
    
    cleaned_df.loc[cleaned_df['county'] == county, 'bachelor_18-24'] = impute_val
    county_list.append(county)

cleaned_df[['county','bachelor_18-24']][cleaned_df['county'].isin(county_list)]
Out[21]:
county bachelor_18-24
24 Cochise County, Arizona 803
38 Lonoke County, Arkansas 80
119 Columbia County, Florida 90
186 Spalding County, Georgia 27
200 Twin Falls County, Idaho 691
277 McCracken County, Kentucky 413

Inspecting with_public_assistance:
This column represents the number of people receiving public assistance.

In [22]:
# Identify rows where 0 occurs in 'with_public_assistance'
cleaned_df[cleaned_df['with_public_assistance'] == 0]
Out[22]:
county poverty_rate health_insurance median_household_income bachelor_18-24 bachelor_25+ population_18-24 population_25+ unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent ucgid
225 Boone County, Indiana 2.7 96.0 103592 680 14078 5001 51586 0.8 0.0 0 29659 369300 1303 0500000US18011
656 Georgetown County, South Carolina 11.9 89.7 71313 340 9992 5044 49831 3.2 0.0 0 28720 322200 1517 0500000US45043

Similarly to the bachelor_18-24 column, I will crosscheck the two 0's against data in 2022.

In [23]:
# Retrieve data in 2022
url = 'https://api.census.gov/data/2022/acs/acs1?get=NAME,B19057_002E,B19057_001E&ucgid=0500000US18011,0500000US45043'
data = requests.get(url).json()
crosscheck = pd.DataFrame(data,columns=['county','with_public_assistance','total_households','ucgid'])
crosscheck.drop(0, inplace=True) # Remove the first row which contains column IDs
# Convert datatype in "crosscheck"
crosscheck['with_public_assistance'] = crosscheck['with_public_assistance'].astype('int')
crosscheck['total_households'] = crosscheck['total_households'].astype('int')
crosscheck
Out[23]:
county with_public_assistance total_households ucgid
1 Boone County, Indiana 257 28756 0500000US18011

In 2022, there were 257 people receiving public assistance income in Boone County, Indiana, while there's no record for Georgetown County, South Carolina.

Issue: The occurrence of 0 means this is missing value, not actual observation.

Solution: with_public_assistance data in 2023 for Boone County will be imputed the same way I did earlier with bachelor_18-24, while Georgetown's 0 value will be kept as is.

In [24]:
# Impute 0 value for Boone County
ratio_22 = crosscheck.loc[1,'with_public_assistance']/crosscheck.loc[1,'total_households']
total_households_23 = cleaned_df['total_households'][cleaned_df['ucgid'] == '0500000US18011']
impute_val = (total_households_23 * ratio_22).astype(int)
cleaned_df.loc[cleaned_df['ucgid'] == '0500000US18011', 'with_public_assistance'] = impute_val
# Check the imputed value
cleaned_df[['county','with_public_assistance']][cleaned_df['ucgid'] == '0500000US18011']
Out[24]:
county with_public_assistance
225 Boone County, Indiana 265

4. Quality Issue: Inconsistency¶

Annotation values:

Annotation values represent non-integer data. They can be non-numeric characters or special numbers. Since I already ruled out the posibility of having non-numeric characters, I will check for special numbers.

A list of annotation values can be found here: Link

In [25]:
cleaned_df[['county','public_transit','ucgid']][cleaned_df['public_transit'].isin([-666666666,-999999999,-888888888,-222222222,-333333333,-555555555])]
Out[25]:
county public_transit ucgid
37 Jefferson County, Arkansas -999999999.0 0500000US05069
119 Columbia County, Florida -999999999.0 0500000US12023
168 Effingham County, Georgia -999999999.0 0500000US13103
187 Troup County, Georgia -999999999.0 0500000US13285
188 Walker County, Georgia -999999999.0 0500000US13295
277 McCracken County, Kentucky -999999999.0 0500000US21145
562 Marion County, Ohio -999999999.0 0500000US39101
733 Orange County, Texas -999999999.0 0500000US48361
843 Arecibo Municipio, Puerto Rico -999999999.0 0500000US72013
851 Toa Alta Municipio, Puerto Rico -999999999.0 0500000US72135

Issue: The occurrence of "-999999999.0" means that there's a missing value due to insufficient number of sample cases in the selected geographic area.

Solution: I will apply similar crosscheck and impute method using data in 2022.

In [26]:
# Retrieve data in 2022
ucgid = ','.join(cleaned_df['ucgid'][cleaned_df['public_transit'] == -999999999])
url = f'https://api.census.gov/data/2022/acs/acs1/profile?get=NAME,DP03_0021PE&ucgid={ucgid}'
data = requests.get(url).json()
crosscheck = pd.DataFrame(data,columns=['county','public_transit','ucgid'])
crosscheck.drop(0, inplace=True)
crosscheck.reset_index(drop=True, inplace=True)
crosscheck['public_transit'] = pd.to_numeric(crosscheck['public_transit'])
crosscheck
Out[26]:
county public_transit ucgid
0 Jefferson County, Arkansas 0.5 0500000US05069
1 Columbia County, Florida 0.0 0500000US12023
2 Effingham County, Georgia 0.3 0500000US13103
3 Troup County, Georgia 0.0 0500000US13285
4 Walker County, Georgia 0.0 0500000US13295
5 McCracken County, Kentucky 0.5 0500000US21145
6 Marion County, Ohio 0.8 0500000US39101
7 Orange County, Texas 0.0 0500000US48361
8 Arecibo Municipio, Puerto Rico 0.0 0500000US72013
9 Toa Alta Municipio, Puerto Rico 0.0 0500000US72135
In [27]:
# Impute annotation values with 2022 data
cleaned_df.loc[cleaned_df['ucgid'].isin(crosscheck['ucgid']),'public_transit'] = crosscheck['public_transit'].values
cleaned_df[['county','public_transit']][cleaned_df['ucgid'].isin(crosscheck['ucgid'])]
Out[27]:
county public_transit
37 Jefferson County, Arkansas 0.5
119 Columbia County, Florida 0.0
168 Effingham County, Georgia 0.3
187 Troup County, Georgia 0.0
188 Walker County, Georgia 0.0
277 McCracken County, Kentucky 0.5
562 Marion County, Ohio 0.8
733 Orange County, Texas 0.0
843 Arecibo Municipio, Puerto Rico 0.0
851 Toa Alta Municipio, Puerto Rico 0.0

Other unexpected values:

To make sure the data is consistence, I will check whether there is any values that are too big.

In [28]:
cleaned_df.describe()
Out[28]:
poverty_rate health_insurance median_household_income bachelor_18-24 bachelor_25+ population_18-24 population_25+ unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent
count 854.000000 854.000000 854.000000 854.000000 8.540000e+02 854.000000 8.540000e+02 854.000000 854.000000 854.000000 8.540000e+02 8.540000e+02 854.000000
mean 12.770843 92.588056 76512.345433 4446.871194 5.373878e+04 31078.072600 2.331190e+05 4.208431 1.528220 3132.231850 1.316149e+05 3.309796e+05 1258.251756
std 5.401620 3.849054 20022.532998 8536.502992 9.874003e+04 51766.925148 4.031053e+05 1.727863 4.359279 7422.502216 2.172988e+05 1.743939e+05 372.467925
min 2.000000 70.700000 16836.000000 14.000000 3.078000e+03 3531.000000 3.539600e+04 0.500000 0.000000 0.000000 1.900400e+04 5.270000e+04 447.000000
25% 9.100000 90.400000 62032.250000 724.500000 1.154025e+04 8554.750000 6.646075e+04 3.100000 0.100000 649.000000 3.809475e+04 2.173750e+05 983.000000
50% 12.000000 93.400000 72994.000000 1665.000000 2.244350e+04 15247.000000 1.132020e+05 4.000000 0.500000 1282.500000 6.376150e+04 2.909500e+05 1170.000000
75% 15.675000 95.400000 87054.000000 4476.000000 5.332825e+04 34961.500000 2.361080e+05 5.000000 1.300000 2725.500000 1.370802e+05 3.917750e+05 1478.750000
max 53.900000 98.500000 174148.000000 124490.000000 1.581628e+06 853064.000000 6.857511e+06 19.800000 52.500000 144731.000000 3.450470e+06 1.512200e+06 2797.000000
In [29]:
cleaned_df.max()
Out[29]:
county                     Yuma County, Arizona
poverty_rate                               53.9
health_insurance                           98.5
median_household_income                  174148
bachelor_18-24                           124490
bachelor_25+                            1581628
population_18-24                         853064
population_25+                          6857511
unemployment_rate                          19.8
public_transit                             52.5
with_public_assistance                   144731
total_households                        3450470
median_house_value                      1512200
median_gross_rent                          2797
ucgid                            0500000US72139
dtype: object

It looks like data are in reasonable ranges.

5. Tidiness Issue: Unnecessary Variables¶

Issue: ucgid is not needed for further analysis.

Solution: I will remove ucgid.

In [30]:
cleaned_df.drop('ucgid', axis=1, inplace=True)
cleaned_df.head(2)
Out[30]:
county poverty_rate health_insurance median_household_income bachelor_18-24 bachelor_25+ population_18-24 population_25+ unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent
0 Baldwin County, Alabama 9.7 93.2 72915 2213 39152 15552 184852 2.0 0.0 618 105698 307000 1286
1 Calhoun County, Alabama 21.3 91.1 50780 605 10275 13499 78420 5.4 0.0 766 46310 160900 782

6. Tidiness Issue: Column headers are values, not variable names¶

Issue:

  • population_18-24 and population_25+ are 2 values of one variable representing total population aged 18+
  • bachelor_18-24 and bachelor_25+ are 2 values of the same variable representing Bachelor's degree holders aged 18+.

Solution: The columns that are values should be combined into distinct variables:

  • population_18+ = population_18-24 + population_25+
  • bachelor_18+ = bachelor_18-24 + bachelor_25+

This solution is preferred over unpivoting because the 2 new variables will be used directly in later analysis.

In [31]:
# Create a new column for the total number of people aged 18+
population_18_over = cleaned_df['population_18-24'] + cleaned_df['population_25+']
# Create a new column for the total number of people aged 18+ with a bachelor's degree or more
bachelor_18_over = cleaned_df['bachelor_18-24'] + cleaned_df['bachelor_25+']

7. Derived Variable Creation¶

There're 2 variables that are needed for later data analysis but are not directly available in the raw dataset and can be computed from existing variables:

  • bachelor_holders: The percentage of people aged 18+ with at least a Bachelor's degree
    bachelor_holders = (bachelor_18+/population_18+)*100
  • public_assistance: The percentage of households receiving public assistance income
    public_assistance = (with_public_assistance/total_households)*100
In [32]:
# Create a column for the percentage of bachelor's degree holders
cleaned_df['bachelor_holders'] = ((bachelor_18_over / population_18_over) * 100).round(2)
# Remove unnecessary columns
cleaned_df = cleaned_df.drop(columns=['population_18-24', 'population_25+', 'bachelor_18-24', 'bachelor_25+'])
cleaned_df.head(3)
Out[32]:
county poverty_rate health_insurance median_household_income unemployment_rate public_transit with_public_assistance total_households median_house_value median_gross_rent bachelor_holders
0 Baldwin County, Alabama 9.7 93.2 72915 2.0 0.0 618 105698 307000 1286 20.64
1 Calhoun County, Alabama 21.3 91.1 50780 5.4 0.0 766 46310 160900 782 11.84
2 Cullman County, Alabama 18.3 89.4 58923 3.8 0.3 483 35961 204000 788 10.75
In [33]:
# Create a column for the percentage of people receiving public assistance
cleaned_df['public_assistance'] = ((cleaned_df['with_public_assistance']/cleaned_df['total_households']) * 100).round(2)
# Remove unnecessary columns
cleaned_df = cleaned_df.drop(columns=['with_public_assistance','total_households'])
cleaned_df.head(3)
Out[33]:
county poverty_rate health_insurance median_household_income unemployment_rate public_transit median_house_value median_gross_rent bachelor_holders public_assistance
0 Baldwin County, Alabama 9.7 93.2 72915 2.0 0.0 307000 1286 20.64 0.58
1 Calhoun County, Alabama 21.3 91.1 50780 5.4 0.0 160900 782 11.84 1.65
2 Cullman County, Alabama 18.3 89.4 58923 3.8 0.3 204000 788 10.75 1.34

7. Update Data Store¶

In [34]:
cleaned_df.to_csv('cleaned_data.csv', index=False)
# Check the saved CSV file
df = pd.read_csv('cleaned_data.csv')
df.head()
Out[34]:
county poverty_rate health_insurance median_household_income unemployment_rate public_transit median_house_value median_gross_rent bachelor_holders public_assistance
0 Baldwin County, Alabama 9.7 93.2 72915 2.0 0.0 307000 1286 20.64 0.58
1 Calhoun County, Alabama 21.3 91.1 50780 5.4 0.0 160900 782 11.84 1.65
2 Cullman County, Alabama 18.3 89.4 58923 3.8 0.3 204000 788 10.75 1.34
3 DeKalb County, Alabama 24.7 86.1 43509 3.8 0.0 156800 659 9.33 2.66
4 Elmore County, Alabama 12.9 92.4 72478 2.7 0.0 229900 933 17.73 0.17
In [ ]: