基本的程式筆記設定,安裝、載入一些基本的套件
rm(list=ls(all=T))
knitr::opts_chunk$set(comment = NA)
knitr::opts_knit$set(global.par = TRUE)
par(cex=0.8); options(scipen=20, digits=4, width=90)
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr)
就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國,政府使用現有人口調查(CPS)衡量失業率,該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中,我們將使用講座中審查的主題以及一些使用2013年9月版的,具有全國代表性的數據集。數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員,完整數據集有385個欄位,但在本練習中,我們將使用數據集CPSData.csv版本,它具有以下欄位:
PeopleInHousehold
: 受訪者家庭中的人數。Region
: 受訪者居住的人口普查區域。State
: 受訪者居住的州。MetroAreaCode
: 都會區代碼,如受訪者不住都會區,則為NA;從代碼到都會區名稱的對應在MetroAreaCodes.csv
中提供。Age
: 受訪者的年齡,以年為單位。 80代表80-84歲的人,85代表85歲及以上的人。Married
: 受訪者的婚姻狀況。Sex
: 受訪者的性別。Education
: 受訪者獲得的最高教育程度。Race
: 受訪者的種族。Hispanic
: 受訪者是否屬於西班牙裔。CountryOfBirthcode
: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。Citizenship
: 受訪者的公民身份。EmploymentStatus
: 受訪者的就業狀況。Industry
: 受訪者的就業行業(僅在受僱的情況下可用)。§ 1.1 How many interviewees are in the dataset?
A = read.csv('data/CPSData.csv')
MetroAreaMap = read.csv("data/MetroAreaCodes.csv")
CountryMap = read.csv("data/CountryCodes.csv")
nrow(A)
[1] 131302
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
table(A$Industry) %>% sort
Armed forces Mining
29 550
Agriculture, forestry, fishing, and hunting Information
1307 1328
Public administration Other services
3186 3224
Transportation and utilities Financial
3260 4347
Construction Leisure and hospitality
4387 6364
Manufacturing Professional and business services
6791 7519
Trade Educational and health services
8933 15017
§ 1.3 Which state has the fewest interviewees?
table(A$State) %>% sort %>% head
New Mexico Montana Mississippi Alabama West Virginia Arkansas
1102 1214 1230 1376 1409 1421
Which state has the largest number of interviewees?
table(A$State) %>% sort %>% tail
Illinois Pennsylvania Florida New York Texas California
3912 3930 5149 5595 7077 11570
§ 1.4 What proportion of interviewees are citizens of the United States?
table(A$Citizenship) %>% prop.table
Citizen, Native Citizen, Naturalized Non-Citizen
0.88833 0.05387 0.05781
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
tapply(A$Hispanic, A$Race, sum) %>% sort
Pacific Islander Asian American Indian Multiracial Black
77 113 304 448 621
White
16731
§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
summary(A)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Midwest :30684 California :11570 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Northeast:25939 Texas : 7077 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 South :41502 New York : 5595 Median :34740 Median :39.0
Mean : 3.28 West :33177 Florida : 5149 Mean :35075 Mean :38.8
3rd Qu.: 4.00 Pennsylvania: 3930 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Illinois : 3912 Max. :79600 Max. :85.0
(Other) :94069 NA's :34238
Married Sex Education
Divorced :11151 Female:67481 High school :30906
Married :55509 Male :63821 Bachelor's degree :19443
Never Married:30772 Some college, no degree:18863
Separated : 2027 No high school diploma :16095
Widowed : 6505 Associate degree : 9913
NA's :25338 (Other) :10744
NA's :25338
Race Hispanic CountryOfBirthCode
American Indian : 1433 Min. :0.000 Min. : 57.0
Asian : 6520 1st Qu.:0.000 1st Qu.: 57.0
Black : 13913 Median :0.000 Median : 57.0
Multiracial : 2897 Mean :0.139 Mean : 82.7
Pacific Islander: 618 3rd Qu.:0.000 3rd Qu.: 57.0
White :105921 Max. :1.000 Max. :555.0
Citizenship EmploymentStatus
Citizen, Native :116639 Disabled : 5712
Citizen, Naturalized: 7073 Employed :61733
Non-Citizen : 7590 Not in Labor Force:15246
Retired :18619
Unemployed : 4203
NA's :25789
Industry
Educational and health services :15017
Trade : 8933
Professional and business services: 7519
Manufacturing : 6791
Leisure and hospitality : 6364
(Other) :21618
NA's :65060
§ 2.2 Which is the most accurate:
tapply(is.na(A$Married), A$Region, mean)
Midwest Northeast South West
0.1980 0.1738 0.1920 0.2046
tapply(is.na(A$Married), A$Sex, mean)
Female Male
0.1810 0.2056
tapply(is.na(A$Married), A$Age, mean)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tapply(is.na(A$Married), A$Citizenship, mean)
Citizen, Native Citizen, Naturalized Non-Citizen
0.21162 0.02305 0.06482
§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sort
District of Columbia New Jersey Rhode Island California
0.00000 0.00000 0.00000 0.02048
Florida Massachusetts Maryland New York
0.03923 0.06492 0.06938 0.08061
Connecticut Illinois Colorado Arizona
0.08568 0.11222 0.12991 0.13154
Nevada Texas Louisiana Pennsylvania
0.13308 0.14370 0.16138 0.17430
Michigan Washington Georgia Virginia
0.17826 0.18132 0.19843 0.19844
Utah Oregon Delaware New Mexico
0.21010 0.21822 0.23397 0.24501
Hawaii Ohio Alabama Indiana
0.24917 0.25122 0.25872 0.29142
Wisconsin South Carolina Minnesota Oklahoma
0.29933 0.31303 0.31507 0.32764
Missouri Tennessee Kansas North Carolina
0.32867 0.35594 0.36227 0.37304
Iowa Arkansas Idaho Kentucky
0.48695 0.49050 0.49868 0.50679
New Hampshire Nebraska Maine Vermont
0.56875 0.58132 0.59832 0.65238
Mississippi South Dakota North Dakota West Virginia
0.69431 0.70250 0.73739 0.75586
Montana Alaska Wyoming
0.83608 1.00000 1.00000
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#
#
§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
tapply(is.na(A$MetroAreaCode), A$Region, mean) %>% sort
Northeast South West Midwest
0.2162 0.2378 0.2437 0.3479
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
#
#
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#
#
§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?
nrow(MetroAreaMap)
[1] 271
How many observations (codes for countries) are there in CountryMap?
nrow(CountryMap)
[1] 149
§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?
A = merge(A, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
A = merge(A, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
How many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(A$MetroArea))
[1] 34238
§ 3.3 Which metropolitan areas has the largest number of interviewees?
table(A$MetroArea) %>% sort %>% tail
Providence-Fall River-Warwick, MA-RI
2284
Chicago-Naperville-Joliet, IN-IN-WI
2772
Philadelphia-Camden-Wilmington, PA-NJ-DE
2855
Los Angeles-Long Beach-Santa Ana, CA
4102
Washington-Arlington-Alexandria, DC-VA-MD-WV
4177
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
tapply(A$Hispanic, A$MetroArea, mean) %>% sort %>% tail
San Antonio, TX El Centro, CA El Paso, TX
0.6442 0.6869 0.7910
Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX Laredo, TX
0.7975 0.9487 0.9663
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
tapply(A$Race == "Asian", A$MetroArea, mean) %>% sort %>% tail
Warner Robins, GA Fresno, CA
0.1667 0.1848
Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA
0.2030 0.2418
San Francisco-Oakland-Fremont, CA Honolulu, HI
0.2468 0.5019
§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
tapply(A$Education=='No high school diploma', A$MetroArea, mean, na.rm=T) %>%
sort %>% head
Iowa City, IA Bowling Green, KY Kalamazoo-Portage, MI
0.02913 0.03704 0.05051
Champaign-Urbana, IL Bremerton-Silverdale, WA Lawrence, KS
0.05155 0.05405 0.05952
§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?
#
#
How many interviewees have a missing value for the new Country
variable?
sum(is.na(A$Country))
[1] 176
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
table(A$Country) %>% sort %>% tail(10)
Cuba Germany Vietnam El Salvador Puerto Rico China
426 438 458 477 518 581
India Philippines Mexico United States
770 839 3921 115063
§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
table(A$Country[A$MetroArea=="New York-Northern New Jersey-Long Island, NY-NJ-PA"] ==
"United States") %>% prop.table
FALSE TRUE
0.3087 0.6913
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
tapply(A$Country=="India", A$MetroArea, sum) %>% sort %>% tail
Kansas City, MO-KS Milwaukee-Waukesha-West Allis, WI
11 12
Fresno, CA San Jose-Sunnyvale-Santa Clara, CA
16 19
Hartford-West Hartford-East Hartford, CT Detroit-Warren-Livonia, MI
26 30
In Brazil?
tapply(A$Country=="Brazil", A$MetroArea, sum) %>% sort %>% tail
Sacramento-Arden-Arcade-Roseville, CA Canton-Massillon, OH
2 3
Phoenix-Mesa-Scottsdale, AZ Davenport-Moline-Rock Island, IA-IL
3 4
Miami-Fort Lauderdale-Miami Beach, FL Boston-Cambridge-Quincy, MA-NH
16 18
In Somalia?
tapply(A$Country=="Somalia", A$MetroArea, sum) %>% sort %>% tail
York-Hanover, PA Youngstown-Warren-Boardman, OH
0 0
Dayton, OH Richmond, VA
1 1
Phoenix-Mesa-Scottsdale, AZ St. Cloud, MN
7 7