美國的人口統計&就業
就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國,政府使用現有人口調查(CPS)衡量失業率,該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中,我們將使用講座中審查的主題以及一些使用2013年9月版的,具有全國代表性的數據集(在線提供)。
數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員。雖然完整數據集有385個變量,但在本練習中,我們將使用數據集CPSData.csv版本,它具有以下變量:
PeopleInHousehold: 受訪者家庭中的人數。地區: 受訪者居住的人口普查區域。州: 受訪者居住的州。MetroAreaCode: 一個代碼,用於標識受訪者所居住的都市區域(如果受訪者不住在大都市區,則會丟失)。從代碼到都市區域名稱對應在MetroAreaCodes.csv文件中提供。年齡: 受訪者的年齡,以年為單位。 80代表80-84歲的人,85代表85歲及以上的人。結婚: 受訪者的婚姻狀況。性別: 受訪者的性別。教育: 受訪者獲得的最高教育水平。種族: 受訪者的種族。西班牙裔: 受訪者是否屬於西班牙裔。CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。公民身份: 受訪者的美國公民身份。EmploymentStatus: 受訪者的就業狀況。行業: 受訪者的就業行業(僅在受僱的情況下可用)。§ 1.1 How many interviewees are in the dataset?
CPS = read.csv("../data/CPSData.csv")
nrow(CPS)[1] 131302
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
sort(table(CPS$Industry)) %>% tail # Educational and health services
Construction Leisure and hospitality
4387 6364
Manufacturing Professional and business services
6791 7519
Trade Educational and health services
8933 15017
§ 1.3 Which state has the fewest interviewees?
head(sort(table(CPS$State))) # New Mexico
New Mexico Montana Mississippi Alabama West Virginia Arkansas
1102 1214 1230 1376 1409 1421
Which state has the largest number of interviewees?
tail(sort(table(CPS$State))) # California
Illinois Pennsylvania Florida New York Texas California
3912 3930 5149 5595 7077 11570
§ 1.4 What proportion of interviewees are citizens of the United States?
table(CPS$Citizenship) %>% prop.table
Citizen, Native Citizen, Naturalized Non-Citizen
0.88833 0.05387 0.05781
table(CPS$Citizenship == "Non-Citizen") %>% prop.table # 0.942194
FALSE TRUE
0.94219 0.05781
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race, CPS$Hispanic)
0 1
American Indian 1129 304
Asian 6407 113
Black 13292 621
Multiracial 2449 448
Pacific Islander 541 77
White 89190 16731
§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
colSums(is.na(CPS)) PeopleInHousehold Region State MetroAreaCode
0 0 0 34238
Age Married Sex Education
0 25338 0 25338
Race Hispanic CountryOfBirthCode Citizenship
0 0 0 0
EmploymentStatus Industry
25789 65060
§ 2.2 Which is the most accurate:
lapply(CPS[c('Region','Sex','Age','Citizenship')],
function(x) table(is.na(CPS$Married), x))$Region
x
Midwest Northeast South West
FALSE 24609 21432 33535 26388
TRUE 6075 4507 7967 6789
$Sex
x
Female Male
FALSE 55264 50700
TRUE 12217 13121
$Age
x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
FALSE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1795
TRUE 1283 1559 1574 1693 1695 1795 1721 1681 1729 1748 1750 1721 1797 1802 1790 0
x
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
FALSE 1751 1764 1596 1517 1398 1525 1536 1638 1627 1604 1643 1657 1736 1645 1854 1762
TRUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
FALSE 1790 1804 1653 1716 1663 1531 1530 1542 1571 1673 1711 1819 1764 1749 1665 1647
TRUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
FALSE 1791 1989 1966 1931 1935 1994 1912 1895 1935 1827 1874 1758 1746 1735 1595 1596
TRUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
FALSE 1519 1569 1577 1227 1130 1062 1195 1031 941 896 842 763 729 698 659 661
TRUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x
80 85
FALSE 2664 2446
TRUE 0 0
$Citizenship
x
Citizen, Native Citizen, Naturalized Non-Citizen
FALSE 91956 6910 7098
TRUE 24683 163 492
§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
table(is.na(CPS$MetroAreaCode), CPS$State) # 2
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
FALSE 1020 0 1327 724 11333 2545 2593 1696
TRUE 356 1590 201 697 237 380 243 518
District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas
FALSE 1791 4947 2250 1576 761 3473 1420 1297 1234
TRUE 0 202 557 523 757 439 584 1231 701
Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi
FALSE 908 1216 909 2978 1858 2517 2150 376
TRUE 933 234 1354 222 129 546 989 854
Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York
FALSE 1440 199 816 1609 1148 2567 832 5144
TRUE 705 1015 1133 247 1514 0 270 451
North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island
FALSE 1642 432 2754 1024 1519 3245 2209
TRUE 977 1213 924 499 424 685 0
South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington
FALSE 1139 595 1149 6060 1455 657 2367 1937
TRUE 519 1405 635 1017 387 1233 586 429
West Virginia Wisconsin Wyoming
FALSE 344 1882 0
TRUE 1065 804 1624
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
# 3§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
tapply(is.na(CPS$MetroAreaCode), CPS$Region, mean) %>% sortNortheast South West Midwest
0.2162 0.2378 0.2437 0.3479
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
abs(0.3 - tapply(is.na(CPS$MetroAreaCode), CPS$State, mean)) %>%
sort %>% head # Wisconsin Wisconsin Indiana South Carolina Minnesota Oklahoma Missouri
0.0006701 0.0085828 0.0130277 0.0150685 0.0276428 0.0286713
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
tapply(is.na(CPS$MetroAreaCode), CPS$State, mean) %>%
sort %>% tail # Montana South Dakota North Dakota West Virginia Montana Alaska Wyoming
0.7025 0.7374 0.7559 0.8361 1.0000 1.0000
§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?
Metro = read.csv("../data/MetroAreaCodes.csv")
nrow(Metro)[1] 271
How many observations (codes for countries) are there in CountryMap?
Country = read.csv("../data/CountryCodes.csv")
nrow(Country)[1] 149
§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?
CPS = merge(CPS, Metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
# MetroAreaHow many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(CPS$MetroArea))[1] 34238
§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
table(CPS$MetroArea) %>% sort %>% tail(10) # Boston-Cambridge-Quincy
Houston-Baytown-Sugar Land, TX
1649
Dallas-Fort Worth-Arlington, TX
1863
Minneapolis-St Paul-Bloomington, MN-WI
1942
Boston-Cambridge-Quincy, MA-NH
2229
Providence-Fall River-Warwick, MA-RI
2284
Chicago-Naperville-Joliet, IN-IN-WI
2772
Philadelphia-Camden-Wilmington, PA-NJ-DE
2855
Los Angeles-Long Beach-Santa Ana, CA
4102
Washington-Arlington-Alexandria, DC-VA-MD-WV
4177
New York-Northern New Jersey-Long Island, NY-NJ-PA
5409
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
tapply(CPS$Hispanic, CPS$MetroArea, mean) %>% sort %>% tail # Laredo, TX San Antonio, TX El Centro, CA El Paso, TX
0.6442 0.6869 0.7910
Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX Laredo, TX
0.7975 0.9487 0.9663
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
tapply(CPS$Race == "Asian", CPS$MetroArea, mean) %>% sort %>% tail # 4 Warner Robins, GA Fresno, CA
0.1667 0.1848
Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA
0.2030 0.2418
San Francisco-Oakland-Fremont, CA Honolulu, HI
0.2468 0.5019
§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean, na.rm=T) %>%
sort %>% head # Iowa City, IA Iowa City, IA Bowling Green, KY Kalamazoo-Portage, MI
0.02913 0.03704 0.05051
Champaign-Urbana, IL Bremerton-Silverdale, WA Lawrence, KS
0.05155 0.05405 0.05952
§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?
CPS = merge(CPS, Country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
# CountryHow many interviewees have a missing value for the new metropolitan area variable?
sum(is.na(CPS$Country))[1] 176
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
table(CPS$Country) %>% sort %>% tail # Philippines
Puerto Rico China India Philippines Mexico United States
518 581 770 839 3921 115063
§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
area = "New York-Northern New Jersey-Long Island, NY-NJ-PA"
mean(CPS$Country[CPS$MetroArea==area] != "United States", na.rm=T)[1] 0.3087
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
tapply(CPS$Country == 'India', CPS$MetroArea, sum, na.rm=T) %>%
sort %>% tail San Francisco-Oakland-Fremont, CA
27
Detroit-Warren-Livonia, MI
30
Chicago-Naperville-Joliet, IN-IN-WI
31
Philadelphia-Camden-Wilmington, PA-NJ-DE
32
Washington-Arlington-Alexandria, DC-VA-MD-WV
50
New York-Northern New Jersey-Long Island, NY-NJ-PA
96
# New York-Northern New Jersey-Long Island, NY-NJ-PAtapply(CPS$Country == 'Brazil', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail Bridgeport-Stamford-Norwalk, CT
7
New York-Northern New Jersey-Long Island, NY-NJ-PA
7
Washington-Arlington-Alexandria, DC-VA-MD-WV
8
Los Angeles-Long Beach-Santa Ana, CA
9
Miami-Fort Lauderdale-Miami Beach, FL
16
Boston-Cambridge-Quincy, MA-NH
18
# Boston-Cambridge-Quincy, MA-NH tapply(CPS$Country == 'Somalia', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail Columbus, OH Fargo, ND-MN
5 5
Phoenix-Mesa-Scottsdale, AZ Seattle-Tacoma-Bellevue, WA
7 7
St. Cloud, MN Minneapolis-St Paul-Bloomington, MN-WI
7 17
# Minneapolis-St Paul-Bloomington, MN-WI