AS3-3: 美國的人口統計和就業

美國的人口統計＆就業

就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國，政府使用現有人口調查（CPS）衡量失業率，該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中，我們將使用講座中審查的主題以及一些使用2013年9月版的，具有全國代表性的數據集（在線提供）。

數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員。雖然完整數據集有385個變量，但在本練習中，我們將使用數據集CPSData.csv版本，它具有以下變量：

PeopleInHousehold: 受訪者家庭中的人數。
地區: 受訪者居住的人口普查區域。
州: 受訪者居住的州。
MetroAreaCode: 一個代碼，用於標識受訪者所居住的都市區域（如果受訪者不住在大都市區，則會丟失）。從代碼到都市區域名稱對應在MetroAreaCodes.csv文件中提供。
年齡: 受訪者的年齡，以年為單位。 80代表80-84歲的人，85代表85歲及以上的人。
結婚: 受訪者的婚姻狀況。
性別: 受訪者的性別。
教育: 受訪者獲得的最高教育水平。
種族: 受訪者的種族。
西班牙裔: 受訪者是否屬於西班牙裔。
CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。
公民身份: 受訪者的美國公民身份。
EmploymentStatus: 受訪者的就業狀況。
行業: 受訪者的就業行業（僅在受僱的情況下可用）。

Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

CPS = read.csv("../data/CPSData.csv")
nrow(CPS)

[1] 131302

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

sort(table(CPS$Industry)) %>% tail  # Educational and health services


                      Construction            Leisure and hospitality 
                              4387                               6364 
                     Manufacturing Professional and business services 
                              6791                               7519 
                             Trade    Educational and health services 
                              8933                              15017

§ 1.3 Which state has the fewest interviewees?

head(sort(table(CPS$State)))  # New Mexico


   New Mexico       Montana   Mississippi       Alabama West Virginia      Arkansas 
         1102          1214          1230          1376          1409          1421

Which state has the largest number of interviewees?

tail(sort(table(CPS$State))) # California


    Illinois Pennsylvania      Florida     New York        Texas   California 
        3912         3930         5149         5595         7077        11570

§ 1.4 What proportion of interviewees are citizens of the United States?

table(CPS$Citizenship) %>% prop.table


     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.88833              0.05387              0.05781

table(CPS$Citizenship == "Non-Citizen") %>% prop.table  # 0.942194


  FALSE    TRUE 
0.94219 0.05781

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)

American Indian
Asian
cBlack
Multiracial
Pacific Islander
White

table(CPS$Race, CPS$Hispanic)

                  
                       0     1
  American Indian   1129   304
  Asian             6407   113
  Black            13292   621
  Multiracial       2449   448
  Pacific Islander   541    77
  White            89190 16731

Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)

PeopleInHousehold
Region
State
MetroAreaCode
Age
Married
Sex
Education
Race
Hispanic
CountryOfBirthCode
Citizenship
EmploymentStatus
Industry

colSums(is.na(CPS))

 PeopleInHousehold             Region              State      MetroAreaCode 
                 0                  0                  0              34238 
               Age            Married                Sex          Education 
                 0              25338                  0              25338 
              Race           Hispanic CountryOfBirthCode        Citizenship 
                 0                  0                  0                  0 
  EmploymentStatus           Industry 
             25789              65060

§ 2.2 Which is the most accurate:

The Married variable being missing is related to the Region value for the interviewee.
The Married variable being missing is related to the Sex value for the interviewee.
The Married variable being missing is related to the Age value for the interviewee.
The Married variable being missing is related to the Citizenship value for the interviewee.
The Married variable being missing is not related to the Region, Sex, Age, or Citizenship value for the interviewee.

lapply(CPS[c('Region','Sex','Age','Citizenship')], 
       function(x) table(is.na(CPS$Married), x))

$Region
       x
        Midwest Northeast South  West
  FALSE   24609     21432 33535 26388
  TRUE     6075      4507  7967  6789

$Sex
       x
        Female  Male
  FALSE  55264 50700
  TRUE   12217 13121

$Age
       x
           0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
  FALSE    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 1795
  TRUE  1283 1559 1574 1693 1695 1795 1721 1681 1729 1748 1750 1721 1797 1802 1790    0
       x
          16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
  FALSE 1751 1764 1596 1517 1398 1525 1536 1638 1627 1604 1643 1657 1736 1645 1854 1762
  TRUE     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
       x
          32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
  FALSE 1790 1804 1653 1716 1663 1531 1530 1542 1571 1673 1711 1819 1764 1749 1665 1647
  TRUE     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
       x
          48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
  FALSE 1791 1989 1966 1931 1935 1994 1912 1895 1935 1827 1874 1758 1746 1735 1595 1596
  TRUE     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
       x
          64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
  FALSE 1519 1569 1577 1227 1130 1062 1195 1031  941  896  842  763  729  698  659  661
  TRUE     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
       x
          80   85
  FALSE 2664 2446
  TRUE     0    0

$Citizenship
       x
        Citizen, Native Citizen, Naturalized Non-Citizen
  FALSE           91956                 6910        7098
  TRUE            24683                  163         492

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

table(is.na(CPS$MetroAreaCode), CPS$State) # 2

       
        Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
  FALSE    1020      0    1327      724      11333     2545        2593     1696
  TRUE      356   1590     201      697        237      380         243      518
       
        District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana  Iowa Kansas
  FALSE                 1791    4947    2250   1576   761     3473    1420  1297   1234
  TRUE                     0     202     557    523   757      439     584  1231    701
       
        Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi
  FALSE      908      1216   909     2978          1858     2517      2150         376
  TRUE       933       234  1354      222           129      546       989         854
       
        Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York
  FALSE     1440     199      816   1609          1148       2567        832     5144
  TRUE       705    1015     1133    247          1514          0        270      451
       
        North Carolina North Dakota  Ohio Oklahoma Oregon Pennsylvania Rhode Island
  FALSE           1642          432  2754     1024   1519         3245         2209
  TRUE             977         1213   924      499    424          685            0
       
        South Carolina South Dakota Tennessee Texas  Utah Vermont Virginia Washington
  FALSE           1139          595      1149  6060  1455     657     2367       1937
  TRUE             519         1405       635  1017   387    1233      586        429
       
        West Virginia Wisconsin Wyoming
  FALSE           344      1882       0
  TRUE           1065       804    1624

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

# 3

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

tapply(is.na(CPS$MetroAreaCode), CPS$Region, mean) %>% sort

Northeast     South      West   Midwest 
   0.2162    0.2378    0.2437    0.3479

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

abs(0.3 - tapply(is.na(CPS$MetroAreaCode), CPS$State, mean)) %>% 
  sort %>% head  # Wisconsin

     Wisconsin        Indiana South Carolina      Minnesota       Oklahoma       Missouri 
     0.0006701      0.0085828      0.0130277      0.0150685      0.0276428      0.0286713

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

tapply(is.na(CPS$MetroAreaCode), CPS$State, mean) %>% 
  sort %>% tail  # Montana

 South Dakota  North Dakota West Virginia       Montana        Alaska       Wyoming 
       0.7025        0.7374        0.7559        0.8361        1.0000        1.0000

Section-3 Integrating Metropolitan Area Data

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

Metro = read.csv("../data/MetroAreaCodes.csv")
nrow(Metro)

[1] 271

How many observations (codes for countries) are there in CountryMap?

Country = read.csv("../data/CountryCodes.csv")
nrow(Country)

[1] 149

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

CPS = merge(CPS, Metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
# MetroArea

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(CPS$MetroArea))

[1] 34238

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

table(CPS$MetroArea) %>% sort %>% tail(10)  # Boston-Cambridge-Quincy


                    Houston-Baytown-Sugar Land, TX 
                                              1649 
                   Dallas-Fort Worth-Arlington, TX 
                                              1863 
            Minneapolis-St Paul-Bloomington, MN-WI 
                                              1942 
                    Boston-Cambridge-Quincy, MA-NH 
                                              2229 
              Providence-Fall River-Warwick, MA-RI 
                                              2284 
               Chicago-Naperville-Joliet, IN-IN-WI 
                                              2772 
          Philadelphia-Camden-Wilmington, PA-NJ-DE 
                                              2855 
              Los Angeles-Long Beach-Santa Ana, CA 
                                              4102 
      Washington-Arlington-Alexandria, DC-VA-MD-WV 
                                              4177 
New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                              5409

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

tapply(CPS$Hispanic, CPS$MetroArea, mean) %>% sort %>% tail  # Laredo, TX

           San Antonio, TX              El Centro, CA                El Paso, TX 
                    0.6442                     0.6869                     0.7910 
 Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX                 Laredo, TX 
                    0.7975                     0.9487                     0.9663

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

tapply(CPS$Race == "Asian", CPS$MetroArea, mean) %>% sort %>% tail  # 4

                 Warner Robins, GA                         Fresno, CA 
                            0.1667                             0.1848 
             Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA 
                            0.2030                             0.2418 
 San Francisco-Oakland-Fremont, CA                       Honolulu, HI 
                            0.2468                             0.5019

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean, na.rm=T) %>% 
  sort %>% head  # Iowa City, IA

           Iowa City, IA        Bowling Green, KY    Kalamazoo-Portage, MI 
                 0.02913                  0.03704                  0.05051 
    Champaign-Urbana, IL Bremerton-Silverdale, WA             Lawrence, KS 
                 0.05155                  0.05405                  0.05952

Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

CPS = merge(CPS, Country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
# Country

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(CPS$Country))

[1] 176

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

table(CPS$Country) %>% sort %>% tail  # Philippines


  Puerto Rico         China         India   Philippines        Mexico United States 
          518           581           770           839          3921        115063

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

area = "New York-Northern New Jersey-Long Island, NY-NJ-PA"
mean(CPS$Country[CPS$MetroArea==area] != "United States", na.rm=T)

[1] 0.3087

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

tapply(CPS$Country == 'India', CPS$MetroArea, sum, na.rm=T) %>% 
  sort %>% tail

                 San Francisco-Oakland-Fremont, CA 
                                                27 
                        Detroit-Warren-Livonia, MI 
                                                30 
               Chicago-Naperville-Joliet, IN-IN-WI 
                                                31 
          Philadelphia-Camden-Wilmington, PA-NJ-DE 
                                                32 
      Washington-Arlington-Alexandria, DC-VA-MD-WV 
                                                50 
New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                                96

# New York-Northern New Jersey-Long Island, NY-NJ-PA

In Brazil?

tapply(CPS$Country == 'Brazil', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail

                   Bridgeport-Stamford-Norwalk, CT 
                                                 7 
New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                                 7 
      Washington-Arlington-Alexandria, DC-VA-MD-WV 
                                                 8 
              Los Angeles-Long Beach-Santa Ana, CA 
                                                 9 
             Miami-Fort Lauderdale-Miami Beach, FL 
                                                16 
                    Boston-Cambridge-Quincy, MA-NH 
                                                18

# Boston-Cambridge-Quincy, MA-NH

In Somalia?

tapply(CPS$Country == 'Somalia', CPS$MetroArea, sum, na.rm=T) %>% sort %>% tail

                          Columbus, OH                           Fargo, ND-MN 
                                     5                                      5 
           Phoenix-Mesa-Scottsdale, AZ            Seattle-Tacoma-Bellevue, WA 
                                     7                                      7 
                         St. Cloud, MN Minneapolis-St Paul-Bloomington, MN-WI 
                                     7                                     17

# Minneapolis-St Paul-Bloomington, MN-WI

AS3-3: 美國的人口統計和就業

第 0 組

2019-03-16 21:42:25

Section-1 Loading and Summarizing the Dataset

Section-2 Evaluating Missing Values

Section-3 Integrating Metropolitan Area Data

Section-4 Integrating Country of Birth Data