AS2-3(a): 美國的人口統計和就業資料

基本的程式筆記設定，安裝、載入一些基本的套件

rm(list=ls(all=T))
knitr::opts_chunk$set(comment = NA)
knitr::opts_knit$set(global.par = TRUE)
par(cex=0.8); options(scipen=20, digits=4, width=90)
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr)

以上這些程式碼請大家不要去改動

就業統計數據是政策制定者用來衡量經濟整體實力的最重要指標之一。在美國，政府使用現有人口調查（CPS）衡量失業率，該調查每月收集來自各種美國人的人口統計和就業信息。在本練習中，我們將使用講座中審查的主題以及一些使用2013年9月版的，具有全國代表性的數據集。數據集中的觀察結果代表2013年9月CPS中實際完成調查的人員，完整數據集有385個欄位，但在本練習中，我們將使用數據集CPSData.csv版本，它具有以下欄位：

PeopleInHousehold: 受訪者家庭中的人數。
Region: 受訪者居住的人口普查區域。
State: 受訪者居住的州。
MetroAreaCode: 都會區代碼，如受訪者不住都會區，則為NA；從代碼到都會區名稱的對應在MetroAreaCodes.csv中提供。
Age: 受訪者的年齡，以年為單位。 80代表80-84歲的人，85代表85歲及以上的人。
Married: 受訪者的婚姻狀況。
Sex: 受訪者的性別。
Education: 受訪者獲得的最高教育程度。
Race: 受訪者的種族。
Hispanic: 受訪者是否屬於西班牙裔。
CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。
Citizenship: 受訪者的公民身份。
EmploymentStatus: 受訪者的就業狀況。
Industry: 受訪者的就業行業（僅在受僱的情況下可用）。

Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

A = read.csv('data/CPSData.csv')
MetroAreaMap = read.csv("data/MetroAreaCodes.csv")
CountryMap = read.csv("data/CountryCodes.csv")
nrow(A)

[1] 131302

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

table(A$Industry) %>% sort


                               Armed forces                                      Mining 
                                         29                                         550 
Agriculture, forestry, fishing, and hunting                                 Information 
                                       1307                                        1328 
                      Public administration                              Other services 
                                       3186                                        3224 
               Transportation and utilities                                   Financial 
                                       3260                                        4347 
                               Construction                     Leisure and hospitality 
                                       4387                                        6364 
                              Manufacturing          Professional and business services 
                                       6791                                        7519 
                                      Trade             Educational and health services 
                                       8933                                       15017

§ 1.3 Which state has the fewest interviewees?

table(A$State) %>% sort %>% head


   New Mexico       Montana   Mississippi       Alabama West Virginia      Arkansas 
         1102          1214          1230          1376          1409          1421

Which state has the largest number of interviewees?

table(A$State) %>% sort %>% tail


    Illinois Pennsylvania      Florida     New York        Texas   California 
        3912         3930         5149         5595         7077        11570

§ 1.4 What proportion of interviewees are citizens of the United States?

table(A$Citizenship) %>% prop.table


     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.88833              0.05387              0.05781

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)

American Indian
Asian
cBlack
Multiracial
Pacific Islander
White

tapply(A$Hispanic, A$Race, sum) %>% sort

Pacific Islander            Asian  American Indian      Multiracial            Black 
              77              113              304              448              621 
           White 
           16731

Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)

PeopleInHousehold
Region
State
MetroAreaCode
Age
Married
Sex
Education
Race
Hispanic
CountryOfBirthCode
Citizenship
EmploymentStatus
Industry

summary(A)

 PeopleInHousehold       Region               State       MetroAreaCode        Age      
 Min.   : 1.00     Midwest  :30684   California  :11570   Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Northeast:25939   Texas       : 7077   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     South    :41502   New York    : 5595   Median :34740   Median :39.0  
 Mean   : 3.28     West     :33177   Florida     : 5149   Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                       Pennsylvania: 3930   3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                       Illinois    : 3912   Max.   :79600   Max.   :85.0  
                                     (Other)     :94069   NA's   :34238                 
          Married          Sex                          Education    
 Divorced     :11151   Female:67481   High school            :30906  
 Married      :55509   Male  :63821   Bachelor's degree      :19443  
 Never Married:30772                  Some college, no degree:18863  
 Separated    : 2027                  No high school diploma :16095  
 Widowed      : 6505                  Associate degree       : 9913  
 NA's         :25338                  (Other)                :10744  
                                      NA's                   :25338  
               Race           Hispanic     CountryOfBirthCode
 American Indian :  1433   Min.   :0.000   Min.   : 57.0     
 Asian           :  6520   1st Qu.:0.000   1st Qu.: 57.0     
 Black           : 13913   Median :0.000   Median : 57.0     
 Multiracial     :  2897   Mean   :0.139   Mean   : 82.7     
 Pacific Islander:   618   3rd Qu.:0.000   3rd Qu.: 57.0     
 White           :105921   Max.   :1.000   Max.   :555.0     
                                                             
               Citizenship               EmploymentStatus
 Citizen, Native     :116639   Disabled          : 5712  
 Citizen, Naturalized:  7073   Employed          :61733  
 Non-Citizen         :  7590   Not in Labor Force:15246  
                               Retired           :18619  
                               Unemployed        : 4203  
                               NA's              :25789  
                                                         
                               Industry    
 Educational and health services   :15017  
 Trade                             : 8933  
 Professional and business services: 7519  
 Manufacturing                     : 6791  
 Leisure and hospitality           : 6364  
 (Other)                           :21618  
 NA's                              :65060

§ 2.2 Which is the most accurate:

The Married variable being missing is related to the Region value for the interviewee.
The Married variable being missing is related to the Sex value for the interviewee.
The Married variable being missing is related to the Age value for the interviewee.
The Married variable being missing is related to the Citizenship value for the interviewee.
The Married variable being missing is not related to the Region, Sex, Age, or Citizenship value for the interviewee.

tapply(is.na(A$Married), A$Region, mean)

  Midwest Northeast     South      West 
   0.1980    0.1738    0.1920    0.2046

tapply(is.na(A$Married), A$Sex, mean)

Female   Male 
0.1810 0.2056

tapply(is.na(A$Married), A$Age, mean)

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

tapply(is.na(A$Married), A$Citizenship, mean)

     Citizen, Native Citizen, Naturalized          Non-Citizen 
             0.21162              0.02305              0.06482

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

tapply(is.na(A$MetroAreaCode), A$State, mean) %>% sort

District of Columbia           New Jersey         Rhode Island           California 
             0.00000              0.00000              0.00000              0.02048 
             Florida        Massachusetts             Maryland             New York 
             0.03923              0.06492              0.06938              0.08061 
         Connecticut             Illinois             Colorado              Arizona 
             0.08568              0.11222              0.12991              0.13154 
              Nevada                Texas            Louisiana         Pennsylvania 
             0.13308              0.14370              0.16138              0.17430 
            Michigan           Washington              Georgia             Virginia 
             0.17826              0.18132              0.19843              0.19844 
                Utah               Oregon             Delaware           New Mexico 
             0.21010              0.21822              0.23397              0.24501 
              Hawaii                 Ohio              Alabama              Indiana 
             0.24917              0.25122              0.25872              0.29142 
           Wisconsin       South Carolina            Minnesota             Oklahoma 
             0.29933              0.31303              0.31507              0.32764 
            Missouri            Tennessee               Kansas       North Carolina 
             0.32867              0.35594              0.36227              0.37304 
                Iowa             Arkansas                Idaho             Kentucky 
             0.48695              0.49050              0.49868              0.50679 
       New Hampshire             Nebraska                Maine              Vermont 
             0.56875              0.58132              0.59832              0.65238 
         Mississippi         South Dakota         North Dakota        West Virginia 
             0.69431              0.70250              0.73739              0.75586 
             Montana               Alaska              Wyoming 
             0.83608              1.00000              1.00000

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

#
#

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

tapply(is.na(A$MetroAreaCode), A$Region, mean) %>% sort

Northeast     South      West   Midwest 
   0.2162    0.2378    0.2437    0.3479

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

#
#

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

#
#

Section-3 Integrating Metropolitan Area Data

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

nrow(MetroAreaMap)

[1] 271

How many observations (codes for countries) are there in CountryMap?

nrow(CountryMap)

[1] 149

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

A = merge(A, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)

A = merge(A, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new metropolitan area variable?

sum(is.na(A$MetroArea))

[1] 34238

§ 3.3 Which metropolitan areas has the largest number of interviewees?

table(A$MetroArea) %>% sort %>% tail


              Providence-Fall River-Warwick, MA-RI 
                                              2284 
               Chicago-Naperville-Joliet, IN-IN-WI 
                                              2772 
          Philadelphia-Camden-Wilmington, PA-NJ-DE 
                                              2855 
              Los Angeles-Long Beach-Santa Ana, CA 
                                              4102 
      Washington-Arlington-Alexandria, DC-VA-MD-WV 
                                              4177 
New York-Northern New Jersey-Long Island, NY-NJ-PA 
                                              5409

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

tapply(A$Hispanic, A$MetroArea, mean) %>% sort %>% tail

           San Antonio, TX              El Centro, CA                El Paso, TX 
                    0.6442                     0.6869                     0.7910 
 Brownsville-Harlingen, TX McAllen-Edinburg-Pharr, TX                 Laredo, TX 
                    0.7975                     0.9487                     0.9663

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

tapply(A$Race == "Asian", A$MetroArea, mean) %>% sort %>% tail

                 Warner Robins, GA                         Fresno, CA 
                            0.1667                             0.1848 
             Vallejo-Fairfield, CA San Jose-Sunnyvale-Santa Clara, CA 
                            0.2030                             0.2418 
 San Francisco-Oakland-Fremont, CA                       Honolulu, HI 
                            0.2468                             0.5019

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

tapply(A$Education=='No high school diploma', A$MetroArea, mean, na.rm=T) %>% 
  sort %>% head

           Iowa City, IA        Bowling Green, KY    Kalamazoo-Portage, MI 
                 0.02913                  0.03704                  0.05051 
    Champaign-Urbana, IL Bremerton-Silverdale, WA             Lawrence, KS 
                 0.05155                  0.05405                  0.05952

Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

#
#

How many interviewees have a missing value for the new Country variable?

sum(is.na(A$Country))

[1] 176

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

table(A$Country) %>% sort %>% tail(10)


         Cuba       Germany       Vietnam   El Salvador   Puerto Rico         China 
          426           438           458           477           518           581 
        India   Philippines        Mexico United States 
          770           839          3921        115063

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

table(A$Country[A$MetroArea=="New York-Northern New Jersey-Long Island, NY-NJ-PA"] == 
  "United States") %>% prop.table


 FALSE   TRUE 
0.3087 0.6913

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

tapply(A$Country=="India", A$MetroArea, sum) %>% sort %>% tail

                      Kansas City, MO-KS        Milwaukee-Waukesha-West Allis, WI 
                                      11                                       12 
                              Fresno, CA       San Jose-Sunnyvale-Santa Clara, CA 
                                      16                                       19 
Hartford-West Hartford-East Hartford, CT               Detroit-Warren-Livonia, MI 
                                      26                                       30

In Brazil?

tapply(A$Country=="Brazil", A$MetroArea, sum) %>% sort %>% tail

Sacramento-Arden-Arcade-Roseville, CA                  Canton-Massillon, OH 
                                    2                                     3 
          Phoenix-Mesa-Scottsdale, AZ   Davenport-Moline-Rock Island, IA-IL 
                                    3                                     4 
Miami-Fort Lauderdale-Miami Beach, FL        Boston-Cambridge-Quincy, MA-NH 
                                   16                                    18

In Somalia?

tapply(A$Country=="Somalia", A$MetroArea, sum) %>% sort %>% tail

              York-Hanover, PA Youngstown-Warren-Boardman, OH 
                             0                              0 
                    Dayton, OH                   Richmond, VA 
                             1                              1 
   Phoenix-Mesa-Scottsdale, AZ                  St. Cloud, MN 
                             7                              7

AS2-3(a): 美國的人口統計和就業資料

第 0 組

2019-09-25 09:44:07

Section-1 Loading and Summarizing the Dataset

Section-2 Evaluating Missing Values

Section-3 Integrating Metropolitan Area Data

Section-4 Integrating Country of Birth Data