基本的程式筆記設定,安裝、載入一些基本的套件

rm(list=ls(all=T))
knitr::opts_chunk$set(comment = NA)
knitr::opts_knit$set(global.par = TRUE)
par(cex=0.8); options(scipen=20, digits=4, width=90)
if(!require(pacman)) install.packages("pacman")
pacman::p_load(magrittr, d3heatmap)

以上這些程式碼請大家不要去改動


犯罪是一個國際關注的問題,但它在不同的國家以不同的方式記錄和處理。 在美國,聯邦調查局(FBI)記錄了暴力犯罪和財產犯罪。 此外,每個城市都記錄了犯罪行為,一些城市發布了有關犯罪率的數據。 伊利諾伊州芝加哥市從2001年開始在線發布犯罪數據。

芝加哥是美國人口第三多的城市,人口超過270萬。在這個作業裡面,我們將關注一種特定類型的財產犯罪,稱為「汽車盜竊」,我們將使用R中的一些基本數據分析來了解芝加哥的汽車盜竊紀錄。請載入文件“data/mvtWeek1.csv”:以下是各欄位的描述:



Section-1 Loading the Data

【1.1】How many rows of data (observations) are in this dataset?

D = read.csv("data/mvtWeek1.csv", stringsAsFactors=F)
nrow(D)
[1] 191641
ncol(D)
[1] 11

檢查各欄位的資料格式

summary(D)
       ID              Date           LocationDescription   Arrest         Domestic      
 Min.   :1310022   Length:191641      Length:191641       Mode :logical   Mode :logical  
 1st Qu.:2832144   Class :character   Class :character    FALSE:176105    FALSE:191226   
 Median :4762956   Mode  :character   Mode  :character    TRUE :15536     TRUE :415      
 Mean   :4968629                                                                         
 3rd Qu.:7201878                                                                         
 Max.   :9181151                                                                         
                                                                                         
      Beat         District     CommunityArea        Year         Latitude   
 Min.   : 111   Min.   : 1      Min.   : 0      Min.   :2001   Min.   :41.6  
 1st Qu.: 722   1st Qu.: 6      1st Qu.:22      1st Qu.:2003   1st Qu.:41.8  
 Median :1121   Median :10      Median :32      Median :2006   Median :41.9  
 Mean   :1259   Mean   :12      Mean   :38      Mean   :2006   Mean   :41.8  
 3rd Qu.:1733   3rd Qu.:17      3rd Qu.:60      3rd Qu.:2009   3rd Qu.:41.9  
 Max.   :2535   Max.   :31      Max.   :77      Max.   :2012   Max.   :42.0  
                NA's   :43056   NA's   :24616                  NA's   :2276  
   Longitude    
 Min.   :-87.9  
 1st Qu.:-87.7  
 Median :-87.7  
 Mean   :-87.7  
 3rd Qu.:-87.6  
 Max.   :-87.5  
 NA's   :2276   

類別(Factor) versus 字串(Character)

【1.2】How many variables are in this dataset?

ncol(D)
[1] 11

【1.3】Using the “max” function, what is the maximum value of the variable “ID”?

max(D$ID)
[1] 9181151

【1.4】 What is the minimum value of the variable “Beat”?

min(D$Beat)
[1] 111

【1.5】 How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

sum(D$Arrest)
[1] 15536
mean(D$Arrest)
[1] 0.08107

【1.6】 How many observations have a LocationDescription value of ALLEY?

sum(D$LocationDescription == "ALLEY")
[1] 2308

使用sum()mean()來計算邏輯運算結果的數量比率



Section-2 Understanding Dates in R

【2.1】 In what format are the entries in the variable Date?

head(D$Date)  # Month/Day/Year Hour:Minute
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 21:30"
[6] "12/31/12 20:30"
ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")
par(cex=0.7)
hist(ts,"year",las=2,freq=T,xlab="")

二維的table()

table(format(ts,'%w'))

    0     1     2     3     4     5     6 
26316 27397 26791 27416 27319 29284 27118 
table(format(ts,'%m'))

   01    02    03    04    05    06    07    08    09    10    11    12 
16047 13511 15758 15280 16035 16002 16801 16572 16060 17086 16063 16426 
table(weekday=format(ts,'%w'), month=format(ts,'%m'))
       month
weekday   01   02   03   04   05   06   07   08   09   10   11   12
      0 2110 1837 2075 2070 2168 2239 2339 2304 2352 2424 2254 2144
      1 2395 1937 2200 2323 2359 2187 2457 2288 2258 2399 2323 2271
      2 2317 1885 2270 2118 2222 2183 2412 2251 2142 2416 2258 2317
      3 2259 2007 2242 2060 2345 2347 2408 2428 2239 2484 2182 2415
      4 2334 1904 2263 2099 2402 2190 2385 2464 2320 2280 2253 2425
      5 2392 2036 2443 2388 2340 2566 2459 2591 2390 2692 2475 2512
      6 2240 1905 2265 2222 2199 2290 2341 2246 2359 2391 2318 2342

二維矩陣的視覺化

table(format(ts,"%u"), format(ts,"%H")) %>% 
  as.data.frame.matrix %>% 
  d3heatmap(F,F,col=colorRamp(c('seagreen','lightyellow','red')))

【2.2】 What is the month and year of the median date in our dataset?

median(ts)
[1] "2006-05-21 12:30:00 CST"

【2.3】 In which month did the fewest motor vehicle thefts occur?

sort(table(format(ts,"%m")))

   02    04    03    06    05    01    09    11    12    08    07    10 
13511 15280 15758 16002 16035 16047 16060 16063 16426 16572 16801 17086 

【2.4】 On which weekday did the most motor vehicle thefts occur?

format(ts,"%w") %>% table %>% sort
.
    0     2     6     4     1     3     5 
26316 26791 27118 27319 27397 27416 29284 

【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?

ts[D$Arrest] %>% format('%m') %>% table %>% sort
.
  05   06   02   09   04   11   03   07   08   10   12   01 
1187 1230 1238 1248 1252 1256 1298 1324 1329 1342 1397 1435