資料處理

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學統計學系

資料處理

(Data Manipulation)

B01-1

(2)

本章大綱&學習目標

 了解資料調處(處理) (Data Manipulation)的概念。

 了解及運用表格處理函式: rbind {base}, cbind {base}, table {base}, xtabs {stats}, expand.table {epitools},

tabulate {base}, ftable {stats}, xtable {xtable}, stack {utils} .

 了解及運用資料調處相關函式: aggregate {stats}, by {base}, cut {base}, with {base}, merge {base}, split {base}.

 了解及運用apply系列於資料調處: apply, tapply, lapply, sapply, mapply, rapply.

 了解及運用資料調處R套件: plyr, dplyr, tidyr, reshape2, data.table.

2/113

(3)

資料處理 (Data Manipulation)



Some Terms:

 Data Cleaning: https://en.wikipedia.org/wiki/Data_cleansing

 Data Integration: https://en.wikipedia.org/wiki/Data_integration

 Data Manipulation, Data Preprocessing

 Data Munging (data wrangling): munging can mean manipulating raw data to achieve a final form. It can mean parsing or filtering data, or the many steps required for data recognition

3/113

國家教育研究院雙語詞彙資料調處(data manipulation)

(4)

資料清理 (Data Cleaning)



Data cleaning is one part of data quality. Aim at:



Accuracy (data is recorded correctly)



Completeness (all relevant data is recorded)



Uniqueness (no duplicated data record)



Timeliness (the data is not old)



Consistency (the data is coherent)



Data cleaning attempts to fill in missing values, smooth out noise while identifying outliers, and correct

inconsistencies in the data.



Data cleaning is usually an iterative two-step process consisting of discrepancy detection and data

transformation.

4/113

GIGO

(5)

Some R Functions for Data Manipulation



R functions



aggregate{stats} : Compute Summary Statistics of Data Subsets



by{base} : Apply a Function to a Data Frame Split by Factors



cut{base} : Convert Numeric to Factor



with{base} : Evaluate an Expression in a Data Environment.



merge{base} : Merge Two Data Frames



split{base} : Divide into Groups and Reassemble



表格處理相關函式: rbind{base}, cbind{base}, table{base}, xtabs{stats}, expand.table

{epitools}, tabulate{base}, ftable{stats}, xtable{xtable}, stack{utils}.



apply 系列: apply , tapply , sapply , lapply , rapply , mapply .

5/113

(6)

tidyverse : Easily Install and Load 'Tidyverse' Packages

Core tidyverse packages

 ggplot2: data visualisation.

 dplyr: data manipulation.

 tidyr: data tidying.

 readr: data import.

 purrr: functional programming.

 tibble: tibbles, a modern re-imagining of data frames.

Other packages for data manipulation:

 hms: times.

 stringr: strings.

 lubridate: date/times.

 forcats: factors.

Data import:

 DBI: databases.

 haven: SPSS, SAS and Stata files.

 httr: web apis.

 jsonlite: JSON.

 readxl: .xls and .xlsx files.

 rvest: web scraping.

 xml2: XML.

Modelling:

 modelr: simple modelling within a pipeline

 broom: turning models into tidy data

6/113

> install.packages("tidyverse")

> library(tidyverse)

(7)

Tidy Data

 Tidy data is a standard way of mapping the meaning of a dataset to its structure.

 A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

 Each variable forms a column.

 Each observation forms a row.

 Each type of observational unit forms a table. (Each value is placed in its own cell)

 Tidy data is particularly well suited for vectorised programming

languages like R, because the layout ensures that values of different variables from the same observation are always paired.

7/113

https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

(8)

Data Transformation with dplyr Cheat Sheet by RStudio

8/113

https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf

(9)

Data Transformation with dplyr Cheat Sheet by RStudio

9/113

(10)

Data Wrangling with dplyr and tidyr Cheat Sheet by RStudio

10/113

(11)

Data Wrangling with dplyr and tidyr Cheat Sheet by RStudio

11/113

(12)

rbind

and

cbind

^12/113

> begin.experiment <- data.frame(name=c("A", "B", "C", "D", "E", "F"), + weights=c(270, 263, 294, 218, 305, 261))

> middle.experiment <- data.frame(name=c("G", "H", "I"), + weights=c(169, 181, 201))

> end.experiment <- data.frame(name=c("C", "D", "A", "H", "I"), + weights=c(107, 104, 104, 102, 100))

> # merge the data for those who started and finished the experiment

> (common <- intersect(begin.experiment$name, end.experiment$name)) [1] "A" "C" "D"

> (b.at <- is.element(begin.experiment$name, common)) [1] TRUE FALSE TRUE TRUE FALSE FALSE

> (e.at <- is.element(end.experiment$name, common)) [1] TRUE TRUE TRUE FALSE FALSE

> experiment <- rbind(cbind(begin.experiment[b.at,], time="begin"), + cbind(end.experiment[e.at,], time="end"))

> experiment

name weights time 1 A 270 begin 3 C 294 begin 4 D 218 begin 11 C 107 end 2 D 104 end

31 A 104 end > tapply(experiment$weights, experiment$time, mean) begin end

260.6667 105.0000

(13)

table {base} :

Cross Tabulation and Table Creation

13/113

table {base}系列函數: tabulate {base}、

ftable {stats}、

xtabs {stats}、

xtable {xtable}

Description: table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

Usage: table(...,

exclude = if (useNA == "no") c(NA, NaN), useNA = c("no", "ifany", "always"),

dnn = list.names(...), deparse.level = 1)

> set.seed(12345)

> grade <- as.factor(sample(c("大一", "大二", "大三", "大四"), 50, replace=T))

> bloodtype <- as.factor(sample(c("A","AB","B","O"), 50, replace=T))

> record <- data.frame(grade, bloodtype)

> head(record) grade bloodtype

1 大三 O

2 大四 O

3 大四 AB

4 大四 A

5 大二 B

6 大一 AB

> record.t <- table(record)

> record.t bloodtype grade A AB B O 大一 3 2 5 2 大二 4 1 7 3 大三 3 3 1 4 大四 3 2 2 5

> as.data.frame(record.t) grade bloodtype Freq

1 大一 A 3

2 大二 A 4

3 大三 A 3

4 大四 A 3

5 大一 AB 2

6 大二 AB 1

7 大三 AB 3

8 大四 AB 2

9 大一 B 5

10 大二 B 7

11 大三 B 1

12 大四 B 2

13 大一 O 2

14 大二 O 3

15 大三 O 4

16 大四 O 5

Cross-tabulated data can be produced from raw data using table.

(14)

table {base} :

Cross Tabulation and Table Creation

14/113

> margin.table(record.t, 1) grade

大一大二大三大四 12 15 11 12

> margin.table(record.t, 2) bloodtype

A AB B O 13 8 15 14

>

> colSums(record.t) A AB B O

13 8 15 14

> rowSums(record.t) 大一大二大三大四

12 15 11 12

>

> colMeans(record.t) A AB B O 3.25 2.00 3.75 3.50

> rowMeans(record.t) 大一大二大三大四 3.00 3.75 2.75 3.00

> prop.table(record.t) bloodtype

grade A AB B O 大一 0.06 0.04 0.10 0.04 大二 0.08 0.02 0.14 0.06 大三 0.06 0.06 0.02 0.08 大四 0.06 0.04 0.04 0.10

> prop.table(record.t, margin=1) # row margin bloodtype

> prop.table(record.t, margin=2) # column margin bloodtype

> set.seed(12345)

> (x <- sample(1:10, 5, replace=T)) [1] 8 9 8 9 5

> (y <- tabulate(x)) [1] 0 0 0 0 1 0 0 2 2

> names(y) <- as.character(1:max(x))

> y

1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 2 2 tabulate {base}: Tabulation for Vectors

Description: tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it.

Usage: tabulate(bin, nbins = max(1, bin, na.rm = TRUE))

(15)

xtabs {stats} : Cross Tabulation

^15/113

Description: Create a contingency table (optionally a sparse matrix) from cross-classifying factors, usually contained in a data frame, using a formula interface.

Usage: xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE, na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE)

> Titanic

, , Age = Child, Survived = No Sex

Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 ...

, , Age = Adult, Survived = Yes Sex

Class Male Female 1st 57 140 2nd 14 80 3rd 75 76 Crew 192 20

> Titanic.df <- as.data.frame(Titanic)

> Titanic.df

Class Sex Age Survived Freq 1 1st Male Child No 0 2 2nd Male Child No 0 3 3rd Male Child No 35 4 Crew Male Child No 0 5 1st Female Child No 0 ...

31 3rd Female Adult Yes 76 32 Crew Female Adult Yes 20

> xtabs(Freq ~ Sex + Age, data = Titanic.df) Age

Sex Child Adult Male 64 1667 Female 45 425

> xtabs(Freq ~ Sex + Age, data = Titanic.df, + subset = Class %in% c("1st", "2nd"))

Age

Sex Child Adult Male 16 343 Female 14 237

Cross-tabulated data can be produced from aggregate data using xtabs.

(16)

> sale <- read.table("itemsale.csv", sep=",", header=T)

> sale

Item Date Count 1 A 20170328 2 2 A 20170329 6 3 A 20170330 4 4 A 20170329 9 5 B 20170331 6 6 C 20170329 1 7 C 20170330 7 8 C 20170331 0

> attach(sale)

> tb <- xtabs(Count ~ Item + Date)

> rbind(cbind(tb, row.total=margin.table(tb, 1)), col.total=c(margin.table(tb, 2), sum(tb))) 20170328 20170329 20170330 20170331 row.total

A 2 15 4 0 21 B 0 0 0 6 6 C 0 1 7 0 8 col.total 2 16 11 6 35

> detach(sale)

樞紐分析 ^16/113

See also: http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/

(17)

expand.table {epitools}

Expand contingency table into individual-level data set

17/113

> survey <- array(0, dim=c(3, 2, 1))

> survey[,1,1] <- c(2, 0, 1)

> survey[,2,1] <- c(3, 2, 4)

> Satisfactory <- c("Good", "Fair", "Bad")

> Sex <- c("Female", "Male")

> Times <- c("First")

> dimnames(survey) <- list(Satisfactory, Sex, Times)

> names(dimnames(survey)) <- c("Satisfactory", "Sex", "Times")

> survey

, , Times = First Sex

Satisfactory Female Male Good 2 3 Fair 0 2 Bad 1 4

> library(epitools)

> (survey.ex <- expand.table(survey)) Satisfactory Sex Times

1 Good Female First 2 Good Female First 3 Good Male First 4 Good Male First 5 Good Male First 6 Fair Male First 7 Fair Male First 8 Bad Female First 9 Bad Male First 10 Bad Male First 11 Bad Male First 12 Bad Male First

expand.table {epitools}: Expand contingency table into individual-level data set

• Usage: expand.table(x)

• Arguments: x: table or array with dimnames(x) and

names(dimnames(x))

(18)

課堂練習:

expand.table

^18/113

> data(HairEyeColor)

> HairEyeColor , , Sex = Male

Eye

Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female

Eye

Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8

> # Convert into individual-level data frame

> HairEyeColor.ex <- expand.table(HairEyeColor)

> HairEyeColor.ex Hair Eye Sex 1 Black Brown Male

2 Black Brown Male 32 items 3 Black Brown Male

...

589 Blond Green Female 590 Blond Green Female 591 Blond Green Female 592 Blond Green Female

> # Convert into group-level data frame

> as.data.frame(HairEyeColor) Hair Eye Sex Freq 1 Black Brown Male 32 2 Brown Brown Male 53 3 Red Brown Male 10 ...

30 Brown Green Female 14 31 Red Green Female 7 32 Blond Green Female 8

(19)

stack {utils}

Stack or Unstack Vectors from a Data Frame or List



Stacking vectors concatenates multiple vectors into a single vector along with a factor indicating where each observation originated.



Unstacking reverses this operation.

19/113

> elections <- read.csv('elections-2000.csv')

> elections

County Gore Bush Buchanan Nader 1 ALACHUA 47365 34124 263 3226 2 BAKER 2392 5610 73 53 3 BAY 18850 38637 248 828 ...

65 WAKULLA 3838 4512 46 149 66 WALTON 5642 12182 120 265 67 WASHINGTON 2798 4994 88 93

>

> elections.stacked <- cbind(stack(elections[,-1]), + county = elections$County)

> elections.stacked

values ind county 1 47365 Gore ALACHUA 2 2392 Gore BAKER 3 18850 Gore BAY ...

266 149 Nader WAKULLA 267 265 Nader WALTON 268 93 Nader WASHINGTON

plot(elections.stacked[, c(2, 1)]) boxplot(elections[,-1])

(20)

Stack Character Vectors ^20/113

> mydata <- data.frame(Area1=c("A", "B", "B", "C"), Area2=c("A", "D", "E", "B"))

> rownames(mydata) <- paste("rater", 1:4, sep="-")

> mydata

Area1 Area2 rater-1 A A rater-2 B D rater-3 B E rater-4 C B

>

> stack(mydata)

Error in stack.data.frame(mydata) : no vector columns were selected

> mydata.stack <- stack(lapply(mydata, as.character))

> colnames(mydata.stack) <- c("Rate", "Area")

> mydata.stack Rate Area 1 A Area1 2 B Area1 3 B Area1 4 C Area1 5 A Area2 6 D Area2 7 E Area2 8 B Area2

(21)

aggregate {stats} :

Compute Summary Statistics of Data Subsets

21/113

Usage: aggregate(x, by, FUN, ..., simplify = TRUE)

http://127.0.0.1:11812/library/datasets/html/state.html

> head(state.x77)

Population Income Illiteracy Life Exp Murder HS Grad Frost Area Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 California 21198 5114 1.1 71.71 10.3 62.6 20 156361 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766

> dim(state.x77) [1] 50 8

> state.region

[1] South West West South West [6] West Northeast South South South ...

[46] South West South North Central West Levels: Northeast South North Central West

> aggregate(state.x77, list(Region = state.region), mean)

Region Population Income Illiteracy Life Exp Murder HS Grad Frost Area 1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.7778 18141.00 2 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.6250 54605.12 3 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.8333 62652.00 4 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 134463.00

state{datasets}: US State Facts and Figures, Data sets related to the 50 states of the United States of America: state.abb, state.area, state.center, state.division,

state.name, state.region, state.x77

tapply 可回傳list，aggregate僅能傳回向量、矩陣。

(22)

aggregate {stats} , Customized Statistics

22/113

> ## Compute the averages according to region and the occurrence of more

> ## than 130 days of frost.

> aggregate(state.x77,

+ by = list(Region = state.region,

+ Cold = state.x77[,"Frost"] > 130), + FUN = function(x){round(mean(x), 2)})

Region Cold Population Income Illiteracy Life Exp Murder HS Grad Frost Area 1 Northeast FALSE 8802.80 4780.40 1.18 71.13 5.58 52.06 110.60 21838.60 2 South FALSE 4208.12 4011.94 1.74 69.71 10.58 44.34 64.62 54605.12 3 North Central FALSE 7233.83 4633.33 0.78 70.96 8.28 53.37 120.00 56736.50 4 West FALSE 4582.57 4550.14 1.26 71.70 6.83 60.11 51.00 91863.71 5 Northeast TRUE 1360.50 4307.50 0.78 71.44 3.65 56.35 160.50 13519.00 6 North Central TRUE 2372.17 4588.83 0.62 72.58 2.27 55.67 157.67 68567.50 7 West TRUE 970.17 4880.50 0.75 70.69 7.67 64.20 161.83 184162.17

> aggregate(state.x77,

+ by = list(Region = state.region, + Cold = state.x77[,"Frost"] > 130),

+ FUN = function(x){ round(sqrt(sum(x^2)), 2)})

Region Cold Population Income Illiteracy Life Exp Murder HS Grad Frost Area 1 Northeast FALSE 23576.34 10707.22 2.66 159.05 14.13 116.74 250.13 66548.16 2 South FALSE 19980.44 16218.16 7.27 278.85 43.53 178.76 285.52 313220.52 3 North Central FALSE 19487.79 11367.76 1.94 173.82 21.00 130.97 294.40 144108.91 4 West FALSE 21791.15 12108.37 3.74 189.72 19.07 159.20 184.01 269467.52 5 Northeast TRUE 3407.56 8708.52 1.60 142.88 7.62 112.72 322.09 33868.12 6 North Central TRUE 6918.96 11260.58 1.53 177.78 5.73 136.59 388.46 169705.71 7 West TRUE 3013.82 12089.92 2.02 173.19 19.98 157.40 398.41 617310.24

(23)

aggregate {stats} , FUN with Arguments

23/113

> # Compute the average annual approval ratings for American presidents.

> presidents

Qtr1 Qtr2 Qtr3 Qtr4 1945 NA 87 82 75 1946 63 50 43 32 1947 35 60 54 55 ...

1974 28 25 24 24

> aggregate(presidents, nfrequency = 1, FUN = mean) Time Series:

Start = 1945 End = 1974 Frequency = 1

[1] NA 47.00 51.00 NA 58.50 41.75 28.75 NA 67.00 65.00 72.75 72.25 65.25 52.25 [15] 61.50 62.75 76.00 71.50 64.75 72.75 66.50 52.25 45.00 41.00 61.25 58.00 50.50 NA [29] 44.75 25.25

> # Give the summer less weight.

> aggregate(presidents, nfrequency = 1, FUN = weighted.mean, w = c(1, 1, 0.5, 1)) Time Series:

Start = 1945 End = 1974 Frequency = 1

[1] NA 47.57143 50.57143 NA 58.71429 41.14286 28.28571 NA 65.85714 [10] 64.14286 71.85714 73.00000 65.57143 52.85714 61.57143 63.00000 76.71429 72.85714 [19] 65.14286 73.28571 66.14286 51.71429 46.00000 41.85714 60.71429 57.57143 50.00000 [28] NA 45.42857 25.42857

nfrequency: new number of observations per unit of time; must be a divisor of the frequency of x.

(24)

aggregate {stats} ,

Example with Character Variables and NAs

24/113

> testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),

+ v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99))

> by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)

> by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)

> aggregate(x = testDF, by = list(by1, by2), FUN = "mean") Group.1 Group.2 v1 v2

1 1 95 5 55 2 2 95 7 77 3 1 99 5 55 4 2 99 NA NA 5 big damp 3 33 6 blue dry 3 33 7 red red 4 44 8 red wet 1 11

> # Treat NAs as a group

> fby1 <- factor(by1, exclude = "")

> fby2 <- factor(by2, exclude = "")

> aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean") Group.1 Group.2 v1 v2

1 1 95 5.0 55.0 2 2 95 7.0 77.0 3 1 99 5.0 55.0 4 2 99 NA NA 5 big damp 3.0 33.0 6 blue dry 3.0 33.0 7 red red 4.0 44.0 8 red wet 1.0 11.0 9 12 <NA> 9.0 99.0 10 <NA> <NA> 7.5 82.5

(25)

aggregate {stats} ,

Formulas, one ~ one, one ~ many

25/113

> aggregate(weight ~ feed, data = chickwts, mean)

feed weight 1 casein 323.5833 2 horsebean 160.2000 3 linseed 218.7500 4 meatmeal 276.9091 5 soybean 246.4286 6 sunflower 328.9167

> summary(chickwts)

weight feed Min. :108.0 casein :12 1st Qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd Qu.:323.5 soybean :14 Max. :423.0 sunflower:12

> aggregate(breaks ~ wool + tension, data = warpbreaks, mean) wool tension breaks

1 A L 44.55556 2 B L 28.22222 3 A M 24.00000 4 B M 28.77778 5 A H 24.55556 6 B H 18.77778

> summary(warpbreaks)

breaks wool tension Min. :10.00 A:27 L:18 1st Qu.:18.25 B:27 M:18 Median :26.00 H:18 Mean :28.15 3rd Qu.:34.00 Max. :70.00

(26)

aggregate {stats} ,

Formulas, many ~ one, and many ~ many

26/113

> summary(esoph)

agegp alcgp tobgp ncases ncontrols 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00 75+ :11 Max. :17.000 Max. :60.00

>

> aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum) alcgp tobgp ncases ncontrols

1 0-39g/day 0-9g/day 9 261 2 40-79 0-9g/day 34 179 3 80-119 0-9g/day 19 61 4 120+ 0-9g/day 16 24 5 0-39g/day 10-19 10 84 6 40-79 10-19 17 85 7 80-119 10-19 19 49 8 120+ 10-19 12 18 9 0-39g/day 20-29 5 42 10 40-79 20-29 15 62 11 80-119 20-29 6 16 12 120+ 20-29 7 12 13 0-39g/day 30+ 5 28 14 40-79 30+ 9 29 15 80-119 30+ 7 12 16 120+ 30+ 10 13

(27)

by {base}:

Apply a Function to a Data Frame Split by Factors

27/113

by(data, INDICES, FUN, ..., simplify = TRUE)

> by(iris[,1:4] , iris$Species , summary) iris$Species: setosa

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 Median :5.000 Median :3.400 Median :1.500 Median :0.200 Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600

--- iris$Species: versicolor

--- iris$Species: virginica

(28)

by {base} , Example

28/113

> by(iris[,1:4] , iris$Species , mean) iris$Species: setosa

[1] NA

Warning messages:

1: In mean.default(data[x, , drop = FALSE], ...) : argument is not numeric or logical: returning NA 2: In mean.default(data[x, , drop = FALSE], ...) : argument is not numeric or logical: returning NA 3: In mean.default(data[x, , drop = FALSE], ...) : argument is not numeric or logical: returning NA

The problem here is that the function you are applying doesn't work on a data frame. In effect you are calling something like this i.e. you are passing a data frame of 4 columns, containing the rows of the original where Species == "setosa".

> mean(iris[iris$Species == "setosa", 1:4]) [1] NA

Warning message:

In mean.default(iris[iris$Species == "setosa", 1:4]) : argument is not numeric or logical: returning NA

(29)

by {base} , Example

29/113

> by(iris[,1] , iris$Species , mean) iris$Species: setosa

[1] 5.006

[1] 5.936

[1] 6.588

For by() you need to do this variable by variable.

> by(iris[,1:4] , iris$Species , colMeans) iris$Species: setosa

Sepal.Length Sepal.Width Petal.Length Petal.Width 5.006 3.428 1.462 0.246 --- iris$Species: versicolor

Sepal.Length Sepal.Width Petal.Length Petal.Width 5.936 2.770 4.260 1.326 --- iris$Species: virginica

Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026

Use colMeans() instead of mean() as the FUN applied.

(30)

by {base} , Example

30/113

> varMean <- function(x, ...) sapply(x, mean, ...)

> by(iris[, 1:4], iris$Species, varMean) iris$Species: setosa

Write a wrapper, to sapply()

Use aggregate() :

> with(iris, aggregate(iris[,1:4], list(Species = iris$Species), FUN = mean)) Species Sepal.Length Sepal.Width Petal.Length Petal.Width

1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026

(31)

cut {base} : Convert Numeric to Factor



cut{base} divides the range of x into intervals and codes the

values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

31/113

cut(x, breaks, labels = NULL,

include.lowest = FALSE, right = TRUE, dig.lab = 3, ordered_result = FALSE, ...)

> x <- rnorm(50)

> (x.cut1 <- cut(x, breaks = -5:5))

[1] (-1,0] (-2,-1] (-2,-1] (-1,0] (-1,0] (-2,-1] (0,1] (0,1] (-1,0] (1,2] (0,1]

...

[45] (1,2] (0,1] (-1,0] (-2,-1] (0,1] (0,1]

Levels: (-5,-4] (-4,-3] (-3,-2] (-2,-1] (-1,0] (0,1] (1,2] (2,3] (3,4] (4,5]

> table(x.cut1) x.cut1

(-5,-4] (-4,-3] (-3,-2] (-2,-1] (-1,0] (0,1] (1,2] (2,3] (3,4] (4,5]

0 0 1 10 18 13 8 0 0 0

> (x.cut2 <- cut(x, breaks = -5:5, labels = FALSE))

[1] 5 4 4 5 5 4 6 6 5 7 6 5 7 5 7 6 4 7 7 4 5 6 5 5 5 6 5 6 5 4 7 ...

[47] 5 4 6 6

> table(x.cut2) x.cut2

3 4 5 6 7 1 10 18 13 8

> hist(x, breaks = -5:5, plot = FALSE)$counts [1] 0 0 1 10 18 13 8 0 0 0

(32)

cut {base}

Examples ^32/113

> #the outer limits are moved away by 0.1% of the range

> cut(0:10, 5)

[1] (-0.01,2] (-0.01,2] (-0.01,2] (2,4] (2,4] (4,6] (4,6] (6,8]

[9] (6,8] (8,10] (8,10]

Levels: (-0.01,2] (2,4] (4,6] (6,8] (8,10]

>

> age <- sample(0:80, 50, replace=T)

> summary(age)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 21.00 35.00 38.16 52.75 80.00

> cut(age, 5)

[1] (48.4,64.2] (16.8,32.6] (16.8,32.6] (48.4,64.2] (16.8,32.6] (32.6,48.4]

...

[49] (16.8,32.6] (48.4,64.2]

Levels: (0.921,16.8] (16.8,32.6] (32.6,48.4] (48.4,64.2] (64.2,80.1]

> mygroup <- c(0, 15, 20, 50, 60, 80)

> (x.cut <- cut(age, mygroup))

[1] (50,60] (20,50] (15,20] (20,50] (20,50] (20,50] (15,20] (0,15] (0,15] (60,80]

...

Levels: (0,15] (15,20] (20,50] (50,60] (60,80]

> table(x.cut) x.cut

(0,15] (15,20] (20,50] (50,60] (60,80]

7 5 22 8 8

Note: Instead of table(cut(x, br)), hist(x, br, plot = FALSE) is more efficient and less memory hungry. Instead of cut(*, labels = FALSE),

findInterval() is more efficient.

(33)

with {base} :

Evaluate an Expression in a Data Environment

33/113

> with(iris, {

+ iris.lm <- lm(Sepal.Length ~ Petal.Length) + summary(iris.lm)})

Call:

lm(formula = Sepal.Length ~ Petal.Length) Residuals:

Min 1Q Median 3Q Max -1.24675 -0.29657 -0.01515 0.27676 1.00269 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 4.30660 0.07839 54.94 <2e-16 ***

Petal.Length 0.40892 0.01889 21.65 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4071 on 148 degrees of freedom

Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

with(data, expr, ...)

People often use the attach()and detach() functions to set up "search paths" for variable names in R, but because this alters global state that's hard to keep track of, people recommend using with()

instead, which sets up a temporary alteration to the search path for the duration of a single expression.

(34)

merge {base} : Merge Two Data Frames



Merge (adds variables to a dataset) two data frames horizontally by common columns or row names

(key variables, either string or numeric). , or do other versions of database join operations.

34/113

https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

merge(x, y, by = intersect(names(x), names(y)),

by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"),

incomparables = NULL, ...)

# merge two data frames by ID

total <- merge(data.frame.A, data.frame.B, by="ID")

# merge two data frames by ID and Country

total <- merge(data.frame.A, data.frame.B, by=c("ID","Country"))

(35)

merge {base},

合併的準則

 merge{base}合併的準則:

 (default) the data frames are merged on the columns with names they both have.

 The rows in the two data frames that match on the specified columns are extracted, and joined together.

 If there is more than one match, all possible matches contribute one row each.

 merge{base}重要Arguments:

 by, by.x, by.y : The names of the columns that are common to both x and y.

The default is to use the columns with common names between the two data frames.

 all.x, (all.y): logical;

 if TRUE, then extra rows will be added to the output, one for each row in xthat has no matching row in y.

 These rows will have NAs in those columns that are usually filled with values from y.

 The default is FALSE, so that only rows with data from both xand yare included in the output.

 all = TRUE (FALSE) is shorthand for all.x = TRUE (FALSE) and all.y = TRUE (FALSE). Logical values that specify the type of merge.

 The default value is all=FALSE(meaning that only the matching rows are returned).

35/113

(36)

Different Types of Merge

 Natural join: To keep only rows that match from the data frames, specify the argument all=FALSE (by default).

 Full outer join: To keep all rows from both data frames, specify

all=TRUE. Note that this performs the complete merge and fills the columns with NA values where there is no matching data.

 Left outer join: To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE.

 Right outer join: To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE.

36/113

natural join full outer join left outer join right outer join all=FALSE all=TRUE all.x=TRUE all.y=TRUE

x y x y x y x y

dplyr

:

inner_join full_join left_join right_join

(37)

merge {base} , Example (1)

37/113

> authors <- data.frame(

+ surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")), + nationality = c("US", "Australia", "US", "UK", "Australia"),

+ deceased = c("yes", rep("no", 4)))

> books <- data.frame(

+ name = I(c("Tukey", "Venables", "Tierney",

+ "Ripley", "Ripley", "McNeil", "R Core")), + title = c("Exploratory Data Analysis",

+ "Modern Applied Statistics ...", + "LISP-STAT",

+ "Spatial Statistics", "Stochastic Simulation", + "Interactive Data Analysis",

+ "An Introduction to R"),

+ other.author = c(NA, "Ripley", NA, NA, NA, NA, "Venables & Smith"))

> authors

surname nationality deceased 1 Tukey US yes 2 Venables Australia no 3 Tierney US no 4 Ripley UK no 5 McNeil Australia no

> books

name title other.author 1 Tukey Exploratory Data Analysis <NA>

2 Venables Modern Applied Statistics ... Ripley 3 Tierney LISP-STAT <NA>

4 Ripley Spatial Statistics <NA>

5 Ripley Stochastic Simulation <NA>

6 McNeil Interactive Data Analysis <NA>

7 R Core An Introduction to R Venables & Smith

(38)

merge {base} , Example (1)

38/113

> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))

surname nationality deceased title other.author 1 McNeil Australia no Interactive Data Analysis <NA>

2 Ripley UK no Spatial Statistics <NA>

3 Ripley UK no Stochastic Simulation <NA>

4 Tierney US no LISP-STAT <NA>

5 Tukey US yes Exploratory Data Analysis <NA>

6 Venables Australia no Modern Applied Statistics ... Ripley

> (m2 <- merge(books, authors, by.x = "name", by.y = "surname"))

name title other.author nationality deceased 1 McNeil Interactive Data Analysis <NA> Australia no 2 Ripley Spatial Statistics <NA> UK no 3 Ripley Stochastic Simulation <NA> UK no 4 Tierney LISP-STAT <NA> US no 5 Tukey Exploratory Data Analysis <NA> US yes 6 Venables Modern Applied Statistics ... Ripley Australia no

> merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)

surname nationality deceased title other.author 1 McNeil Australia no Interactive Data Analysis <NA>

2 R Core <NA> <NA> An Introduction to R Venables & Smith 3 Ripley UK no Spatial Statistics <NA>

4 Ripley UK no Stochastic Simulation <NA>

5 Tierney US no LISP-STAT <NA>

6 Tukey US yes Exploratory Data Analysis <NA>

7 Venables Australia no Modern Applied Statistics ... Ripley

(39)

merge {base} , Example (2)

39/113

> (x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)) k1 k2 data

1 NA 1 1 2 NA NA 2 3 3 NA 3 4 4 4 4 5 5 5 5

> (y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)) k1 k2 data

1 NA NA 1 2 2 NA 2 3 NA 3 3 4 4 4 4 5 5 5 5

> merge(x, y, by = c("k1","k2")) # NA's match k1 k2 data.x data.y

1 4 4 4 4 2 5 5 5 5 3 NA NA 2 1

> merge(x, y, by = "k1") # NA's match, so 6 rows k1 k2.x data.x k2.y data.y

1 4 4 4 4 4 2 5 5 5 5 5 3 NA 1 1 NA 1 4 NA 1 1 3 3 5 NA NA 2 NA 1 6 NA NA 2 3 3

> merge(x, y, by = "k2", incomparables = NA) # 2 rows k2 k1.x data.x k1.y data.y

1 4 4 4 4 4 2 5 5 5 5

(40)

merge {base} , Example (3)

40/113

> stories <- read.table(header=TRUE, text=' + storyid title

+ 1 lions + 2 tigers + 3 bears + ')

> data <- read.table(header=TRUE, text=' + subject storyid rating

+ 1 1 6.7 + 1 2 4.5 + 1 3 3.7 + 2 2 3.3 + 2 3 4.1 + 2 1 5.2 + ')

>

> merge(stories, data, by="storyid") storyid title subject rating

1 1 lions 1 6.7 2 1 lions 2 5.2 3 2 tigers 1 4.5 4 2 tigers 2 3.3 5 3 bears 1 3.7 6 3 bears 2 4.1

> stories2 <- read.table(header=TRUE, text=' + id title

+ 1 lions + 2 tigers + 3 bears + ')

>

> merge(stories2, data, by.x="id", by.y="storyid") id title subject rating

1 1 lions 1 6.7 2 1 lions 2 5.2 3 2 tigers 1 4.5 4 2 tigers 2 3.3 5 3 bears 1 3.7 6 3 bears 2 4.1

http://www.cookbook-r.com/Manipulating_data/Merging_data_frames/

(41)

Merge on Multiple Columns ^41/113

> animals <- read.table(header=T, text=' + size type name

+ small cat lynx + big cat tiger + small dog chihuahua + big dog "great dane"

+ ')

>

> observations <- read.table(header=T, text=' + number size type

+ 1 big cat + 2 small dog + 3 small dog + 4 big dog + ')

>

> merge(observations, animals, c("size","type")) size type number name

1 big cat 1 tiger 2 big dog 4 great dane 3 small dog 2 chihuahua 4 small dog 3 chihuahua

(42)

split {base} :

Divide into Groups and Reassemble

 split{base} divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit{base} reverses the effect of split.

42/113

split(x, f, drop = FALSE, ...)

split(x, f, drop = FALSE, ...) <- value unsplit(value, f, drop = FALSE)

> n <- 10

> edu <- factor(sample(1:4, n, replace=T))

> score <- sample(0:100, n)

> cbind(edu, score) edu score [1,] 1 54 [2,] 2 50 [3,] 1 8 [4,] 3 14 [5,] 4 43 [6,] 3 7 [7,] 4 92 [8,] 4 16 [9,] 3 49 [10,] 4 51

> score.edu <- split(score, edu)

> score.edu

$`1`

[1] 54 8

$`2`

[1] 50

$`3`

[1] 14 7 49

$`4`

[1] 43 92 16 51

> unsplit(score.edu, edu)

[1] 54 50 8 14 43 7 92 16 49 51

> sort(edu)

[1] 1 1 2 3 3 3 4 4 4 4 Levels: 1 2 3 4

> unsplit(score.edu, sort(edu)) [1] 54 8 50 14 7 49 43 92 16 51

(43)

split {base} , Example (1)

^43/113

> head(airquality)

Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6

> month <- airquality$Month

> airquality.month <- split(airquality, month)

>

> mydata <- lapply(airquality.month, transform, + Oz.Z = scale(Ozone))

> airquality2 <- unsplit(mydata, month)

> head(airquality2)

Ozone Solar.R Wind Temp Month Day Oz.Z 1 41 190 7.4 67 5 1 0.7822293 2 36 118 8.0 72 5 2 0.5572518 3 12 149 12.6 74 5 3 -0.5226399 4 18 313 11.5 62 5 4 -0.2526670 5 NA NA 14.3 56 5 5 NA 6 28 NA 14.9 66 5 6 0.1972879

See also:

transform(airquality, Ozone = -Ozone)

transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8) attach(airquality)

transform(Ozone, logOzone = log(Ozone))

> airquality.month

$`5`

Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 ...

31 37 279 7.4 76 5 31

$`6`

Ozone Solar.R Wind Temp Month Day 32 NA 286 8.6 78 6 1 ...

61 NA 138 8.0 83 6 30 ...

$`9`

Ozone Solar.R Wind Temp Month Day 124 96 167 6.9 91 9 1 ...

153 20 223 11.5 68 9 30

(44)

split{base}

,

Example (2)

^44/113

> # Split a matrix into a list by columns

> mat <- cbind(x = 1:10, y = (-4:5)^2)

> cbind(mat, col(mat)) x y

[1,] 1 16 1 2 [2,] 2 9 1 2 [3,] 3 4 1 2 [4,] 4 1 1 2 [5,] 5 0 1 2 [6,] 6 1 1 2 [7,] 7 4 1 2 [8,] 8 9 1 2 [9,] 9 16 1 2 [10,] 10 25 1 2

> split(mat, col(mat))

$`1`

[1] 1 2 3 4 5 6 7 8 9 10

$`2`

[1] 16 9 4 1 0 1 4 9 16 25

> split(1:10, 1:2)

$`1`

[1] 1 3 5 7 9

$`2`

[1] 2 4 6 8 10