tidyverse 框架最大的优点在于所有函数的输入输出都是数据框,从而可以通过管道操作链式运算。
dplyr 包中filter()
, mutate()
等函数参数中的表达式本质上都是以列为单位的向量化运算
return a subset of rows as a new data
filter()
条件筛选行slice
索引切片行distinct()
去重,指定以哪些列作为评价基准(其他列舍弃)top_n
选择最大的n行,要指定用以比较的列arrange()
排序add_row
添加行top_n(iris, 5, Sepal.Width) %>% setDT()) # 因为并列第5都是3.9,所以筛出来6行 (
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.4 3.9 1.7 0.4 setosa
#> 2: 5.8 4.0 1.2 0.2 setosa
#> 3: 5.7 4.4 1.5 0.4 setosa
#> 4: 5.4 3.9 1.3 0.4 setosa
#> 5: 5.2 4.1 1.5 0.1 setosa
#> 6: 5.5 4.2 1.4 0.2 setosa
<- as_tibble(faithful)
faithful add_row(faithful, eruptions = 1, waiting = 1)
#> # A tibble: 273 × 2
#> eruptions waiting
#> <dbl> <dbl>
#> 1 3.6 79
#> 2 1.8 54
#> 3 3.33 74
#> 4 2.28 62
#> 5 4.53 85
#> 6 2.88 55
#> 7 4.7 88
#> 8 3.6 85
#> 9 1.95 51
#> 10 4.35 85
#> # … with 263 more rows
filter()
筛选行可以用下标实现,如flights[8:12,]
,head(x, n)
和tail(x, n)
也能取最前或后若干行。但dplyr::filter()
是最灵活的。
# 选取1月1日的航班
<- filter(flights, month == 1, day == 1)
new1 head(new1)
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
# 选取11月或12月的航班,注意%in%的用法
<- flights %>% filter(month %in% c(11, 12))
nov_dec nov_dec
#> # A tibble: 55,403 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 11 1 5 2359 6 352 345
#> 2 2013 11 1 35 2250 105 123 2356
#> 3 2013 11 1 455 500 -5 641 651
#> 4 2013 11 1 539 545 -6 856 827
#> 5 2013 11 1 542 545 -3 831 855
#> 6 2013 11 1 549 600 -11 912 923
#> 7 2013 11 1 550 600 -10 705 659
#> 8 2013 11 1 554 600 -6 659 701
#> 9 2013 11 1 554 600 -6 826 827
#> 10 2013 11 1 554 600 -6 749 751
#> # … with 55,393 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# 使用管道操作
<- flights %>%
not_cancelled filter(!is.na(dep_delay), !is.na(arr_delay))
filter() 只能筛选出条件为 TRUE 的行;它会排除那些条件为 FALSE 和 NA 的行。如果想保留缺失值,需要明确指出。
silce()
dplyr::slice(data, ...)
可以选择指定序号的行子集,正的序号表示保留,负的序号表示排除。如:
%>% slice(3:5) flights
#> # A tibble: 3 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 542 540 2 923 850
#> 2 2013 1 1 544 545 -1 1004 1022
#> 3 2013 1 1 554 600 -6 812 837
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
distinct()
dplyr::distinct()
可以对数据框指定若干变量,然后筛选出所有不同值,每组不同值仅保留一行。
%>% distinct(sex, age, .keep_all = TRUE) d.class
#> # A tibble: 11 × 5
#> name sex age height weight
#> <chr> <fct> <dbl> <dbl> <dbl>
#> 1 Alice F 13 56.5 84
#> 2 Gail F 14 64.3 90
#> 3 Karen F 12 56.3 77
#> 4 Mary F 15 66.5 112
#> 5 Sandy F 11 51.3 50.5
#> 6 Alfred M 14 69 112.
#> 7 Guido M 15 67 133
#> 8 James M 12 57.3 83
#> 9 Jeffrey M 13 62.5 84
#> 10 Philip M 16 72 150
#> 11 Thomas M 11 57.5 85
# .keep_all=TRUE 保留数据框中其它变量
arrange()
# 依次按year, month, day排序;默认升序
<- arrange(flights, year, month, day)
new2 head(new2)
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
# 使用desc()可以按列进行降序排序
<- arrange(flights, desc(arr_delay))
new2 head(new2)
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 9 641 900 1301 1242 1530
#> 2 2013 6 15 1432 1935 1137 1607 2120
#> 3 2013 1 10 1121 1635 1126 1239 1810
#> 4 2013 9 20 1139 1845 1014 1457 2210
#> 5 2013 7 22 845 1600 1005 1044 1815
#> 6 2013 4 10 1100 1900 960 1342 2211
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
排序时不论升序还是降序,所有的缺失值都自动排到末尾。
pull()
选出一列并从数据框转化为向量,其实它就是操作符$
的函数化
%>% pull(name) d.class
#> [1] "Alice" "Becka" "Gail" "Karen" "Kathy" "Mary" "Sandy"
#> [8] "Sharon" "Tammy" "Alfred" "Duke" "Guido" "James" "Jeffrey"
#> [15] "John" "Philip" "Robert" "Thomas" "William"
pull()
可以指定单个变量名,也可以指定变量序号,负的变量序号从最后一个变量数起。
select()
<- select(flights, year, month, day)
new3 head(new3)
#> # A tibble: 6 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
# 冒号":"选择 year 和 arr_time 之间的所有列
%>%
flights select(year:arr_time) %>%
head()
#> # A tibble: 6 × 7
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
# 也可以用数字序号表示范围
%>%
flights select(1:7) %>%
head()
#> # A tibble: 6 × 7
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
# 负号或!表示扣除
<- select(flights, -(year:day))
new3 head(new3)
#> # A tibble: 6 × 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 517 515 2 830 819 11 UA
#> 2 533 529 4 850 830 20 UA
#> 3 542 540 2 923 850 33 AA
#> 4 544 545 -1 1004 1022 -18 B6
#> 5 554 600 -6 812 837 -25 DL
#> 6 554 558 -4 740 728 12 UA
#> # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
可以在select()的参数中使用一些辅助函数,界定符合条件的变量
<- select(flights, starts_with("c"))
new3 head(new3)
#> # A tibble: 6 × 1
#> carrier
#> <chr>
#> 1 UA
#> 2 UA
#> 3 AA
#> 4 B6
#> 5 DL
#> 6 UA
# 如果要选择的变量名已经保存为一个字符型向量
# 可以用 one_of() 引用,直接引用向量名会报错
<- c("name", "sex")
vars %>%
d.class select(one_of(vars))
#> # A tibble: 19 × 2
#> name sex
#> <chr> <fct>
#> 1 Alice F
#> 2 Becka F
#> 3 Gail F
#> 4 Karen F
#> 5 Kathy F
#> 6 Mary F
#> 7 Sandy F
#> 8 Sharon F
#> 9 Tammy F
#> 10 Alfred M
#> 11 Duke M
#> 12 Guido M
#> 13 James M
#> 14 Jeffrey M
#> 15 John M
#> 16 Philip M
#> 17 Robert M
#> 18 Thomas M
#> 19 William M
select(..., everything())
将select()函数和everything()辅助函数结合起来使用。当想要将几个变量移到数据框开头时,这种用法非常奏效。
<- select(flights, time_hour, air_time, everything())
new3 head(new3)
#> # A tibble: 6 × 19
#> time_hour air_time year month day dep_time sched_dep_time
#> <dttm> <dbl> <int> <int> <int> <int> <int>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#> # … with 12 more variables: dep_delay <dbl>, arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
#> # minute <dbl>
rename()
# 这个语法有点怪,前者是新列名,后者是旧列名
%>%
flights rename(tail_num = tailnum) %>%
head()
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
注意这样改名字不是对原始数据框修改而是返回改了名字后的新数据框。
mutate()
以下都是向量化函数,一行一行地依次操作
mutate()
,新变量添加在最后一列,且创建的新变量可以立即使用在参数中transmute()
,
如果只想保留新变量和一部分旧变量,可以使用transmute()函数,未被提及的变量不会被保留。mutate_all()
, 计算新列覆盖同名旧列mutate_if()
, 按一定条件计算新列覆盖同名旧列add_column()
,合并列%>%
flights select(year:day, ends_with("delay"), distance, air_time) %>%
mutate(
gain = arr_delay - dep_delay,
hours = air_time / 60, speed = distance / hours
)
#> # A tibble: 336,776 × 10
#> year month day dep_delay arr_delay distance air_time gain hours speed
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 2 11 1400 227 9 3.78 370.
#> 2 2013 1 1 4 20 1416 227 16 3.78 374.
#> 3 2013 1 1 2 33 1089 160 31 2.67 408.
#> 4 2013 1 1 -1 -18 1576 183 -17 3.05 517.
#> 5 2013 1 1 -6 -25 762 116 -19 1.93 394.
#> 6 2013 1 1 -4 12 719 150 16 2.5 288.
#> 7 2013 1 1 -5 19 1065 158 24 2.63 404.
#> 8 2013 1 1 -3 -14 229 53 -11 0.883 259.
#> 9 2013 1 1 -3 -8 944 140 -5 2.33 405.
#> 10 2013 1 1 -2 8 733 138 10 2.3 319.
#> # … with 336,766 more rows
%>%
d.class mutate(
cheight = height - mean(height)
)
#> # A tibble: 19 × 6
#> name sex age height weight cheight
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Alice F 13 56.5 84 -5.84
#> 2 Becka F 13 65.3 98 2.96
#> 3 Gail F 14 64.3 90 1.96
#> 4 Karen F 12 56.3 77 -6.04
#> 5 Kathy F 12 59.8 84.5 -2.54
#> 6 Mary F 15 66.5 112 4.16
#> 7 Sandy F 11 51.3 50.5 -11.0
#> 8 Sharon F 15 62.5 112. 0.163
#> 9 Tammy F 14 62.8 102. 0.463
#> 10 Alfred M 14 69 112. 6.66
#> 11 Duke M 14 63.5 102. 1.16
#> 12 Guido M 15 67 133 4.66
#> 13 James M 12 57.3 83 -5.04
#> 14 Jeffrey M 13 62.5 84 0.163
#> 15 John M 12 59 99.5 -3.34
#> 16 Philip M 16 72 150 9.66
#> 17 Robert M 12 64.8 128 2.46
#> 18 Thomas M 11 57.5 85 -4.84
#> 19 William M 15 66.5 112 4.16
%>% transmute(dep_time,
flights hour = dep_time %/% 60,
minute = dep_time %% 60
)
#> # A tibble: 336,776 × 3
#> dep_time hour minute
#> <int> <dbl> <dbl>
#> 1 517 8 37
#> 2 533 8 53
#> 3 542 9 2
#> 4 544 9 4
#> 5 554 9 14
#> 6 554 9 14
#> 7 555 9 15
#> 8 557 9 17
#> 9 557 9 17
#> 10 558 9 18
#> # … with 336,766 more rows
mutate_all(faithful, funs(log(.), log2(.)))
#> # A tibble: 272 × 6
#> eruptions waiting eruptions_log waiting_log eruptions_log2 waiting_log2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.6 79 1.28 4.37 1.85 6.30
#> 2 1.8 54 0.588 3.99 0.848 5.75
#> 3 3.33 74 1.20 4.30 1.74 6.21
#> 4 2.28 62 0.825 4.13 1.19 5.95
#> 5 4.53 85 1.51 4.44 2.18 6.41
#> 6 2.88 55 1.06 4.01 1.53 5.78
#> 7 4.7 88 1.55 4.48 2.23 6.46
#> 8 3.6 85 1.28 4.44 1.85 6.41
#> 9 1.95 51 0.668 3.93 0.963 5.67
#> 10 4.35 85 1.47 4.44 2.12 6.41
#> # … with 262 more rows
<- as_tibble(iris)
iris mutate_if(iris, is.numeric, funs(log(.)))
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1.63 1.25 0.336 -1.61 setosa
#> 2 1.59 1.10 0.336 -1.61 setosa
#> 3 1.55 1.16 0.262 -1.61 setosa
#> 4 1.53 1.13 0.405 -1.61 setosa
#> 5 1.61 1.28 0.336 -1.61 setosa
#> 6 1.69 1.36 0.531 -0.916 setosa
#> 7 1.53 1.22 0.336 -1.20 setosa
#> 8 1.61 1.22 0.405 -1.61 setosa
#> 9 1.48 1.06 0.336 -1.61 setosa
#> 10 1.59 1.13 0.405 -2.30 setosa
#> # … with 140 more rows
mutate_at(iris, vars(-Species), funs(log(.)))
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1.63 1.25 0.336 -1.61 setosa
#> 2 1.59 1.10 0.336 -1.61 setosa
#> 3 1.55 1.16 0.262 -1.61 setosa
#> 4 1.53 1.13 0.405 -1.61 setosa
#> 5 1.61 1.28 0.336 -1.61 setosa
#> 6 1.69 1.36 0.531 -0.916 setosa
#> 7 1.53 1.22 0.336 -1.20 setosa
#> 8 1.61 1.22 0.405 -1.61 setosa
#> 9 1.48 1.06 0.336 -1.61 setosa
#> 10 1.59 1.13 0.405 -2.30 setosa
#> # … with 140 more rows
add_column(mtcars, new = 1:32)
#> mpg cyl disp hp drat wt qsec vs am gear carb new
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
rename(iris, Length = Sepal.Length)
#> # A tibble: 150 × 5
#> Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
这些函数配合 mutate()
和
transmutate()
,input一个列向量,output一个列向量
seq_along(along.with)
根据输入的列的长度,产生一列从1到N的顺序序号,很像 data.table
包中的1:.N
dplyr::lag()
后退偏移dplyr::lead()
前进偏移cumsum()
累计和,分组后计算1-12月累计值非常有用cumprod()
累计积cummin()/cummax
累计最小/大值dplyr::cummean()
累计平均值dplyr::cumall()
累计all(逻辑判断)dplyr::cumany()
累计any(逻辑判断)见 cheatsheet
见 cheatsheet
见 cheatsheet
<- tibble(
df name = c("Alice", "Alice", "Bob", "Bob", "Carol", "Carol"),
type = c("english", "math", "english", "math", "english", "math"),
score = c(60.2, 90.5, 92.2, 98.8, 82.5, 74.6)
)
# if_else( , , ),用 ifelse( ? : )会报错
# 因为 ifelse() 不是向量化运算函数
%>% mutate(
df assess = if_else(score > 85, "very_good", "good")
)
#> # A tibble: 6 × 4
#> name type score assess
#> <chr> <chr> <dbl> <chr>
#> 1 Alice english 60.2 good
#> 2 Alice math 90.5 very_good
#> 3 Bob english 92.2 very_good
#> 4 Bob math 98.8 very_good
#> 5 Carol english 82.5 good
#> 6 Carol math 74.6 good
%>% mutate(
df assess = case_when(
< 70 ~ "general",
score >= 70 & score < 80 ~ "good",
score >= 80 & score < 90 ~ "very_good",
score >= 90 ~ "best",
score TRUE ~ "other"
) )
#> # A tibble: 6 × 4
#> name type score assess
#> <chr> <chr> <dbl> <chr>
#> 1 Alice english 60.2 general
#> 2 Alice math 90.5 best
#> 3 Bob english 92.2 best
#> 4 Bob math 98.8 best
#> 5 Carol english 82.5 very_good
#> 6 Carol math 74.6 good
group_by()
创建一个分好组的原数据的copy,等待进一步指令,然后分组执行。
ungroup()
逆操作,生成一个数据的 ungrouped copy
%>%
mtcars group_by(cyl) %>%
summarise(avg = mean(mpg))
#> # A tibble: 3 × 2
#> cyl avg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
<- group_by(iris, Species)
g_iris ungroup(g_iris)
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
虽然group_by()与summarize()结合起来使用是最有效的,但分组也可以与filter()和mutate()结合,以完成非常便捷的操作:如
# 每天到达延误时间最长的航班
%>%
flights select(year:day, ends_with("delay"), distance, air_time) %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
#> # A tibble: 3,306 × 7
#> # Groups: year, month, day [365]
#> year month day dep_delay arr_delay distance air_time
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 853 851 184 41
#> 2 2013 1 1 290 338 1134 213
#> 3 2013 1 1 260 263 266 46
#> 4 2013 1 1 157 174 213 60
#> 5 2013 1 1 216 222 708 121
#> 6 2013 1 1 255 250 589 115
#> 7 2013 1 1 285 246 1085 146
#> 8 2013 1 1 192 191 199 44
#> 9 2013 1 1 379 456 1092 222
#> 10 2013 1 2 224 207 550 94
#> # … with 3,296 more rows
# 大于365个航班的航线
<- flights %>%
popular_dests group_by(dest) %>%
filter(n() > 365)
popular_dests
#> # A tibble: 332,577 × 19
#> # Groups: dest [77]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # … with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# 对数据进行标准化以计算分组指标
%>%
popular_dests filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 × 6
#> # Groups: dest [77]
#> year month day dest arr_delay prop_delay
#> <int> <int> <int> <chr> <dbl> <dbl>
#> 1 2013 1 1 IAH 11 0.000111
#> 2 2013 1 1 IAH 20 0.000201
#> 3 2013 1 1 MIA 33 0.000235
#> 4 2013 1 1 ORD 12 0.0000424
#> 5 2013 1 1 FLL 19 0.0000938
#> 6 2013 1 1 ORD 8 0.0000283
#> 7 2013 1 1 LAX 7 0.0000344
#> 8 2013 1 1 DFW 31 0.000282
#> 9 2013 1 1 ATL 12 0.0000400
#> 10 2013 1 1 DTW 16 0.000116
#> # … with 131,096 more rows
当使用多个变量进行分组时,每次的摘要统计会用掉一个分组变量。这样就可以轻松地对数据集进行循序渐进的分析。
<- group_by(flights, year, month, day)
daily <- summarize(daily, flights = n())) (per_day
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day flights
#> <int> <int> <int> <int>
#> 1 2013 1 1 842
#> 2 2013 1 2 943
#> 3 2013 1 3 914
#> 4 2013 1 4 915
#> 5 2013 1 5 720
#> 6 2013 1 6 832
#> 7 2013 1 7 933
#> 8 2013 1 8 899
#> 9 2013 1 9 902
#> 10 2013 1 10 932
#> # … with 355 more rows
<- summarize(per_day, flights = sum(flights))) (per_month
#> # A tibble: 12 × 3
#> # Groups: year [1]
#> year month flights
#> <int> <int> <int>
#> 1 2013 1 27004
#> 2 2013 2 24951
#> 3 2013 3 28834
#> 4 2013 4 28330
#> 5 2013 5 28796
#> 6 2013 6 28243
#> 7 2013 7 29425
#> 8 2013 8 29327
#> 9 2013 9 27574
#> 10 2013 10 28889
#> 11 2013 11 27268
#> 12 2013 12 28135
<- summarize(per_month, flights = sum(flights))) (per_year
#> # A tibble: 1 × 2
#> year flights
#> <int> <int>
#> 1 2013 336776
取消分组,并回到未分组的数据继续操作,可以使用 ungroup()
%>%
daily ungroup() %>% # 取消分组
summarize(flights = n()) # 全年所有航班
#> # A tibble: 1 × 1
#> flights
#> <int>
#> 1 336776
summarise()
统计%>% summarise(avg = mean(mpg)) mtcars
#> avg
#> 1 20.09062
count()
按该列值的 level 分组计数<- tibble(
df name = c("Alice", "Alice", "Bob", "Bob", "Carol", "Carol"),
type = c("english", "math", "english", "math", "english", "math"),
score = c(60.2, 90.5, 92.2, 98.8, 82.5, 74.6)
)
# 加权 count
%>% count(name,
df sort = TRUE,
wt = score,
name = "total_score"
)
#> # A tibble: 3 × 2
#> name total_score
#> <chr> <dbl>
#> 1 Bob 191
#> 2 Carol 157.
#> 3 Alice 151.
# 等价于
%>%
df group_by(name) %>%
summarise(
n = n(),
total_score = sum(score, na.rm = TRUE)
%>%
) arrange(desc(total_score))
#> # A tibble: 3 × 3
#> name n total_score
#> <chr> <int> <dbl>
#> 1 Bob 2 191
#> 2 Carol 2 157.
#> 3 Alice 2 151.
add_count()
增加一个 count 列增加一列,代表每人参加的考试次数
%>% add_count(name) df
#> # A tibble: 6 × 4
#> name type score n
#> <chr> <chr> <dbl> <int>
#> 1 Alice english 60.2 2
#> 2 Alice math 90.5 2
#> 3 Bob english 92.2 2
#> 4 Bob math 98.8 2
#> 5 Carol english 82.5 2
#> 6 Carol math 74.6 2
# 等价于
%>%
df group_by(name) %>%
mutate(n = n()) %>%
ungroup()
#> # A tibble: 6 × 4
#> name type score n
#> <chr> <chr> <dbl> <int>
#> 1 Alice english 60.2 2
#> 2 Alice math 90.5 2
#> 3 Bob english 92.2 2
#> 4 Bob math 98.8 2
#> 5 Carol english 82.5 2
#> 6 Carol math 74.6 2
配合在 summarise()
内部使用,占据第二个参数位置。input一个向量,output一个值。
sum()
, mean()
, cumsum()
n()
, n_distinct()
,
sum(!is.na())
first()
, last()
,
nth()
min()
, max()
,
quantile(x, 0.25)
sd()
、var()
, mad()
,
IQR()
例1 观察航班数据的若干统计量
# by_date是flights按年月日分组后的新数据
<- group_by(flights, year, month, day)
by_date # 由此得到每一天的平均延误时间。
summarize(by_date, delay = mean(dep_delay, na.rm = T))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> 7 2013 1 7 5.42
#> 8 2013 1 8 2.55
#> 9 2013 1 9 2.28
#> 10 2013 1 10 2.84
#> # … with 355 more rows
# 也可以用管道操作写成
%>%
flights group_by(year, month, day) %>%
summarize(delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> 7 2013 1 7 5.42
#> 8 2013 1 8 2.55
#> 9 2013 1 9 2.28
#> 10 2013 1 10 2.84
#> # … with 355 more rows
# 也可以预先去掉存在NA(航班取消)的行,写成
<- flights %>%
not_cancelled filter(!is.na(dep_delay), !is.na(arr_delay))
%>%
not_cancelled group_by(year, month, day) %>%
summarize(delay = mean(dep_delay))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.4
#> 2 2013 1 2 13.7
#> 3 2013 1 3 10.9
#> 4 2013 1 4 8.97
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> 7 2013 1 7 5.42
#> 8 2013 1 8 2.56
#> 9 2013 1 9 2.30
#> 10 2013 1 10 2.84
#> # … with 355 more rows
# 观察平均正延误时间
%>%
not_cancelled group_by(year, month, day) %>%
summarize(delay_pos = mean(dep_delay[dep_delay > 0]))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day delay_pos
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 32.6
#> 2 2013 1 2 32.4
#> 3 2013 1 3 28.3
#> 4 2013 1 4 26.2
#> 5 2013 1 5 21.8
#> 6 2013 1 6 22.1
#> 7 2013 1 7 26.2
#> 8 2013 1 8 21.3
#> 9 2013 1 9 26.5
#> 10 2013 1 10 32.6
#> # … with 355 more rows
例2 按目的地分组,按目的地分组,研究平均飞行距离与平均延误时间之间的关系
# 根据目的地分组
<- group_by(flights, dest)
by_dest
# 用mean()统计平均飞行距离和平均延误时间
# 用n()统计航班数,sum(!is_na())非缺失值记数
<- summarize(by_dest,
sum_data count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
# 筛选出航班数大于20(去除小样本)且目的地不为'HNL'的行
# HNL这个机场在夏威夷,距离太远属于异常值
<- filter(sum_data, count > 29, dest != "HNL")
delay
# 作图
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1 / 3) +
geom_smooth(se = FALSE)
# 管道操作
# 这样写代码很清晰,每一行是一步操作
%>%
flights group_by(dest) %>% # 分组
summarize(
count = n(), dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
%>% # 统计
) filter(count > 29, dest != "HNL") %>% # 筛选观测
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1 / 3) +
geom_smooth(se = FALSE) # 作图
可见,750英里内,平均延误时间会随着距离的增加而增加,接着会随着距离的增加而减少。或许,随着飞行距离的增加,延误时间有可能会在飞行中弥补回来。
例3 查看具有最长平均延误时间的航班
# 通过尾号(tailnum)对航班进行分组识别
%>%
not_cancelled group_by(tailnum) %>%
summarize(delay = mean(arr_delay)) %>%
ggplot(mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
有些航班的平均延误时间长达300分钟,这是怎么回事呢?
# 作航班架次数量和平均延误时间的散点图
%>%
not_cancelled group_by(tailnum) %>%
summarize(delay = mean(arr_delay), n = n()) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 0.1)
由图可见,当航班数量非常少时,平均延误时间的变动特别大。
注:这张图的形状非常能够说明问题:当绘制均值(或其他摘要统计量)和分组规模的关系时,总能看到随着样本量的增加,变动在不断减小。因此,查看此类图形时,通常应该筛选掉那些观测数量非常少的分组,这样就可以避免受到特别小的分组中的极端变动的影响,进而更好地发现数据模式。
# 作去掉小样本后,航班架次数量和平均延误时间的散点图
%>%
not_cancelled group_by(tailnum) %>%
summarize(delay = mean(arr_delay), n = n()) %>%
filter(n >= 30) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", se = FALSE)
这时我们发现,平均延误时间大多在20分钟以下;飞行架次在30以上的航班,平均延误时间最长的也只有60分钟。
例4 棒球击球手的安打率与击球次数之间的关系(能力与出场机会的关系)
# 棒球比赛数据,转换成tibble使输出更美观
<- as_tibble(Lahman::Batting)
batting
# 根据球员ID分组统计,总击打数ab和安打率ba(安打数/击打数)
# 分组统计后包含playerID, ab, ba变量的新数据框为batters
<- batting %>%
batters group_by(playerID) %>%
summarize(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
# 按安打率排名,发现位次最靠前的是一些总击打数几乎为零、
# 靠好运气获得极高安打率的球员——而这并不表明能力强。
%>% arrange(desc(ba)) batters
#> # A tibble: 19,898 × 3
#> playerID ba ab
#> <chr> <dbl> <int>
#> 1 abramge01 1 1
#> 2 alanirj01 1 1
#> 3 alberan01 1 1
#> 4 banisje01 1 1
#> 5 bartocl01 1 1
#> 6 bassdo01 1 1
#> 7 birasst01 1 2
#> 8 bruneju01 1 1
#> 9 burnscb01 1 1
#> 10 cammaer01 1 1
#> # … with 19,888 more rows
%>% ggplot(mapping = aes(x = ab, y = ba)) +
batters geom_point(alpha = 0.1)
# 忽略小样本,绘制安打率~击打数散点图,发现安打率基本小于35%
# 正相关关系,说明安打率较高的球员击打数较多(出场机会多)
# 由图可见,拥有20%以上的安打率,才有机会成为球队主力
%>%
batters filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point(alpha = 0.1) +
geom_smooth(se = FALSE)
# 从N.Y.到其他机场飞行距离的标准差
%>%
not_cancelled group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
#> # A tibble: 104 × 2
#> dest distance_sd
#> <chr> <dbl>
#> 1 EGE 10.5
#> 2 SAN 10.4
#> 3 SFO 10.2
#> 4 HNL 10.0
#> 5 SEA 9.98
#> 6 LAS 9.91
#> 7 PDX 9.87
#> 8 PHX 9.86
#> 9 LAX 9.66
#> 10 IND 9.46
#> # … with 94 more rows
# 每天最早和最晚航班的出发时间
%>%
not_cancelled group_by(year, month, day) %>%
summarize(first = min(dep_time), last = max(dep_time))
#> # A tibble: 365 × 5
#> # Groups: year, month [12]
#> year month day first last
#> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 2356
#> 2 2013 1 2 42 2354
#> 3 2013 1 3 32 2349
#> 4 2013 1 4 25 2358
#> 5 2013 1 5 14 2357
#> 6 2013 1 6 16 2355
#> 7 2013 1 7 49 2359
#> 8 2013 1 8 454 2351
#> 9 2013 1 9 2 2252
#> 10 2013 1 10 3 2320
#> # … with 355 more rows
# 找出每天最早和最晚出发的航班
# min_rank(x)可以返回一个序列各元素的升序位置向量
# min_rank(desc(x))返回降序位置向量
# 对出发时间排序后,用range()取出最大值和最小值
# 再用filter()取出每天出发最早和最晚的航班
%>%
not_cancelled group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
#> # A tibble: 770 × 20
#> # Groups: year, month, day [365]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 2356 2359 -3 425 437
#> 3 2013 1 2 42 2359 43 518 442
#> 4 2013 1 2 2354 2359 -5 413 437
#> 5 2013 1 3 32 2359 33 504 442
#> 6 2013 1 3 2349 2359 -10 434 445
#> 7 2013 1 4 25 2359 26 505 442
#> 8 2013 1 4 2358 2359 -1 429 437
#> 9 2013 1 4 2358 2359 -1 436 445
#> 10 2013 1 5 14 2359 15 503 445
#> # … with 760 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
# 纽约至哪个目的地的航线有最多的航空公司运营
# n_distinct()函数统计唯一值的数量
%>%
not_cancelled group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
#> # A tibble: 104 × 2
#> dest carriers
#> <chr> <int>
#> 1 ATL 7
#> 2 BOS 7
#> 3 CLT 7
#> 4 ORD 7
#> 5 TPA 7
#> 6 AUS 6
#> 7 DCA 6
#> 8 DTW 6
#> 9 IAD 6
#> 10 MSP 6
#> # … with 94 more rows
# 纽约出发的每条航线有多少个航班
%>% count(dest) not_cancelled
#> # A tibble: 104 × 2
#> dest n
#> <chr> <int>
#> 1 ABQ 254
#> 2 ACK 264
#> 3 ALB 418
#> 4 ANC 8
#> 5 ATL 16837
#> 6 AUS 2411
#> 7 AVL 261
#> 8 BDL 412
#> 9 BGR 358
#> 10 BHM 269
#> # … with 94 more rows
# 加权记数,计算每个航班的总飞行距离
%>% count(tailnum, wt = distance) not_cancelled
#> # A tibble: 4,037 × 2
#> tailnum n
#> <chr> <dbl>
#> 1 D942DN 3418
#> 2 N0EGMQ 239143
#> 3 N10156 109664
#> 4 N102UW 25722
#> 5 N103US 24619
#> 6 N104UW 24616
#> 7 N10575 139903
#> 8 N105UW 23618
#> 9 N107US 21677
#> 10 N108UW 32070
#> # … with 4,027 more rows
# 若x为逻辑向量,sum(x)可以找出x中TRUE的数量,mean(x)可以找出x中TRUE的比例。
# 每天有多少架航班是早上5点前出发的
%>%
not_cancelled group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day n_early
#> <int> <int> <int> <int>
#> 1 2013 1 1 0
#> 2 2013 1 2 3
#> 3 2013 1 3 4
#> 4 2013 1 4 3
#> 5 2013 1 5 3
#> 6 2013 1 6 2
#> 7 2013 1 7 2
#> 8 2013 1 8 1
#> 9 2013 1 9 3
#> 10 2013 1 10 3
#> # … with 355 more rows
# 延误超过1小时的航班比例是多少
%>%
not_cancelled group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay > 60))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day hour_perc
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 0.0722
#> 2 2013 1 2 0.0851
#> 3 2013 1 3 0.0567
#> 4 2013 1 4 0.0396
#> 5 2013 1 5 0.0349
#> 6 2013 1 6 0.0470
#> 7 2013 1 7 0.0333
#> 8 2013 1 8 0.0213
#> 9 2013 1 9 0.0202
#> 10 2013 1 10 0.0183
#> # … with 355 more rows