tidyfst包实例分析

我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。

测试数据构造

本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。

library(tidyfst)
diamonds <- ggplot2::diamonds
n = 1e5  #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
                     to = as.Date("2020-01-01"),
                     by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE)  # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

print(dat1)
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  2.23   Premium     J     VS2  61.0    58 14867  8.39  8.36  5.11
#>      2:  0.46     Ideal     F     SI2  61.5    54   758  4.98  5.01  3.07
#>      3:  0.70     Ideal     H    VVS1  60.9    57  3611  5.70  5.78  3.50
#>      4:  1.50   Premium     F     VS2  61.5    58 14719  7.34  7.32  4.51
#>      5:  0.40 Very Good     G      IF  61.0    59  1154  4.75  4.79  2.91
#>     ---                                                                  
#>  99996:  1.14     Ideal     F     VS2  61.4    56  8017  6.73  6.78  4.14
#>  99997:  0.39     Ideal     F     SI1  61.8    55   886  4.70  4.72  2.91
#>  99998:  2.00      Good     F     SI1  63.0    64 17869  7.67  7.76  4.86
#>  99999:  1.00   Premium     H     VS2  59.7    59  5139  6.57  6.52  3.91
#> 100000:  0.30     Ideal     G    VVS2  60.6    57   878  4.35  4.33  2.63
#>                 dt
#>             <Date>
#>      1: 2016-09-29
#>      2: 2011-10-22
#>      3: 2013-03-06
#>      4: 2014-06-11
#>      5: 2015-02-23
#>     ---           
#>  99996: 2013-11-22
#>  99997: 2014-11-10
#>  99998: 2019-02-21
#>  99999: 2017-01-15
#> 100000: 2019-05-19

基础

小技巧

后面的分析,经常要根据日期进行计算。所以,先对日期进行排序,就能够提高运行速度。在tidyfst中,可以使用arrange_dt函数来对数据进行原位的各种操作,其中就包括排序。

dat1 = arrange_dt(dat1,dt)
dat1
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44  2.76
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23  2.70
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66  4.10
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44  3.83
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03  3.28
#>     ---                                                                  
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42  2.73
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39  2.75
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19  2.58
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30  2.66
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40  2.74
#>                 dt
#>             <Date>
#>      1: 2011-01-01
#>      2: 2011-01-01
#>      3: 2011-01-01
#>      4: 2011-01-01
#>      5: 2011-01-01
#>     ---           
#>  99996: 2020-01-01
#>  99997: 2020-01-01
#>  99998: 2020-01-01
#>  99999: 2020-01-01
#> 100000: 2020-01-01

那么,现在dat1的数据就按照日期排好序了。

聚合

1.求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格

在tidyfst中,我设置了一个sys_time_print函数,可以方便地输出system.time()函数返回的结果。

sys_time_print({
  r1_1 <- dat1 %>% 
    summarise_dt(
      by = .(cut,color),
      mean_price = mean(price, na.rm = TRUE),
      median_price = median(price, na.rm = TRUE),
      max_price = max(price, na.rm = TRUE)
    )
})
#> [1] "Finished in 0.000s elapsed (0.000s cpu)"
r1_1
#>           cut color mean_price median_price max_price
#>         <ord> <ord>      <num>        <num>     <int>
#>  1: Very Good     H   4390.470       3394.0     18803
#>  2:      Good     G   3999.886       3303.0     18788
#>  3:     Ideal     F   3309.061       1751.0     18780
#>  4:   Premium     E   3481.628       1881.0     18426
#>  5:      Fair     D   4237.899       3205.0     16386
#>  6:   Premium     J   6416.235       5110.5     18706
#>  7:   Premium     F   4347.158       2874.0     18791
#>  8:     Ideal     D   2608.030       1550.5     18693
#>  9: Very Good     G   3900.558       2403.0     18818
#> 10: Very Good     E   3301.873       2051.0     18731
#> 11:   Premium     G   4543.181       2780.0     18741
#> 12:     Ideal     G   3751.035       1881.0     18806
#> 13:     Ideal     E   2558.501       1429.0     18729
#> 14:   Premium     H   5215.019       4451.0     18795
#> 15:   Premium     I   5921.498       4545.5     18823
#> 16: Very Good     F   3792.863       2473.0     18777
#> 17:      Good     D   3369.352       2661.0     18468
#> 18:      Fair     G   4186.403       2797.0     18574
#> 19:      Good     E   3293.289       2407.0     18236
#> 20:   Premium     D   3796.609       2348.0     18286
#> 21:     Ideal     H   3924.682       2320.0     18659
#> 22: Very Good     I   5475.587       4013.0     18500
#> 23:     Ideal     I   4434.030       2631.0     18779
#> 24:      Fair     J   5029.502       3422.0     18531
#> 25: Very Good     D   3437.133       2242.0     18526
#> 26:      Good     F   3518.548       2755.0     18686
#> 27:      Good     I   4966.745       3484.0     18707
#> 28:      Good     H   4098.280       3323.0     18640
#> 29:      Fair     H   5454.118       4134.5     18308
#> 30:      Fair     F   3790.703       2961.0     17995
#> 31:     Ideal     J   4972.801       4135.0     18508
#> 32: Very Good     J   5107.638       4126.5     18430
#> 33:      Fair     I   4278.831       3022.0     18242
#> 34:      Good     J   4578.490       3690.0     18325
#> 35:      Fair     E   3834.358       2996.5     15584
#>           cut color mean_price median_price max_price

tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。

2.求每天最高出售价格对应的那笔订单

sys_time_print({
  r1_2 <- dat1 %>% 
    arrange_dt(dt,-price) %>% 
    drop_na_dt(price) %>% 
    group_dt(
      by = dt,
      head(1)
    )
})
#> [1] "Finished in 0.050s elapsed (0.070s cpu)"
r1_2
#>               dt carat       cut color clarity depth table price     x     y
#>           <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>    1: 2011-01-01  2.10   Premium     J     VS2  58.3    54 12401  8.49  8.45
#>    2: 2011-01-02  2.02   Premium     I     VS2  59.0    59 17893  8.34  8.27
#>    3: 2011-01-03  1.61     Ideal     F     VS1  62.0    55 17414  7.55  7.52
#>    4: 2011-01-04  2.19   Premium     E     SI2  62.4    61 18232  8.31  8.23
#>    5: 2011-01-05  2.05   Premium     H     VS2  60.7    61 16235  8.25  8.13
#>   ---                                                                       
#> 3284: 2019-12-28  2.22     Ideal     J     SI2  62.8    57 13703  8.33  8.29
#> 3285: 2019-12-29  2.10 Very Good     J     VS2  61.3    59 14229  8.21  8.27
#> 3286: 2019-12-30  1.55 Very Good     H     VS2  63.3    56 10546  7.38  7.32
#> 3287: 2019-12-31  1.51 Very Good     F    VVS2  63.1    56 17317  7.32  7.26
#> 3288: 2020-01-01  2.04   Premium     J     VS2  63.0    58 13734  8.02  7.99
#>           z
#>       <num>
#>    1:  4.94
#>    2:  4.90
#>    3:  4.67
#>    4:  5.16
#>    5:  4.97
#>   ---      
#> 3284:  5.22
#> 3285:  5.05
#> 3286:  4.65
#> 3287:  4.60
#> 3288:  5.04

join

1.dat1与dat2以dt列左连接

实质上,merge函数已经优化得很好。tidyfst设计*_join系列函数的时候,只是为了一种不一样的语法结构来帮助实现不同的连接,因为它确实更加直观一些。但是实质上它还是merge.data.table函数的包装版本。

sys_time_print({
  r2_1 <- dat1 %>% 
    left_join_dt(dat2,by = "dt")
})
#> [1] "Finished in 0.030s elapsed (0.020s cpu)"
r2_1
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x     y
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                                       
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#>             z price1
#>         <num>  <int>
#>      1:  2.76     NA
#>      2:  2.70     NA
#>      3:  4.10     NA
#>      4:  3.83     NA
#>      5:  3.28     NA
#>     ---             
#>  99996:  2.73     NA
#>  99997:  2.75     NA
#>  99998:  2.58     NA
#>  99999:  2.66     NA
#> 100000:  2.74     NA

2.多重join

sys_time_print({
  mymerge <- function(x, y) left_join_dt(x, y, by = "dt")
  r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))
})
#> [1] "Finished in 0.030s elapsed (0.030s cpu)"
r2_2
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x     y
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                                       
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#>             z price1 price2
#>         <num>  <int>  <int>
#>      1:  2.76     NA     NA
#>      2:  2.70     NA     NA
#>      3:  4.10     NA     NA
#>      4:  3.83     NA     NA
#>      5:  3.28     NA     NA
#>     ---                    
#>  99996:  2.73     NA     NA
#>  99997:  2.75     NA     NA
#>  99998:  2.58     NA     NA
#>  99999:  2.66     NA     NA
#> 100000:  2.74     NA     NA

长宽表转换

1.长表转宽表

sys_time_print({
  mean1 <- function(x) mean(x, na.rm = TRUE)
  max1 <- function(x) max(x, na.rm = TRUE)
  r3_1 <-dat1 %>% 
    wider_dt(cut,
             value = c("depth", "price"),
             name = "color",
             fun = list(mean1,max1))
})
#> [1] "Finished in 0.040s elapsed (0.030s cpu)"
r3_1
#> Key: <cut>
#>          cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#>        <ord>         <num>         <num>         <num>         <num>
#> 1:      Fair      64.15688      63.09882      63.41129      64.33663
#> 2:      Good      62.32176      62.16065      62.18220      62.60212
#> 3: Very Good      61.70454      61.73751      61.73138      61.83043
#> 4:   Premium      61.18875      61.16850      61.25098      61.27894
#> 5:     Ideal      61.67241      61.68693      61.65616      61.70585
#>    depth_mean1_H depth_mean1_I depth_mean1_J price_mean1_D price_mean1_E
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      64.44778      64.24571      64.75682      4237.899      3834.358
#> 2:      62.50826      62.56427      62.44288      3369.352      3293.289
#> 3:      61.97838      61.90052      61.94553      3437.133      3301.873
#> 4:      61.31882      61.34123      61.42092      3796.609      3481.628
#> 5:      61.72374      61.79243      61.80705      2608.030      2558.501
#>    price_mean1_F price_mean1_G price_mean1_H price_mean1_I price_mean1_J
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      3790.703      4186.403      5454.118      4278.831      5029.502
#> 2:      3518.548      3999.886      4098.280      4966.745      4578.490
#> 3:      3792.863      3900.558      4390.470      5475.587      5107.638
#> 4:      4347.158      4543.181      5215.019      5921.498      6416.235
#> 5:      3309.061      3751.035      3924.682      4434.030      4972.801
#>    depth_max1_D depth_max1_E depth_max1_F depth_max1_G depth_max1_H
#>           <num>        <num>        <num>        <num>        <num>
#> 1:         71.6         79.0         70.8         72.9         71.8
#> 2:         67.0         65.9         66.5         65.8         66.1
#> 3:         64.7         64.4         64.2         64.9         64.7
#> 4:         63.0         63.0         63.0         63.0         63.0
#> 5:         64.5         65.5         65.3         64.1         65.1
#>    depth_max1_I depth_max1_J price_max1_D price_max1_E price_max1_F
#>           <num>        <num>        <int>        <int>        <int>
#> 1:         71.3         73.6        16386        15584        17995
#> 2:         65.9         66.0        18468        18236        18686
#> 3:         64.4         64.1        18526        18731        18777
#> 4:         63.0         63.0        18286        18426        18791
#> 5:         66.7         65.4        18693        18729        18780
#>    price_max1_G price_max1_H price_max1_I price_max1_J
#>           <int>        <int>        <int>        <int>
#> 1:        18574        18308        18242        18531
#> 2:        18788        18640        18707        18325
#> 3:        18818        18803        18500        18430
#> 4:        18741        18795        18823        18706
#> 5:        18806        18659        18779        18508

2.宽表转长表

sys_time_print({
  r3_2 <-dat1 %>% 
    select_dt(cut,color,x,y,z) %>% 
    longer_dt(cut,color,
              name = "xyz",
              value = "xyzvalue")
})
#> [1] "Finished in 0.110s elapsed (0.100s cpu)"

r3_2 
#>               cut color    xyz xyzvalue
#>             <ord> <ord> <fctr>    <num>
#>      1: Very Good     H      x     4.41
#>      2:      Good     G      x     4.26
#>      3:     Ideal     F      x     6.69
#>      4:   Premium     E      x     6.52
#>      5:      Fair     D      x     4.92
#>     ---                                
#> 299996:     Ideal     J      z     2.73
#> 299997:     Ideal     G      z     2.75
#> 299998: Very Good     I      z     2.58
#> 299999:     Ideal     G      z     2.66
#> 300000:     Ideal     F      z     2.74

高阶

向上/下填充空值

对于填充空值来说,可以这样操作:

sys_time_print({
  dat1 %>% fill_na_dt(price) -> dat1
})
#> [1] "Finished in 0.000s elapsed (0.000s cpu)"
dat1
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44  2.76
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23  2.70
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66  4.10
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44  3.83
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03  3.28
#>     ---                                                                  
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42  2.73
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39  2.75
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19  2.58
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30  2.66
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40  2.74
#>                 dt
#>             <Date>
#>      1: 2011-01-01
#>      2: 2011-01-01
#>      3: 2011-01-01
#>      4: 2011-01-01
#>      5: 2011-01-01
#>     ---           
#>  99996: 2020-01-01
#>  99997: 2020-01-01
#>  99998: 2020-01-01
#>  99999: 2020-01-01
#> 100000: 2020-01-01

添加子维度聚合结果为新列

1.以dat1为例,添加两列,一列为以cut、color聚合求price的均值,另一列是求标准差


sys_time_print({
  mutate_dt(dat1,
           mean_price = mean(price, na.rm = TRUE),
           sd_price = sd(price, na.rm = TRUE),
           by = .(cut, color))
})
#> [1] "Finished in 0.030s elapsed (0.030s cpu)"

dat1
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44  2.76
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23  2.70
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66  4.10
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44  3.83
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03  3.28
#>     ---                                                                  
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42  2.73
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39  2.75
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19  2.58
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30  2.66
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40  2.74
#>                 dt
#>             <Date>
#>      1: 2011-01-01
#>      2: 2011-01-01
#>      3: 2011-01-01
#>      4: 2011-01-01
#>      5: 2011-01-01
#>     ---           
#>  99996: 2020-01-01
#>  99997: 2020-01-01
#>  99998: 2020-01-01
#>  99999: 2020-01-01
#> 100000: 2020-01-01

2.以dat1为例,以dt分组添加一列序号id


sys_time_print({
  dat1 %>% 
  group_dt(
    by = dt,
    mutate_dt(id = seq(.N))
  ) -> dat1
})
#> [1] "Finished in 3.970s elapsed (3.640s cpu)"
dat1
#>                 dt carat       cut color clarity depth table price     x     y
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                                       
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#>             z    id
#>         <num> <int>
#>      1:  2.76     1
#>      2:  2.70     2
#>      3:  4.10     3
#>      4:  3.83     4
#>      5:  3.28     5
#>     ---            
#>  99996:  2.73    27
#>  99997:  2.75    28
#>  99998:  2.58    29
#>  99999:  2.66    30
#> 100000:  2.74    31

移动函数


sys_time_print({
  dat1 %>% 
    group_dt(
      by = color,
      mutate_dt(
        MA10_price = frollmean(price, 10),
        MSD10_price = frollapply(price, 10, FUN = sd)
      )
    ) -> dat1
})
#> [1] "Finished in 1.280s elapsed (1.250s cpu)"

dat1
#>         color         dt carat       cut clarity depth table price     x     y
#>         <ord>     <Date> <num>     <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:     H 2011-01-01  0.34 Very Good     VS2  62.4    60   537  4.41  4.44
#>      2:     H 2011-01-01  1.20   Premium     SI2  62.1    57  3965  6.84  6.69
#>      3:     H 2011-01-01  1.51   Premium     SI1  58.0    59 10063  7.57  7.54
#>      4:     H 2011-01-02  1.00   Premium     VS2  60.7    60  4930  6.55  6.49
#>      5:     H 2011-01-02  1.01     Ideal     SI2  61.1    56  4693  6.44  6.48
#>     ---                                                                       
#>  99996:     I 2019-12-30  0.54     Ideal     SI1  62.1    54  1057  5.24  5.26
#>  99997:     I 2019-12-31  0.90 Very Good     VS2  59.3    59  3992  6.27  6.30
#>  99998:     I 2019-12-31  0.30     Ideal     SI1  61.3    56   506  4.33  4.32
#>  99999:     I 2020-01-01  1.22   Premium     VS2  62.3    59  5832  6.79  6.83
#> 100000:     I 2020-01-01  0.28 Very Good     VS2  61.9    57   379  4.16  4.19
#>             z    id MA10_price MSD10_price
#>         <num> <int>      <num>       <num>
#>      1:  2.76     1         NA          NA
#>      2:  4.21    16         NA          NA
#>      3:  4.38    23         NA          NA
#>      4:  3.96     3         NA          NA
#>      5:  3.95    13         NA          NA
#>     ---                                   
#>  99996:  3.26    31     4239.9    3421.390
#>  99997:  3.73    29     4379.3    3375.145
#>  99998:  2.65    30     4044.3    3592.134
#>  99999:  4.24     5     3976.6    3546.612
#> 100000:  2.58    29     3949.0    3576.283

系统参数

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                              
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bench_1.1.1       dplyr_1.0.1       data.table_1.13.0 tidyfst_0.9.9    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5       pillar_1.4.6     compiler_4.0.2   tools_4.0.2     
#>  [5] digest_0.6.25    evaluate_0.14    lifecycle_0.2.0  tibble_3.0.3    
#>  [9] gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.7      cli_2.0.2       
#> [13] yaml_2.2.1       parallel_4.0.2   xfun_0.16        stringr_1.4.0   
#> [17] knitr_1.29       generics_0.0.2   vctrs_0.3.2      grid_4.0.2      
#> [21] tidyselect_1.1.0 glue_1.4.1       R6_2.4.1         fansi_0.4.1     
#> [25] profmem_0.5.0    rmarkdown_2.3    purrr_0.3.4      ggplot2_3.3.2   
#> [29] magrittr_1.5     scales_1.1.1     ellipsis_0.3.1   htmltools_0.5.0 
#> [33] fst_0.9.2        assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4      
#> [37] stringi_1.4.6    munsell_0.5.0    crayon_1.3.4