Suppose I have a dataframe like so:
contracts
Dates Last.Price Last.Price.1 id carry
1 1998-11-30 94.50 98.50 QS -0.040609137
2 1998-11-30 31.32 32.13 HO -0.025210084
3 1998-12-31 95.50 98.00 QS -0.025510204
4 1998-12-31 34.00 34.28 HO -0.008168028
5 1999-01-29 100.00 100.50 QS -0.004975124
6 1999-01-29 33.16 33.42 HO -0.007779773
7 1999-02-26 100.25 100.25 QS 0.000000000
8 1999-02-26 32.29 32.37 HO -0.002471424
9 1999-02-26 10.88 11.00 CO -0.010909091
10 1999-03-31 131.50 130.75 QS 0.005736138
11 1999-03-31 44.68 44.00 HO 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045
I want to calculate the weights of each id in each month. I have a function that does this. I use dplyr to achieve this:
library(dplyr)
library(lubridate)
contracts <- contracts %>%
mutate(Dates = ymd(Dates)) %>%
group_by(Dates) %>%
mutate(weights = weight(carry))
which gives:
contracts
Dates Last.Price Last.Price.1 id carry weights
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977
7 1999-02-26 100.25 100.25 QS 0.000000000 NA
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045
Now I want the lag the weights, such that the weights calculated in november are applied in december. So I essentially want to shift the weights column by group, the group being the dates. So the values in November end up being the values in December and so on.
Now I also want the shift to match by id, such that if a new id is included, the group where the id first appears will have an NA in the lagged column.
The desired output is given below:
desired
Dates Last.Price Last.Price.1 id carry weights w
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910 NA
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090 NA
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623 0.61697991
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377 0.38302009
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023 0.75746862
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977 0.24253138
7 1999-02-26 100.25 100.25 QS 0.000000000 NA 0.39005602
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218 0.60994398
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782 NA
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377 NA
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545 0.18470322
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045 0.81529678
Take note of February 1999. CO has an NA because it first appears in February.
Now look at March 1999, CO has the value from Februray, QS has an NA only because the February value was NA (due to division by 0).
Can this be done?
Data:
contracts <- read.table(text = "Dates, Last.Price, Last.Price.1, id,carry
1998-11-30, 94.500, 98.500, QS, -0.0406091371
1998-11-30, 31.320, 32.130, HO, -0.0252100840
1998-12-31, 95.500, 98.000, QS, -0.0255102041
1998-12-31, 34.000, 34.280, HO, -0.0081680280
1999-01-29, 100.000, 100.500, QS, -0.0049751244
1999-01-29, 33.160, 33.420, HO, -0.0077797726
1999-02-26, 100.250, 100.250, QS, 0.0000000000
1999-02-26, 32.290, 32.370, HO, -0.0024714242
1999-02-26, 10.880, 11.000, CO, -0.0109090909
1999-03-31, 131.500, 130.750, QS, 0.0057361377
1999-03-31, 44.680, 44.000, HO, 0.0154545455
1999-03-31, 15.240, 15.160, CO, 0.0052770449", sep = ",", header = T)
desired <- read.table(text = "Dates,Last.Price,Last.Price.1,id,carry,weights,w
1998-11-30,94.5,98.5, QS,-0.0406091371,0.616979909839741,NA
1998-11-30,31.32,32.13, HO,-0.025210084,0.383020090160259,NA
1998-12-31,95.5,98, QS,-0.0255102041,0.757468623182272,0.616979909839741
1998-12-31,34,34.28, HO,-0.008168028,0.242531376817728,0.383020090160259
1999-01-29,100,100.5, QS,-0.0049751244,0.390056023188584,0.757468623182272
1999-01-29,33.16,33.42, HO,-0.0077797726,0.609943976811416,0.242531376817728
1999-02-26,100.25,100.25, QS,0,NA,0.390056023188584
1999-02-26,32.29,32.37, HO,-0.0024714242,0.184703218189261,0.609943976811416
1999-02-26,10.88,11, CO,-0.0109090909,0.815296781810739,NA
1999-03-31,131.5,130.75, QS,0.057361377,0.057361377,NA
1999-03-31,44.68,44, HO,0.0154545455,0.0154545455,0.184703218189261
1999-03-31,15.24,15.16, CO,0.0052770449,0.0052770449,0.815296782", sep = ",", header = TRUE)
weights function:
weight <- function(vec) {
neg <- which(vec<0)
w <- vec
w[neg] <- vec[vec<0] / sum(vec[vec<0])
w[-neg] <- vec[vec>=0] / sum(vec[vec>=0])
w
}
contracts %>%
group_by(Dates) %>%
mutate(weights = weight(carry)) %>%
arrange(Dates) %>%
group_by(id) %>%
mutate(w = dplyr::lag(weights)) %>%
ungroup()
# # A tibble: 12 x 7
# Dates Last.Price Last.Price.1 id carry weights w
# <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1998-11-30 94.5 98.5 " QS" -0.0406 0.617 NA
# 2 1998-11-30 31.3 32.1 " HO" -0.0252 0.383 NA
# 3 1998-12-31 95.5 98 " QS" -0.0255 0.757 0.617
# 4 1998-12-31 34 34.3 " HO" -0.00817 0.243 0.383
# 5 1999-01-29 100 100. " QS" -0.00498 0.390 0.757
# 6 1999-01-29 33.2 33.4 " HO" -0.00778 0.610 0.243
# 7 1999-02-26 100. 100. " QS" 0 NaN 0.390
# 8 1999-02-26 32.3 32.4 " HO" -0.00247 0.185 0.610
# 9 1999-02-26 10.9 11 " CO" -0.0109 0.815 NA
# 10 1999-03-31 132. 131. " QS" 0.00574 0.00574 NaN
# 11 1999-03-31 44.7 44 " HO" 0.0155 0.0155 0.185
# 12 1999-03-31 15.2 15.2 " CO" 0.00528 0.00528 0.815
Notes:
I used dplyr::lag instead of just lag because of the possibility of confusion with stats::lag, which behaves significantly differently than dplyr::lag. While most of the time it'll work just fine, it works until it doesn't ... and it doesn't usually warn you :-)
This is lagging by Dates regardless of month. I'll assume that you are certain that Dates are always perfectly frequent. If you think there's the possibility in a gap (where lagging by-row is not correct), then you'll need to break out the year/month into a new field and join on itself instead of doing a lag.
When I try to use cajolst function from urca package I get a strange error.
would you please guide me how can i confront the problem?
result<-urca::cajolst(data ,trend = FALSE, K = 2, season = NULL)
Error in embed(diff(x), K) : wrong embedding dimension.
dates A G
2016-11-30 0 0
2016-12-01 -3.53 3.198
2016-12-02 -2.832 8.703
2016-12-04 -2.666 7.799
2016-12-05 -0.54 7.701
2016-12-06 -1.296 4.685
2016-12-07 -1.785 -4.587
2016-12-08 -6.834 -3.696
2016-12-09 -9.624 -5.461
2016-12-11 -11.374 -0.423
2016-12-12 -6.037 -1.614
2016-12-13 -5.934 -3.231
2016-12-14 -7.279 1.072
2016-12-15 -7.859 -4.823
2016-12-16 -15.132 10.838
2016-12-19 -15.345 11.5
2016-12-20 -15.673 6.639
2016-12-21 -15.391 11.162
2016-12-22 -14.357 7.032
2016-12-23 -14.99 12.355
2016-12-26 -15.626 10.944
2016-12-27 -12.297 10.215
2016-12-28 -13.967 5.957
2016-12-29 -12.946 3.446
2016-12-30 -19.681 10.274
2017-01-02 -18.24 8.781
2017-01-03 -16.83 1.116
2017-01-04 -18.189 -0.036
2017-01-05 -15.897 -1.441
2017-01-06 -20.196 -8.534
2017-01-09 -14.57 -28.768
2017-01-10 -13.27 -29.821
2017-01-11 -8.85 -38.881
2017-01-12 -6.375 -50.885
2017-01-13 -8.056 -51.321
2017-01-16 -5.217 -63.619
2017-01-17 -4.75 -39.163
2017-01-18 3.505 -46.309
2017-01-19 10.939 -45.825
2017-01-20 9.248 -42.973
2017-01-23 9.532 -33.396
2017-01-24 4.235 -31.38
2017-01-25 -1.885 -19.21
2017-01-26 -5.027 -15.74
2017-01-27 0.015 -23.029
2017-01-30 -0.685 -30.773
2017-01-31 -2.692 -25.544
2017-02-01 -2.654 -17.912
2017-02-02 4.002 -43.309
2017-02-03 4.813 -52.627
2017-02-06 7.049 -49.965
2017-02-07 10.003 -40.568
2017-02-08 8.996 -39.828
2017-02-09 7.047 -41.19
2017-02-10 7.656 -50.853
2017-02-13 4.986 -41.318
2017-02-14 8.493 -51.946
2017-02-15 12.547 -59.538
2017-02-16 10.327 -54.496
2017-02-17 7.09 -57.571
2017-02-20 11.633 -54.91
2017-02-21 12.664 -51.597
2017-02-22 16.103 -57.819
2017-02-23 14.25 -51.336
2017-02-24 7.794 -54.898
2017-02-27 15.27 -55.754
2017-02-28 19.984 -58.37
2017-03-01 23.899 -70.73
2017-03-02 16.63 -56.29
2017-03-03 16.443 -55.858
2017-03-06 17.901 -59.377
2017-03-07 19.067 -64.383
2017-03-08 17.219 -57.829
2017-03-09 15.694 -55.022
2017-03-10 17.351 -60.431
2017-03-13 18.945 -59.79
2017-03-14 20.001 -64.848
2017-03-15 23.852 -73.806
2017-03-16 22.697 -64.191
2017-03-17 26.892 -65.328
2017-03-20 29.221 -72.764
2017-03-21 25.165 -53.427
2017-03-22 22.998 -51.676
2017-03-23 20.072 -40.57
2017-03-24 20.758 -43.654
2017-03-27 20.062 -33.672
2017-03-28 22.066 -47.184
2017-03-29 22.363 -54.57
2017-03-30 20.684 -48.199
2017-03-31 17.056 -40.887
2017-04-03 19.12 -39.618
2017-04-04 16.359 -37.1
2017-04-05 18.643 -32.734
2017-04-06 14.708 -30.455
2017-04-07 8.403 -33.553
2017-04-10 6.072 -29.048
2017-04-11 5.186 -20.696
2017-04-12 4.248 -20.924
2017-04-13 12.803 -31.075
2017-04-14 12.566 -29.768
2017-04-17 14.065 -28.906
2017-04-18 14.5 4.121
2017-04-19 13.865 8.835
2017-04-20 16.126 6.191
2017-04-21 17.591 3.77
2017-04-24 22.3 -2.497
2017-04-25 22.731 7.408
2017-04-26 19.146 18.45
2017-04-27 19.052 25.541
2017-04-28 21.889 26.878
2017-05-01 27.323 14.362
2017-05-02 29.93 17.525
2017-05-03 19.835 29.856
2017-05-04 19.683 36.72
2017-05-05 13.545 41.055
2017-05-08 14.165 43.544
2017-05-09 11.325 49.978
2017-05-10 10.143 47.072
2017-05-11 13.718 38.901
2017-05-12 14.216 36.017
2017-05-15 13.701 33.797
2017-05-16 13.505 33.867
2017-05-17 13.456 38.004
2017-05-18 12.613 37.758
2017-05-19 11.166 40.367
2017-05-22 12.221 34.022
2017-05-23 13.682 29.793
2017-05-24 10.05 26.701
2017-05-25 10.122 31.394
2017-05-26 7.592 20.073
2017-05-29 6.796 23.809
2017-05-30 9.638 16.1
2017-05-31 7.983 29.043
2017-06-01 3.594 39.557
2017-06-02 8.763 27.863
2017-06-05 12.157 22.397
2017-06-06 13.383 19.053
2017-06-07 20.52 17.449
2017-06-08 19.534 -1.615
2017-06-09 16.011 -1.989
2017-06-12 9.153 -9.294
2017-06-13 4.295 -0.897
2017-06-14 9.743 -9.818
2017-06-15 10.386 -8.255
2017-06-16 11.983 -12.522
2017-06-19 9.513 -12.931
2017-06-20 10.298 -21.024
2017-06-21 11.087 -11.801
2017-06-22 4.472 -9.048
2017-06-23 9.416 -9.592
2017-06-26 9.686 -12.006
2017-06-27 6.424 -2.632
2017-06-28 3.062 -1.016
2017-06-29 5.593 -0.825
2017-06-30 3.531 0.914
2017-07-03 3.208 -2.596
2017-07-04 -6.373 4.289
2017-07-05 -5.149 5.917
2017-07-06 -6.104 12.75
2017-07-07 -9.565 1.615
2017-07-10 -8.961 -0.053
2017-07-11 -4.065 -8.541
2017-07-12 -10.133 -11.286
2017-07-13 -6.223 -15.181
2017-07-14 -1.524 -14.396
2017-07-17 -1.613 -14.61
2017-07-18 5.781 -35.473
2017-07-19 8.243 -44.186
2017-07-20 7.665 -49.857
2017-07-21 0.485 -41.286
2017-07-24 -0.638 -39.127
2017-07-25 0.767 -40.952
2017-07-26 3.566 -44.388
2017-07-27 6.834 -42.543
2017-07-28 1.306 -37.657
2017-07-31 5.839 -34.048
2017-08-01 5.838 -28.939
2017-08-02 7.298 -26.566
2017-08-03 6.804 -32.876
2017-08-04 8.989 -38.618
2017-08-07 8.862 -36.676
2017-08-08 8.234 -40.893
2017-08-09 7.39 -35.16
2017-08-10 8.593 -35.555
2017-08-11 7.253 -35.175
2017-08-14 5.593 -33.644
2017-08-15 4.528 -37.82
2017-08-16 6.752 -53.217
2017-08-17 6.284 -49.252
2017-08-18 4.765 -55.602
2017-08-21 3.905 -54.32
2017-08-22 1.76 -57.853
2017-08-23 0.406 -58.925
2017-08-24 -2.438 -58.098
2017-08-25 -0.791 -56.682
2017-08-28 2.173 -51.278
2017-08-29 2.523 -54.353
2017-08-30 4.482 -46.325
2017-08-31 0.246 -52.567
2017-09-01 -4.214 -53.636
2017-09-04 -4.548 -52.735
2017-09-05 -1.781 -50.421
2017-09-06 -10.463 -51.122
2017-09-07 -13.119 -52.433
2017-09-08 -11.716 -43.493
2017-09-11 -16.15 -43.142
2017-09-12 -12.478 -29.335
2017-09-13 -16.457 -31.697
2017-09-14 -14.615 -15.13
2017-09-15 -13.911 3.023
One of the issue is that the 'Date' column is also included and secondly, the season is not needed, it can be FALSE or specify an integer value
library(urca)
out <- cajolst(data[-1] ,trend = FALSE, K = 2, season =FALSE)
If there is a season effect and it is `quarterly, the value would be 4
out1 <- cajolst(data[-1] ,trend = FALSE, K = 2, season = 4)
out1
#####################################################
# Johansen-Procedure Unit Root / Cointegration Test #
#####################################################
#The value of the test statistic is: 3.6212 13.2233
data
data <- structure(list(dates = c("2016-11-30", "2016-12-01", "2016-12-02",
"2016-12-04", "2016-12-05", "2016-12-06", "2016-12-07", "2016-12-08",
"2016-12-09", "2016-12-11", "2016-12-12", "2016-12-13", "2016-12-14",
"2016-12-15", "2016-12-16", "2016-12-19", "2016-12-20", "2016-12-21",
"2016-12-22", "2016-12-23", "2016-12-26", "2016-12-27", "2016-12-28",
"2016-12-29", "2016-12-30", "2017-01-02", "2017-01-03", "2017-01-04",
"2017-01-05", "2017-01-06", "2017-01-09", "2017-01-10", "2017-01-11",
"2017-01-12", "2017-01-13", "2017-01-16", "2017-01-17", "2017-01-18",
"2017-01-19", "2017-01-20", "2017-01-23", "2017-01-24", "2017-01-25",
"2017-01-26", "2017-01-27", "2017-01-30", "2017-01-31", "2017-02-01",
"2017-02-02", "2017-02-03", "2017-02-06", "2017-02-07", "2017-02-08",
"2017-02-09", "2017-02-10", "2017-02-13", "2017-02-14", "2017-02-15",
"2017-02-16", "2017-02-17", "2017-02-20", "2017-02-21", "2017-02-22",
"2017-02-23", "2017-02-24", "2017-02-27", "2017-02-28", "2017-03-01",
"2017-03-02", "2017-03-03", "2017-03-06", "2017-03-07", "2017-03-08",
"2017-03-09", "2017-03-10", "2017-03-13", "2017-03-14", "2017-03-15",
"2017-03-16", "2017-03-17", "2017-03-20", "2017-03-21", "2017-03-22",
"2017-03-23", "2017-03-24", "2017-03-27", "2017-03-28", "2017-03-29",
"2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04", "2017-04-05",
"2017-04-06", "2017-04-07", "2017-04-10", "2017-04-11", "2017-04-12",
"2017-04-13", "2017-04-14", "2017-04-17", "2017-04-18", "2017-04-19",
"2017-04-20", "2017-04-21", "2017-04-24", "2017-04-25", "2017-04-26",
"2017-04-27", "2017-04-28", "2017-05-01", "2017-05-02", "2017-05-03",
"2017-05-04", "2017-05-05", "2017-05-08", "2017-05-09", "2017-05-10",
"2017-05-11", "2017-05-12", "2017-05-15", "2017-05-16", "2017-05-17",
"2017-05-18", "2017-05-19", "2017-05-22", "2017-05-23", "2017-05-24",
"2017-05-25", "2017-05-26", "2017-05-29", "2017-05-30", "2017-05-31",
"2017-06-01", "2017-06-02", "2017-06-05", "2017-06-06", "2017-06-07",
"2017-06-08", "2017-06-09", "2017-06-12", "2017-06-13", "2017-06-14",
"2017-06-15", "2017-06-16", "2017-06-19", "2017-06-20", "2017-06-21",
"2017-06-22", "2017-06-23", "2017-06-26", "2017-06-27", "2017-06-28",
"2017-06-29", "2017-06-30", "2017-07-03", "2017-07-04", "2017-07-05",
"2017-07-06", "2017-07-07", "2017-07-10", "2017-07-11", "2017-07-12",
"2017-07-13", "2017-07-14", "2017-07-17", "2017-07-18", "2017-07-19",
"2017-07-20", "2017-07-21", "2017-07-24", "2017-07-25", "2017-07-26",
"2017-07-27", "2017-07-28", "2017-07-31", "2017-08-01", "2017-08-02",
"2017-08-03", "2017-08-04", "2017-08-07", "2017-08-08", "2017-08-09",
"2017-08-10", "2017-08-11", "2017-08-14", "2017-08-15", "2017-08-16",
"2017-08-17", "2017-08-18", "2017-08-21", "2017-08-22", "2017-08-23",
"2017-08-24", "2017-08-25", "2017-08-28", "2017-08-29", "2017-08-30",
"2017-08-31", "2017-09-01", "2017-09-04", "2017-09-05", "2017-09-06",
"2017-09-07", "2017-09-08", "2017-09-11", "2017-09-12", "2017-09-13",
"2017-09-14", "2017-09-15"), A = c(0, -3.53, -2.832, -2.666,
-0.54, -1.296, -1.785, -6.834, -9.624, -11.374, -6.037, -5.934,
-7.279, -7.859, -15.132, -15.345, -15.673, -15.391, -14.357,
-14.99, -15.626, -12.297, -13.967, -12.946, -19.681, -18.24,
-16.83, -18.189, -15.897, -20.196, -14.57, -13.27, -8.85, -6.375,
-8.056, -5.217, -4.75, 3.505, 10.939, 9.248, 9.532, 4.235, -1.885,
-5.027, 0.015, -0.685, -2.692, -2.654, 4.002, 4.813, 7.049, 10.003,
8.996, 7.047, 7.656, 4.986, 8.493, 12.547, 10.327, 7.09, 11.633,
12.664, 16.103, 14.25, 7.794, 15.27, 19.984, 23.899, 16.63, 16.443,
17.901, 19.067, 17.219, 15.694, 17.351, 18.945, 20.001, 23.852,
22.697, 26.892, 29.221, 25.165, 22.998, 20.072, 20.758, 20.062,
22.066, 22.363, 20.684, 17.056, 19.12, 16.359, 18.643, 14.708,
8.403, 6.072, 5.186, 4.248, 12.803, 12.566, 14.065, 14.5, 13.865,
16.126, 17.591, 22.3, 22.731, 19.146, 19.052, 21.889, 27.323,
29.93, 19.835, 19.683, 13.545, 14.165, 11.325, 10.143, 13.718,
14.216, 13.701, 13.505, 13.456, 12.613, 11.166, 12.221, 13.682,
10.05, 10.122, 7.592, 6.796, 9.638, 7.983, 3.594, 8.763, 12.157,
13.383, 20.52, 19.534, 16.011, 9.153, 4.295, 9.743, 10.386, 11.983,
9.513, 10.298, 11.087, 4.472, 9.416, 9.686, 6.424, 3.062, 5.593,
3.531, 3.208, -6.373, -5.149, -6.104, -9.565, -8.961, -4.065,
-10.133, -6.223, -1.524, -1.613, 5.781, 8.243, 7.665, 0.485,
-0.638, 0.767, 3.566, 6.834, 1.306, 5.839, 5.838, 7.298, 6.804,
8.989, 8.862, 8.234, 7.39, 8.593, 7.253, 5.593, 4.528, 6.752,
6.284, 4.765, 3.905, 1.76, 0.406, -2.438, -0.791, 2.173, 2.523,
4.482, 0.246, -4.214, -4.548, -1.781, -10.463, -13.119, -11.716,
-16.15, -12.478, -16.457, -14.615, -13.911), G = c(0, 3.198,
8.703, 7.799, 7.701, 4.685, -4.587, -3.696, -5.461, -0.423, -1.614,
-3.231, 1.072, -4.823, 10.838, 11.5, 6.639, 11.162, 7.032, 12.355,
10.944, 10.215, 5.957, 3.446, 10.274, 8.781, 1.116, -0.036, -1.441,
-8.534, -28.768, -29.821, -38.881, -50.885, -51.321, -63.619,
-39.163, -46.309, -45.825, -42.973, -33.396, -31.38, -19.21,
-15.74, -23.029, -30.773, -25.544, -17.912, -43.309, -52.627,
-49.965, -40.568, -39.828, -41.19, -50.853, -41.318, -51.946,
-59.538, -54.496, -57.571, -54.91, -51.597, -57.819, -51.336,
-54.898, -55.754, -58.37, -70.73, -56.29, -55.858, -59.377, -64.383,
-57.829, -55.022, -60.431, -59.79, -64.848, -73.806, -64.191,
-65.328, -72.764, -53.427, -51.676, -40.57, -43.654, -33.672,
-47.184, -54.57, -48.199, -40.887, -39.618, -37.1, -32.734, -30.455,
-33.553, -29.048, -20.696, -20.924, -31.075, -29.768, -28.906,
4.121, 8.835, 6.191, 3.77, -2.497, 7.408, 18.45, 25.541, 26.878,
14.362, 17.525, 29.856, 36.72, 41.055, 43.544, 49.978, 47.072,
38.901, 36.017, 33.797, 33.867, 38.004, 37.758, 40.367, 34.022,
29.793, 26.701, 31.394, 20.073, 23.809, 16.1, 29.043, 39.557,
27.863, 22.397, 19.053, 17.449, -1.615, -1.989, -9.294, -0.897,
-9.818, -8.255, -12.522, -12.931, -21.024, -11.801, -9.048, -9.592,
-12.006, -2.632, -1.016, -0.825, 0.914, -2.596, 4.289, 5.917,
12.75, 1.615, -0.053, -8.541, -11.286, -15.181, -14.396, -14.61,
-35.473, -44.186, -49.857, -41.286, -39.127, -40.952, -44.388,
-42.543, -37.657, -34.048, -28.939, -26.566, -32.876, -38.618,
-36.676, -40.893, -35.16, -35.555, -35.175, -33.644, -37.82,
-53.217, -49.252, -55.602, -54.32, -57.853, -58.925, -58.098,
-56.682, -51.278, -54.353, -46.325, -52.567, -53.636, -52.735,
-50.421, -51.122, -52.433, -43.493, -43.142, -29.335, -31.697,
-15.13, 3.023)), class = "data.frame", row.names = c(NA, -210L
))
I have a dataset from a sources that uses a special compression algorithm. Simply put, new measurements are recorded only when the change in slope (rate of change) is greater than a certain percentage (say 5%).
However, for the analysis I'm currently carrying out, I need values at regular intervals. I am able to carry out a piecewise interpolation using approx, approxfun or spline for different variables vs time (tme in below data) but I'd like to do it for all variables (columns of data.table) in a single shot.
library(data.table)
q = setDT(
structure(list(tme = structure(c(1463172120, 1463173320, 1463175720,
1463180520, 1463182920, 1463187720, 1463188920, 1463190120, 1463191320,
1463192520, 1463202180, 1463203380, 1463204580, 1463205780, 1463206980,
1463208180, 1463218980, 1463233440, 1463244240, 1463245440, 1463246640,
1463247840, 1463249040, 1463250240, 1463251440, 1463252640, 1463253840,
1463255040, 1463256240, 1463316360, 1463317560, 1463318760, 1463319960,
1463321160, 1463322360, 1463323560, 1463324760, 1463325960, 1463327160,
1463328360, 1463329560, 1463330760, 1463331960), class = c("POSIXct",
"POSIXt"), tzone = "America/Montreal"), rh = c(50.36, 47.31,
46.39, 46.99, 47.89, 50.37, 51.29, 51.92, 54.97, 67.64, 69.38,
68.96, 69.89, 56.66, 51.23, 55.38, 64.36, 50.72, 31.33, 31.38,
32.65, 33.15, 33.05, 31.87, 32.58, 32.65, 31.06, 29.82, 28.72,
67.95, 66.68, 64.66, 62.12, 59.86, 58.11, 57.41, 56.5, 56.16,
55.69, 54.57, 53.89, 53.81, 52.01), degc = c(30.0055555555556,
30.3611111111111, 30.6611111111111, 30.5833333333333, 30.2666666666667,
28.6888888888889, 28.2555555555556, 28.0722222222222, 27.4944444444444,
25.0722222222222, 24.8111111111111, 24.7166666666667, 24.1666666666667,
25.4111111111111, 25.5222222222222, 24.3555555555556, 22.7722222222222,
25.5222222222222, 27.8111111111111, 27.9888888888889, 28.0277777777778,
28.1333333333333, 28.5333333333333, 28.7, 28.85, 29.1555555555556,
28.8388888888889, 29.5111111111111, 29.6722222222222, 22.3888888888889,
22.5722222222222, 22.9444444444444, 23.3722222222222, 23.6777777777778,
23.8777777777778, 24.2055555555556, 24.6888888888889, 24.9777777777778,
25.3888888888889, 25.8, 26.1, 26.1555555555556, 26.7388888888889
)), .Names = c("tme", "rh", "degc"), row.names = c(NA, -43L), class = c("data.table",
"data.frame")))
q is my queried dataset. Here's what works for individual variables (degc in this example):
interpolate_degc <- approxfun(x = q$tme, y = q$degc, method = "linear")
# To get the uniform samples:
width <- "10 mins"
new_times <- seq.POSIXt(from = q$tme[1], to = q$tme[nrow(q)], by = width)
new_degc <- interpolate_degc(new_times)
I'd like to do this for all variables in a single shot, preferably using data.table.
This seems to work:
cols = c("rh", "degc")
DT = q[.(seq(min(tme), max(tme), by="10 mins")), on=.(tme)]
DT[, (cols) := lapply(cols, function(z) with(q,
approxfun(x = tme, y = get(z), method = "linear")
)(tme))]
tme rh degc
1: 2016-05-13 16:42:00 50.360 30.00556
2: 2016-05-13 16:52:00 48.835 30.18333
3: 2016-05-13 17:02:00 47.310 30.36111
4: 2016-05-13 17:12:00 47.080 30.43611
5: 2016-05-13 17:22:00 46.850 30.51111
---
263: 2016-05-15 12:22:00 54.026 26.04000
264: 2016-05-15 12:32:00 53.866 26.11667
265: 2016-05-15 12:42:00 53.826 26.14444
266: 2016-05-15 12:52:00 53.270 26.33056
267: 2016-05-15 13:02:00 52.370 26.62222
Generally when you want to iterate over columns, lapply or Map will work.
How it works: Inside the with(q, ...), tme and get(z) refer to columns of q, but outside of it, we're looking at columns of DT (in this case just tme).
Another way of doing the same thing:
q[, {
tt = seq(min(tme), max(tme), by="10 mins")
c(
.(tme = tt),
lapply(.SD, function(z) approxfun(x = tme, y = z, method="linear")(tt))
)
}, .SDcols=cols]
For time series I like to use specialized packages like xts and zoo:
library(xts)
ts <- merge(xts(x = q[,-1], order.by = q[,1]), new_times)
head(ts)
#> rh degc
#> 2016-05-13 16:42:00 50.36 30.00556
#> 2016-05-13 16:52:00 NA NA
#> 2016-05-13 17:02:00 47.31 30.36111
#> 2016-05-13 17:12:00 NA NA
#> 2016-05-13 17:22:00 NA NA
#> 2016-05-13 17:32:00 NA NA
head(na.approx(ts))
#> rh degc
#> 2016-05-13 16:42:00 50.360 30.00556
#> 2016-05-13 16:52:00 48.835 30.18333
#> 2016-05-13 17:02:00 47.310 30.36111
#> 2016-05-13 17:12:00 47.080 30.43611
#> 2016-05-13 17:22:00 46.850 30.51111
#> 2016-05-13 17:32:00 46.620 30.58611
head(na.spline(ts))
#> rh degc
#> 2016-05-13 16:42:00 50.36000 30.00556
#> 2016-05-13 16:52:00 48.52407 30.20524
#> 2016-05-13 17:02:00 47.31000 30.36111
#> 2016-05-13 17:12:00 46.62601 30.47791
#> 2016-05-13 17:22:00 46.33972 30.56219
#> 2016-05-13 17:32:00 46.30857 30.62093