R Caret's timeslices - window and horizon unclear - r

Using the timeslicing in caret, and it's parameters,
how do I split up data with xyz rows with each having a length of 12?
Ideally, also considering the 60-20-20 train-test-validate ratio.
Should I set it like so:
initialWindow=12, horizon=12, fixedWindow=TRUE?
I've read the documentation but this is still unclear to me.

You can try out what happens using an example vector like 1:100.
If you set window = 12 and fixedWindow = T the training sets will always have 12 rows. horizon specifies the size of the subsequent observations that will be included in the test sets. If it is set to 12 and you do not want any rows to be predicted multiple times skip has to be set to (horizon - 1).
A partitioning of 60-20-20 can be achieved for example by setting initialWindow to the size of the first 60% and first running your model on the first half of slices and using the second half of slices as the last 20%.
I don't know if you are trying to use timeslicing inside of caret's train function already. In any case, you can experiment with the different settings using the createTimeSlices() function:
library(caret)
dat <- 1:100
slices <- createTimeSlices(dat, initialWindow = 12, horizon = 1,
skip = 0, fixedWindow = T)
slices # 88 test and train sets
# [...]
slices <- createTimeSlices(y = dat, initialWindow = 12, horizon = 12,
skip = 11, fixedWindow = T)
slices
# 7 test and train sets, observations 97 - 100 not in any test set
$train
$train$Training01
[1] 1 2 3 4 5 6 7 8 9 10 11 12
$train$Training13
[1] 13 14 15 16 17 18 19 20 21 22 23 24
$train$Training25
[1] 25 26 27 28 29 30 31 32 33 34 35 36
$train$Training37
[1] 37 38 39 40 41 42 43 44 45 46 47 48
$train$Training49
[1] 49 50 51 52 53 54 55 56 57 58 59 60
$train$Training61
[1] 61 62 63 64 65 66 67 68 69 70 71 72
$train$Training73
[1] 73 74 75 76 77 78 79 80 81 82 83 84
$test
$test$Testing01
[1] 13 14 15 16 17 18 19 20 21 22 23 24
$test$Testing13
[1] 25 26 27 28 29 30 31 32 33 34 35 36
$test$Testing25
[1] 37 38 39 40 41 42 43 44 45 46 47 48
$test$Testing37
[1] 49 50 51 52 53 54 55 56 57 58 59 60
$test$Testing49
[1] 61 62 63 64 65 66 67 68 69 70 71 72
$test$Testing61
[1] 73 74 75 76 77 78 79 80 81 82 83 84
$test$Testing73
[1] 85 86 87 88 89 90 91 92 93 94 95 96

Initial window + horizon - 1 will ensure your training folds and testing folds do not overlap.
timeSlices <- createTimeSlices(1:nrow(DF), initialWindow = 36, horizon = 6, skip = 41, fixedWindow = TRUE)
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
testSlices <- do.call(rbind.data.frame, testSlices)
trainSlices <- do.call(rbind.data.frame, trainSlices)

Related

how to delete every nth range of rows in R

Suppose I have a dataframe df, which contains 450 rows. I want to delete rows from 10 through 18 (that is 10, 11, 12, 13, 14, 15, 16, 17, 18). Then similarly rows from 28 through 36, then 46:54. And so on, up to deleting rows from 442 through 450.
Any suggestions, guys?
Create a sequence and remove those rows. The first argument, nvec, is the length of each sequence (8, repeated for each sequence); the second, from, is the starting point for each sequence (10, 28, ...).
n = 450
len = n %/% 18
s <- sequence(nvec = rep(9, len), from = seq(10, n, 18))
# [1] 10 11 12 13 14 15 16 17 18 28 29 30 31 32 33 34
# [17] 35 36 46 47 48 49 50 51 52 53 54 64 65 66 67 68
# [33] 69 70 71 72 82 83 84 85 86 87 88 89 90 100 101 102
# ...
your_df[-s, ]
You can also create the sequence like this:
rep(10:18, len) + rep(18*(0:(len - 1)), each = 9)
# [1] 10 11 12 13 14 15 16 17 18 28 29 30 31 32 33 34
# [17] 35 36 46 47 48 49 50 51 52 53 54 64 65 66 67 68
# [33] 69 70 71 72 82 83 84 85 86 87 88 89 90 100 101 102
# ...

roughly equal spaced integers on a log scale

For the integers between 10 and 100, I want to keep the numbers that are roughly equally spaced in a log-scale.
General speaking if the space between two integers is much less than .5 (scaled by (log(10)-log(9) as shown in the code below), then one integer should be dropped. But the remaining integers should be also be rounded to a multiple of 2, 5 and 10 if possible.
So this would end up 10, 11, ..., 19, 20, 22, ..., 30, 32, 35, 40, ..., 85, 90, 100.
R> data.frame(diff((log(9:100)-lodata.frame(delta=diff((log(9:100)-log(10))/(log(10)-log(9))), n=10:100)
delta n
1 1.00000000 10
2 0.90461004 11
3 0.82584426 12
4 0.75970307 13
5 0.70337518 14
6 0.65482663 15
7 0.61254940 16
8 0.57540172 17
9 0.54250317 18
10 0.51316398 19
11 0.48683602 20
12 0.46307826 21
13 0.44153178 22
14 0.42190153 23
15 0.40394273 24
16 0.38745060 25
17 0.37225248 26
18 0.35820182 27
19 0.34517337 28
20 0.33305949 29
21 0.32176714 30
22 0.31121547 31
23 0.30133393 32
24 0.29206063 33
25 0.28334109 34
26 0.27512714 35
27 0.26737604 36
28 0.26004974 37
29 0.25311424 38
30 0.24653910 39
31 0.24029693 40
32 0.23436306 41
33 0.22871520 42
34 0.22333316 43
35 0.21819861 44
36 0.21329485 45
37 0.20860667 46
38 0.20412016 47
39 0.19982257 48
40 0.19570222 49
41 0.19174837 50
42 0.18795112 51
43 0.18430136 52
44 0.18079064 53
45 0.17741118 54
46 0.17415574 55
47 0.17101763 56
48 0.16799061 57
49 0.16506888 58
50 0.16224705 59
51 0.15952008 60
52 0.15688327 61
53 0.15433221 62
54 0.15186279 63
55 0.14947115 64
56 0.14715367 65
57 0.14490696 66
58 0.14272783 67
59 0.14061326 68
60 0.13856044 69
61 0.13656670 70
62 0.13462951 71
63 0.13274652 72
64 0.13091548 73
65 0.12913426 74
66 0.12740086 75
67 0.12571338 76
68 0.12407002 77
69 0.12246907 78
70 0.12090892 79
71 0.11938801 80
72 0.11790489 81
73 0.11645817 82
74 0.11504652 83
75 0.11366868 84
76 0.11232346 85
77 0.11100971 86
78 0.10972633 87
79 0.10847228 88
80 0.10724658 89
81 0.10604827 90
82 0.10487644 91
83 0.10373023 92
84 0.10260880 93
85 0.10151136 94
86 0.10043714 95
87 0.09938543 96
88 0.09835551 97
89 0.09734672 98
90 0.09635841 99
91 0.09538996 100
When I plot these manually select integers, it is roughly on a straight line. Is there a more intelegent algorithm that can help this job automatically (so that it can easily extend to larger ranges, note in that case the round should be to 20, 25, 50, ...something that can divide powers of 10) without having to manually select data?
R> plot(log(c(10:20, seq(from=22, to=32, by=2), seq(from=35, to=90, by=5), 100)), log='y')
Something like this may be helpful:
int_log <- function(min, max, by = 1, round_to = 1) {
round_to * round(exp(seq(log(min), log(max), log(min + by) - log(min)))/round_to)
}
int_log(10, 100)
#> [1] 10 11 12 13 15 16 18 19 21 24 26 29 31 35 38 42 46 51 56 61 67 74 81 90 98
plot(int_log(10, 100), log = 'y')
plot(int_log(10, 1000, by = 10, round_to = 10), log = 'y')
Created on 2022-04-01 by the reprex package (v2.0.1)

In igraph, which network specifications allow groups of nodes to have the same distribution?

I am currently trying to generate a network where the degree distribution has a large variance, but with a sufficient number of nodes at each degree. For example, in igraph, if we use the Barabasi-Albert network, we can do:
g <- sample_pa(n=100,power = 1,m = 10)
g_adj <- as.matrix(as_adj(g))
rowSums(g_adj)
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[29] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
[57] 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
[85] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
The above shows the degree on each of the 100 nodes. The problem for me is that I would like to only have 10-15 unique degree values, so that instead of having 93 94 95 96 97 98 99 at the end, we have instead, for example, 93 for each of the last 7 nodes. In other words, when I call
unique(rowSums(g_adj))
I'd like at most 10-15 values. Is there a way to "cluster" the nodes instead of having so many different unique degree values? thanks.
You may use sample_degseq: Generate random graphs with a given degree sequence. For instance,
degrees <- seq(1, 61, length = 10) # Ten different degrees
times <- rep(10, 10) # Giving each of the degrees to ten vertices
g <- sample_degseq(rep(degrees, times = times), method = "vl")
table(degree(g))
# 1 7 14 21 27 34 41 47 54 61
# 10 10 10 10 10 10 10 10 10 10
Note that you may need to play with degree and times as ultimately rep(degrees, times = times) needs to be a graphic sequence.

loop over a sequence and rounding problem in R

I want to assign some value to a vecter like:
a = rep(0, 101)
for(i in seq(0, 1, 0.01)){
u <- 100 * i + 1
a[u] <- u
}
a
plot(a)
The output is
> a
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 0 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59 0 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
There are problems on the 29th and the 59th elements. They should be 29 and 59, but it turns out to be 0, the default value. And the previous values, the 28th and 58th, are also incorrect. Why is this happening? Thank you!
There is a problem with your indexing. I don't know how to explain why it doesn't work as written, but here is a modification to your code that works:
a = rep(0, 101)
s<-seq(0, 1, 0.01)
for(i in 1:101){
a[i] <- 100 * s[i] + 1
}
a
plot(a)
In general it is best to avoid multiple indexes in the same loop as it can be confusing and difficult to diagnose problems.

Select values within/outside of a set of intervals (ranges) R

I've got some sort of index, like:
index <- 1:100
I've also got a list of "exclusion intervals" / ranges
exclude <- data.frame(start = c(5,50, 90), end = c(10,55, 95))
start end
1 5 10
2 50 55
3 90 95
I'm looking for an efficient way (in R) to remove all the indexes that belong in the ranges in the exclude data frame
so the desired output would be:
1,2,3,4, 11,12,...,48,49, 56,57,...,88,89, 96,97,98,99,100
I could do this iteratively: go over every exclusion interval (using ddply) and iteratively remove indexes that fall in each interval. But is there a more efficient way (or function) that does this?
I'm using library(intervals) to calculate my intervals, I could not find a built-in function tha does this.
Another approach that looks valid could be:
starts = findInterval(index, exclude[["start"]])
ends = findInterval(index, exclude[["end"]])# + 1L) ##1 needs to be added to remove upper
##bounds from the 'index' too
index[starts != (ends + 1L)] ##a value above a lower bound and
##below an upper is inside that interval
The main advantage here is that no vectors including all intervals' elements need to be created and, also, that it handles any set of values inside a particular interval; e.g.:
set.seed(101); x = round(runif(15, 1, 100), 3)
x
# [1] 37.848 5.339 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 93.232 46.057
x[findInterval(x, exclude[["start"]]) != (findInterval(x, exclude[["end"]]) + 1L)]
# [1] 37.848 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 46.057
We can use Map to get the sequence for the corresponding elements in 'start' 'end' columns, unlist to create a vector and use setdiff to get the values of 'index' that are not in the vector.
setdiff(index,unlist(with(exclude, Map(`:`, start, end))))
#[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[20] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#[39] 45 46 47 48 49 56 57 58 59 60 61 62 63 64 65 66 67 68 69
#[58] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
#[77] 89 96 97 98 99 100
Or we can use rep and then use setdiff.
i1 <- with(exclude, end-start) +1L
setdiff(index,with(exclude, rep(start, i1)+ sequence(i1)-1))
NOTE: Both the methods return the index position that needs to be excluded. In the above case, the original vector ('index') is a sequence so I used setdiff. If it contains random elements, use the position vector appropriately, i.e.
index[-unlist(with(exclude, Map(`:`, start, end)))]
or
index[setdiff(seq_along(index), unlist(with(exclude,
Map(`:`, start, end))))]
Another approach
> index[-do.call(c, lapply(1:nrow(exclude), function(x) exclude$start[x]:exclude$end[x]))]
[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[25] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 56 57 58 59 60
[49] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[73] 85 86 87 88 89 96 97 98 99 100

Resources