Suppose I have a dataframe df, which contains 450 rows. I want to delete rows from 10 through 18 (that is 10, 11, 12, 13, 14, 15, 16, 17, 18). Then similarly rows from 28 through 36, then 46:54. And so on, up to deleting rows from 442 through 450.
Any suggestions, guys?
Create a sequence and remove those rows. The first argument, nvec, is the length of each sequence (8, repeated for each sequence); the second, from, is the starting point for each sequence (10, 28, ...).
n = 450
len = n %/% 18
s <- sequence(nvec = rep(9, len), from = seq(10, n, 18))
# [1] 10 11 12 13 14 15 16 17 18 28 29 30 31 32 33 34
# [17] 35 36 46 47 48 49 50 51 52 53 54 64 65 66 67 68
# [33] 69 70 71 72 82 83 84 85 86 87 88 89 90 100 101 102
# ...
your_df[-s, ]
You can also create the sequence like this:
rep(10:18, len) + rep(18*(0:(len - 1)), each = 9)
# [1] 10 11 12 13 14 15 16 17 18 28 29 30 31 32 33 34
# [17] 35 36 46 47 48 49 50 51 52 53 54 64 65 66 67 68
# [33] 69 70 71 72 82 83 84 85 86 87 88 89 90 100 101 102
# ...
Related
For the integers between 10 and 100, I want to keep the numbers that are roughly equally spaced in a log-scale.
General speaking if the space between two integers is much less than .5 (scaled by (log(10)-log(9) as shown in the code below), then one integer should be dropped. But the remaining integers should be also be rounded to a multiple of 2, 5 and 10 if possible.
So this would end up 10, 11, ..., 19, 20, 22, ..., 30, 32, 35, 40, ..., 85, 90, 100.
R> data.frame(diff((log(9:100)-lodata.frame(delta=diff((log(9:100)-log(10))/(log(10)-log(9))), n=10:100)
delta n
1 1.00000000 10
2 0.90461004 11
3 0.82584426 12
4 0.75970307 13
5 0.70337518 14
6 0.65482663 15
7 0.61254940 16
8 0.57540172 17
9 0.54250317 18
10 0.51316398 19
11 0.48683602 20
12 0.46307826 21
13 0.44153178 22
14 0.42190153 23
15 0.40394273 24
16 0.38745060 25
17 0.37225248 26
18 0.35820182 27
19 0.34517337 28
20 0.33305949 29
21 0.32176714 30
22 0.31121547 31
23 0.30133393 32
24 0.29206063 33
25 0.28334109 34
26 0.27512714 35
27 0.26737604 36
28 0.26004974 37
29 0.25311424 38
30 0.24653910 39
31 0.24029693 40
32 0.23436306 41
33 0.22871520 42
34 0.22333316 43
35 0.21819861 44
36 0.21329485 45
37 0.20860667 46
38 0.20412016 47
39 0.19982257 48
40 0.19570222 49
41 0.19174837 50
42 0.18795112 51
43 0.18430136 52
44 0.18079064 53
45 0.17741118 54
46 0.17415574 55
47 0.17101763 56
48 0.16799061 57
49 0.16506888 58
50 0.16224705 59
51 0.15952008 60
52 0.15688327 61
53 0.15433221 62
54 0.15186279 63
55 0.14947115 64
56 0.14715367 65
57 0.14490696 66
58 0.14272783 67
59 0.14061326 68
60 0.13856044 69
61 0.13656670 70
62 0.13462951 71
63 0.13274652 72
64 0.13091548 73
65 0.12913426 74
66 0.12740086 75
67 0.12571338 76
68 0.12407002 77
69 0.12246907 78
70 0.12090892 79
71 0.11938801 80
72 0.11790489 81
73 0.11645817 82
74 0.11504652 83
75 0.11366868 84
76 0.11232346 85
77 0.11100971 86
78 0.10972633 87
79 0.10847228 88
80 0.10724658 89
81 0.10604827 90
82 0.10487644 91
83 0.10373023 92
84 0.10260880 93
85 0.10151136 94
86 0.10043714 95
87 0.09938543 96
88 0.09835551 97
89 0.09734672 98
90 0.09635841 99
91 0.09538996 100
When I plot these manually select integers, it is roughly on a straight line. Is there a more intelegent algorithm that can help this job automatically (so that it can easily extend to larger ranges, note in that case the round should be to 20, 25, 50, ...something that can divide powers of 10) without having to manually select data?
R> plot(log(c(10:20, seq(from=22, to=32, by=2), seq(from=35, to=90, by=5), 100)), log='y')
Something like this may be helpful:
int_log <- function(min, max, by = 1, round_to = 1) {
round_to * round(exp(seq(log(min), log(max), log(min + by) - log(min)))/round_to)
}
int_log(10, 100)
#> [1] 10 11 12 13 15 16 18 19 21 24 26 29 31 35 38 42 46 51 56 61 67 74 81 90 98
plot(int_log(10, 100), log = 'y')
plot(int_log(10, 1000, by = 10, round_to = 10), log = 'y')
Created on 2022-04-01 by the reprex package (v2.0.1)
This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 3 years ago.
How to cut the values (1 to 100) in a regular interval (25) and place them into 4 groups as below:
sdr <- c(1:100)
Group1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Group2: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Group3: 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
Group4: 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Any suggestion, please.
You could use split
sdr <- 1:100
split(sdr, rep(1:4, each = 25))
#$`1`
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#
#$`2`
# [1] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#
#$`3`
# [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#
#$`4`
# [1] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
#[20] 95 96 97 98 99 100
This returns a list with 4 vector elements.
Also note that the c() around 1:100 is not necessary.
Or we can define the number of groups
ngroup <- 4
split(sdr, rep(1:ngroup, each = length(sdr) %/% ngroup))
giving the same result.
You can make a dataframe for your groups and then transpose using t:
df <- t(data.frame(Group1 = c(1:25), Group2 = c(26:50), Group3 = c(51:75), Group4 = c(76:100)))
I want to assign some value to a vecter like:
a = rep(0, 101)
for(i in seq(0, 1, 0.01)){
u <- 100 * i + 1
a[u] <- u
}
a
plot(a)
The output is
> a
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 0 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59 0 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
There are problems on the 29th and the 59th elements. They should be 29 and 59, but it turns out to be 0, the default value. And the previous values, the 28th and 58th, are also incorrect. Why is this happening? Thank you!
There is a problem with your indexing. I don't know how to explain why it doesn't work as written, but here is a modification to your code that works:
a = rep(0, 101)
s<-seq(0, 1, 0.01)
for(i in 1:101){
a[i] <- 100 * s[i] + 1
}
a
plot(a)
In general it is best to avoid multiple indexes in the same loop as it can be confusing and difficult to diagnose problems.
I am experiencing some strange behavior in R when trying to index a matrix with another matrix. I run into an error of subscript out of bounds with indexing with a 2 column matrix, but not with a four column matrix. See the following reproducible code. Any insight would be appreciated!
This
data <- matrix(rbinom(100, 1, .5), nrow = 10)
idx <- cbind(1:50, 51:100)
data[idx]
results in:
Error in data[idx] : subscript out of bounds
However
data[cbind(idx,idx)]
works.
My session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
The key insight as to why this is wrong isn't working is given in ?'[':
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i.
and it is clear when the subscript out of bounds error arises; data doesn't have 50 rows and 100 columns.
What's happening in the second example the indexing matrix is just being treated as a vector because it has more columns than the matrix being indexed has dimensions, and is extracting elements c(1:100, 1:100) from data.
This is more easily see with
m <- matrix(1:100, ncol = 10, byrow = TRUE)
and indexing with cbind(idx, idx) gives
> m[cbind(idx,idx)]
[1] 1 11 21 31 41 51 61 71 81 91 2 12 22 32 42 52 62 72
[19] 82 92 3 13 23 33 43 53 63 73 83 93 4 14 24 34 44 54
[37] 64 74 84 94 5 15 25 35 45 55 65 75 85 95 6 16 26 36
[55] 46 56 66 76 86 96 7 17 27 37 47 57 67 77 87 97 8 18
[73] 28 38 48 58 68 78 88 98 9 19 29 39 49 59 69 79 89 99
[91] 10 20 30 40 50 60 70 80 90 100 1 11 21 31 41 51 61 71
[109] 81 91 2 12 22 32 42 52 62 72 82 92 3 13 23 33 43 53
[127] 63 73 83 93 4 14 24 34 44 54 64 74 84 94 5 15 25 35
[145] 45 55 65 75 85 95 6 16 26 36 46 56 66 76 86 96 7 17
[163] 27 37 47 57 67 77 87 97 8 18 28 38 48 58 68 78 88 98
[181] 9 19 29 39 49 59 69 79 89 99 10 20 30 40 50 60 70 80
[199] 90 100
which is the same as
m[c(idx[,1], idx[,2], idx[,1], idx[,2])]
or specifically,
m[c(1:50, 51:100, 1:50, 51:100)]
Using the timeslicing in caret, and it's parameters,
how do I split up data with xyz rows with each having a length of 12?
Ideally, also considering the 60-20-20 train-test-validate ratio.
Should I set it like so:
initialWindow=12, horizon=12, fixedWindow=TRUE?
I've read the documentation but this is still unclear to me.
You can try out what happens using an example vector like 1:100.
If you set window = 12 and fixedWindow = T the training sets will always have 12 rows. horizon specifies the size of the subsequent observations that will be included in the test sets. If it is set to 12 and you do not want any rows to be predicted multiple times skip has to be set to (horizon - 1).
A partitioning of 60-20-20 can be achieved for example by setting initialWindow to the size of the first 60% and first running your model on the first half of slices and using the second half of slices as the last 20%.
I don't know if you are trying to use timeslicing inside of caret's train function already. In any case, you can experiment with the different settings using the createTimeSlices() function:
library(caret)
dat <- 1:100
slices <- createTimeSlices(dat, initialWindow = 12, horizon = 1,
skip = 0, fixedWindow = T)
slices # 88 test and train sets
# [...]
slices <- createTimeSlices(y = dat, initialWindow = 12, horizon = 12,
skip = 11, fixedWindow = T)
slices
# 7 test and train sets, observations 97 - 100 not in any test set
$train
$train$Training01
[1] 1 2 3 4 5 6 7 8 9 10 11 12
$train$Training13
[1] 13 14 15 16 17 18 19 20 21 22 23 24
$train$Training25
[1] 25 26 27 28 29 30 31 32 33 34 35 36
$train$Training37
[1] 37 38 39 40 41 42 43 44 45 46 47 48
$train$Training49
[1] 49 50 51 52 53 54 55 56 57 58 59 60
$train$Training61
[1] 61 62 63 64 65 66 67 68 69 70 71 72
$train$Training73
[1] 73 74 75 76 77 78 79 80 81 82 83 84
$test
$test$Testing01
[1] 13 14 15 16 17 18 19 20 21 22 23 24
$test$Testing13
[1] 25 26 27 28 29 30 31 32 33 34 35 36
$test$Testing25
[1] 37 38 39 40 41 42 43 44 45 46 47 48
$test$Testing37
[1] 49 50 51 52 53 54 55 56 57 58 59 60
$test$Testing49
[1] 61 62 63 64 65 66 67 68 69 70 71 72
$test$Testing61
[1] 73 74 75 76 77 78 79 80 81 82 83 84
$test$Testing73
[1] 85 86 87 88 89 90 91 92 93 94 95 96
Initial window + horizon - 1 will ensure your training folds and testing folds do not overlap.
timeSlices <- createTimeSlices(1:nrow(DF), initialWindow = 36, horizon = 6, skip = 41, fixedWindow = TRUE)
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
testSlices <- do.call(rbind.data.frame, testSlices)
trainSlices <- do.call(rbind.data.frame, trainSlices)