Sequence with different intervals in R: matching sensor data - r

I need a vector that repeats numbers in a sequence at varying intervals. I basically need this
c(rep(1:42, each=6), rep(43:64, each = 7),
rep(65:106, each=6), rep(107:128, each = 7),
.... but I need to this to keep going, until almost 2 million.
So I want a vector that looks like
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 ...
.....
[252] 43 43 43 43 43 43 43 44 44 44 44 44 44 44
....
[400] 64 64 64 64 64 64 65 65 65 65 65 65...
and so on. Not just alternating between 6 and 7 repetitions, rather mostly 6s and fewer 7s until the whole vector is something like 1.7 million rows. So, is there a loop I can do? Or apply, replicate? I need the 400th entry in the vector to be 64, the 800th entry to be 128, and so on, in somewhat evenly spaced integers.
UPDATE
Thank you all for the quick clever tricks there. It worked, at least well enough for the deadline I was dealing with. I realize repeating 6 xs and 7 xs are a really dumb way to try to solve this, but it was quick at least. But now that I have some time, I would like to get everyone's opinions /ideas on my real underlying issue here.
I have two datasets to merge. They are both sensor datasets, both with stopwatch time as primary keys. But one records every 1/400 of a second, and the other records every 1/256 of a second. I have trimmed the top of each so that they are starting the exact same moment. But.. now what? I have 400 records for each second in one set, and 256 records for 1 second in the other. Is there a way to merge these without losing data? Interpolating or just repeating obs is a-ok, necessary, I think, but I'd rather not throw any data out.
I read this post here, that had to do with using xts and zoo for a very similar problem to mine. But they have nice epoch date/times for each. I just have these awful fractions of seconds!
sample data (A):
time dist a_lat
1 139.4300 22 0
2 139.4325 22 0
3 139.4350 22 0
4 139.4375 22 0
5 139.4400 22 0
6 139.4425 22 0
7 139.4450 22 0
8 139.4475 22 0
9 139.4500 22 0
10 139.4525 22 0
sample data (B):
timestamp hex_acc_x hex_acc_y hex_acc_z
1 367065215501 -0.5546875 -0.7539062 0.1406250
2 367065215505 -0.5468750 -0.7070312 0.2109375
3 367065215509 -0.4218750 -0.6835938 0.1796875
4 367065215513 -0.5937500 -0.7421875 0.1562500
5 367065215517 -0.6757812 -0.7773438 0.2031250
6 367065215521 -0.5937500 -0.8554688 0.2460938
7 367065215525 -0.6132812 -0.8476562 0.2109375
8 367065215529 -0.3945312 -0.8906250 0.2031250
9 367065215533 -0.3203125 -0.8906250 0.2226562
10 367065215537 -0.3867188 -0.9531250 0.2578125
(oh yeah, and btw, the B dataset timestamps are epoch format * 256, because life is hard. i haven't converted it for this because dataset A has nothing like that, only just 0.0025 intervals. Also the B data sensor was left on for hours later the A data sensor turned off, so that doesn't help)

Or if you like, you can try this using apply
# using this sample data
df <- data.frame(from=c(1,4,7,11), to = c(3,6,10,13),rep=c(6,7,6,7));
> df
# from to rep
#1 1 3 6
#2 4 6 7
#3 7 10 6
#4 11 13 7
unlist(apply(df, 1, function(x) rep(x['from']:x['to'], each=x['rep'])))
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4
#[26] 5 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8
#[51] 8 9 9 9 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 11 12 12 12 12 12
#[76] 12 12 13 13 13 13 13 13 13

Now that you put it that way ... I have absolutely no idea how you are planning on using all of the 6s and 7s. :-)
Regardless, I recommend standardizing the time, adding a "sample" column, and merging on them. Having the "sample" column may facilitate your processing later on, perhaps.
Your data:
df400 <- structure(list(time = c(139.43, 139.4325, 139.435, 139.4375, 139.44, 139.4425,
139.445, 139.4475, 139.45, 139.4525),
dist = c(22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L),
a_lat = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
.Names = c("time", "dist", "a_lat"),
class = "data.frame", row.names = c(NA, -10L))
df256 <- structure(list(timestamp = c(367065215501, 367065215505, 367065215509, 367065215513,
367065215517, 367065215521, 367065215525, 367065215529,
367065215533, 367065215537),
hex_acc_x = c(-0.5546875, -0.546875, -0.421875, -0.59375, -0.6757812,
-0.59375, -0.6132812, -0.3945312, -0.3203125, -0.3867188),
hex_acc_y = c(-0.7539062, -0.7070312, -0.6835938, -0.7421875,
-0.7773438, -0.8554688, -0.8476562, -0.890625,
-0.890625, -0.953125),
hex_acc_z = c(0.140625, 0.2109375, 0.1796875, 0.15625, 0.203125,
0.2460938, 0.2109375, 0.203125, 0.2226562, 0.2578125)),
.Names = c("timestamp", "hex_acc_x", "hex_acc_y", "hex_acc_z"),
class = "data.frame", row.names = c(NA, -10L))
Standardize your time frames:
colnames(df256)[1] <- 'time'
df400$time <- df400$time - df400$time[1]
df256$time <- (df256$time - df256$time[1]) / 256
Assign a label for easy reference (not that the NAs won't be clear enough):
df400 <- cbind(sample='A', df400, stringsAsFactors=FALSE)
df256 <- cbind(sample='B', df256, stringsAsFactors=FALSE)
And now for the merge and sorting:
dat <- merge(df400, df256, by=c('sample', 'time'), all.x=TRUE, all.y=TRUE)
dat <- dat[order(dat$time),]
dat
## sample time dist a_lat hex_acc_x hex_acc_y hex_acc_z
## 1 A 0.000000 22 0 NA NA NA
## 11 B 0.000000 NA NA -0.5546875 -0.7539062 0.1406250
## 2 A 0.002500 22 0 NA NA NA
## 3 A 0.005000 22 0 NA NA NA
## 4 A 0.007500 22 0 NA NA NA
## 5 A 0.010000 22 0 NA NA NA
## 6 A 0.012500 22 0 NA NA NA
## 7 A 0.015000 22 0 NA NA NA
## 12 B 0.015625 NA NA -0.5468750 -0.7070312 0.2109375
## 8 A 0.017500 22 0 NA NA NA
## 9 A 0.020000 22 0 NA NA NA
## 10 A 0.022500 22 0 NA NA NA
## 13 B 0.031250 NA NA -0.4218750 -0.6835938 0.1796875
## 14 B 0.046875 NA NA -0.5937500 -0.7421875 0.1562500
## 15 B 0.062500 NA NA -0.6757812 -0.7773438 0.2031250
## 16 B 0.078125 NA NA -0.5937500 -0.8554688 0.2460938
## 17 B 0.093750 NA NA -0.6132812 -0.8476562 0.2109375
## 18 B 0.109375 NA NA -0.3945312 -0.8906250 0.2031250
## 19 B 0.125000 NA NA -0.3203125 -0.8906250 0.2226562
## 20 B 0.140625 NA NA -0.3867188 -0.9531250 0.2578125
I'm guessing your data was just a small representation. If I've guessed poorly (that A's integers are seconds and B's integers are 1/400ths of a second) then just scale differently. Either way, by resetting the first value to zero and then merging/sorting, they are easy to merge and sort.

alt <- data.frame(len=c(42,22),rep=c(6,7));
alt;
## len rep
## 1 42 6
## 2 22 7
altrep <- function(alt,cyc,len) {
cyclen <- sum(alt$len*alt$rep);
if (missing(cyc)) {
if (missing(len)) {
cyc <- 1;
len <- cyc*cyclen;
} else {
cyc <- ceiling(len/cyclen);
};
} else if (missing(len)) {
len <- cyc*cyclen;
};
if (isTRUE(all.equal(len,0))) return(integer());
result <- rep(1:(cyc*sum(alt$len)),rep(rep(alt$rep,alt$len),cyc));
length(result) <- len;
result;
};
altrep(alt,2);
## [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9
## [52] 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17
## [103] 18 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25 26 26 26
## [154] 26 26 26 27 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 29 29 30 30 30 30 30 30 31 31 31 31 31 31 32 32 32 32 32 32 33 33 33 33 33 33 34 34 34 34 34 34
## [205] 35 35 35 35 35 35 36 36 36 36 36 36 37 37 37 37 37 37 38 38 38 38 38 38 39 39 39 39 39 39 40 40 40 40 40 40 41 41 41 41 41 41 42 42 42 42 42 42 43 43 43
## [256] 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45 45 45 45 46 46 46 46 46 46 46 47 47 47 47 47 47 47 48 48 48 48 48 48 48 49 49 49 49 49 49 49 50 50 50 50 50
## [307] 50 50 51 51 51 51 51 51 51 52 52 52 52 52 52 52 53 53 53 53 53 53 53 54 54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 56 57 57 57 57 57 57 57
## [358] 58 58 58 58 58 58 58 59 59 59 59 59 59 59 60 60 60 60 60 60 60 61 61 61 61 61 61 61 62 62 62 62 62 62 62 63 63 63 63 63 63 63 64 64 64 64 64 64 64 65 65
## [409] 65 65 65 65 66 66 66 66 66 66 67 67 67 67 67 67 68 68 68 68 68 68 69 69 69 69 69 69 70 70 70 70 70 70 71 71 71 71 71 71 72 72 72 72 72 72 73 73 73 73 73
## [460] 73 74 74 74 74 74 74 75 75 75 75 75 75 76 76 76 76 76 76 77 77 77 77 77 77 78 78 78 78 78 78 79 79 79 79 79 79 80 80 80 80 80 80 81 81 81 81 81 81 82 82
## [511] 82 82 82 82 83 83 83 83 83 83 84 84 84 84 84 84 85 85 85 85 85 85 86 86 86 86 86 86 87 87 87 87 87 87 88 88 88 88 88 88 89 89 89 89 89 89 90 90 90 90 90
## [562] 90 91 91 91 91 91 91 92 92 92 92 92 92 93 93 93 93 93 93 94 94 94 94 94 94 95 95 95 95 95 95 96 96 96 96 96 96 97 97 97 97 97 97 98 98 98 98 98 98 99 99
## [613] 99 99 99 99 100 100 100 100 100 100 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103 103 103 103 104 104 104 104 104 104 105 105 105 105 105 105 106 106 106 106 106 106 107 107 107 107 107
## [664] 107 107 108 108 108 108 108 108 108 109 109 109 109 109 109 109 110 110 110 110 110 110 110 111 111 111 111 111 111 111 112 112 112 112 112 112 112 113 113 113 113 113 113 113 114 114 114 114 114 114 114
## [715] 115 115 115 115 115 115 115 116 116 116 116 116 116 116 117 117 117 117 117 117 117 118 118 118 118 118 118 118 119 119 119 119 119 119 119 120 120 120 120 120 120 120 121 121 121 121 121 121 121 122 122
## [766] 122 122 122 122 122 123 123 123 123 123 123 123 124 124 124 124 124 124 124 125 125 125 125 125 125 125 126 126 126 126 126 126 126 127 127 127 127 127 127 127 128 128 128 128 128 128 128
altrep(alt,len=1000);
## [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9
## [52] 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17
## [103] 18 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25 26 26 26
## [154] 26 26 26 27 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 29 29 30 30 30 30 30 30 31 31 31 31 31 31 32 32 32 32 32 32 33 33 33 33 33 33 34 34 34 34 34 34
## [205] 35 35 35 35 35 35 36 36 36 36 36 36 37 37 37 37 37 37 38 38 38 38 38 38 39 39 39 39 39 39 40 40 40 40 40 40 41 41 41 41 41 41 42 42 42 42 42 42 43 43 43
## [256] 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45 45 45 45 46 46 46 46 46 46 46 47 47 47 47 47 47 47 48 48 48 48 48 48 48 49 49 49 49 49 49 49 50 50 50 50 50
## [307] 50 50 51 51 51 51 51 51 51 52 52 52 52 52 52 52 53 53 53 53 53 53 53 54 54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 56 57 57 57 57 57 57 57
## [358] 58 58 58 58 58 58 58 59 59 59 59 59 59 59 60 60 60 60 60 60 60 61 61 61 61 61 61 61 62 62 62 62 62 62 62 63 63 63 63 63 63 63 64 64 64 64 64 64 64 65 65
## [409] 65 65 65 65 66 66 66 66 66 66 67 67 67 67 67 67 68 68 68 68 68 68 69 69 69 69 69 69 70 70 70 70 70 70 71 71 71 71 71 71 72 72 72 72 72 72 73 73 73 73 73
## [460] 73 74 74 74 74 74 74 75 75 75 75 75 75 76 76 76 76 76 76 77 77 77 77 77 77 78 78 78 78 78 78 79 79 79 79 79 79 80 80 80 80 80 80 81 81 81 81 81 81 82 82
## [511] 82 82 82 82 83 83 83 83 83 83 84 84 84 84 84 84 85 85 85 85 85 85 86 86 86 86 86 86 87 87 87 87 87 87 88 88 88 88 88 88 89 89 89 89 89 89 90 90 90 90 90
## [562] 90 91 91 91 91 91 91 92 92 92 92 92 92 93 93 93 93 93 93 94 94 94 94 94 94 95 95 95 95 95 95 96 96 96 96 96 96 97 97 97 97 97 97 98 98 98 98 98 98 99 99
## [613] 99 99 99 99 100 100 100 100 100 100 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103 103 103 103 104 104 104 104 104 104 105 105 105 105 105 105 106 106 106 106 106 106 107 107 107 107 107
## [664] 107 107 108 108 108 108 108 108 108 109 109 109 109 109 109 109 110 110 110 110 110 110 110 111 111 111 111 111 111 111 112 112 112 112 112 112 112 113 113 113 113 113 113 113 114 114 114 114 114 114 114
## [715] 115 115 115 115 115 115 115 116 116 116 116 116 116 116 117 117 117 117 117 117 117 118 118 118 118 118 118 118 119 119 119 119 119 119 119 120 120 120 120 120 120 120 121 121 121 121 121 121 121 122 122
## [766] 122 122 122 122 122 123 123 123 123 123 123 123 124 124 124 124 124 124 124 125 125 125 125 125 125 125 126 126 126 126 126 126 126 127 127 127 127 127 127 127 128 128 128 128 128 128 128 129 129 129 129
## [817] 129 129 130 130 130 130 130 130 131 131 131 131 131 131 132 132 132 132 132 132 133 133 133 133 133 133 134 134 134 134 134 134 135 135 135 135 135 135 136 136 136 136 136 136 137 137 137 137 137 137 138
## [868] 138 138 138 138 138 139 139 139 139 139 139 140 140 140 140 140 140 141 141 141 141 141 141 142 142 142 142 142 142 143 143 143 143 143 143 144 144 144 144 144 144 145 145 145 145 145 145 146 146 146 146
## [919] 146 146 147 147 147 147 147 147 148 148 148 148 148 148 149 149 149 149 149 149 150 150 150 150 150 150 151 151 151 151 151 151 152 152 152 152 152 152 153 153 153 153 153 153 154 154 154 154 154 154 155
## [970] 155 155 155 155 155 156 156 156 156 156 156 157 157 157 157 157 157 158 158 158 158 158 158 159 159 159 159 159 159 160 160
You can specify len=1.7e6 (and omit the cyc argument) to get exactly 1.7 million elements, or you can get a whole number of cycles using cyc.

How about
len <- 2e6
step <- 400
x <- rep(64 * seq(0, ceiling(len / step) - 1), each = step) +
sort(rep(1:64, length.out = step))
x <- x[seq(len)] # to get rid of extra elements

Related

K-nearest neighbor for spatial weights R

I was wondering if you could help me with this problem. I have a dataset of US counties that I am trying to do k-nearest neighbor analysis for spatial weighting, following the method proposed here (section 4.5), but the results aren't making sense, or potentially I'm not understanding them.
library(spdep)
library(tigris)
library(sf)
counties <- counties("Georgia", cb = TRUE)
coords <- st_centroid(st_geometry(counties), of_largest_polygon=TRUE)
col.knn <- knearneigh(coords)
gck4.nb <- knn2nb(knearneigh(coords, k=4, longlat=TRUE))
summary(gck4.nb, coords, longlat=TRUE, scale=0.5)
However, the output I'm getting, with regards to the distances, seems rather small, on the order of less than 1 km:
Neighbour list object:
Number of regions: 159
Number of nonzero links: 636
Percentage nonzero weights: 2.515723
Average number of links: 4
Non-symmetric neighbours list
Link number distribution:
4
159
159 least connected regions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 with 4 links
159 most connected regions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 with 4 links
Summary of link distances:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1355 0.2650 0.3085 0.3112 0.3482 0.6224
The decimal point is 1 digit(s) to the left of the |
1 | 44
1 | 7799999999999999
2 | 00000000000011111111112222222222222233333333333333333333333333444444
2 | 55555555555555555555555555556666666666666666666666666666666666667777+92
3 | 00000000000000000000000000000001111111111111111111111111111111111111+121
3 | 55555555555555555555555555555556666666666667777777777777777777777777+19
4 | 00000000000111111111112222222222223333333444
4 | 555667777999
5 | 0000014
5 | 7888
6 | 2

Subset a dataframe with specific condition in R

hello I have this df
res1 res4 aa1234
1 1 4 IVGG
2 10 13 RQFP
3 102 105 TSSV
4 112 115 LQNA
5 118 121 EAGT
6 12 15 FPFL
7 132 135 RSGG
8 138 141 SRFP
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
12 165 168 TRRG
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
16 181 184 CEGL
17 195 198 PCGR
18 20 23 NQGR
19 205 208 RVAL
20 32 35 HARF
21 39 42 AASC
22 40 43 ASCF
23 48 51 PGVS
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
28 78 81 ENGY
29 8 11 RPRQ
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
33 95 98 LDRE
I want to subset it considering only rows in which res1 are in sequence as i and i <= i+4, as :
res1 res4 aa1234
29 8 11 RPRQ
6 12 15 FPFL
21 39 42 AASC
22 40 43 ASCF
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
28 78 81 ENGY
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
I tried something woth functions "filter" and "subset" but I didn't got the result expected.
So in general, I need to have the overlap between two rows in a range (i-i+4) including i+4.
For example, in this 3 lines there is the overlap between rows [9] and [10] (150-153 overlaps with 151-154), but also row [11] corresponds to res1[10] + 4 (151+4 = 155). So maybe an idea should be to consider res1[i] and check if res1[i+1] is =< res[i].
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
why not we are simply doing this?
df[df$res1 %in% c(df$res1 -4,df$res1 -3, df$res1-2, df$res1 -1, df$res1+1,df$res1 +2, df$res1 +3, df$res1 +4),]
res1 res4 aa1234
2 10 13 RQFP
6 12 15 FPFL
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
21 39 42 AASC
22 40 43 ASCF
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
28 78 81 ENGY
29 8 11 RPRQ
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
edited scenario just order the df, and rest will be same. See
df <- df[order(df$res1),]
df[sort(unique(c(which(rev(diff(rev(df$res1))) >= -3 & rev(diff(rev(df$res1))) <= 0), which(diff(df$res1) <= 4 & diff(df$res1) >= 0)+1))),]
res1 res4 aa1234
29 8 11 RPRQ
2 10 13 RQFP
6 12 15 FPFL
21 39 42 AASC
22 40 43 ASCF
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
old answer Use this
df[sort(unique(c(which(rev(diff(rev(df$res1))) >= -3 & rev(diff(rev(df$res1))) <= 0), which(diff(df$res1) <= 4 & diff(df$res1) >= 0)+1))),]
res1 res4 aa1234
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
21 39 42 AASC
22 40 43 ASCF
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
Data used
df <- read.table(text = "res1 res4 aa1234
1 1 4 IVGG
2 10 13 RQFP
3 102 105 TSSV
4 112 115 LQNA
5 118 121 EAGT
6 12 15 FPFL
7 132 135 RSGG
8 138 141 SRFP
9 150 153 PEDQ
10 151 154 EDQC
11 155 158 RPNN
12 165 168 TRRG
13 171 174 CNGD
14 172 175 NGDG
15 174 177 DGGT
16 181 184 CEGL
17 195 198 PCGR
18 20 23 NQGR
19 205 208 RVAL
20 32 35 HARF
21 39 42 AASC
22 40 43 ASCF
23 48 51 PGVS
24 57 60 AYDL
25 59 62 DLRR
26 64 67 ERQS
27 65 68 RQSR
28 78 81 ENGY
29 8 11 RPRQ
30 82 85 DPQQ
31 83 86 PQQN
32 86 89 NLND
33 95 98 LDRE", header = T)

Randoming equal values in order in R

I have to get order of one vector to sort other vector. The point is I don't want my function to be stable. In fact, I'd like to random order of equal values. Any idea how do it in R in finite time? :D
Thanks for any help.
You can do this in base R using order. order will take multiple variable to sort on. If you make the second one be a random variable, it will randomize the ties. Here is an example using the built-in iris data. The variable Sepal.Length has several ties for second lowest value. Here are some:
iris$Sepal.Length[c(9,39,43)]
[1] 4.4 4.4 4.4
Now let's sort just that variable (stable sort) and then sort with a random secondary sort.
order(iris$Sepal.Length)
[1] 14 9 39 43 42 4 7 23 48 3 30 12 13 25 31 46 2 10 35
[20] 38 58 107 5 8 26 27 36 41 44 50 61 94 1 18 20 22 24 40
[39] 45 47 99 28 29 33 60 49 6 11 17 21 32 85 34 37 54 81 82
[58] 90 91 65 67 70 89 95 122 16 19 56 80 96 97 100 114 15 68 83
[77] 93 102 115 143 62 71 150 63 79 84 86 120 139 64 72 74 92 128 135
[96] 69 98 127 149 57 73 88 101 104 124 134 137 147 52 75 112 116 129 133
[115] 138 55 105 111 117 148 59 76 66 78 87 109 125 141 145 146 77 113 144
[134] 53 121 140 142 51 103 110 126 130 108 131 106 118 119 123 136 132
order(iris$Sepal.Length, sample(150,150))
[1] 14 43 39 9 42 48 7 4 23 3 30 25 31 46 13 12 35 38 107
[20] 10 58 2 8 41 27 61 94 5 36 44 50 26 18 22 99 40 20 47
[39] 24 45 1 33 60 29 28 49 85 11 6 32 21 17 90 81 91 54 34
[58] 37 82 67 122 95 65 70 89 100 96 56 114 80 16 19 97 93 15 68
[77] 143 102 83 115 150 62 71 120 79 84 63 139 86 72 135 74 64 92 128
[96] 149 69 98 127 88 134 101 57 137 73 104 147 124 138 112 129 116 75 52
[115] 133 148 55 111 105 117 59 76 87 66 78 146 141 109 125 145 144 113 77
[134] 140 53 121 142 51 103 126 130 110 108 131 106 136 119 118 123 132
Without the random secondary sort, positions 2,3,and 4 are in order (stable). With the random secondary sort, they are jumbled.
Try fct_reorder in the forcats package to order one factor by another. If you want to introduce randomness as well, try fct_reorder2 with .y = runif(length(your_vector))
(I'm apparently thinking in strange directions today - fct_reorder will reorder the levels of a factor. If that's what you are after, this may help. Otherwise, order is the better approach.)

The train function in R caret package

Suppose I have a data set and I want to do a 4-fold cross validation using logistic regression. So there will be 4 different models. In R, I did the following:
ctrl <- trainControl(method = "repeatedcv", number = 4, savePredictions = TRUE)
mod_fit <- train(outcome ~., data=data1, method = "glm", family="binomial", trControl = ctrl)
I would assume that mod_fit should contain 4 separate sets of coefficients? When I type modfit$finalModel$ I just get the same set of coefficients.
I've created a reproducible example based on your code snippet. The first thing to notice about your code is that it's specifying repeatedcv as the method, but it doesn't give any repeats, so the number=4 parmeter is just telling it to resample 4 times (this is not an answer to your question but important to understand).
mod_fit$finalModel gives you only 1 set of coefficients because it's the one final model that's derived by aggergating the non-repeated k-fold CV results from each of the 4 folds.
You can see the fold-level performance in the resample object:
library(caret)
library(mlbench)
data(iris)
iris$binary <- ifelse(iris$Species=="setosa",1,0)
iris$Species <- NULL
ctrl <- trainControl(method = "repeatedcv",
number = 4,
savePredictions = TRUE,
verboseIter = T,
returnResamp = "all")
mod_fit <- train(binary ~.,
data=iris,
method = "glm",
family="binomial",
trControl = ctrl)
# Fold-level Performance
mod_fit$resample
RMSE Rsquared parameter Resample
1 2.630866e-03 0.9999658 none Fold1.Rep1
2 3.863821e-08 1.0000000 none Fold2.Rep1
3 8.162472e-12 1.0000000 none Fold3.Rep1
4 2.559189e-13 1.0000000 none Fold4.Rep1
To your earlier point, the package is not going to save and display information on the coefficients of each fold. In addition the the performance information above, does however save the index (list of in-sample rows), indexOut (hold how rows), and random seeds for each fold, thus if you were so inclined it would be easy to reconstruct the intermediate models.
mod_fit$control$seeds
[[1]]
[1] 169815
[[2]]
[1] 445763
[[3]]
[1] 871613
[[4]]
[1] 706905
[[5]]
[1] 89408
mod_fit$control$index
$Fold1
[1] 1 2 3 4 5 6 7 8 9 10 11 12 15 18 19 21 22 24 28 30 31 32 33 34 35 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 59 60 61 63
[45] 64 65 66 68 69 70 71 72 73 75 76 77 79 80 81 82 84 85 86 87 89 90 91 92 93 94 95 96 98 99 100 103 104
106 107 108 110 111 113 114 116 118 119 120
[89] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 140 141 142 143 145 147 149 150
$Fold2
[1] 1 6 7 8 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42
44 46 48 50 51 53 54 55 56 57 58
[45] 59 61 62 64 66 67 69 70 71 72 73 74 75 76 78 79 80 81 82 83 84 85 87 88 89 90 91 92 95 96 97 98 99
101 102 104 105 106 108 109 111 112 113 115
[89] 116 117 119 120 121 122 123 127 130 131 132 134 135 137 138 139 140 141 142 143 144 145 146 147 148
$Fold3
[1] 2 3 4 5 6 7 8 9 10 11 13 14 16 17 20 23 24 25 26 27 28 29 30 33 35 36 37 38 39 40 41 43 45
46 47 49 50 51 52 54 55 56 57 58
[45] 60 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 82 83 84 85 86 88 89 93 94 97 98 99 100 101 102
103 105 106 107 108 109 110 111 112 114 115
[89] 117 118 119 121 124 125 126 128 129 131 132 133 134 135 136 137 138 139 144 145 146 147 148 149 150
$Fold4
[1] 1 2 3 4 5 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 31 32 34 36 37 38 39 41
42 43 44 45 47 48 49 52 53 55 56
[45] 57 58 59 60 61 62 63 65 67 68 74 77 78 79 80 81 83 86 87 88 90 91 92 93 94 95 96 97 100 101 102 103 104
105 107 109 110 112 113 114 115 116 117 118
[89] 120 122 123 124 125 126 127 128 129 130 133 136 137 138 139 140 141 142 143 144 146 148 149 150
mod_fit$control$indexOut
$Resample1
[1] 13 14 16 17 20 23 25 26 27 29 36 37 38 39 55 56 57 58 62 67 74 78 83 88 97 101 102 105 109 112 115 117 137
138 139 144 146 148
$Resample2
[1] 2 3 4 5 9 10 11 24 41 43 45 47 49 52 60 63 65 68 77 86 93 94 100 103 107 110 114 118 124 125 126 128 129
133 136 149 150
$Resample3
[1] 1 12 15 18 19 21 22 31 32 34 42 44 48 53 59 61 79 80 81 87 90 91 92 95 96 104 113 116 120 122 123 127 130
140 141 142 143
$Resample4
[1] 6 7 8 28 30 33 35 40 46 50 51 54 64 66 69 70 71 72 73 75 76 82 84 85 89 98 99 106 108 111 119 121 131
132 134 135 145 147
#Damien your mod_fit will not contain 4 separate set of coefficients. You are asking for cross validation with 4 folds. This does not mean you will have 4 different models. According to the documentation here, the train function works as follows:
At the end of the resampling loop - in your case 4 iterations for 4 folds, you will have one set of average forecast accuracy measures (eg., rmse, R-squared), for a given one set of model parameters.
Since you did not use tuneGrid or tuneLength argument in train function, by default, train function will tune over three values of each tuneable parameter.
This means you will have at most three models (not 4 models as you were expecting) and therefore three sets of average model performance measures.
The optimum model is the one that has the lowest rmse in case of regression. This model coefficients are available in mod_fit$finalModel.

subset multiple times a data frame

I want to subset 100 times a data frame that consists of 20 variables (continuous and categorical) in two parts that represent 70% and 30%. But I can do this with iris data set too as an example.
data(iris)
test.rows <- sample(1:nrow(iris), 105)
iris.70 <- iris[test.rows, ]
iris.30 <- iris[-test.rows, ]
This gives the data frames I want. But how can I do this 100 times and store the results somewhere to employ them later?
I have been trying
output <- list()
for(i in 1:surveyed100){
output[[i]] <- test.rows <- sample(1:nrow(surveyed100), 246)
}
But it tells me: numerical expression has 5 elements: only the first used.
I will appreciate your help.
First create 100 samples:
samples <- list()
for(i in 1:100){
samples[[i]] <- sample(1:nrow(surveyed100), 246)
}
Then use lapply() to store all 100 subsets in a list:
output <- lapply(samples, function(x) list(surveyed100[x,], surveyed100[-x,]))
Example using iris:
samples <- list()
for(i in 1:100){
samples[[i]] <- sample(1:nrow(iris), 105)
}
head(samples)
[[1]]
[1] 66 106 39 50 33 123 68 62 65 125 30 25 60 70 49 98 140 44 141 94 18 59 117 32 63 133 16 139 97 145 105 78 112 95
[35] 128 36 37 64 10 124 40 111 17 29 51 89 99 4 135 103 101 19 115 74 73 91 11 67 84 88 1 114 138 21 77 24 69 13
[69] 53 58 110 150 9 31 144 54 129 34 35 52 142 14 113 127 27 20 87 134 118 15 72 92 75 8 104 96 136 143 2 41 109 90
[103] 146 26 6
[[2]]
[1] 78 84 89 75 63 81 119 51 127 20 66 106 140 65 116 72 147 141 61 113 130 136 109 49 57 149 90 56 8 46 82 55 38 4
[35] 70 94 100 117 95 29 45 13 128 11 83 80 35 41 121 73 39 67 19 98 108 103 42 2 44 132 114 137 118 12 125 24 77 53
[69] 28 150 92 5 43 112 60 122 15 30 104 102 120 76 47 85 40 79 33 143 48 139 148 124 36 16 138 101 115 107 134 126 74 6
[103] 52 50 10
[[3]]
[1] 23 67 54 131 84 146 25 7 41 101 138 49 28 95 15 5 57 69 126 60 12 92 35 89 50 1 13 77 140 116 136 17 144 64
[35] 32 139 76 102 61 130 2 44 75 100 81 31 34 46 72 33 18 79 24 133 124 62 9 88 8 66 74 125 51 127 123 52 90 39
[69] 120 42 16 83 40 137 47 58 82 135 96 20 119 91 36 48 132 55 93 106 107 109 113 53 19 141 105 128 78 143 29 4 45 37
[103] 73 94 87
[[4]]
[1] 125 41 37 80 136 50 91 89 44 117 132 82 78 128 146 49 61 105 145 83 111 126 100 94 7 102 112 17 120 60 36 104 123 65
[35] 48 34 45 73 25 46 110 74 66 137 107 101 106 24 97 18 119 72 33 134 87 35 121 14 88 9 39 8 64 142 10 148 54 99
[69] 103 95 63 11 133 141 32 96 51 81 140 76 138 127 52 75 55 26 115 19 90 16 21 86 56 22 79 53 31 23 68 13 77 30
[103] 71 116 67
[[5]]
[1] 83 4 85 133 111 55 145 65 81 50 136 64 13 27 5 117 33 69 40 127 80 61 53 125 77 36 124 140 138 86 7 6 79 29
[35] 21 115 23 74 93 10 132 51 2 41 49 123 94 142 120 48 19 89 28 91 14 118 43 103 87 58 149 20 56 113 82 62 104 44
[69] 72 47 119 35 143 116 128 26 75 88 9 60 16 130 114 31 1 147 78 73 3 32 70 146 131 102 15 54 141 129 42 101 17 59
[103] 46 134 110
[[6]]
[1] 18 20 53 106 142 125 120 109 119 129 84 146 99 51 43 91 141 89 131 124 95 135 81 42 73 112 128 133 108 27 28 47 32 76
[35] 130 138 70 36 10 90 16 11 137 17 87 5 35 25 123 97 12 115 127 94 34 103 4 54 134 78 68 71 101 126 61 37 33 2
[69] 88 80 144 82 150 3 21 114 58 110 136 22 105 117 79 64 102 49 98 59 132 39 8 149 121 40 29 104 55 77 147 74 50 56
[103] 48 75 23
Subsets:
output <- lapply(samples, function(x) list(iris[x,], iris[-x,]))
Output:
head(output[[1]][[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
66 6.7 3.1 4.4 1.4 versicolor
106 7.6 3.0 6.6 2.1 virginica
39 4.4 3.0 1.3 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
33 5.2 4.1 1.5 0.1 setosa
123 7.7 2.8 6.7 2.0 virginica
head(output[[1]][[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
12 4.8 3.4 1.6 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
> nrow(output[[1]][[1]])
[1] 105
> nrow(output[[1]][[2]])
[1] 45
You can build a little function to do this for you, for example:
foo <- function(dat, train_percent = 0.7) {
n <- seq_len(nrow(dat))
train <- sample(n, floor(train_percent * max(n)))
test <- sample(setdiff(n, train))
list(train = dat[train,], test = dat[test,])
}
Then you can easily apply this function multiple times using replicate:
replicate(100, foo(iris), simplify = FALSE)
The result list has 100 elements and each element is itself a list of two elements, where the first is the "train" and the second the "test" dataset.

Resources