I'm working a large matrix (187,682,789 x 5)
Say it's build like this:
Day1 <- rep(1, 10)
Lat=sample(30:33, 10, replace=T)
Lon=sample(-30:-33, 10, replace=T)
Var=runif(10,1,100)
Mat1<-cbind(Day1,Lat,Lon,Var)
Day2 <- rep(2, 10)
Lat=sample(30:33, 10, replace=T)
Lon=sample(-30:-33, 10, replace=T)
Var=runif(10,1,100)
Mat2<-cbind(Day2,Lat,Lon,Var)
#... And so on, but let's stick to 2 days for the example
Mat = rbind(Mat1,Mat2)
Of course here, there is a redundancy in the number of unique Lat Lon combinaison.
position=cbind(Mat[,2],Mat[,3]) # Lat Lon
nrow(unique(position)) < nrow(position) #True
I would like to obtain a matrix that shows all the unique Lat Lon combinaison followed by all the corresponding variable per day.
For example:
> Mat
Day Lat Lon Var
[1,] 1 36 -36 51.086210
[2,] 1 37 -37 48.486008
[3,] 1 38 -38 39.482635
[4,] 1 39 -39 97.848232
[5,] 1 40 -40 71.076543
[6,] 2 31 -31 5.641855
[7,] 2 32 -32 62.124584
[8,] 2 33 -33 39.524119
[9,] 2 34 -34 7.214646
[10,] 2 35 -35 94.254170
[11,] 2 36 -36 40.615783
[12,] 2 37 -37 71.319719
[13,] 2 38 -38 81.775119
[14,] 2 39 -39 49.224411
[15,] 2 40 -40 80.813237
Would become:
>Resulting.Mat.Var
Unique.Lat Unique.Lon Day1 Day2
[1,] 36 -36 51.08621 40.615783
[2,] 37 -37 48.48601 71.319719
[3,] 38 -38 39.48264 81.775119
[4,] 39 -39 97.84823 49.224411
[5,] 40 -40 71.07654 80.813237
[6,] 31 -31 NA 5.641855
[7,] 32 -32 NA 62.124584
[8,] 33 -33 NA 39.524119
[9,] 34 -34 NA 7.214646
[10,] 35 -35 NA 94.254170
I tried to create a Matrix of NAs and fill it with 2 for loops, but it really takes too long !
Many thanks !
Edit:
This is somewhat different than what I found on SO since it really need efficiency, all are in numeric format and there are 2 columns that form the position...
J
This is a typical "long-to-wide" conversion problem. One possibility to obtain the desired form is to use dcast() from the reshape2 package:
library(reshape2)
as.matrix(dcast(as.data.frame(Mat), Lat + Lon ~ Day, value.var = "Var"))
# Lat Lon 1 2
# [1,] 31 -31 NA 5.641855
# [2,] 32 -32 NA 62.124584
# [3,] 33 -33 NA 39.524119
# [4,] 34 -34 NA 7.214646
# [5,] 35 -35 NA 94.254170
# [6,] 36 -36 51.08621 40.615783
# [7,] 37 -37 48.48601 71.319719
# [8,] 38 -38 39.48264 81.775119
# [9,] 39 -39 97.84823 49.224411
#[10,] 40 -40 71.07654 80.813237
Quite a few similar questions have been answered before on SO, so this is probably a duplicate. However, most questions refer to data.frame structures, and not to matrices.
data:
Mat <- structure(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 36,
37, 38, 39, 40, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, -36,
-37, -38, -39, -40, -31, -32, -33, -34, -35, -36, -37, -38, -39,
-40, 51.08621, 48.486008, 39.482635, 97.848232, 71.076543, 5.641855,
62.124584, 39.524119, 7.214646, 94.25417, 40.615783, 71.319719,
81.775119, 49.224411, 80.813237), .Dim = c(15L, 4L),
.Dimnames = list(NULL, c("Day", "Lat", "Lon", "Var")))
Another method using dplyr is:
library(dplyr)
Resulting.Mat.Var <- as.matrix(
Mat %>% group_by(Unique.Lat=Lat,Unique.Lon=Lon) %>%
summarise(Day1=Var[which(Day==1)], Day2=Var[which(Day==2)]))
print(Resulting.Mat.Var)
## Unique.Lat Unique.Lon Day1 Day2
## [1,] 31 -31 NA 5.641855
## [2,] 32 -32 NA 62.124584
## [3,] 33 -33 NA 39.524119
## [4,] 34 -34 NA 7.214646
## [5,] 35 -35 NA 94.254170
## [6,] 36 -36 51.08621 40.615783
## [7,] 37 -37 48.48601 71.319719
## [8,] 38 -38 39.48264 81.775119
## [9,] 39 -39 97.84823 49.224411
##[10,] 40 -40 71.07654 80.813237
Looks like a merge to me:
> merge( Mat[Mat[,'Day']==1 , -1], Mat[ Mat[,'Day']==2, -1], by=c(1,2) , all=TRUE)
Lat Lon Var.x Var.y
1 31 -31 NA 5.641855
2 32 -32 NA 62.124584
3 33 -33 NA 39.524119
4 34 -34 NA 7.214646
5 35 -35 NA 94.254170
6 36 -36 51.08621 40.615783
7 37 -37 48.48601 71.319719
8 38 -38 39.48264 81.775119
9 39 -39 97.84823 49.224411
10 40 -40 71.07654 80.813237
Can coerce to matrix if needed since that result is a data.frame
Related
My code is this:
X2 <- list(33, 41, 7, 49, 5, 31)
Y <- list(42, 33, 75, 28, 91, 55)
X <- matrix(cbind(1:1,X1,X2),nrow=6,ncol=3)
Y <-t(X)
XtX <- X %*% Y
And I don't know what is going wrong pls help
Try to create the matrix with vectors instead of lists:
X1 <- c(33, 41, 7, 49, 5, 31)
X2 <- c(42, 33, 75, 28, 91, 55)
X <- as.matrix(cbind(1:1,X1,X2),nrow=6,ncol=3)
XtX <- X %*% t(X)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 2854 2740 3382 2794 3988 3334
[2,] 2740 2771 2763 2934 3209 3087
[3,] 3382 2763 5675 2444 6861 4343
[4,] 2794 2934 2444 3186 2794 3060
[5,] 3988 3209 6861 2794 8307 5161
[6,] 3334 3087 4343 3060 5161 3987
Not sure matrix(cbind(1:1,X1,X2),nrow=6,ncol=3) is exactly what you expect:
X1 X2
[1,] 1 33 42
[2,] 1 41 33
[3,] 1 7 75
[4,] 1 49 28
[5,] 1 5 91
[6,] 1 31 55
I am having a problem with summing the rows of my matrices. I have a list formed by 30 matrices
Matrix<-matrix(1:45, ncol=9)
List<-list(lapply(seq_len(30), function(X) Matrix))
The idea is to create 30 matrices size 5*3. Firstly, I need to sum some columns, 1:3 4:6 7:9, such that the result will be the following:
[,1] [,2] [,3]
[1,] 18 63 108
[2,] 21 66 111
[3,] 34 69 114
[4,] 47 72 117
[5,] 30 75 120
I am trying to get this matrix using this code:
Y<-lapply(List, function(x) rowSums(x[, 1:3]))
But, it only allows me to sum the 3 firsts columns.
After this, I need to sum the list and obtain only one matrix(5*3). I think that the command final<-reduce(Y,+) could help.
540 1890 3240
630 1980 3330
1020 2070 3420
1410 2160 3510
900 2250 3600
Thank you for your help
You need to find someway to group your columns by threes, for example:
grp = (1:ncol(Matrix) -1) %/% 3
or if you know the dimensions:
grp = rep(0:2,each=3)
To do rowSums in columns of threes, we can do this with a function:
SumCols = function(M,col_grp){
sapply(unique(col_grp),function(i)rowSums(M[,col_grp==i]))
}
SumCols(Matrix,grp)
[,1] [,2] [,3]
[1,] 18 63 108
[2,] 21 66 111
[3,] 24 69 114
[4,] 27 72 117
[5,] 30 75 120
So put this inside your List of matrices,
Reduce("+",lapply(List[[1]],SumCols,grp))
[,1] [,2] [,3]
[1,] 540 1890 3240
[2,] 630 1980 3330
[3,] 720 2070 3420
[4,] 810 2160 3510
[5,] 900 2250 3600
Here is another base R solution
out <- Reduce(`+`,Map(function(x) do.call(cbind,Map(rowSums, split.default(data.frame(x),ceiling(seq(ncol(x))/3)))),List[[1]]))
such that
> out
0 1 2
[1,] 540 1890 3240
[2,] 630 1980 3330
[3,] 720 2070 3420
[4,] 810 2160 3510
[5,] 900 2250 3600
I have a vector:
vec <- c(44,0,13,18,32,13,25,42,13,24)
I want to calculate fT as follows:
fT <- ifelse(vec >= 10 & vec <= 20, min(vec) - max(vec),
ifelse(vec > 20 & vec <= 50, max(vec) - min(vec),0))
I want to extent this calculation for each row of a dataframe i.e.
I have a dataframe and I want to calculate fT for each row.
A sample data:
dat <- data.frame(replicate(10,sample(0:50,1000,rep=TRUE)))
That means I will have another dataframe which will have the fT value for each value in dat.
To calculate fT for each row, I thought of using dplyr,
dat%>%
rowwise()%>%
mutate(fT = ifelse(dat[,1:10] >= 10 & dat[,1:10] <= 30, min(dat[,1:10]) - max(dat[,1:10]),
ifelse(dat[,1:10] > 30 & dat[,1:10] <= 50, max(dat[,1:10]) - min(dat[,1:10]),0)))
I am stuck at this stage. I do not know how to index by column so that for each row of dat, I have a
fT.
If you want the sums of fT, you can do this with apply:
dat$fT = apply(dat, 1, function(x) sum(ifelse(x >= 10 & x <= 20, min(x) - max(x),
ifelse(x > 20 & x <= 50, max(x) - min(x),0))))
Result:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 fT
1 14 13 8 10 15 12 22 47 29 40 -39
2 40 30 7 48 42 50 20 30 24 44 301
3 20 8 7 19 30 36 18 4 37 12 -33
4 45 43 26 31 41 33 26 43 11 28 272
5 47 43 25 9 14 12 3 1 38 46 138
6 2 24 31 33 7 4 36 41 42 0 252
Note:
1 in apply specifies the row margin. This loops through the rows of the input, dat, and output a single sum of fT for each row.
Edit:
If you actually want the value of fT (not the sum), you can still use apply, but wrap the output with matrix and specify ncol=10 and byrow=TRUE. This means that you want an output matrix with 10 columns (just like dat) and fill the matrix rowwise with the output of apply:
new_dat = matrix(apply(dat, 1,
function(x) ifelse(x >= 10 & x <= 20, min(x) - max(x),
ifelse(x > 20 & x <= 50, max(x) - min(x),0))),
ncol = 10, byrow = TRUE)
Result:
> head(new_dat)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] -39 -39 0 -39 -39 -39 39 39 39 39
[2,] 43 43 0 43 43 43 -43 43 43 43
[3,] -33 0 0 -33 33 33 -33 0 33 -33
[4,] 34 34 34 34 34 34 34 34 -34 34
[5,] 46 46 46 0 -46 -46 0 0 46 46
[6,] 0 42 42 42 0 0 42 42 42 0
If you prefer sticking to dplyr, you can first transpose your dat and map it on the "columns", then transpose back:
library(dplyr)
library(purrr)
dat %>%
transpose() %>%
map_dfr(~ ifelse(. >= 10 & . <= 20, min(.) - max(.),
ifelse(. > 20 & . <= 50, max(.) - min(.),0))) %>%
transpose()
Result:
> head(new_dat2)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 -39 -39 0 -39 -39 -39 39 39 39 39
2 43 43 0 43 43 43 -43 43 43 43
3 -33 0 0 -33 33 33 -33 0 33 -33
4 34 34 34 34 34 34 34 34 -34 34
5 46 46 46 0 -46 -46 0 0 46 46
6 0 42 42 42 0 0 42 42 42 0
Note:
The advantage of using transpose instead of t in Base R is that you get a data.frame after transposing instead of a matrix.
Data:
set.seed(123)
dat <- data.frame(replicate(10,sample(0:50,1000,rep=TRUE)))
Here is one option with pmax/pmin which would be efficient
m1 <- (do.call(pmax, dat) - do.call(pmin, dat))[row(dat)]
out <- (-1*m1 *(dat >=10 & dat <=20)) + (m1*(dat > 20 & dat <=50))
all.equal(new_dat, out, check.attributes = FALSE)
#[1] TRUE
Benchmarks
set.seed(24)
dat <- data.frame(replicate(500,sample(0:50,15000,rep=TRUE)))
system.time({
new_dat = matrix(apply(dat, 1,
function(x) ifelse(x >= 10 & x <= 20, min(x) - max(x),
ifelse(x > 20 & x <= 50, max(x) - min(x),0))),
ncol = ncol(dat), byrow = TRUE)
})
#user system elapsed
# 2.67 0.10 2.77
system.time({
m1 <- (do.call(pmax, dat) - do.call(pmin, dat))[row(dat)]
out <- (-1*m1 *(dat >=10 & dat <=20)) + (m1*(dat > 20 & dat <=50))
})
# user system elapsed
# 0.48 0.11 0.60
#all.equal(new_dat, out, check.attributes = FALSE)
#[1] TRUE
I have a dataframe (cenMca) with 1020 rows and 800 columns.
Each 4 columns, I have a set of data I call "cen". So, from column 1 to 4, I have cen 1, from 5 to 8, I have cen2 and so on.
I wanted to split cenMca into 200 hundred smaller dataframes of dimensions equal to 1020 lines by 4 columns and sum the values per row. For this I'd apply a function sum to each row, however, I searched for ways to split my dataframe in the way I wanted, but failed in doing so. Also, I have no idea how I would iterate through these smaller dataframes to save each with a different name.
So I thought that instead of breaking cenMca into smaller dataframes, I'd sum the values from cenMca and assign them to a single dataframe I called sumvec. So, for every 4 columns in cenMca, I'd have one corresponding column in sumvec. This gives sumvec dimensions equal to 1020 rows and 200 columns.
To accomplish this, I tried:
sumvec = matrix(NA,1020,200)
for (i in 1:1020 ){
for (j in seq(1,800,4)){
sumvec[i,(j+3)/4] = cenMca[i,j]+cenMca[i,j+1]+cenMca[i,j+2]+cenMca[i,j+3]
}
}
The first for runs through rows, and the second for runs through the columns. My increment is 4 for the second for because then I'd get all four values I wanted in a cycle.
I know this is far from efficient, but I thought it'd work.
After I ran the script, I got this:
I tried warnings() but nothing came up. All I have in sumvec is "NA"
How could I fix this?
Other techniques on how to get this done will be appreciated. Thank you.
This seems like a good application for rowSums. You could use lapply to run it over the grouped columns like you're trying to do.
I'll use similar data setup as #R.Schifini:
set.seed(1)
z <- matrix( rnorm( 1020*800 ), ncol = 800 )
Make it a data frame, like your data.
z <- as.data.frame(z)
Now group the data frame into groups of 4 columns, running rowSums on each group.
x <- lapply( seq.int( 1, ncol(z), 4 ),
function(i) {
rowSums( z[ , i:(i+3) ] )
} )
Bind it together as a single data frame, with the column names you need.
x <- as.data.frame( x, col.names = paste0( "cen", seq_along( x ) ) )
Here's a small sample of the output.
> head( x[1:6] )
cen1 cen2 cen3 cen4 cen5 cen6
1 -0.8027240 -0.7437158 -1.5305678 -0.7055544 2.0122082 0.7851487
2 0.0854064 0.2422316 -2.5071390 1.7854231 -3.5219698 -0.7699433
3 1.2738387 1.7360087 1.4317099 -3.3501584 -1.8412381 -2.1396324
4 -0.5864149 -0.5648199 -0.3099392 -1.9144969 0.7874474 -2.4840934
5 -0.3887289 -1.0745042 -1.9729363 1.8971846 -4.3374676 2.5744197
6 0.9104741 -0.7546090 4.2516971 1.0335885 2.6814576 -0.2548666
Is this what you are trying to achieve?
I'll create a sample matrix (also works if it is a data frame)
z = matrix(floor(runif(120, 0, 100)), ncol = 12)
> z
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 37 50 37 0 71 84 29 65 0 34 33 65
[2,] 53 60 17 44 39 94 16 66 72 12 27 32
[3,] 10 26 5 26 11 58 39 47 71 38 11 19
[4,] 80 42 65 93 24 50 45 96 18 92 4 11
[5,] 73 36 57 71 86 18 43 40 64 80 37 99
[6,] 5 94 98 16 43 0 51 84 54 75 33 37
[7,] 48 12 60 47 49 87 84 75 33 95 17 56
[8,] 92 7 6 69 69 13 5 53 63 99 62 73
[9,] 4 96 16 46 76 2 55 87 82 60 39 87
[10,] 29 44 47 95 15 93 68 46 70 2 95 57
Then add columns in groups of four:
result = z[,seq(1,12,by = 4)]+z[,seq(2,12,by = 4)]+z[,seq(3,12,by = 4)]+z[,seq(4,12,by = 4)]
> result
[,1] [,2] [,3]
[1,] 124 249 132
[2,] 174 215 143
[3,] 67 155 139
[4,] 280 215 125
[5,] 237 187 280
[6,] 213 178 199
[7,] 167 295 201
[8,] 174 140 297
[9,] 162 220 268
[10,] 215 222 224
First of all, you don't need to loop over rows. R works well with vectors.
Secondly, NAs in sumvec might be results of NAs in cenMca. If you have NAs in cenMca, use sum instead of +.
for (j in seq(1,800,4)) sumvec[,(j+3)/4] <- apply(cenMca[,j:(j+3)],1,sum, na.rm=T)
Hope, this helps.
I have a long vector and I need to divide it into segments according to a threshold. A segment is consecutive values over the threshold. When values drop below the threshold, the segment ends and the next segment begins where the values once again cross above the threshold. I need to record the start and end indices of each segment.
Below is an inefficient implementation. What's the fastest and most appropriate way to write this? This is pretty ugly, I have to assume that there's a cleaner implementation.
set.seed(10)
test.vec <- rnorm(100, 8, 10)
threshold <- 0
segments <- list()
in.segment <- FALSE
for(i in 1:length(test.vec)){
# If we're in a segment
if(in.segment){
if(test.vec[i] > threshold){
next
}else{
end.ind <- i - 1
in.segment <- FALSE
segments[[length(segments) + 1]] <- c(start.ind, end.ind)
}
}
# if not in segment
else{
if(test.vec[i] > threshold){
start.ind <- i
in.segment <- TRUE
}
}
}
EDIT: Runtime of all solutions
Thanks for all the replies, this has been helpful and very instructive. A small test of all five solutions is below (the four provided plus the original example). As you can see, all four are a huge improvement over the original solution, but Khashaa's solution is by far the fastest.
set.seed(1)
test.vec <- rnorm(1e6, 8, 10);threshold <- 0
originalFunction <- function(x, threshold){
segments <- list()
in.segment <- FALSE
for(i in 1:length(test.vec)){
# If we're in a segment
if(in.segment){
if(test.vec[i] > threshold){
next
}else{
end.ind <- i - 1
in.segment <- FALSE
segments[[length(segments) + 1]] <- c(start.ind, end.ind)
}
}
# if not in segment
else{
if(test.vec[i] > threshold){
start.ind <- i
in.segment <- TRUE
}
}
}
segments
}
SimonG <- function(x, threshold){
hit <- which(x > threshold)
n <- length(hit)
ind <- which(hit[-1] - hit[-n] > 1)
starts <- c(hit[1], hit[ ind+1 ])
ends <- c(hit[ ind ], hit[n])
cbind(starts,ends)
}
Rcpp::cppFunction('DataFrame Khashaa(NumericVector x, double threshold) {
x.push_back(-1);
int n = x.size(), startind, endind;
std::vector<int> startinds, endinds;
bool insegment = false;
for(int i=0; i<n; i++){
if(!insegment){
if(x[i] > threshold){
startind = i + 1;
insegment = true; }
}else{
if(x[i] < threshold){
endind = i;
insegment = false;
startinds.push_back(startind);
endinds.push_back(endind);
}
}
}
return DataFrame::create(_["start"]= startinds, _["end"]= endinds);
}')
bgoldst <- function(x, threshold){
with(rle(x>threshold),
t(matrix(c(0L,rep(cumsum(lengths),2L)[-length(lengths)]),2L,byrow=T)+1:0)[values,])
}
ClausWilke <- function(x, threshold){
suppressMessages(require(dplyr, quietly = TRUE))
in.segment <- (x > threshold)
start <- which(c(FALSE, in.segment) == TRUE & lag(c(FALSE, in.segment) == FALSE)) - 1
end <- which(c(in.segment, FALSE) == TRUE & lead(c(in.segment, FALSE) == FALSE))
data.frame(start, end)
}
system.time({ originalFunction(test.vec, threshold); })
## user system elapsed
## 66.539 1.232 67.770
system.time({ SimonG(test.vec, threshold); })
## user system elapsed
## 0.028 0.008 0.036
system.time({ Khashaa(test.vec, threshold); })
## user system elapsed
## 0.008 0.000 0.008
system.time({ bgoldst(test.vec, threshold); })
## user system elapsed
## 0.065 0.000 0.065
system.time({ ClausWilke(test.vec, threshold); })
## user system elapsed
## 0.274 0.012 0.285
Here's another option, mostly using which. The start and end points are determined by finding the non-consecutive elements of the hit sequence.
test.vec <- rnorm(100, 8, 10)
threshold <- 0
findSegments <- function(x, threshold){
hit <- which(x > threshold)
n <- length(hit)
ind <- which(hit[-1] - hit[-n] > 1)
starts <- c(hit[1], hit[ ind+1 ])
ends <- c(hit[ ind ], hit[n])
cbind(starts,ends)
}
findSegments(test.vec, threshold=0)
This gives something like:
> findSegments(test.vec, threshold=0)
starts ends
[1,] 1 3
[2,] 5 7
[3,] 9 11
[4,] 13 28
[5,] 30 30
[6,] 32 32
[7,] 34 36
[8,] 38 39
[9,] 41 41
[10,] 43 43
[11,] 46 51
[12,] 54 54
[13,] 56 61
[14,] 63 67
[15,] 69 72
[16,] 76 77
[17,] 80 81
[18,] 83 84
[19,] 86 88
[20,] 90 92
[21,] 94 95
[22,] 97 97
[23,] 100 100
Compare that to the original sequence:
> round(test.vec,1)
[1] 20.7 15.7 4.3 -15.1 24.6 9.4 23.2 -4.5 16.9 20.9 13.2 -1.2
[13] 22.6 7.7 6.0 6.6 4.1 21.3 5.3 16.7 11.4 16.7 19.6 16.7
[25] 11.6 7.3 3.7 8.4 -4.5 11.7 -7.1 8.4 -18.5 12.8 22.5 11.0
[37] -3.3 11.1 6.9 -7.9 22.9 -3.7 3.5 -7.1 -5.9 3.5 13.2 20.0
[49] 13.2 23.4 15.9 -5.0 -6.3 10.0 -6.2 4.7 2.1 26.4 5.9 27.3
[61] 14.3 -12.4 28.4 30.9 18.2 11.4 5.7 -4.5 6.2 12.0 10.9 11.1
[73] -2.0 -9.0 -1.4 15.4 19.1 -1.6 -5.4 5.4 7.8 -5.6 15.2 13.8
[85] -18.8 7.1 17.1 9.3 -3.9 22.6 1.7 28.9 -21.3 21.2 8.2 -15.4
[97] 3.2 -10.2 -6.2 14.1
I like for loops for translation to Rcpp is straightforward.
Rcpp::cppFunction('DataFrame findSegment(NumericVector x, double threshold) {
x.push_back(-1);
int n = x.size(), startind, endind;
std::vector<int> startinds, endinds;
bool insegment = false;
for(int i=0; i<n; i++){
if(!insegment){
if(x[i] > threshold){
startind = i + 1;
insegment = true; }
}else{
if(x[i] < threshold){
endind = i;
insegment = false;
startinds.push_back(startind);
endinds.push_back(endind);
}
}
}
return DataFrame::create(_["start"]= startinds, _["end"]= endinds);
}')
set.seed(1); test.vec <- rnorm(1e7,8,10); threshold <- 0;
system.time(findSegment(test.vec, threshold))
# user system elapsed
# 0.045 0.000 0.045
# #SimonG's solution
system.time(findSegments(test.vec, threshold))
# user system elapsed
# 0.533 0.012 0.548
with(rle(test.vec>threshold),t(matrix(c(0L,rep(cumsum(lengths),2L)[-length(lengths)]),2L,byrow=T)+1:0)[values,]);
## [,1] [,2]
## [1,] 1 8
## [2,] 10 13
## [3,] 16 17
## [4,] 20 26
## [5,] 28 28
## [6,] 30 34
## [7,] 36 38
## [8,] 41 46
## [9,] 48 49
## [10,] 51 53
## [11,] 55 81
## [12,] 84 90
## [13,] 92 100
Explanation
test.vec>threshold
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Compute which elements in the input vector are above the threshold using vectorized comparison.
rle(...)
## Run Length Encoding
## lengths: int [1:25] 8 1 4 2 2 2 7 1 1 1 ...
## values : logi [1:25] TRUE FALSE TRUE FALSE TRUE FALSE ...
Compute the run-length encoding of the logical vector. This returns a list classed as 'rle' which contains two named components: lengths, containing the lengths of each run-length, and values, containing the value that ran that length, which in this case will be TRUE or FALSE, with the former representing a segment of interest, and the latter representing a non-segment run length.
with(...,...)
The first argument is the run-length encoding as described above. This will evaluate the second argument in a virtual environment consisting of the 'rle'-classed list, thus making the lengths and values components accessible as lexical variables.
Below I dive into the contents of the second argument.
cumsum(lengths)
## [1] 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100
Compute the cumulative sum of the lengths. This will form the basis for computing both the start indexes and end indexes of each run-length. Critical point: Each element of the cumsum represents the end index of that run-length.
rep(...,2L)
## [1] 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100
Duplicate the cumulative sum. The first repetition will serve as the basis for the start indexes, the second the end. I will henceforth refer to these repetitions as the "start-index repetition" and the "end-index repetition".
c(0L,...[-length(lengths)])
## [1] 0 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100
This removes the last element at the end of the start-index repetition, and prepends a zero to the beginning of it. This effectively lags the start-index repetition by one element. This is necessary because we need to compute each start index by adding one to the previous run-length's end index, taking zero as the end index of the non-existent run-length prior to the first.
matrix(...,2L,byrow=T)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 0 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91
## [2,] 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100
This builds a two-row matrix out of the previous result. The lagged start-index repetition is the top row, the end-index repetition is the bottom row.
...+1:0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 1 9 10 14 16 18 20 27 28 29 30 35 36 39 41 47 48 50 51 54 55 82 84 91 92
## [2,] 8 9 13 15 17 19 26 27 28 29 34 35 38 40 46 47 49 50 53 54 81 83 90 91 100
R cycles this two-element addend across rows first, then across columns, thus this adds one to the top row. This completes the computation of the start indexes.
t(...)
## [,1] [,2]
## [1,] 1 8
## [2,] 9 9
## [3,] 10 13
## [4,] 14 15
## [5,] 16 17
## [6,] 18 19
## [7,] 20 26
## [8,] 27 27
## [9,] 28 28
## [10,] 29 29
## [11,] 30 34
## [12,] 35 35
## [13,] 36 38
## [14,] 39 40
## [15,] 41 46
## [16,] 47 47
## [17,] 48 49
## [18,] 50 50
## [19,] 51 53
## [20,] 54 54
## [21,] 55 81
## [22,] 82 83
## [23,] 84 90
## [24,] 91 91
## [25,] 92 100
Transpose to a two-column matrix. This is not entirely necessary, if you're ok with getting the result as a two-row matrix.
...[values,]
## [,1] [,2]
## [1,] 1 8
## [2,] 10 13
## [3,] 16 17
## [4,] 20 26
## [5,] 28 28
## [6,] 30 34
## [7,] 36 38
## [8,] 41 46
## [9,] 48 49
## [10,] 51 53
## [11,] 55 81
## [12,] 84 90
## [13,] 92 100
Subset just the segments of interest. Since values is a logical vector representing which run-lengths surpassed the threshold, we can use it directly as a row index vector.
Performance
I guess I'm screwing myself here, but SimonG's solution performs about twice as well as mine:
bgoldst <- function() with(rle(test.vec>threshold),t(matrix(c(0L,rep(cumsum(lengths),2L)[-length(lengths)]),2L,byrow=T)+1:0)[values,]);
simong <- function() findSegments(test.vec,threshold);
set.seed(1); test.vec <- rnorm(1e7,8,10); threshold <- 0;
identical(bgoldst(),unname(simong()));
## [1] TRUE
system.time({ bgoldst(); })
## user system elapsed
## 1.344 0.204 1.551
system.time({ simong(); })
## user system elapsed
## 0.656 0.109 0.762
+1 from me...
Here is another solution that I think is simpler. Note that you have to use set.seed(10), not set.seed <- 10, to set the seed of the random number generator.
require(dplyr) # for lead() and lag()
set.seed(10)
test.vec <- rnorm(100, 8, 10)
threshold <- 0
in.segment <- (test.vec > threshold)
start <- which(c(FALSE, in.segment) == TRUE & lag(c(FALSE, in.segment) == FALSE)) - 1
end <- which(c(in.segment, FALSE) == TRUE & lead(c(in.segment, FALSE) == FALSE))
segments <- data.frame(start, end)
head(segments)
## start end
## 1 1 2
## 2 4 6
## 3 8 8
## 4 10 16
## 5 18 21
## 6 23 23
In general, in R, if you find yourself writing complicated loops and if statements you're probably doing it wrong. Most problems can be solved in a vectorized form.