I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.
Related
I have a list of two series that start out the same length. After executing the following code, the second series has one fewer elements than the first. Is there a general way of removing the final element of only the series containing n+1 elements, so that all the series in my list have n elements? What about if I have a combination of series in my list containing n, n+1 and n+2 elements? Below is a minimal reproducible example.
#test
library('urca')
tseries <- list("t1" = c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1), "t2" = c(1,2,3,4,5,6,7,8,9,10,9,8,7,8,9));
# apply stationarity test to the list of series
adf <- lapply(tseries, function(x) tseries::adf.test(x)$p.value)
adf
# index only series that need differencing
not_stationary <- tseries[which(adf > 0.05)]
stationary <- tseries[which(adf < 0.05)]
not_stationary <- lapply(not_stationary, diff);
# verify
adf <- lapply(not_stationary, function(x) tseries::adf.test(x)$p.value)
adf
now_stationary <- not_stationary
#combine stationary and now_stationary
tseries_diff <- c(stationary, now_stationary)
tseries_diff
#$t1
#[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
#$t2
#[1] 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 1
So to summarise, I would ike to remove the final element, 1, from t1, but using code that can be applied to a list of series of lengths n and n+1 (and n+2 would be useful).
Thanks!
You can find the minimum length and simply get the series up to that point, i.e.
new_series_list <- lapply(tseries_diff, function(i)i[seq(min(lengths(tseries_diff)))])
so the lengths are now the same
lengths(new_series_list)
#t1 t2
#14 14
This will work in any size series. It will trim the long series to much the short one.
Edited for list instead of vector -
If you are dealing with list, you are wanting to make all of the series the length of the shortest:
(I modify the example to avoid using a library)
#test
mylist <- c(1,1,1,1,1)
mylongerlist <- c(1,1,1,1,1,1,1)
length(mylist)
# [1] 5
length(mylongerlist)
# [1] 7
#combine
tseries_diff <- list("t1" = mylist, "t2" = mylongerlist)
tseries_diff
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1 1 1
# on the fly approach to truncate
lapply(tseries_diff, function(x) { length(x) <- min(lengths(tseries_diff)); x })
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
And a function
# As a reusable function for clear code
reduceToShortestLength <- function(toCut) {
# takes a list and cuts the tail off of any series longer than the shortest
lapply(toCut, function(x) { length(x) <- min(lengths(tseries_diff)); x })
}
reduceToShortestLength(tseries_diff)
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
Original below (in case anyone thinks vector like I did at first)
I think you are asking how to truncate a vector to the shortest length. The head function does this well in base R.
the on the fly approach:
> mylist <- c(1,1,1,1,1)
> mylongerlist <- c(1,1,1,1,1,1,1)
> length(mylist)
[1] 5
> length(mylongerlist)
[1] 7
> x <- head(mylongerlist, length(mylist))
> length(x)
[1] 5
A function can be written like so:
> reduceToShorterLength<- function(toshorten, template) { head(toshorten, length(template))}
> x <- reduceToShorterLength(mylongerlist, mylist)
> length(x)
[1] 5
i'm trying to comb thru stock data and I usually use Java but not good enough for what I'll eventually build out. In R, how can I check how many times a condition is true until its false on average
so if X> 10 = true
if x < 10 = false
how many times is x > 10 until its below
I just recently started using R and (if I correctly understood your question) I incurred in a similar problem(s).
First I am generating a sample of 1000 elements with random values between 0 and 20 (I chose 20 simply because your condition would be <> 10, hence I took 10 as the middle point)
library(dplyr)
x <- data.frame(n=runif(1000, min = 0, max = 20), group = 0)
> x
# n group
#1 18.01267749 0
#2 8.50561210 0
#3 11.26424876 0
#4 1.22902009 0
#5 17.37173610 0
#6 15.79453081 0
#7 4.84231228 0
#8 1.36992180 0
#9 2.16605579 0
#10 16.51773243 0
...
I am not sure what you mean specifically but I will try to solve two problems:
1- Count how many rows meet a certain conditio and how many do not.
2- Check how many consecutive "rows" meet a certain condition until the condition is not met anymore.
Now for the first case, thinking in a "programming" way with for-loops:
res2 <- c(0,0)
for(i in 1:nrow(x)){
if(x[[i,"n"]] > 10)
res2[1] <- res2[1]+1
else
res2[2] <- res2[2]+1
}
#> res2
#[1] 494 506
Of course there is a better and faster way to do it, the "R" way:
res <- x %>%
group_by(group = if_else(x$n > 10, 1, 0)) %>%
summarise(total = n())
# A tibble: 2 x 2
# group total
# <dbl> <int>
#1 0 506
#2 1 494
The idea is simple: Put all the rows with a value >10 in a group (identified by 1, and rows with n <10 in another group identified by 0), then group with the variable group and finally count the total rows in the two groups.
Now for the second case, which is slightly harder if one does not have the right tools. It actually took me a while back then to find the right answer without using for-loops.
The idea is to use rle, or run-length encoding:
x$group <- with(rle(x$n < 10), rep(seq_along(lengths), lengths))
#> x
# n group
#1 18.01267749 1
#2 8.50561210 2
#3 11.26424876 3
#4 1.22902009 4
#5 17.37173610 5
#6 15.79453081 5
#7 4.84231228 6
#8 1.36992180 6
#9 2.16605579 6
#10 16.51773243 7
#11 2.22784827 8
#12 19.44676961 9
#13 1.28190206 10
#14 15.93426880 11
#15 16.70963107 11
#16 5.01572254 12
What rle does is to generate a new group index whenever there is a change in the condition result. So if there are 3 values in a row which are all < 10, they will all have the same group index, but as soon as a value does not meet the condition anymore, a the group index increases.
You can notice in the example as rows 5-6 are both >10, and their group has index 5, but row 7 is <10 so the new group index is 6, and so on...
Now, since the group index changes everytime the condition result changes, in order to know how many times it does change, you can just get the largest group index and divide by two.
> max(x$group)/2
# [1] 242
You can test that by doing:
runLength <- rle(x$n < 10)$values
res <- length(runLength[runLength == TRUE])
#> res
#[1] 242
I produced data using runif like #Gabryxx7, but this solution is a one-liner (I think this is what you were looking for anyway).
# Data
set.seed(123)
x <- data.frame(n=runif(1000, min = 0, max = 20))
# Solution
mean(rle(x$n > 10)$lengths[rle(x$n > 10)$values == T])
[1] 2.020492
I'll explain what's going on. The x$n > 10 part outputs TRUE if the values in column n in object x are greater than 10. It's really that simple.
Here it is step by step.
# Create column in x for whether value is greater than 10
x$GreaterThanTen <- x$n > 10
# Input rle output into object
ConsecutiveVars <- rle(x$GreaterThanTen)
ConsecutiveVars$lengths # (1 1 1 2 1) Tells us some value occurs consecutively: once, once, once, twice, once, etc.
ConsecutiveVars$values # (F T F T F) Tells us which values occur consecutively: FALSE then TRUE then FALSE then TRUE then FALSE, etc.
# so FALSE occurs once, then TRUE occurs once, then FALSE occurs once, then TRUE occurs twice, then FALSE occurs once, etc.
# We want to know only how many times TRUE occurs consecutively, so we filter ConsecutiveVars$lengths for when it is TRUE
ConsecutiveTRUES <- ConsecutiveVars$lengths[ConsecutiveVars$values == T]
# Then take the average
mean(ConsecutiveTRUES)
2.020492
I am trying to write a code where I can find the previous consecutive occurrences of the same binary value.
I have managed to write a for loop to find previous value (in my real problem, the data is subsetted hence a for loop being required).
x<-data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
xLength<-length(x$successRate)
y<-vector(mode="integer",length<-xLength)
if (xLength>1){
for (i in 2:xLength){
y[i]<-x$successRate[i-1]
}
}
y[1]<-NA
x[,"previous"]<-y
However I am looking for the desired output as follows:
# desired output
data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1),previousConsecutiveSuccess=c(NA,1,2,-1,1,-1,-2,-3,1,-1,1,2,3,-1,1,-1,-2,-3,-4,1,2,-1))
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
x$previous <- NA # no need for extra variable
if (nrow(x)>1) {
# set first consecutive idx manually
x$previous[2] <- -1+2*x$successRate[1] # -1 if successRate == 0; 1 otherwise
# loop only if nrow(x) is large enough
if (nrow(x)>2) {
for (i in 3:nrow(x)){ # start on row 3, as the last 2 rows are needed
x$previous[i] <- ifelse(x$successRate[i-1] == x$successRate[i-2], # consecutive?
sign(x$previous[i-1])*(abs(x$previous[i-1])+1), # yes: add 1 and keep sign
-1+2*x$successRate[i-1]) # no: 0 -> -1; 1 -> 1
}
}
}
print(x$previous)
[1] NA 1 2 -1 1 -1 -2 -3 1 -1 1 2 3 -1 1 -1 -2 -3 -4 1 2 -1
A couple of simple options:
1) Option 1: Using base R functions only, including rle for run length encoding:
# Your original data.frame
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
# base R method to get lag 1 of a vector
lag_successRate <- c( NA, x$successRate[ - length(x$successRate) ] )
lag_rle <- rle(lag_successRate) # base function for run length encoding
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
Option 2: Useing data.table, similar to above using base::rle to get the run length encoding.
If you have very large data sets, the data.table data functions are likely to be the fastest and most memory efficient options.
# your sample data as a dataframe, as you had originally:
DT <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
library(data.table)
setDT(DT) # set DT as a data.table by reference (without any copy!)
lag_rle <- rle( shift(DT$successRate) ) # get rle on the lag 1 of successRate
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
Completely new to R and am trying to count how many numbers in a list are larger than the one right before.
This is what I have so far,
count <- 0
number <- function(value) {
for (i in 1:length(value))
{ if(value[i+1] > value[i])
{count <- count + 1}
}
}
x <- c(1,2,1,1,3,5)
number(x)
The output should be 3 based on the list.
Any help or advice would be greatly appreciated!
A base R alternative would be diff
sum(diff(x) > 0)
#[1] 3
Or we can also eliminate first and last values and compare them.
sum(x[-1] > x[-length(x)])
#[1] 3
where
x[-1]
#[1] 2 1 1 3 5
x[-length(x)]
#[1] 1 2 1 1 3
You can lag your vector and count how many times your initial vector is greater than your lagged vector
library(dplyr)
sum(x>lag(x), na.rm = TRUE)
In details, lag(x) does:
> lag(x)
[1] NA 1 2 1 1 3
so x > lag(x) does
> x>lag(x)
[1] NA TRUE FALSE FALSE TRUE TRUE
The sum of the above is 3.
Let T={t|t=1,2,3..T} be the time (sequence order number) For each group, at each t when/if a sequence occurs, we need to make sure the sequence (it is a number,let's assume it is X) is within the set of {K-1,K,K+1}, where K is the previous sequence number at t-1. For example, if the previous sequence number K=4, for the next sequence X, if X fall within [3,4,5]. Then this X meet the requirement. If every X in the sequence meets the requirement, this group meets the require and labeled it TRUE.
I know the for loop can do the trick but I have large observations, it is very slow to do it in a loop. I known the cummax can help find the non-deceasing sequence quickly. I was wondering is there any quick solution like cummax.
seq <- c(1,2,1,2,3,1,2,3,1,2,1,2,2,3,4)
group <- rep(letters[1:3],each=5)
dt <- data.frame(group,seq)
> dt
group seq
1 a 1
2 a 2
3 a 1
4 a 2
5 a 3
6 b 1
7 b 2
8 b 3
9 b 1
10 b 2
11 c 1
12 c 2
13 c 2
14 c 3
15 c 4
The desired output:
y label
a:true
b:false
c:true
You can use the diff function to check if the adjacent sequence satisfies the condition:
library(dplyr)
dt %>% group_by(group) %>% summarize(label = all(abs(diff(seq)) <= 1))
# A tibble: 3 x 2
# group label
# <fctr> <lgl>
#1 a TRUE
#2 b FALSE
#3 c TRUE
Here is the corresponding data.table version:
library(data.table)
setDT(dt)[, .(label = all(abs(diff(seq)) <= 1)), .(group)]
You can do:
is.sequence <- function(x)
all(apply(head(cbind(x-1, x, x+1), -1) - x[-1] == 0, 1, any))
tapply(dt$seq, dt$group, is.sequence)
# a b c
# TRUE FALSE TRUE
Here is a base R example with aggregate and diff
aggregate(c(1, abs(diff(dt$seq)) * (tail(dt$group, -1) ==
head(dt$group, -1))),
dt["group"], function(i) max(i) < 2)
group x
1 a TRUE
2 b FALSE
3 c TRUE
The first argument to aggregate is a vector that uses diff and turns the result on and off (to zero) based on whether the current adjacent vector elements are in the same group.
We can also use aggregate from base R
aggregate(seq~group,dt, FUN = function(x) all(c(TRUE,
abs((x[-1] - x[-length(x)])) <=1)))
# group seq
#1 a TRUE
#2 b FALSE
#3 c TRUE