Related
I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.
I am trying to write a code where I can find the previous consecutive occurrences of the same binary value.
I have managed to write a for loop to find previous value (in my real problem, the data is subsetted hence a for loop being required).
x<-data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
xLength<-length(x$successRate)
y<-vector(mode="integer",length<-xLength)
if (xLength>1){
for (i in 2:xLength){
y[i]<-x$successRate[i-1]
}
}
y[1]<-NA
x[,"previous"]<-y
However I am looking for the desired output as follows:
# desired output
data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1),previousConsecutiveSuccess=c(NA,1,2,-1,1,-1,-2,-3,1,-1,1,2,3,-1,1,-1,-2,-3,-4,1,2,-1))
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
x$previous <- NA # no need for extra variable
if (nrow(x)>1) {
# set first consecutive idx manually
x$previous[2] <- -1+2*x$successRate[1] # -1 if successRate == 0; 1 otherwise
# loop only if nrow(x) is large enough
if (nrow(x)>2) {
for (i in 3:nrow(x)){ # start on row 3, as the last 2 rows are needed
x$previous[i] <- ifelse(x$successRate[i-1] == x$successRate[i-2], # consecutive?
sign(x$previous[i-1])*(abs(x$previous[i-1])+1), # yes: add 1 and keep sign
-1+2*x$successRate[i-1]) # no: 0 -> -1; 1 -> 1
}
}
}
print(x$previous)
[1] NA 1 2 -1 1 -1 -2 -3 1 -1 1 2 3 -1 1 -1 -2 -3 -4 1 2 -1
A couple of simple options:
1) Option 1: Using base R functions only, including rle for run length encoding:
# Your original data.frame
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
# base R method to get lag 1 of a vector
lag_successRate <- c( NA, x$successRate[ - length(x$successRate) ] )
lag_rle <- rle(lag_successRate) # base function for run length encoding
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
Option 2: Useing data.table, similar to above using base::rle to get the run length encoding.
If you have very large data sets, the data.table data functions are likely to be the fastest and most memory efficient options.
# your sample data as a dataframe, as you had originally:
DT <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
library(data.table)
setDT(DT) # set DT as a data.table by reference (without any copy!)
lag_rle <- rle( shift(DT$successRate) ) # get rle on the lag 1 of successRate
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
I have two vectors x1 and p:
x1 <- c(1,2,3,1,2,3)
p <- c(0.1,0.9,0.9,0.1,0.5,0.7)
Both vectors form pairs of values, see df1:
df1 <- data.frame(x1,p)
> df1
x1 p
1 1 0.1
2 2 0.9
3 3 0.9
4 1 0.1
5 2 0.5
6 3 0.7
Following function is used to update vector df1$x1 to a vector df1$x2, depending on a random experiment and a probability p:
rexp <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
Using lapply, the function "rexp" is applied to every df1$x1 value. Depending on the random experiment, the value for x2 remains equal x1 or increases by + 1.
In the follwing example, p equals 0.5:
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,rexp,0.5))
> df1
x1 p x2
1 1 0.1 2
2 2 0.9 2
3 3 0.9 4
4 1 0.1 1
5 2 0.5 2
6 3 0.7 4
Now to my problem: I want the argument "p" in "rexp" to refer to the vector df1$p.
For example, p for df1$x1[1] should be 0.1 (as can be seen in df1$p[1]): unlist(lapply(df1$x1[1],rexp,df1$p[1])).
p for df1$x1[5] should be df1$p[5], which is 0.5: unlist(lapply(df1$x1[5],rexp,df1$p[5]))
Desired output should be something like:
> unlist(lapply(df1$x1,rexp,df1$p))
[1] 1 3 4 1 2 4
#where 1 refers to rexp(df1$x1[1],df1$p[1]),
#3 refers to rexp(df1$x1[2],df1$p[2]),
#4 refers to rexp(df1$x1[3],df1$p[3]) and so on...
Doing that "manually" leads to:
set.seed(123)
> unlist(lapply(df1$x1[1],rexp,df1$p[1]))
[1] 1
> unlist(lapply(df1$x1[2],rexp,df1$p[2]))
[1] 3
> unlist(lapply(df1$x1[3],rexp,df1$p[3]))
[1] 4
> unlist(lapply(df1$x1[4],rexp,df1$p[4]))
[1] 1
> unlist(lapply(df1$x1[5],rexp,df1$p[5]))
[1] 2
> unlist(lapply(df1$x1[6],rexp,df1$p[6]))
[1] 4
How can "rexp" be adjusted so that the function uses the specific df1$p-value for each df1$x1-value?
Note: At this point, using "lapply" is important, because for every df1$x1-value in the function "rexp" a new random number should be drawn.
I am happy about any help!
Using your defined function, you may do
df1$x2 <- mapply(rexp, df1$x1, df1$p)
However, you may also exploit vectorization and use simply
df1$x2 <- df1$x1 + (runif(nrow(df1)) <= df1$p)
In this manner we element-wise sum the vector df1$x1 with a logical vector runif(nrow(df1)) <= df1$p that is being coerced to a binary vector (TRUE becomes 1 and FALSE becomes 0). The comparison <= is done element-wise as well, and we draw just as many different values from the uniform distribution as there are rows.
Regarding your approach, notice that when p is fixed, then there is no need for lapply, as it returns a list, and you may instead use
df1$x2 <- sapply(df1$x1, rexp, 0.5)
Assuming we have a vector of values with missing values like the following:
test <- c(3,6,NA,7,8,NA,NA,5,8,6,NA,4,3,NA,NA,NA)
The objective is to identify the series of NA that have a length of 2 or less in order to apply a linear interpolation for the series tha have non-NA values at their extremities. I was able to detect the index of the end of such series with this code:
which.na <- which(is.na(test))
diff.which.na <- diff(which.na)
which.diff.which.na <- which(diff.which.na>1)
end.index <- which.na[which.diff.which.na]
result:
> end.index
[1] 3 7 11
the last NA series could be treated with a conditional statement. However I'm not able to find the index of the beginning of a NA series because I can't do the following:
diff.which.na <- diff(which.na,lag=-1)
So the expected output is:
beg.index= c(3,6,11)
and
end.index=c(3,7,11)
Any ideas?
Thank you
You can try with rle:
seq_na <- rle(is.na(test))
seq_na
#Run Length Encoding
# lengths: int [1:8] 2 1 2 2 3 1 2 3
# values : logi [1:8] FALSE TRUE FALSE TRUE FALSE TRUE ...
And look for a sequence of TRUE with lengths at least 2:
seq_na$lengths[seq_na$values]
# [1] 1 2 1 3 # there are 2 of them
To find the indices, you can do with cumsum (thanks to #Frank for the improvment!):
end.index <- with(seq_na, cumsum(lengths)[lengths <= 2 & values])
#[1] 3 7 11
beg.index <- end.index - with(seq_na, +(lengths==2 & values)[lengths <= 2 & values])
#[1] 3 6 11
I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1
It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.
You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.
There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works
I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.
Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE
Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99