Previous consecutive occurrences in R data frame - r

I am trying to write a code where I can find the previous consecutive occurrences of the same binary value.
I have managed to write a for loop to find previous value (in my real problem, the data is subsetted hence a for loop being required).
x<-data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
xLength<-length(x$successRate)
y<-vector(mode="integer",length<-xLength)
if (xLength>1){
for (i in 2:xLength){
y[i]<-x$successRate[i-1]
}
}
y[1]<-NA
x[,"previous"]<-y
However I am looking for the desired output as follows:
# desired output
data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1),previousConsecutiveSuccess=c(NA,1,2,-1,1,-1,-2,-3,1,-1,1,2,3,-1,1,-1,-2,-3,-4,1,2,-1))

x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
x$previous <- NA # no need for extra variable
if (nrow(x)>1) {
# set first consecutive idx manually
x$previous[2] <- -1+2*x$successRate[1] # -1 if successRate == 0; 1 otherwise
# loop only if nrow(x) is large enough
if (nrow(x)>2) {
for (i in 3:nrow(x)){ # start on row 3, as the last 2 rows are needed
x$previous[i] <- ifelse(x$successRate[i-1] == x$successRate[i-2], # consecutive?
sign(x$previous[i-1])*(abs(x$previous[i-1])+1), # yes: add 1 and keep sign
-1+2*x$successRate[i-1]) # no: 0 -> -1; 1 -> 1
}
}
}
print(x$previous)
[1] NA 1 2 -1 1 -1 -2 -3 1 -1 1 2 3 -1 1 -1 -2 -3 -4 1 2 -1

A couple of simple options:
1) Option 1: Using base R functions only, including rle for run length encoding:
# Your original data.frame
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
# base R method to get lag 1 of a vector
lag_successRate <- c( NA, x$successRate[ - length(x$successRate) ] )
lag_rle <- rle(lag_successRate) # base function for run length encoding
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
Option 2: Useing data.table, similar to above using base::rle to get the run length encoding.
If you have very large data sets, the data.table data functions are likely to be the fastest and most memory efficient options.
# your sample data as a dataframe, as you had originally:
DT <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
library(data.table)
setDT(DT) # set DT as a data.table by reference (without any copy!)
lag_rle <- rle( shift(DT$successRate) ) # get rle on the lag 1 of successRate
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1

Related

Create a list of vectors from a vector where n consecutive values are not 0 in R

So I have this vector:
a = sample(0:3, size=30, replace = T)
[1] 0 1 3 3 0 1 1 1 3 3 2 1 1 3 0 2 1 1 2 0 1 1 3 2 2 3 0 1 3 2
What I want to have is a list of vectors with all the elements that are separated by n 0s. So in this case, with n = 0 (there can't be any 0 between the consecutive values), this would give:
res = c([1,3,3], [1,1,1,3,3,2,1,1,3], [2,1,1,2]....)
However, I would like to control the n-parameter flexible to that if I would set it for example to 2, that something like this:
b = c(1,2,0,3,0,0,4)
would still result in a result like this
res = c([1,2,3],[4])
I tried a lot of approaches with while loops in for-loops while trying to count the number of 0s. But I just could not achieve it.
Update
I tried to post the question in a more real-world setting here:
Flexibly calculate column based on consecutive counts in another column in R
Thank you all for the help. I just don't seem to manage put your help into practice with my limited knowledge..
Here is a base R option using rle + split for general cases, i.e., values in b is not limited to 0 to 3.
with(
rle(with(rle(b == 0), rep(values & lengths == n, lengths))),
Map(
function(x) x[x != 0],
unname(split(b, cut(seq_along(b), c(0, cumsum(lengths))))[!values])
)
)
which gives (assuming n=2)
[[1]]
[1] 1 2 3
[[2]]
[1] 4
If you have values within ragne 0 to 9, you can try the code below
lapply(
unlist(strsplit(paste0(b, collapse = ""), strrep(0, n))),
function(x) {
as.numeric(
unlist(strsplit(gsub("0", "", x), ""))
)
}
)
which also gives
[[1]]
[1] 1 2 3
[[2]]
[1] 4
I also wanted to paste a somehow useful solution with the function SplitAt from DescTools:
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)])
where a is your intial vector. It gives you a list where every entry contains the pair of numbers between zeros:
If you than add another SplitAt with empty chars, you can create sublist after sublist and split it in as many sublists as you want: e.g.:
n <- 4
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)]) %>% SplitAt(., n)
gives you:
set.seed(1)
a <- sample(0:3, size=30, replace = T)
a
[1] 0 3 2 0 1 0 2 2 1 1 2 2 0 0 0 1 1 1 1 2 0 2 0 0 0 0 1 0 0 1
a2 <- paste(a, collapse = "") # Turns into a character vector, making it easier to handle patterns.
a3 <- unlist(strsplit(a2, "0")) # Change to whatever pattern you want, like "00".
a3 <- a3[a3 != ""] # Remove empty elements
a3 <- as.numeric(a3) # Turn back to numeric
a3
[1] 32 1 221122 11112 2 1 1

Modifying a list based on the length of its shortest string

I have a list of two series that start out the same length. After executing the following code, the second series has one fewer elements than the first. Is there a general way of removing the final element of only the series containing n+1 elements, so that all the series in my list have n elements? What about if I have a combination of series in my list containing n, n+1 and n+2 elements? Below is a minimal reproducible example.
#test
library('urca')
tseries <- list("t1" = c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1), "t2" = c(1,2,3,4,5,6,7,8,9,10,9,8,7,8,9));
# apply stationarity test to the list of series
adf <- lapply(tseries, function(x) tseries::adf.test(x)$p.value)
adf
# index only series that need differencing
not_stationary <- tseries[which(adf > 0.05)]
stationary <- tseries[which(adf < 0.05)]
not_stationary <- lapply(not_stationary, diff);
# verify
adf <- lapply(not_stationary, function(x) tseries::adf.test(x)$p.value)
adf
now_stationary <- not_stationary
#combine stationary and now_stationary
tseries_diff <- c(stationary, now_stationary)
tseries_diff
#$t1
#[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
#$t2
#[1] 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 1
So to summarise, I would ike to remove the final element, 1, from t1, but using code that can be applied to a list of series of lengths n and n+1 (and n+2 would be useful).
Thanks!
You can find the minimum length and simply get the series up to that point, i.e.
new_series_list <- lapply(tseries_diff, function(i)i[seq(min(lengths(tseries_diff)))])
so the lengths are now the same
lengths(new_series_list)
#t1 t2
#14 14
This will work in any size series. It will trim the long series to much the short one.
Edited for list instead of vector -
If you are dealing with list, you are wanting to make all of the series the length of the shortest:
(I modify the example to avoid using a library)
#test
mylist <- c(1,1,1,1,1)
mylongerlist <- c(1,1,1,1,1,1,1)
length(mylist)
# [1] 5
length(mylongerlist)
# [1] 7
#combine
tseries_diff <- list("t1" = mylist, "t2" = mylongerlist)
tseries_diff
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1 1 1
# on the fly approach to truncate
lapply(tseries_diff, function(x) { length(x) <- min(lengths(tseries_diff)); x })
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
And a function
# As a reusable function for clear code
reduceToShortestLength <- function(toCut) {
# takes a list and cuts the tail off of any series longer than the shortest
lapply(toCut, function(x) { length(x) <- min(lengths(tseries_diff)); x })
}
reduceToShortestLength(tseries_diff)
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
Original below (in case anyone thinks vector like I did at first)
I think you are asking how to truncate a vector to the shortest length. The head function does this well in base R.
the on the fly approach:
> mylist <- c(1,1,1,1,1)
> mylongerlist <- c(1,1,1,1,1,1,1)
> length(mylist)
[1] 5
> length(mylongerlist)
[1] 7
> x <- head(mylongerlist, length(mylist))
> length(x)
[1] 5
A function can be written like so:
> reduceToShorterLength<- function(toshorten, template) { head(toshorten, length(template))}
> x <- reduceToShorterLength(mylongerlist, mylist)
> length(x)
[1] 5

Find all subsequences with specific length in sequence of numbers in R

I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.

Distinguishing between infinity and negative infinity during value replacement in R

There are some good examples of how to replace infinite values in R with NA in this thread.
For instance,
DT <- data.table(dat)
invisible(lapply(names(DT),function(.name) set(DT,
which(is.infinite(DT[[.name]])), j = .name,value =NA)))
However, this doesn't distinguish between positive (Inf) and negative infinity (-Inf).
I need to make this distinction because instead of just replacing the values with NA and throwing them out or imputing them, I'd like to try using the max non-infinite value for positive infinity and min non-infinity value for negative infinity (and things like that).
Is this possible?
Example input data
a <- c(-1,2,3,4,100/0,-100/0)
[1] -1 2 3 4 Inf -Inf
Example output data
[1] -1 1 2 3 4 4 -1
Why not just combine is.infinite with a standard > or < comparison?
a <- c(-1,2,3,4,100/0,-100/0)
a[is.infinite(a) & a < 0] <- min(a[!is.infinite(a)])
a[is.infinite(a) & a > 0] <- max(a[!is.infinite(a)])
a
[1] -1 2 3 4 4 -1
You may extract/replace any -Inf or Inf values in your vector in an even simpler fashion:
a <- c(-1,2,3,4,100/0,-100/0)
a[a <= -Inf] <- min(a[is.finite(a)])
a[a >= Inf] <- max(a[is.finite(a)])
a
[1] -1 2 3 4 4 -1

Select dataframe if both values exists

Here is example:
df1 <- data.frame(x=1:2, account=c(-1,-1))
df2 <- data.frame(x=1:3, account=c(1,-1,1))
df3 <- data.frame(x=1, account=c(-1))
ls <- list(df1,df2,df3)
Failed attempt:
for(i in 1:length(ls)){
d <- ls[[i]]; if(d$account %in% c(-1,1)) { dout <- d} else {next}
}
I also tried: (not sure why this doesn't work)
grepl(paste(c(-1,1), collapse="|"), as.character(df1$account))
gives: (which is correct, since | means or, so one of the values is matched)
[1] TRUE TRUE
however, I have tried this:
df1 <- data.frame(x=1:2, account=c(-1,1))
grepl(paste(c(-1,1), collapse="&"), as.character(df1$account))
gives:
[1] FALSE FALSE
I would like to store only the subset of dataframes that contain both -1,1 values in column account otherwise neglect.
Desired result:
d
x account
1 1 1
2 2 -1
3 3 1
Or, you could stop using a list of data.frames:
library(data.table)
DT <- rbindlist(ls, idcol="id")
# id x account
# 1: 1 1 -1
# 2: 1 2 -1
# 3: 2 1 1
# 4: 2 2 -1
# 5: 2 3 1
# 6: 3 1 -1
And filter the single table:
DT[, if (uniqueN(account) > 1) .SD, by=id]
# id x account
# 1: 2 1 1
# 2: 2 2 -1
# 3: 2 3 1
(This follows #akrun's answer; uniqueN(x) is a fast shortcut to length(unique(x)).)
We could loop through the list and check whether the length of unique elements in 'account' is greater than 1 (assuming that there are only -1 and 1 as possible elements). Use this logical index to filter the list.
ls[sapply(ls, function(x) length(unique(x$account))>1)]

Resources