Related
I'm having a problem with an efficient way to store the counts of a vector which is changing over time. In my problem I start with an empty vector of length n and by each iteration I add a number to this vector, but I also want to have some type of object that acts as a counter, so if the number that I add is already in the vector then it should add 1 to the object and if it's not then it should add the value as a "name" and set it to 1.
What I want is something analogous to Python, in which we can use numbers as keys and counts as values, so then I can access both separately with dict.keys() and dict.values().
For example, if I get the values 1, 2, 1, 4 then I would like the object to update as:
> value count
1 1
> value count
1 1
2 1
> value count
1 2
2 1
> value count
1 2
2 1
4 1
and to access efficiently both values and count separately. I thought of using something like plyr::count on the vector, but I don't think that it's efficient to count at every iteration, specially if n is really large.
Edit: In my problem it's necessary (well, maybe not) to update the counts at every iteration.
What I'm doing is simulating data from a Dirichlet Process using the Polya urn representation. For example, suppose that I have the vector (1.1, 0.2, 0.3, 1.1, 0.2), then to get a new data point one samples from a base distribution (for example a normal distribution) and adds that value with a certain probability, or adds a previous value with a probability proportional to the frequency of the value. With numbers:
Add the sampled value with probability 1/6, or
Add 1.1 with probability 2/6, or 0.2 with probability 2/6, or 0.3 with probability 1/6 (i.e. the probabilities are proportional to the frecuencies)
The structure you are describing is produced by as.data.frame(table(vec)). There is no need to update the counts as you go along, since calling this line will give you the updated counts
vec <- c(1, 2, 4, 1)
as.data.frame(table(vec))
#> vec Freq
#> 1 1 2
#> 2 2 1
#> 3 4 1
Suppose I now update vec
vec <- append(vec, c(1, 2, 4, 5))
We get the new counts the same way
as.data.frame(table(vec))
#> vec Freq
#> 1 1 3
#> 2 2 2
#> 3 4 2
#> 4 5 1
Maybe you can use assign and get0 of an environment to update the counts like:
x <- c(1, 2, 1, 4)
y <- new.env()
lapply(x, function(z) {
assign(as.character(z), get0(as.character(z), y, ifnotfound = 0) + 1, y)
setNames(stack(mget(ls(y), y))[2:1], c("value", "count"))
})
#[[1]]
# value count
#1 1 1
#
#[[2]]
# value count
#1 1 1
#2 2 1
#
#[[3]]
# value count
#1 1 2
#2 2 1
#
#[[4]]
# value count
#1 1 2
#2 2 1
#3 4 1
So I have this vector:
a = sample(0:3, size=30, replace = T)
[1] 0 1 3 3 0 1 1 1 3 3 2 1 1 3 0 2 1 1 2 0 1 1 3 2 2 3 0 1 3 2
What I want to have is a list of vectors with all the elements that are separated by n 0s. So in this case, with n = 0 (there can't be any 0 between the consecutive values), this would give:
res = c([1,3,3], [1,1,1,3,3,2,1,1,3], [2,1,1,2]....)
However, I would like to control the n-parameter flexible to that if I would set it for example to 2, that something like this:
b = c(1,2,0,3,0,0,4)
would still result in a result like this
res = c([1,2,3],[4])
I tried a lot of approaches with while loops in for-loops while trying to count the number of 0s. But I just could not achieve it.
Update
I tried to post the question in a more real-world setting here:
Flexibly calculate column based on consecutive counts in another column in R
Thank you all for the help. I just don't seem to manage put your help into practice with my limited knowledge..
Here is a base R option using rle + split for general cases, i.e., values in b is not limited to 0 to 3.
with(
rle(with(rle(b == 0), rep(values & lengths == n, lengths))),
Map(
function(x) x[x != 0],
unname(split(b, cut(seq_along(b), c(0, cumsum(lengths))))[!values])
)
)
which gives (assuming n=2)
[[1]]
[1] 1 2 3
[[2]]
[1] 4
If you have values within ragne 0 to 9, you can try the code below
lapply(
unlist(strsplit(paste0(b, collapse = ""), strrep(0, n))),
function(x) {
as.numeric(
unlist(strsplit(gsub("0", "", x), ""))
)
}
)
which also gives
[[1]]
[1] 1 2 3
[[2]]
[1] 4
I also wanted to paste a somehow useful solution with the function SplitAt from DescTools:
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)])
where a is your intial vector. It gives you a list where every entry contains the pair of numbers between zeros:
If you than add another SplitAt with empty chars, you can create sublist after sublist and split it in as many sublists as you want: e.g.:
n <- 4
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)]) %>% SplitAt(., n)
gives you:
set.seed(1)
a <- sample(0:3, size=30, replace = T)
a
[1] 0 3 2 0 1 0 2 2 1 1 2 2 0 0 0 1 1 1 1 2 0 2 0 0 0 0 1 0 0 1
a2 <- paste(a, collapse = "") # Turns into a character vector, making it easier to handle patterns.
a3 <- unlist(strsplit(a2, "0")) # Change to whatever pattern you want, like "00".
a3 <- a3[a3 != ""] # Remove empty elements
a3 <- as.numeric(a3) # Turn back to numeric
a3
[1] 32 1 221122 11112 2 1 1
I am trying to recreate a Stata code snippet in R and I have hit a snag.
In Stata, the lag function gives this result when applied:
A B
1 2
1 2
1 2
1 2
replace A=B if A==A[_n-1]
A B
1 2
2 2
1 2
2 2
If I try to replicate in R I get the following:
temp <- data.frame("A" = rep(1,4), "B" = rep(2,4))
temp
A B
1 2
1 2
1 2
1 2
temp <- temp %>% mutate(A = ifelse(A==lag(A,1),B,A))
temp
A B
2 2
2 2
2 2
2 2
I need it to be the same as in Stata.
lag would not be used here because it uses the original values in A whereas at each iteration the question needs the most recently updated values.
Define an Update function and apply it using accumulate2 in the purrr package. It returns a list so unlist it.
library(purrr)
Update <- function(prev, A, B) if (A == prev) B else A
transform(temp, A = unlist(accumulate2(A, B[-1], Update)))
giving:
A B
1 1 2
2 2 2
3 1 2
4 2 2
Another way to write this uses fn$ in gsubfn which causes formula arguments to be interpreted as functions. The function that it builds uses the free variables in the formula as the arguments in the order encountered.
library(gsubfn)
library(purrr)
transform(temp, A = unlist(fn$accumulate2(A, B[-1], ~ if (prev == A) B else A)))
Also note the comments below this answer for another variation.
Looks like we need to update after each run
for(i in 2:nrow(temp)) temp$A[i] <- if(temp$A[i] == temp$A[i-1])
temp$B[i] else temp$A[i]
temp
# A B
#1 1 2
#2 2 2
#3 1 2
#4 2 2
Or as #G.Grothendieck mentioned in the comments, it can be compact with
for(i in 2:nrow(temp)) if (temp$A[i] == temp$A[i-1]) temp$A[i] <- temp$B[i]
Here's a function that will do it:
lagger <- function(x,y){
current = x[1]
out = x
for(i in 2:length(x)){
if(x[i] == current){
out[i] = y[i]
}
current = out[i]
}
out
}
lagger(temp$A, temp$B)
[1] 1 2 1 2
I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.
I am trying to write a code where I can find the previous consecutive occurrences of the same binary value.
I have managed to write a for loop to find previous value (in my real problem, the data is subsetted hence a for loop being required).
x<-data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
xLength<-length(x$successRate)
y<-vector(mode="integer",length<-xLength)
if (xLength>1){
for (i in 2:xLength){
y[i]<-x$successRate[i-1]
}
}
y[1]<-NA
x[,"previous"]<-y
However I am looking for the desired output as follows:
# desired output
data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1),previousConsecutiveSuccess=c(NA,1,2,-1,1,-1,-2,-3,1,-1,1,2,3,-1,1,-1,-2,-3,-4,1,2,-1))
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
x$previous <- NA # no need for extra variable
if (nrow(x)>1) {
# set first consecutive idx manually
x$previous[2] <- -1+2*x$successRate[1] # -1 if successRate == 0; 1 otherwise
# loop only if nrow(x) is large enough
if (nrow(x)>2) {
for (i in 3:nrow(x)){ # start on row 3, as the last 2 rows are needed
x$previous[i] <- ifelse(x$successRate[i-1] == x$successRate[i-2], # consecutive?
sign(x$previous[i-1])*(abs(x$previous[i-1])+1), # yes: add 1 and keep sign
-1+2*x$successRate[i-1]) # no: 0 -> -1; 1 -> 1
}
}
}
print(x$previous)
[1] NA 1 2 -1 1 -1 -2 -3 1 -1 1 2 3 -1 1 -1 -2 -3 -4 1 2 -1
A couple of simple options:
1) Option 1: Using base R functions only, including rle for run length encoding:
# Your original data.frame
x <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
# base R method to get lag 1 of a vector
lag_successRate <- c( NA, x$successRate[ - length(x$successRate) ] )
lag_rle <- rle(lag_successRate) # base function for run length encoding
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1
Option 2: Useing data.table, similar to above using base::rle to get the run length encoding.
If you have very large data sets, the data.table data functions are likely to be the fastest and most memory efficient options.
# your sample data as a dataframe, as you had originally:
DT <- data.frame(successRate=c(1,1,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,1,1,0,1))
library(data.table)
setDT(DT) # set DT as a data.table by reference (without any copy!)
lag_rle <- rle( shift(DT$successRate) ) # get rle on the lag 1 of successRate
ifelse( lag_rle$values==0, -1, 1 ) * lag_rle$lengths # multiply the rle length by -1 if the rle value == 0
# output as requested
[1] NA 2 -1 1 -3 1 -1 3 -1 1 -4 2 -1