How to reset cumsum at end of consecutive string [duplicate] - r

This question already has answers here:
Cumulative sum for positive numbers only [duplicate]
(9 answers)
Closed 6 years ago.
If I have the following vector:
x = c(1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1)
how can I calculate the cumulative sum for all of the consecutive 1's, resetting each time I hit a 0?
So, the desired output would look like this:
> y
[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3

This works:
unlist(lapply(rle(x)$lengths, FUN = function(z) 1:z)) * x
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
It relies pretty heavily on your special case of only having 1s and 0s, but for that case it works great! Even better, with #nicola's suggested improvements:
sequence(rle(x)$lengths) * x
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3

I read this post about how to split a vector, and use splitAt2 by #Calimo.
So it's like this:
splitAt2 <- function(x, pos) {
out <- list()
pos2 <- c(1, pos, length(x)+1)
for (i in seq_along(pos2[-1])) {
out[[i]] <- x[pos2[i]:(pos2[i+1]-1)]
}
return(out)
}
x = c(1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1)
where_split = which(x == 0)
x_split = splitAt2(x, where_split)
unlist(sapply(x_split, cumsum))
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3

Here is another option
library(data.table)
ave(x, rleid(x), FUN=seq_along)*x
#[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
Or without any packages
ave(x, cumsum(c(TRUE, x[-1]!= x[-length(x)])), FUN=seq_along)*x
#[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3

Related

How to convert a binary data frame to a vector?

Suppose I have a data frame such like
dat<-data.frame('0'=c(1,1,0,0,0,0,0,0),
'1'=c(0,0,1,0,1,0,0,0),
'2'=c(0,0,0,1,0,0,1,1),
'3'=c(0,0,0,0,0,1,0,0))
dat
X0 X1 X2 X3
1 1 0 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 1 0
5 0 1 0 0
6 0 0 0 1
7 0 0 1 0
8 0 0 1 0
I wanted to convert it to a vector like 1,1,2,3,2,4,3,3 where the numbers corresponding the column-th with unit 1. For example, 4 means the col 4th on row number 6th is 1.
Use
max.col(dat)
# [1] 1 1 2 3 2 4 3 3
In base R, we can use apply
apply(dat == 1, 1, which)
#[1] 1 1 2 3 2 4 3 3

summing all possible left to right diagonals along specified columns in a data frame by group?

Suppose I have something like this:
df<-data.frame(group=c(1, 1,2, 2, 2, 4,4,4,4,6,6,6),
binary1=c(1,0,1,0,0,0,0,0,0,0,0,0),
binary2=c(0,1,0,1,0,1,0,0,0,0,1,1),
binary3=c(0,0,0,0,1,0,1,0,0,0,0,0),
binary4=c(0,0,0,0,0,0,0,1,0,0,0,0))
I want to sum along all possible left to right diagonals within groups (i.e group 1, 2 4 and 6) and return the max sum. This is also in a dataframe, so I would like to specify to only sum along binary1-binary4. Anyone know if this is possible?
Here's my desired output:
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
I have circled the "diagonals" I would like summed for group 4 in this image as an example:
Here is another solution where we use row and col indices to get all possible combinations of diagonals. Use by to split by group and merge it with original dataframe.
max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
merge(df, stack(by(df[-1], df$group, max_diag)), by.x = "group", by.y = "ind")
# group binary1 binary2 binary3 binary4 values
#1 1 1 0 0 0 2
#2 1 0 1 0 0 2
#3 2 1 0 0 0 3
#4 2 0 1 0 0 3
#5 2 0 0 1 0 3
#6 4 0 1 0 0 3
#7 4 0 0 1 0 3
#8 4 0 0 0 1 3
#9 4 0 0 0 0 3
#10 6 0 0 0 0 1
#11 6 0 1 0 0 1
#12 6 0 1 0 0 1
You can split the data.frame and sum the diagonal using diag(). Once you have this sum diagonal per group, it's putting them back into the data.frame by calling the group.
Group 4 should be zero? Or am I missing something:
DIAG = by(df[,-1],df$group,function(i)sum(diag(as.matrix(i))))
df$want = DIAG[as.character(df$group)]
If I get your definition correct, we define a function to calculate sum of main diagonal:
main_diag = function(m){
sapply(1:(ncol(m)-1),function(i)sum(diag(m[,i:ncol(m)])))
}
Thanks to #IceCreamToucan for correcting this. Then we consider the max of all main diagonals, and their transpose:
DIAG = by(df[,-1],df$group,function(i){
i = as.matrix(i)
max(main_diag(i),main_diag(t(i)))
})
df$want = DIAG[as.character(df$group)]
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1

Sequence of two numbers with decreasing occurrence of one of them

I would like to create a sequence from two numbers, such that the occurrence of one of the numbers decreases (from n_1 to 1) while for the other number the occurrences are fixed at n_2.
I've been looking around for and tried using seq and rep to do it but I can't seem to figure it out.
Here is an example for c(0,1) and n_1=5, n_2=3:
0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,1,1,1,0,1,1,1
And here for c(0,1) and n_1=2, n_2=1:
0,0,1,0,1
Maybe something like this?
rep(rep(c(0, 1), n_1), times = rbind(n_1:1, n_2))
## [1] 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1
Here it is as a function (without any sanity checks):
myfun <- function(vec, n1, n2) rep(rep(vec, n1), times = rbind(n1:1, n2))
myfun(c(0, 1), 2, 1)
## [1] 0 0 1 0 1
inverse.rle
Another alternative is to use inverse.rle:
y <- list(lengths = rbind(n_1:1, n_2),
values = rep(c(0, 1), n_1))
inverse.rle(y)
## [1] 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1
An alternative (albeit slower) method using a similar concept:
unlist(mapply(rep,c(0,1),times=rbind(n_1:1,n_2)))
###[1] 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1
Here is another approach using upper-triangle of a matrix:
f_rep <- function(num1, n_1, num2, n_2){
m <- matrix(rep(c(num1, num2), times=c(n_1+1, n_2)), n_1+n_2+1, n_1+n_2+1, byrow = T)
t(m)[lower.tri(m,diag=FALSE)][1:sum((n_1:1)+n_2)]
}
f_rep(0, 5, 1, 3)
#[1] 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1
f_rep(2, 4, 3, 3)
#[1] 2 2 2 2 3 3 3 2 2 2 3 3 3 2 2 3 3 3 2 3 3 3
myf = function(x, n){
rep(rep(x,n[1]), unlist(lapply(0:(n[1]-1), function(i) n - c(i,0))))
}
myf(c(0,1), c(5,3))
#[1] 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1

Expand a single column to a wide/model matrix format

Suppose I have a column in a matrix or data.frame as follows:
df <- data.frame(col1=sample(letters[1:3], 10, TRUE))
I want to expand this out to multiple columns, one for each level in the column, with 0/1 entries indicating presence or absence of level for each row
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
for (i in 1:length(levels(df$col1))) {
curLetter <- levels(df$col1)[i]
newdf[which(df$col1 == curLetter), curLetter] <- 1
}
newdf
I know there's a simple clever solution to this, but I can't figure out what it is.
I've tried expand.grid on df, which returns itself as is. Similarly melt in the reshape2 package on df returned df as is. I've also tried reshape but it complains about incorrect dimensions or undefined columns.
Obviously, model.matrix is the most direct candidate here, but here, I'll present three alternatives: table, lapply, and dcast (the last one since this question is tagged reshape2.
table
table(sequence(nrow(df)), df$col1)
#
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
lapply
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
newdf[] <- lapply(names(newdf), function(x)
{ newdf[[x]][df[,1] == x] <- 1; newdf[[x]] })
newdf
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
dcast
library(reshape2)
dcast(df, sequence(nrow(df)) ~ df$col1, fun.aggregate=length, value.var = "col1")
# sequence(nrow(df)) a b c
# 1 1 1 0 0
# 2 2 0 1 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 1 0 0
# 6 6 0 0 1
# 7 7 0 0 1
# 8 8 0 1 0
# 9 9 0 1 0
# 10 10 1 0 0
It's very easy with model.matrix
model.matrix(~ df$col1 + 0)
The term + 0 means that the intercept is not included. Hence, you receive a dummy variable for each factor level.
The result:
df$col1a df$col1b df$col1c
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 0 1 0
9 1 0 0
10 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$col1`
[1] "contr.treatment"

Identify and replace duplicates elements from a vector

I have got a vector which is as under
a<- c(1,1,1,2,3,2,2,2,2,1,0,0,0,0,2,3,4,4,1,1)
Here we can see that there are lot of duplicate elements, ie. they are repeated ones.
I want a code which can replace all the elements which are consecutive and duplicate by 0 except for the first element. The result which i require is
a<- c(1,0,0,2,3,2,0,0,0,1,0,0,0,0,2,3,4,0,1,0)
I've tried
unique(a)
#which gives
[1] 1 2 3 0 4
You can created a lagged series and compare
> a
[1] 1 1 1 2 3 2 2 2 2 1 0 0 0 0 2 3 4 4 1 1
> ifelse(a == c(a[1]-1,a[(1:length(a)-1)]) , 0 , a)
[1] 1 0 0 2 3 2 0 0 0 1 0 0 0 0 2 3 4 0 1 0
replace(a, duplicated(c(0, cumsum(abs(diff(a))))), 0)
# [1] 1 0 0 2 3 2 0 0 0 1 0 0 0 0 2 3 4 0 1 0

Resources