Assigning unique identifier to consecutive sequences of binomial values in R [duplicate] - r

This question already has answers here:
Create counter of consecutive runs of a certain value
(4 answers)
Closed 1 year ago.
I have a dataframe with column consisting of sequences of 0s and 1s. The 0s are not of interest but the 1s signify events occurring in a time series and the goal is to assign a unique value to each event. Simple integer values suffice. So in the code below 'x' is what I have and 'goal' is what I am after.
This seems so simple yet I don't quite know how to phrase the question on a help search...
What I have as a dataframe:
x <- c(rep(0,4),rep(1,5),rep(0,2),rep(1,4),rep(0,10),rep(1,3))
x <- data.frame(x)
What I want in the dataframe:
x$goal <- c(rep(0,4),rep(1,5),rep(0,2),rep(2,4),rep(0,10),rep(3,3))

This is effectively a run-length encoding, with a slight-twist (of zero-izing 0s).
While data.table::rleid does this well, if you are not already using that package, then we'll use
my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
From here, we'll see
x$out <- my_rleid(x$x)
x$out <- ifelse(x$x == 0, 0L, x$out)
x
# x goal out
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 1 1 2
# 6 1 1 2
# 7 1 1 2
# 8 1 1 2
# 9 1 1 2
# 10 0 0 0
# 11 0 0 0
# 12 1 2 4
# 13 1 2 4
# 14 1 2 4
# 15 1 2 4
# 16 0 0 0
# 17 0 0 0
# 18 0 0 0
# 19 0 0 0
# 20 0 0 0
# 21 0 0 0
# 22 0 0 0
# 23 0 0 0
# 24 0 0 0
# 25 0 0 0
# 26 1 3 6
# 27 1 3 6
# 28 1 3 6
which is pretty close. If you need consecutive numbers (no gaps like above), then
x$out <- match(x$out, sort(unique(x$out))) - (0 %in% x$out)
x
# x goal out
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 1 1 1
# 6 1 1 1
# 7 1 1 1
# 8 1 1 1
# 9 1 1 1
# 10 0 0 0
# 11 0 0 0
# 12 1 2 2
# 13 1 2 2
# 14 1 2 2
# 15 1 2 2
# 16 0 0 0
# 17 0 0 0
# 18 0 0 0
# 19 0 0 0
# 20 0 0 0
# 21 0 0 0
# 22 0 0 0
# 23 0 0 0
# 24 0 0 0
# 25 0 0 0
# 26 1 3 3
# 27 1 3 3
# 28 1 3 3
The reason I chose to use - (0 %in% x$out) instead of a hard-coded 1 is that I wanted to guard against the possibility of there being no 0s in the data. Put differently, that (0 %in% x$out) resolves to FALSE or TRUE, which when subtracted from integers, is coerced to 0L or 1L, respectively. The reason I need this: if there is a 0 in $out, then match will effectively be match(0, 0:6) which will return 1. We want the x == 0 matches to be 0L, so we have to subtract one. Since the second argument (from sort(unique(.))) is always either 0-based (as here) or 1-based (no zeroes present in x$x), it's an easy adjustment.
If you are certain that this cannot be the case, and you don't like the - (.) I appended to match(.), then you can change that to match(.) - 1L.

Related

Create an index variable for blocks of values

I have a dataframe "data" with a grouping variable "grp" and a binary classification variable "classif". For each group in grp, I want to create a "result" variable creating an index of separate blocks of 0 in the classif variable. For the time being, I don't know how to reset the count for each level of the grouping variable and I don't find a way to only create the index for blocks of 0s (ignoring the 1s).
Example data:
grp <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
classif <- c(0,1,0,0,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0,0,1,1,0,0,0,1,0,1,0)
result <- c(1,0,2,2,0,3,3,0,0,1,1,1,1,0,2,0,0,0,3,3,0,0,1,1,1,0,2,0,3)
wrong_result <- c(1,2,3,3,4,5,5,1,1,2,2,2,2,3,4,5,5,5,6,6,1,1,2,2,2,3,4,5,6)
Data <- data.frame(grp,classif,result, wrong_result)
I have tried using rleid but the following command produces "wrong_result", which is not what I'm after.
data[, wrong_result:= rleid(classif)]
data[, wrong_result:= rleid(classif), by=grp]
With dplyr, use cumsum() and lag() to find blocks of zeroes .by group. (Make sure you’re using the latest version of dplyr to use the .by argument).
library(dplyr)
Data %>%
mutate(
result2 = ifelse(
classif == 0,
cumsum(classif == 0 & lag(classif, default = 1) == 1),
0
),
.by = grp
)
grp classif result result2
1 1 0 1 1
2 1 1 0 0
3 1 0 2 2
4 1 0 2 2
5 1 1 0 0
6 1 0 3 3
7 1 0 3 3
8 2 1 0 0
9 2 1 0 0
10 2 0 1 1
11 2 0 1 1
12 2 0 1 1
13 2 0 1 1
14 2 1 0 0
15 2 0 2 2
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 0 3 3
20 2 0 3 3
21 3 1 0 0
22 3 1 0 0
23 3 0 1 1
24 3 0 1 1
25 3 0 1 1
26 3 1 0 0
27 3 0 2 2
28 3 1 0 0
29 3 0 3 3
Use rle and sequentially number the runs produced and then convert back and zero out the runs of 1's. No packages are used.
seq0 <- function(x) {
r <- rle(x)
is0 <- r$values == 0
r$values[is0] <- seq_len(sum(is0))
inverse.rle(r) * !x
}
transform(Data, result2 = ave(classif, grp, FUN = seq0))

change value of row and subsequent rows depending on row number

I have a dataframe where some rows have values as 0. I want to make a code that makes the next few rows as 0 too.
> head(df$n,n=20)
df$n
1 0
2 9009
3 0
4 0
5 0
6 0
7 0
8 5410
9 0
10 0
11 0
12 0
13 0
14 0
15 32
16 0
17 0
18 1054
19 0
20 0
I want to create a code that converts the next five rows with value 0 as 0.
basically row with 0 is 0 and the next five rows is also 0.
I tried
for(j in 1:nrow(indx)){
for(i in 1:4){
df$n[j+i]<-0
}
}
where indx is dataframe containing all the row number with 0 values.
This works but incorrectly.
How to I get my desired output?
> head(df$n,n=20)
df$n
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 5410
9 0
10 0
11 0
12 0
13 0
14 0
15 32
16 0
17 0
18 0
19 0
20 0
Edit: sorry for the unclear language. My aim is to convert 5 values after 0 to 0. since it is incorrect data.
Edit2: I think this code worked for me. its a little bit primitive.
for( i in 1:nrow(indx)){
u<-indx[i,]
df[u,]<-0
df[u+1,]<-0
df[u+2,]<-0
df[u+3,]<-0
df[u+4,]<-0
df[u+5,]<-0
}
however it introduces extra rows at end but it works.
If I understand correctly, you want to make sure any run of zeros is at least five rows long, unless it's at the end of the data. Here's a dplyr-based solution:
library(dplyr)
df %>%
group_by(zero_run = cumsum(n == 0 & lag(n, default = 1) != 0)) %>%
mutate(
zeros_consecutive = row_number(),
n_new = ifelse(zero_run == 0 | zeros_consecutive > 5, n, 0)
) %>%
ungroup()
# # A tibble: 20 × 4
# n zero_run zeros_consecutive n_new
# <dbl> <int> <int> <dbl>
# 1 0 1 1 0
# 2 9009 1 2 0
# 3 0 2 1 0
# 4 0 2 2 0
# 5 0 2 3 0
# 6 0 2 4 0
# 7 0 2 5 0
# 8 5410 2 6 5410
# 9 0 3 1 0
# 10 0 3 2 0
# 11 0 3 3 0
# 12 0 3 4 0
# 13 0 3 5 0
# 14 0 3 6 0
# 15 32 3 7 32
# 16 0 4 1 0
# 17 0 4 2 0
# 18 1054 4 3 0
# 19 0 5 1 0
# 20 0 5 2 0
I left in the helper columns to better demonstrate the approach, but you could remove these by using n = ifelse(...) instead of n_new = ifelse(...) and adding select(!zeros_run:zeros_consecutive).

Create new column when when values repeat 3 or more times

Problem
I'm trying to create a new column (b) based on values from a previous column (a). Column a is binary, consisting of either 0's or 1's. If there are three or more 1's in a row in column a, then keep them in column b. I'm close to the desired output, but when there are two 1's in a row, the ifelse grabs the second value because it's meeting the first condition.
Desired Output–Column b
df <- data.frame(a = c(1,1,1,0,0,1,0,1,1,0,1,1,1,0,1,1,0,1,1,1,1),
b = c(1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,1))
df
a b
1 1 1
2 1 1
3 1 1
4 0 0
5 0 0
6 1 0
7 0 0
8 1 0 #
9 1 0 #
10 0 0
11 1 1
12 1 1
13 1 1
14 0 0
15 1 0 #
16 1 0 #
17 0 0
18 1 1
19 1 1
20 1 1
21 1 1
Failed Attempt...s
require(dplyr)
df_fail <- df %>% mutate(b=ifelse((lag(df$a) + df$a) > 1 |(df$a + lead(df$a) + lead(df$a,2)) >= 3, df$a,NA))
df_fail
a b
1 1 1
2 1 1
3 1 1
4 0 0
5 0 0
6 1 0
7 0 0
8 1 0
9 1 1 # should be 0
10 0 0
11 1 1
12 1 1
13 1 1
14 0 0
15 1 0
16 1 1 # should be 0
17 0 0
18 1 1
19 1 1
20 1 1
21 1 1
We can use rle from base R to change the elements that have less than 3 repeating 1s to 0
inverse.rle(within.list(rle(df$a), values[values == 1 & lengths <3] <- 0))
#[1] 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1
Or use rleid from data.table
library(data.table)
library(dplyr)
df %>%
group_by(grp = rleid(a)) %>%
mutate(b1 = if(n() <3 & all(a == 1)) 0 else a) %>%
ungroup %>%
select(-grp)

R how to generate a descending sequence by subject measuring the distance from the next uninterrupted series of a given value

I have spent a lot of time trying to figure out how to create a descending sequence which is subject specific and measures the distance from the next uninterrupted series of a given value in another column. Do you have any suggestions?
Here is an example of the problem:
Given the following data, where the "id" column is the subject unique identifier and the column "dummy" is an attribute
mydata<-data.frame(id=rep(seq(1,3),each=5), dummy=c(0,0,0,1,1,0,0,1,0,1,0,0,0,0,0))
id dummy
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 2 0
7 2 0
8 2 1
9 2 0
10 2 1
11 3 0
12 3 0
13 3 0
14 3 0
15 3 0
Generate a new column measuring the distance from the next uninterrupted series of the value 1 in the "dummy" column (notice: I am considering an individual occurrence of the value 1 as an interrupted series). Here is an example of the output:
id dummy output
1 1 0 3
2 1 0 2
3 1 0 1
4 1 1 0
5 1 1 0
6 2 0 2
7 2 0 1
8 2 1 0
9 2 0 1
10 2 1 0
11 3 0 0
12 3 0 0
13 3 0 0
14 3 0 0
15 3 0 0
Thanks,
H
Here's an attempt using the data.table package in two steps.
First step is to shift the dummy column one step further in order to afterwards check if the zero sequences are being followed by one.
Second step is to calculate the sequences by condition that they are zero sequences and being followed by one.
I'm using the shift function from the latest data.table version (v 1.9.6+) for this task, but you can just use indx := c(dummy[-1L], 0L) instead
library(data.table) # V1.9.6+
setDT(mydata)[, indx := shift(dummy, type = "lead", fill = 0L)]
mydata[, output := .N:1L*(dummy == 0L)*(indx[.N] == 1L), by = .(id, cumsum(dummy == 1L))]
# id dummy indx output
# 1: 1 0 0 3
# 2: 1 0 0 2
# 3: 1 0 1 1
# 4: 1 1 1 0
# 5: 1 1 0 0
# 6: 2 0 0 2
# 7: 2 0 1 1
# 8: 2 1 0 0
# 9: 2 0 1 1
# 10: 2 1 0 0
# 11: 3 0 0 0
# 12: 3 0 0 0
# 13: 3 0 0 0
# 14: 3 0 0 0
# 15: 3 0 0 0
Here is an option with base R. First we label the number of consecutive identical entries (with rle) in the dummy column in reverse order:
mydata$output<- unlist(sapply(rle(mydata$dummy)$lengths,function(x) rev(seq(x))))
Then we set the values of the output column to zero for all rows in which dummy is not equal to zero:
mydata$output[mydata$dummy!=0] <- 0
In a last step, we identify the sets of id which only contain zeros as values for dummy and set their entries of the output column to zero, too:
mydata[mydata$id==which(aggregate(dummy ~ id,mydata,sum)$dummy==0),]$output <- 0
#> mydata
# id dummy output
#1 1 0 3
#2 1 0 2
#3 1 0 1
#4 1 1 0
#5 1 1 0
#6 2 0 2
#7 2 0 1
#8 2 1 0
#9 2 0 1
#10 2 1 0
#11 3 0 0
#12 3 0 0
#13 3 0 0
#14 3 0 0
#15 3 0 0
This solution assumes that there are no negative values in the dummy column.

Expand a single column to a wide/model matrix format

Suppose I have a column in a matrix or data.frame as follows:
df <- data.frame(col1=sample(letters[1:3], 10, TRUE))
I want to expand this out to multiple columns, one for each level in the column, with 0/1 entries indicating presence or absence of level for each row
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
for (i in 1:length(levels(df$col1))) {
curLetter <- levels(df$col1)[i]
newdf[which(df$col1 == curLetter), curLetter] <- 1
}
newdf
I know there's a simple clever solution to this, but I can't figure out what it is.
I've tried expand.grid on df, which returns itself as is. Similarly melt in the reshape2 package on df returned df as is. I've also tried reshape but it complains about incorrect dimensions or undefined columns.
Obviously, model.matrix is the most direct candidate here, but here, I'll present three alternatives: table, lapply, and dcast (the last one since this question is tagged reshape2.
table
table(sequence(nrow(df)), df$col1)
#
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
lapply
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
newdf[] <- lapply(names(newdf), function(x)
{ newdf[[x]][df[,1] == x] <- 1; newdf[[x]] })
newdf
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
dcast
library(reshape2)
dcast(df, sequence(nrow(df)) ~ df$col1, fun.aggregate=length, value.var = "col1")
# sequence(nrow(df)) a b c
# 1 1 1 0 0
# 2 2 0 1 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 1 0 0
# 6 6 0 0 1
# 7 7 0 0 1
# 8 8 0 1 0
# 9 9 0 1 0
# 10 10 1 0 0
It's very easy with model.matrix
model.matrix(~ df$col1 + 0)
The term + 0 means that the intercept is not included. Hence, you receive a dummy variable for each factor level.
The result:
df$col1a df$col1b df$col1c
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 0 1 0
9 1 0 0
10 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$col1`
[1] "contr.treatment"

Resources