restart counting under conditions in R [duplicate] - r

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 3 years ago.
I have a flag column that contains continuous streams 1s and 0s. I want to add the stream of 1s. When it encounters 0s, the summing should stop. For the next stream of 1s, summing should start afresh
I have tried cumsum(negread_flag == 1) this continues to sum after the 0s
negread_flag result
1 1
1 2
1 3
1 4
0 0
0 0
0 0
1 1
1 2
1 3
0 0

We can make use of rleid (run-length-id - to generate different ids when the adjacent element differ) as a grouping variable, then get the sequence of the group and assign it to 'result' where 'negread_flag' is 1, remove the 'grp' column by assigning it to NULL
library(data.table)
setDT(df1)[, grp := rleid(negread_flag)
][, result := 0
][negread_flag == 1,
result := seq_len(.N), grp][, grp := NULL][]
# negread_flag result
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 0 0
# 6: 0 0
# 7: 0 0
# 8: 1 1
# 9: 1 2
#10: 1 3
#11: 0 0
Or a similar idea with tidyverse, using the rleid (from data.table), create the 'result' by multiplying the row_number() with the 'negread_flag' so that values corresponding to 0 in 'negread_flag' becomes 0
library(tidyverse)
df1 %>%
group_by(grp = rleid(negread_flag)) %>%
mutate(result = row_number() * negread_flag) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 2
# negread_flag result
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 0 0
# 6 0 0
# 7 0 0
# 8 1 1
# 9 1 2
#10 1 3
#11 0 0
Or using base R
i1 <- df1$negread_flag != 0
df1$result[i1] <- with(rle(df1$negread_flag), sequence(lengths * values))
Or as #markus commented
df1$result[i1] <- sequence(rle(df1$negread_flag)$lengths) * df1$negread_flag
data
df1 <- structure(list(negread_flag = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 0L)), row.names = c(NA, -11L), class = "data.frame")

Related

Restarting a Counter By Groups Under Conditions [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 29 days ago.
I have the following dataset:
id = c("A","A","A","A","A","B", "B", "B", "B")
result = c(1,1,0,1,1,0,1,0,1)
my_data = data.frame(id, result)
For each unique id, I want to create a "counter variable" that:
if the first result value is 1 then counter = 1 , else 0
increases by 1 each time when result = 1
becomes 0 when the result = 0
remains 0 until the first result = 1 is encountered
restart to increasing by 1 each time the result = 1
when the next unique id is encountered, the counter initializes back to 1 if result = 1 , else 0
I think the final result should look something like this:
id result counter
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
I have these two codes that I am trying to use:
# creates counter by treating entire dataset as a single ID
my_data$counter = unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))
# creates counter by taking into consideration ID's
my_data$counter = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})
But I am not sure how to interpret these correctly. For example, I am interested in learning about how to write a general function to accomplish this task with general conditions - e.g. if result = AAA then counter restarts to 0, if result = BBB then counter + 1, if result = CCC then counter + 2, if result = DDD then counter - 1.
Can someone please show me how to do this?
Thanks!
We may create a grouping column with rleid and then do the grouping by 'id' and the rleid of 'result'
library(dplyr)
library(data.table)
my_data %>%
group_by(id) %>%
mutate(grp = rleid(result)) %>%
group_by(grp, .add = TRUE) %>%
mutate(counter = row_number() * result)%>%
ungroup %>%
select(-grp)
-output
# A tibble: 9 × 3
id result counter
<chr> <dbl> <dbl>
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
Or using data.table
library(data.table)
setDT(my_data)[, counter := seq_len(.N) * result, .(id, rleid(result))]
-output
> my_data
id result counter
1: A 1 1
2: A 1 2
3: A 0 0
4: A 1 1
5: A 1 2
6: B 0 0
7: B 1 1
8: B 0 0
9: B 1 1

Rearrange and Sort

I have the following data
ID v1 v2 v3 v4 v5
1 1 3 6 4
2 4 2
3 3 1 8 5
4 2 5 3 1
Can I rearrange the data so that it will automatically create new columns and assign binary value (1 or 0) according to the value in each variable (v1 to v5)?
E.g. In first row, I have values of 1,3,4 and 6. Can R automatically create 6 dummy variables to have assign the value to the respective column as below:
ID dummy1 dummy2 dummy3 dummy4 dummy5 dummy6
1 1 0 1 1 0 1
To have something like this:
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 0 1 1 0 1 0 0
2 0 1 0 1 0 0 0 0
3 1 0 1 0 1 0 0 1
4 1 1 1 0 1 0 0 0
Thanks.
We can use base R to do this. Loop through the rows of the dataset except the first column, get the sequence of max value in the row, check how many of these are in the row and convert it to integer with as.integer, append NAs at the end to make the lengths same in the list output and cbind with the first column
lst <- apply(df[-1], 1, function(x) as.integer(seq_len(max(x, na.rm = TRUE)) %in% x))
res <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
res[is.na(res)] <- 0
colnames(res)[-1] <- paste0('c', 1:8)
res
# ID c1 c2 c3 c4 c5 c6 c7 c8
#1 1 1 0 1 1 0 1 0 0
#2 2 0 1 0 1 0 0 0 0
#3 3 1 0 1 0 1 0 0 1
#4 4 1 1 1 0 1 0 0 0
In base R, you can use:
table(transform(cbind(mydf[1], stack(mydf[-1]))[1:2], values = factor(values, 1:8)))
## values
## ID 1 2 3 4 5 6 7 8
## 1 1 0 1 1 0 1 0 0
## 2 0 1 0 1 0 0 0 0
## 3 1 0 1 0 1 0 0 1
## 4 1 1 1 0 1 0 0 0
Note that you need to convert the stacked values to factor if you want the "7" to be included in the output. This applies to the "data.table" and "tidyverse" approaches as well.
Alternatively, you can try the following with "data.table":
library(data.table)
melt(as.data.table(mydf), "ID", na.rm = TRUE)[
, dcast(.SD, ID ~ factor(value, 1:8), fun = length, drop = FALSE)]
Or the following with the "tidyverse":
library(tidyverse)
mydf %>%
gather(var, val, -ID, na.rm = TRUE) %>%
select(-var) %>%
mutate(var = 1, val = factor(val, 1:8)) %>%
spread(val, var, fill = 0, drop = FALSE)
Sample data:
mydf <- structure(list(ID = 1:4, v1 = c(1L, 4L, 3L, 2L), v2 = c(3L, 2L,
1L, 5L), v3 = c(6L, NA, 8L, 3L), v4 = c(4L, NA, 5L, 1L), v5 = c(NA,
NA, NA, NA)), .Names = c("ID", "v1", "v2", "v3", "v4", "v5"), row.names = c(NA,
4L), class = "data.frame")
If automation is important, you can also use syntax like factor(value, sequence(max(value)) in the "data.table" approach or val = factor(val, sequence(max(val)))) in the "tidyverse" approach.
Another base R answer with some similarities to akrun's is
# create matrix of values
myMat <- as.matrix(dat[-1])
# create result matrix of desired shape, filled with 0s
res <- matrix(0L, nrow(dat), ncol=max(myMat, na.rm=TRUE))
# use matrix indexing to fill in 1s
res[cbind(dat$ID, as.vector(myMat))] <- 1L
# convert to data.frame, add ID column, and provide variable names
setNames(data.frame(cbind(dat$ID, res)), c("ID", paste0("c", 1:8)))
which returns
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 1 0 1 1 0 1 0 0
2 2 0 1 0 1 0 0 0 0
3 3 1 0 1 0 1 0 0 1
4 4 1 1 1 0 1 0 0 0

R programming , subsetting data , and plotting graphs

R - I have a dataframe, with 0 and 1 in a column , I found out the row index at which the toggling takes place, now I want to sample out data from these by setting these particular row IDS?
This is the data:
row id mode
1 0
2 0
3 1
4 1
5 0
6 0
7 0
8 1
9 1
10 1
After splitting dataframe there should be 4 new dataframes:
y[1] :
row id mode
1 0
2 0
y[2]
row id mode
3 1
4 1
y[3]
row id mode
5 0
6 0
7 0
And so on.
We can create a grouping variable based on the difference of adjacent elements in 'mode' and split the dataset based on that
split(df1, cumsum(c(TRUE, diff(df1$mode)!=0)))
#$`1`
# row id mode
#1 1 0
#2 2 0
#$`2`
# row id mode
#3 3 1
#4 4 1
#$`3`
# row id mode
#5 5 0
#6 6 0
#7 7 0
#$`4`
# row id mode
#8 8 1
#9 9 1
#10 10 1
Or another option is to use rleid from data.table
library(data.table)
split(df1, rleid(df1$mode))
Or using rle from base R
split(df1, with(rle(df1$mode), rep(seq_along(values), lengths)))
data
df1 <- structure(list(`row id` = 1:10, mode = c(0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L)), .Names = c("row id", "mode"),
class = "data.frame", row.names = c(NA, -10L))

How to find columns that fit an specific range (per individual) and add 1, else 0, using R

I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
library(data.table)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
Notes:
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
Data:
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
-3L))

Expanding a data.frame by replacing missing values with set of all possible values in R

I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

Resources