Subset data.frame based on lag between two columns - r

Suppose you want to subset a data.frame where the rule for keeping rows is based
on a lag beteen rows 'a' and 'b':
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
#output
a b
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
Essentially, if 'a' = 1 you want to keep that row as well as the subsequent run of rows in
'b' that have a value of 1. This capture continues until the next row with a = 0 & b = 0.
I've tried using nested 'ifelse()' statements, but I am stuck incorporate logical tests based on a lag issue.
Suggestions?

This is how I would do it. There are probably options out there that require maybe 1 or 2 lines less.
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
library(dplyr)
df %>%
mutate(grp = cumsum(a==1|a+b==0)) %>%
group_by(grp) %>%
filter(any(a == 1)) %>%
ungroup() %>%
select(a, b)

A solution without dplyr. Work with a flag:
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# create new empty df
new_df <- read.table(text = "", col.names = c("a", "b"))
a_okay = FALSE # initialize the flag
for (row_number in seq(1:nrow(df))) { # loop over each row of the original df
# if a is 1, we add the row to the new df and set the flag to TRUE
if (df[row_number, "a"] == 1) {
a_okay = TRUE
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# now we consider the rows where a is not 1
else {
# if b is 1 and we are still following an a == 1: add the row
if (df[row_number, "b"] == 1 & a_okay) {
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# if b is 0, we reset the flag
else {
a_okay = FALSE
}
}
}

Another base solution inspired by this post, #Wietse de Vries's answer and #Ben's comment.
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# identify groups
df$grp <- cumsum(df$a == 1 | df$b == 0)
# subset df by groups with first element of a == 1
df <- do.call(rbind, split(df, df$grp)[by(df, df$grp, function(x) {x$a[1] == 1})])
# remove grp
df$grp <- NULL

Related

Creating group ids by comparing values of two variables across rows: in R

I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

Nested if statements inside case_when

I'm attempting to mutate a large data set, and am wondering if there is a better way to run this code:
library(dplyr)
df <- data.frame(id = c('person 1', 'person 2'), a1 = c(0, 1), a2 = c(1, 0), a3 = c(0, 0), b1 = c(0, 1), b3 = c(1, 0))
new_function = function(name){
df %>% mutate(
!!name := case_when(
if(any(names(df) == paste0(name,'_1'))){ !!sym(paste0(name,'1')) == 1 ~ 1 },
if(any(names(df) == paste0(name,'_2'))){ !!sym(paste0(name,'2')) == 1 ~ 2 },
if(any(names(df) == paste0(name,'_3'))){ !!sym(paste0(name,'3')) == 1 ~ 3 },
TRUE ~ 0)
)
}
col_names = c('a', 'b')
col_names %>%
map(new_function)
To add two new columns:
a = c(2,1)
b = c(3,1)
The problem is the vector col_names has over a hundred entries and I'm not sure which columns are missing. It seems very inefficient to scan the entire data frame for each column name in each statement of case_when.
I tried to use possibly() or safely() (without the if statements) to ignore the error and keep running through the code, but it gave the same error as before.
*Note each person can only have one '1' for each letter, so only one statement of case_when can evaluate to TRUE.
Since, we know that each person can have only one 1 in each col_names, we can use that fact to simplify our logic. We can select column which starts with col_names get the column number which has 1 in it using max.col and return the value from the column name by removing all the non-numeric data from it.
library(dplyr)
new_function <- function(x) {
temp <- df %>% select(starts_with(x))
as.integer(gsub("\\D", "", names(temp)[max.col(temp)]))
}
bind_cols(df, purrr::map_dfc(col_names, new_function))
# id a1 a2 a3 b1 b3 V1 V2
#1 person 1 0 1 0 0 1 2 3
#2 person 2 1 0 0 1 0 1 1

Create new column to identify the first value >0 across different columns

I have the following data frame
df <- data.frame(col1 = c(0,0,1,1),col2 = c(1,0,0,3),col2 = c(1,0,0,3))
How can I identify the the first value of each value which is greater than 0.
The expected output is like this
df <- data.frame(col1 = c(0,0,1,1),col2 = c(1,0,0,3),col3 = c(1,0,0,3),col4 = c(1,0,1,1))
And I have tried the followings
for (i in 1:3){
df$col4 <- apply(df[,c(0:i)],1,sum)
if (df$col4>0)
break
}
We can use max.col() for this.
df[cbind(1:nrow(df), max.col(df > 0, "first"))]
# [1] 1 0 1 1
df$col4 <- apply(df, 1, function(x) x[which(x>0)[1]])
df[is.na(df$col4),'col4'] <- 0
Here is another idea using mapply,
unlist(mapply(`[`, split(df, 1:nrow(df)), max.col(df>0, ties.method = 'first')))
#1.col2 2.col1 3.col1 4.col1
# 1 0 1 1
Depending on what you need by 'count backward', you can either change ties.method to 'last', i.e.
unlist(mapply(`[`, split(df, 1:nrow(df)), max.col(df>0, ties.method = 'last')))
#1.col2.1 2.col2.1 3.col1 4.col2.1
# 1 0 1 3
Or reverse the data frame and leave ties.method to 'first', i.e.
unlist(mapply(`[`, split(rev(df), 1:nrow(df)), max.col(df>0, ties.method = 'first')))
# 1.col2 2.col2.1 3.col2.1 4.col2.1
# 1 0 0 3

For loop for updating data.frame

I am trying to write a "for loop" to update my R data frame by iterating.
Here is my code:
datalist = list()
for (i in 1:5) {
dat <- data.frame(ID=LETTERS[seq( from = 1, to = 20 )],nutrition=rnorm(20, mean=50, sd=10),
Stage=c(rep("A1",5), rep("B1",15)))
dat$ADG<-dat$nutrition*0.05
dat$M_weight<-dat$nutrition*0.5+dat$ADG*100
dat$Age<-dat$M_weight*1.1+dat$ADG*0.6
dat$Stage<-as.character(dat$Stage)
dat$Stage[dat$ADG>=3]<-"C1"
dat$i <- i # maybe you want to keep track of which iteration produced it?
datalist[[i]] <- dat # add it to your list #
}
big_data = do.call(rbind, datalist)
From iteration 2, I would like to have "Stage" updated to "C1" if ADG is equal or greater than 3 but this would not apply to iteration 1.
Thank you so much! I appreciate any replies!
I think you want a recursive function instead of an iterative one
Your data stringsAsFactors=F
dat <- data.frame(ID=LETTERS[seq( from = 1, to = 20 )], nutrition=rnorm(20, mean=50, sd=10), Stage=c(rep("A1",5), rep("B1",15)), stringsAsFactors=F)
Use tidyverse for dplyr and purrr verbs
library(tidyverse)
special <- function( dat, counter, end ) {
dat1 <- dat %>%
mutate(ADG = nutrition*0.05) %>%
mutate(M_weight = nutrition*0.5 + ADG*100) %>%
mutate(Age = M_weight*1.1 + ADG*0.6) %>%
mutate(Stage = ifelse( ADG >= 3, "C1", Stage )) %>%
mutate(i=counter)
if (counter < end) {
special(dat1, counter+1, end)
} else {
return(dat1)
}
}
desired <- map_df(2:5, ~special(dat,1,.x))
head(desired)
ID nutrition Stage ADG M_weight Age i
1 A 47.17826 A1 2.358913 259.4804 286.8438 2
2 B 64.55988 C1 3.227994 355.0794 392.5241 2
3 C 52.29020 A1 2.614510 287.5961 317.9244 2
4 D 59.96544 A1 2.998272 329.8099 364.5899 2
Let me know if this is not the output you were expecting

Speed up fill function R

I have a dataframe with missing values that I've written a function to fill using R 3.3.2
pkgs <- c("dplyr", "ggplot2", "tidyr", 'data.table', 'lazyeval')
lapply(pkgs, require, character.only = TRUE)
UID <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C')
Col1 <- c(1, 0, 0, 0, 1, 0, 0, 0)
df <- data.frame(UID, Col1)
Function to fill in Col1:
AggregatedColumns <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating new columns
DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
DF <- DF %>% group_by_("UID") %>% sort(DF[[columnToUse]], decreasing = TRUE) %>% fill_(NewCol1)
DF <- DF %>% group_by_("UID") %>% sort(DF$columnToUse, decreasing = TRUE) %>% fill_(NewCol1, .direction = 'up')
DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])
DF
}
I've pulled out this part of the function since this is the piece that is slowing down the function. I'm very new to writing functions and any advice on how/if this can be sped up would be appreciated. I've isolated the speed issue down to the fill_ part of the function.
What I am trying to do is pass a dummy variable from Col1 to New_Column and then forward fills to other same ID's. For example:
UID Col1
John Smith 1
John Smith 0
Should become
UID Col1 New_Column
John Smith 1 1
John Smith 0 1
EDITED FUNCTION
I edited the function to fit with #HubertL suggestion. The function is still fairly slow, but hopefully with these edits the example is reproducible.
AggregatedColumns <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating new columns
DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
DF <- DF %>% group_by_("UID") %>% fill_(NewCol1) %>% fill_(NewCol1, .direction = 'up')
DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])
DF
}
Desired output:
UID Col1 New
A 1 1
A 0 1
A 0 1
B 0 1
B 1 1
B 0 1
C 0 0
C 0 0
First of all, here are few points:
You are needlessly calling ifelse (twice) while this function is very inefficient
You needlessly using inefficient function (by group) from external package (also twice) when you could simply vectorize the process with just base R.
Here's a simple one-liner without using any external packages that enhances performance by a factor of x72 (and probably much more for bigger data sets) on a 5e7 data set
AggregatedColumns2 <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating the new column (one simple line)
DF[[NewCol1]] <- as.integer(DF$UID %in% DF$UID[DF[[columnToUse]] == 1])
# returning new data set back
DF
}
Benchmark
set.seed(123)
library(stringi)
N <- 5e7
UID <- stri_rand_strings(N, 2)
Col1 <- sample(0:1, N, replace = TRUE)
df <- data.frame(UID, Col1)
system.time(res <- AggregatedColumns(df, Col1, NewCol1))
# user system elapsed
# 198.67 3.94 203.07
system.time(res2 <- AggregatedColumns2(df, Col1, NewCol1))
# user system elapsed
# 2.82 0.00 2.82
Now in order to compare those I will reorder both and convert to a matrix, because Hadleyverses packages add tons of unnecessary attributes (compare the mess created in str(res) vs the simple structure in str(res2))
identical(arrange(res, UID) %>% as.matrix, arrange(res2, UID) %>% as.matrix)
## [1] TRUE
If speed is a concern, you may try this with data.table and na.locf() from the zoo package. LOCF means last observation carried forward.
library(data.table)
setDT(df)[Col1 != 0, New := Col1 ][, New := zoo::na.locf(New), UID][is.na(New), New := 0][]
# UID Col1 New
#1: A 1 1
#2: A 0 1
#3: A 0 1
#4: B 0 1
#5: B 1 1
#6: B 0 1
#7: C 0 0
#8: C 0 0
This is just to give an idea. It still needs to be wrapped in a function call.
It assumes that value 0 in Col1 is considered as missing.

Resources