apply conditional on two groups of columns within dataframe - r

I have a df:
a<-c(5,1,5,3,5,3,5,1)
b<-c(1,5,1,5,1,5,3,5)
df<-as.data.frame(rbind(a,b))
names(df)<-c('pre1','post1','pre2','post2','pre3','post3','pre4','post4')
And I have two groups of samples within the columns eg 'pre' and post':
pre<-seq(1,8,by=2)
post<-seq(2,8,by=2)
I would like to apply a conditional that 100% of the pre and 50% of the post pass OR 50% of the pre and 100% of the post
eg
if 100% of 'pre' are 3 or over AND 50% post are 3 or over keep row
OR
if 50% of 'pre' are 3 or over AND 100% post are 3 or over keep row
so in the example df only row 'a' would stay
I have:
test<- ((df[apply(df[pre],1,function(x) sum(x>=3)/length(x)),] &
df[apply(df[post],1,function(x) sum(x>3)/length(x))>=0.5,]) |
(df[apply(df[pre],1,function(x) sum(x>3)/length(x))>=0.5,] &
df[apply(df[post],1,function(x) sum(x>3)/length(x)),]))
But I get a vector of 'TRUEs' which isn't what I want.

We can create a logical vector to compare using rowSums
df[(rowSums(df[pre] >= 3)/length(pre) == 1) &
(rowSums(df[post] >= 3)/length(post) >= 0.5) |
(rowSums(df[post] >= 3)/length(post) == 1) &
(rowSums(df[pre] >= 3)/length(pre) >= 0.5), ]
# pre1 post1 pre2 post2 pre3 post3 pre4 post4
#a 5 1 5 3 5 3 5 1
Using apply we can do
df[apply(df[pre] >= 3, 1, all) & apply(df[post] >= 3, 1, sum)/length(post) >= 0.5 |
apply(df[post] >= 3, 1, all) & apply(df[pre] >= 3, 1, sum)/length(pre) >= 0.5, ]

Here's a much less concise tidyverse solution that could probably be shortened substantially.
library(tidyverse)
pass_val = 3
df %>%
rownames_to_column() %>%
gather(col, val, -rowname) %>%
separate("col", c("type", "num"), sep = -1) %>%
count(rowname, type, pass = val >= pass_val) %>%
spread(pass, n, fill = 0) %>%
transmute(rowname, type, pass_pct = `TRUE`/(`TRUE` + `FALSE`)) %>%
spread(type, pass_pct) %>%
filter(post == 1 & pre >= 0.5 | post >= 0.5 & pre == 1)

Here is one option with tidyverse
library(tidyverse)
library(rap)
crossing(val = c(0.5, 1), cols = c("pre", "post")) %>%
rap(x = ~ df %>%
select(matches(cols)) %>%
{rowMeans(. >=3) >= val}) %>%
group_by(val) %>%
transmute(ind = reduce(x, `&`)) %>%
filter(any(ind)) %>%
pull(ind) %>%
filter(df, .)
# pre1 post1 pre2 post2 pre3 post3 pre4 post4
#1 5 1 5 3 5 3 5 1

Here's a base R solution that splits by row name, checks the conditions using sapply, and uses the output as a logical index on df:
df[sapply(split(df, rownames(df)), function(x) {
(sum(x[pre] > 2)/ncol(x[pre]) >= .5) & (sum(x[post] > 2)/ncol(x[post]) == 1) ||
(sum(x[pre] > 2)/ncol(x[pre]) == 1) & (sum(x[post] > 2)/ncol(x[post]) >= .5)
}),]
#### OUTPUT ####
pre1 post1 pre2 post2 pre3 post3 pre4 post4
a 5 1 5 3 5 3 5 1

Related

dplyr filter multiple variables (columns) with multiple conditions

I plan to filter data for multiple columns with multiple columns in one line to reduce the time used for running the code. This is sample data that I used to test my code. Basically, I want to remove any rows containing 0, 1, 2, and NA.
test <- data.frame(A = c(1,0,2,3,4,0,5,6,0,7,0,8,0,9,NA),
B = c(0,1,0,2,3,4,0,5,0,7,8,0,NA,9,0),
C = c(1,2,3,0,0,4,5,6,0,7,0,8,NA,0,9))
I used the following code to clean my data. Although it does the job, the code is very tedious and takes me quite a while when I run it with a large database.
test %>% filter(!is.na(A)) %>%
filter(!is.na(B)) %>%
filter(!is.na(C)) %>%
filter(A != 0) %>%
filter(A != 1) %>%
filter(A != 2) %>%
filter(B != 0) %>%
filter(B != 1) %>%
filter(B != 2) %>%
filter(C != 0) %>%
filter(C != 1) %>%
filter(C != 2)
A B C
1 6 5 6
2 7 7 7
I tried to shorten the code using filter, filter_at, and any_vars, but it did not work. Below are my attempts to deal with this problem (all of these codes did not work because they could not delete the row containing 0 (or 1,2, and NA).
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.))) %>%
filter_at(vars(A, B, C), any_vars(. != 2)) %>%
filter_at(vars(A, B, C), any_vars(. != 1)) %>%
filter_at(vars(A, B, C), any_vars(. != 0))
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.) | . != 2 | . != 1 | . != 0))
df_total <- test %>%
filter(!is.na(A) | A!= 2 | A!= 1 | A!= 0) %>%
filter(!is.na(B) | B!= 2 | B!= 1 | B!= 0) %>%
filter(!is.na(C) | C!= 2 | C!= 1 | C!= 0) %>%
I cannot figure out what I did incorrectly here. I went back and forth between the documentation and R to solve this problem, but my efforts were useless. Could you please suggest to me what I did wrong in my code? How can I write a code for multiple columns with multiple conditions in just one line? The point for one line is to speed up the running time for R. Any advice/ suggestions/ resources to find the answer would be appreciated! Thank you.
Another possible solution:
library(dplyr)
test %>%
filter(complete.cases(.) & if_all(everything(), ~ !(.x %in% 0:2)))
#> A B C
#> 1 6 5 6
#> 2 7 7 7
test %>%
filter(across(c(A, B, C), function(x) !is.na(x) & !x %in% c(0, 1, 2)))
# A B C
# 6 5 6
# 7 7 7

Is there a way to speed up a for loop?

I am working with a for loop. The goal of the for loop is to simply test a condition and code the data appropriately. The loop has to iterate over 503,288 unique values and includes three if statements. Is there a way to speed up the for loop?
The code is as follows:
count<- 0
for(i in unique(Data$ID)){ #503288
#Subset Relevant Data
Loop_Before<- subset(Primary_Before, ID == i); Loop_After <- subset(Primary_After, ID == i)
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) >= 1){
Data$Status[Data$ID == i] <- "Both Elections"
}
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) == 0){
Data$Status[Data$ID == i] <- "Only Primary Election"
}
if(nrow(Loop_Before) >= 0 & nrow(Loop_After) == 1){
Data$Status[Data$ID == i] <- "Only General Election"
}
message(count<- count +1)
}
table(Data$Status)
Thank you for your help!
Avoid the for-loop entirely. I don't know your dataset but the following should be 10 or even 100 times faster:
library(tidyverse) # load some packages that will help
# let's create some sample data
Data <- data.frame(ID = c(1,1,1,1,2,2,2,3,3))
Primary_before <- data.frame(ID = c(0,1,2,2,3,3,3))
Primary_after <- data.frame(ID = c(1,3))
# now for every ID we count the nr of rows in both dataframes
summarised_before <- Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 0 1
2 1 1
3 2 2
4 3 3
summarised_after <- Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 1 1
2 3 1
# now we join them together
summarised_both <- summarised_after %>%
full_join(summarised_before, by = "ID", suffix = c("_after", "_before"))
# and now we do the final calculation
summarised_both %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)
I saved the intermediate results, but you could also do it in one go, like this:
Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n()) %>%
full_join(Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
, by = "ID"
, suffix = c("_after", "_before")) %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)

Implementing "at least" condition in a filter using R (dplyr)

This question is related to my previous post:
Consecutive exceedance above a threshold and additional conditions in R
Here's the data:
dat <- structure(list(V1 = c(-3.85326, -2.88262, -4.1405, -3.95193,
-6.68925, -2.04202, -2.47597, -4.91161, -2.5946, -2.82873, 2.68839,
-4.1287, -4.50296, -0.143476, -1.12174, -0.756168, -1.67556,
-1.92704, -1.89279, -2.37569, -5.71746, -2.7247, -4.12986, -2.29769,
-1.52835, -2.63623, -2.31461, 2.32796, 4.14354, 4.47055, -0.557311,
-0.425266, -2.37455, -5.97684, -5.22391, 0.374004, -0.986549,
2.36419, 0.218283, 2.66014, -3.44225, 3.46593, 1.3309, 0.679601,
5.42195, 10.6555, 8.34144, 1.64939, -1.64558, -0.754001, -4.77503,
-6.66197, -4.07188, -1.72996, -1.15338, -8.05588, -6.58208, 1.32375,
-3.69241, -5.23582, -4.33509, -7.43028, -3.57103, -10.4991, -8.68752,
-8.98304, -8.96825, -7.99087, -8.25109, -6.48483, -6.09004, -7.05249,
-4.78267)), class = "data.frame", row.names = c(NA, -73L))
What I want
I want to get the FIRST timestep satisfying the following modified conditions:
[1] V1 > 0 at the time step
[2] In the succeeding FOUR time steps (including the timestep in [1]), V1 > 0 in AT LEAST THREE timesteps
[3] Accumulated value of the next FOUR timesteps (including the timestep in [1]) should be greater than 1.
Here's the script so far:
library(dplyr)
newx <- dat %>% as_tibble() %>%
mutate(time = 1: n()) %>%
filter(V1 > 0, dplyr::lead(V1, 1) > 0, dplyr::lead(V1, 2) > 0,
(dplyr::lead(V1, 1) + dplyr::lead(V1, 2) + dplyr::lead(V1, 3) +
dplyr::lead(V1, 4)) > 1)
Output
> newx
# A tibble: 7 x 2
V1 time
<dbl> <int>
1 2.33 28
2 2.36 38
3 3.47 42
4 1.33 43
5 0.680 44
6 5.42 45
7 10.7 46
Problem
I dont know how to implement the second condition correctly. It should check whether three out of four timesteps is > 0. It doesnt matter wether consecutive or not.
Expected Output
The correct answer should be 28.
I'll appreaciate any help.
If I've understood correctly and you want the first row that meets your conditions you can use zoo::rollsum:
library(zoo)
library(dplyr)
dat %>%
rownames_to_column() %>%
filter(V1 > 0 &
rollsum(V1 > 0, 4, fill = NA, align = "left") >= 3 &
rollsum(V1, 4, fill = NA, align = "left") > 1) %>%
slice(1)
rowname V1
1 28 2.32796
Using stats::filter to do rolling sums:
which(
(dat$V1 > 0) &
(rev(stats::filter(rev(dat$V1 > 0), rep(1,4), sides=1)) >= 3) &
rev(stats::filter(rev(dat$V1), rep(1,4), sides=1))
)[1]
#[1] 28
Or if you have to incorporate into dplyr:
dat %>%
slice(
which(
(rev(stats::filter(rev(V1 > 0), rep(1,4), sides=1)) >= 3) &
(V1 > 0) &
rev(stats::filter(rev(V1), rep(1,4), sides=1))
)[1]
)
## A tibble: 1 x 1
# V1
# <dbl>
#1 2.33
Wordier:
library(dplyr)
dat2 <- dat %>%
tibble::rowid_to_column() %>%
mutate(gtz = (V1 > 0) * 1,
gtz_cuml = cumsum(gtz),
gtz_next_three = lead(gtz_cuml, 3) - lag(gtz_cuml),
cuml_V1 = cumsum(V1),
V1_next_three = lead(cuml_V1, 3) - lag(cuml_V1)) %>%
filter(gtz > 0,
gtz_next_three >= 3,
V1_next_three > 1) %>%
slice(1)
#> dat2
# rowid V1 gtz gtz_cuml gtz_next_three cuml_V1 V1_next_three
#1 28 2.32796 1 2 3 -71.22716 9.959473

Creating functions with logical comparatives as input R

I've got several sequential comparative evaluations to conduct with two variables in R in order to check for concordance.
In this example, say I have a boolean ANES_6 and a numeric ANES. The boolean is 1 if the patient had anesthesia for more than 6 hours, 0 else. The numeric value is the time the patient was under anesthesia.
I'm looking to write a function which can replace multiple copy-pastes of the following:
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES < 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
I could create the following function (non-exhaustive of all cases shown above):
my_func <- function(x, y) {
if (x == "gt" & y == 1) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES >= 6 & ANES_6 == 1) %>%
tally()
} else if (x == "lt" & y == 0 ) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES < 6 & ANES_6 != 1) %>%
tally()
}}
which takes x and y as input, with values for x being c('lt', 'gt'), and y being c(0, 1), in order to evaluate all possible condition. However, this would entail writing more code, and not less.
Is there a way to input logical comparisons in the function such that the following works:
my_func <- function(x, y) {
data %>% select(ANES_6, ANES) %>%
filter(ANES x 6 & ANES_6 == y)
}
with x replaced by >=, <, etc, in the input of the function. Currently, this does not work, are there any workarounds?
Try grouping. The question should normally include reproducible test data but I have provided it this time.
library(dplyr)
data <- data.frame(ANES_6 = c(0, 0, 1, 1), ANES = 5:6) # test data
data %>%
group_by(ANES_6, ANES >= 6) %>%
tally %>%
ungroup
giving:
# A tibble: 4 x 3
ANES_6 `ANES >= 6` n
<dbl> <lgl> <int>
1 0. FALSE 1
2 0. TRUE 1
3 1. FALSE 1
4 1. TRUE 1

Grouped operation on all groups relative to "baseline" group, with multiple observations

Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?
Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)
You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))
You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902

Resources