Conditional operations in each group - r

I have some groups of data and in each group there is one number that is a multiple of 7.
For each group, I want to subtract the first value from that multiple.
Reproducible example:
temp.df <- data.frame("temp" = c(48:55, 70:72, 93:99))
temp.df$group <- cumsum(c(TRUE, diff(temp.df$temp) > 1))
Expected result:
group 1: 49-48 = 1
group 2: 70-70 = 0
group 3: 98-93 = 5
Can you suggest me a way that do not require using any loop?

You can get the number divisible by 7 in each group and subtract it with first value.
This can be done in base R using aggregate.
aggregate(temp~group, temp.df, function(x) x[x %% 7 == 0] - x[1])
# group temp
#1 1 1
#2 2 0
#3 3 5
You can also do this using dplyr
library(dplyr)
temp.df %>%
group_by(group) %>%
summarise(temp = temp[temp %% 7 == 0] - first(temp))
and data.table
library(data.table)
setDT(temp.df)[, .(temp = temp[temp %% 7 == 0] - first(temp)), group]

We can also do
library(dplyr)
temp.df %>%
group_by(group) %>%
summarise(temp = temp[which.max(!temp %% 7)] - first(temp))
# A tibble: 3 x 2
# group temp
# <int> <int>
#1 1 1
#2 2 0
#3 3 5

Related

Use apply functions within %>%

Below I create a function that deletes a specific column if there is only one unique value in it. Can I somehow use lapply within %>% to avoid calling the function three times? Or even call the function for all columns?
df <- tibble(col1 = sample(1:6), col2 = sample(1:6), col3 = 3, col4 = 4)
condDelCol <- function(mycolumn, mydataframe) {
if(length(unique(mydataframe[[mycolumn]])) == 1) { mydataframe[[mycolumn]] = NULL }
mydataframe
}
df %>%
condDelCol("col2", .) %>%
condDelCol("col3", .) %>%
condDelCol("col4", .)
With dplyr, an option is select_if
library(dplyr)
df %>%
select_if(~ n_distinct(.) > 1)
# A tibble: 6 x 2
# col1 col2
# <int> <int>
#1 1 6
#2 6 1
#3 5 5
#4 3 4
#5 4 2
#6 2 3
Or another way is base R by looping over the columns with sapply, create a logical vector, extract the column names that have only single unique value and assign (<-) it to NULL
i1 <- sapply(df, function(x) length(unique(x)))
df[names(which(i1 == 1))] <- NULL
Or with Filter
Filter(var, df)
You could use this one as well. It ignores the columns for which the standard deviation is 0.
df[, sapply(df, sd) != 0]
# A tibble: 6 x 2
col1 col2
<int> <int>
1 1 3
2 5 6
3 6 1
4 2 2
5 3 4
6 4 5
or if you want to use the pipe operator
df %>%
select(which(sapply(df, sd) != 0))

Filtering rows based on two conditions at the ID level

I have long data where a given subject has 4 observations. I want to only include a given id that meets the following conditions:
has at least one 3
has at least one of 1,2 OR NA
My data structure:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3), a=c(NA,1,2,3, NA,3,2,0, NA,NA,1,1))
My unsuccessful attempt (I get an empty data frame):
df %>% dplyr::group_by(id) %>% filter(a==3 & a %in% c(1,2,NA))
An option is to group by 'id', create a logic to return single TRUE/FALSE as output. Based on the OP's post, we need both values '3' and either one of the values 1, 2, NA in the column 'a'. So, 3 %in% a returns a logical vector of length 1, then wrap any on the second set where we do a comparison with multiple values or check the NA elements (is.na), merge both logical output with &
library(dplyr)
df %>%
group_by(id) %>%
filter((3 %in% a) & any(c(1, 2) %in% a|is.na(a)) )
# A tibble: 8 x 2
# Groups: id [2]
# id a
# <dbl> <dbl>
#1 1 NA
#2 1 1
#3 1 2
#4 1 3
#5 2 NA
#6 2 3
#7 2 2
#8 2 0
I have done this a bit of a long way to show how an idea could work. You can consolidate this a bit.
df %>%
group_by(id) %>%
mutate(has_3 = sum(a == 3, na.rm = T) > 0,
keep_me = has_3 & (sum(is.na(a)) > 0 | sum(a %in% c(1, 2)) > 0)) %>%
filter(keep_me == TRUE) %>%
select(id, a)
id a
<dbl> <dbl>
1 1 NA
2 1 1
3 1 2
4 1 3
5 2 NA
6 2 3
7 2 2
8 2 0
As I read it, the filter should keep ids 1 and 2. So I would use combo of all/any:
df %>%
group_by(id) %>%
filter(all(3 %in% a) & any(c(1,2,NA) %in% a))

Filter group only when both levels are present

This feels like it should be more straightforward and I'm just missing something. The goal is to filter the data into a new df where both var values 1 & 2 are represented in the group
here's some toy data:
grp <- c(rep("A", 3), rep("B", 2), rep("C", 2), rep("D", 1), rep("E",2))
var <- c(1,1,2,1,1,2,1,2,2,2)
id <- c(1:10)
df <- as.data.frame(cbind(id, grp, var))
only grp A and C should be present in the new data because they are the only ones where var 1 & 2 are present.
I tried dplyr, but obviously '&' won't work since it's not row based and '|' just returns the same df:
df.new <- df %>% group_by(grp) %>% filter(var==1 & var==2) #returns no rows
Here is another dplyr method. This can work for more than two factor levels in var.
library(dplyr)
df2 <- df %>%
group_by(grp) %>%
filter(all(levels(var) %in% var)) %>%
ungroup()
df2
# # A tibble: 5 x 3
# id grp var
# <fct> <fct> <fct>
# 1 1 A 1
# 2 2 A 1
# 3 3 A 2
# 4 6 C 2
# 5 7 C 1
We can condition on there being at least one instance of var == 1 and at least one instance of var == 2 by doing the following:
library(tidyverse)
df1 <- data_frame(grp, var, id) # avoids coercion to character/factor
df1 %>%
group_by(grp) %>%
filter(sum(var == 1) > 0 & sum(var == 2) > 0)
grp var id
<chr> <dbl> <int>
1 A 1 1
2 A 1 2
3 A 2 3
4 C 2 6
5 C 1 7

Filter rows based on multiple conditions using dplyr

df <- data.frame(loc.id = rep(1:2,each = 10), threshold = rep(1:10,times = 2))
I want to filter out the first rows when threshold >= 2 and threshold is >= 4 for each loc.id. I did this:
df %>% group_by(loc.id) %>% dplyr::filter(row_number() == which.max(threshold >= 2),row_number() == which.max(threshold >= 4))
I expected a dataframe like this:
loc.id threshold
1 2
1 4
2 2
2 4
But it returns me an empty dataframe
Based on the condition, we can slice the rows from concatenating the two which.max index, get the unique (if there are only cases where threshold is greater than 4, then both the conditions get the same index)
df %>%
group_by(loc.id) %>%
filter(any(threshold >= 2)) %>% # additional check
#slice(unique(c(which.max(threshold > 2), which.max(threshold > 4))))
# based on the expected output
slice(unique(c(which.max(threshold >= 2), which.max(threshold >= 4))))
# A tibble: 4 x 2
# Groups: loc.id [2]
# loc.id threshold
# <int> <int>
#1 1 2
#2 1 4
#3 2 2
#4 2 4
Note that there can be groups where there are no values in threshold greater than or equal to 2. We could keep only those groups
If this isn't what you want, assign the df below a name and use it to filter your dataset.
df %>%
distinct() %>%
filter(threshold ==2 | threshold==4)
#> loc.id threshold
#> 1 1 2
#> 2 1 4
#> 3 2 2
#> 4 2 4
```

dplyr summarise logical condition

I have the following data frame
df <- data.frame(Gender = c(rep(c("M","F"),each=4)),
DiffA=c(1,1,-1,-1,1,1,1,-1),
DiffB=c(1,-1,1,-1,1,1,1,-1))
I would like to create 2 new variables which summarize for each gender i)the number of rows for which DiffA and DiffB are positive and ii) the number of rows for which DiffA and DiffB are negative in order to obtain:
df2 <- data.frame(Gender = c("M","F"),
Diff_Pos=c(1,3),
Diff_Neg=c(1,1))
I have failed to combine the summary function from dplyr n() which returns the count of rows and the required logical statement. Thanks in advance
I would consider doing
library(tidyr)
df %>% filter(DiffA == DiffB) %>% count(Gender, DiffA) %>% spread(DiffA, n)
Gender -1 1
# (fctr) (int) (int)
# 1 F 1 3
# 2 M 1 1
The analogous data.table code is
dcast(df[DiffA == DiffB, .N, by=.(Gender, DiffA)], Gender ~ DiffA)
# Gender -1 1
# 1: F 1 3
# 2: M 1 1
If your real data goes beyond -1 and 1, wrap the relevant columns in sign().
Here is a base R option
with(subset(df, DiffA==DiffB), table(Gender, DiffA))
# DiffA
#Gender -1 1
# F 1 3
# M 1 1
This should work:
df %>%
dplyr::mutate(
Diff_Pos = DiffA > 0 & DiffB > 0,
Diff_Neg = DiffA < 0 & DiffB < 0) %>%
dplyr::group_by(Gender) %>%
dplyr::summarise(
Diff_Pos = sum(Diff_Pos),
Diff_Neg = sum(Diff_Neg))

Resources