dplyr use both rowwise and df-wise values in a mutate - r

How do you perform a rowwise operation which uses values from other rows (in dplyr/tidy style)? Let's say I have this df:
df <- data_frame(value = c(5,6,7,3,4),
group = c(1,2,2,3,3),
group.to.use = c(2,3,3,1,1))
I want to create a new variable, new.value, which is equal to each row's current value plus the maximum of value for rows whose "group" equals this row's "group.to.use." So for the first row
new.value = 5 + (max(value[group === 2])) = 5 + 7 = 12
desired output:
# A tibble: 5 x 4
value group group.to.use new.value
<dbl> <dbl> <dbl> <dbl>
1 5. 1. 2. 12.
2 6. 2. 3. 10.
3 7. 2. 3. 11.
4 3. 3. 1. 8.
5 4. 3. 1. 9.
pseudo code:
df %<>%
mutate(new.value = value + max(value[group.to.use == <group.for.this.row>]))

In rowwise operation, you can refer to the whole data.frame with . and a whole column in the data.frame with normal syntax .$colname or .[['col.name']]:
df %>%
rowwise() %>%
mutate(new.value = value + max(.$value[.$group == group.to.use])) %>%
ungroup()
# # A tibble: 5 x 4
# value group group.to.use new.value
# <dbl> <dbl> <dbl> <dbl>
# 1 5 1 2 12
# 2 6 2 3 10
# 3 7 2 3 11
# 4 3 3 1 8
# 5 4 3 1 9
Alternatively, you can precompute the max for each group and then do a left-join:
df.max <- df %>% group_by(group) %>% summarise(max.value = max(value))
df %>%
left_join(df.max, by = c('group.to.use' = 'group')) %>%
mutate(new.value = value + max.value) %>%
select(-max.value)
# # A tibble: 5 x 4
# value group group.to.use new.value
# <dbl> <dbl> <dbl> <dbl>
# 1 5 1 2 12
# 2 6 2 3 10
# 3 7 2 3 11
# 4 3 3 1 8
# 5 4 3 1 9

With base R, we can use ave, where we calculate max for each group and add them with the corresponding value matching the groups.
df$new.value <- with(df, value +
ave(value, group, FUN = max)[match(group.to.use, group)])
df
# A tibble: 5 x 4
# value group group.to.use new.value
# <dbl> <dbl> <dbl> <dbl>
#1 5.00 1.00 2.00 12.0
#2 6.00 2.00 3.00 10.0
#3 7.00 2.00 3.00 11.0
#4 3.00 3.00 1.00 8.00
#5 4.00 3.00 1.00 9.00

Here is an option with base R
df$new.value <- with(df, value + vapply(group.to.use, function(x)
max(value[group == x]), numeric(1)))
df$new.value
#[1] 12 10 11 8 9

Related

Filter by value counts within groups

I want to filter my grouped dataframe based on the number of occurrences of a specific value within a group.
Some exemplary data:
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
The “lapse” column is my filter variable in this case.
I want to exclude every “ID” group that has more than 15 counts of “lapse” == 2 within!
data %>% group_by(ID) %>% count(lapse == 2)
So, if for example the group “A” has 17 times “lapse” == 2 within it should be filtered entirely from the datafame.
First I created some reproducible data using a set.seed and check the number of values per group. It seems that in this case only group D more values with lapse 2 has. You can use filter and sum the values with lapse 2 per group like this:
set.seed(7)
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
library(dplyr)
# Check n values per group
data %>%
group_by(ID, lapse) %>%
summarise(n = n())
#> # A tibble: 8 × 3
#> # Groups: ID [4]
#> ID lapse n
#> <chr> <int> <int>
#> 1 A 1 8
#> 2 A 2 7
#> 3 B 1 13
#> 4 B 2 15
#> 5 C 1 18
#> 6 C 2 6
#> 7 D 1 17
#> 8 D 2 16
data %>%
group_by(ID) %>%
filter(!(sum(lapse == 2) > 15))
#> # A tibble: 67 × 3
#> # Groups: ID [3]
#> ID rt lapse
#> <chr> <dbl> <int>
#> 1 B 0.517 2
#> 2 C 0.589 1
#> 3 C 0.598 2
#> 4 C 0.715 1
#> 5 B 0.475 2
#> 6 C 0.965 1
#> 7 B 0.234 1
#> 8 B 0.812 2
#> 9 C 0.517 1
#> 10 B 0.700 1
#> # … with 57 more rows
Created on 2023-01-08 with reprex v2.0.2

R Regex capture to remove/keep columns with repeats in their column names

This is an example dataframe
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
I have column names in my dataframe in R in this format
\S+_T\d+|\S+_T\d+
The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)
Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|"
I did this with some manual labor :
means_t0 <- means2 %>% select(matches("\\S+_T0\\|\\S+_T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S+_T1\\|\\S+_T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S+_T5\\|\\S+_T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)
This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?
Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same
For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained
Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)
Thank you
If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)
library(tidyr)
library(dplyr)
library(stringr)
res <- means2 %>%
pivot_longer(
cols = -id_cp_interaction,
names_to = "names",
values_to = "values"
) %>%
mutate(
celltype_1 = str_extract(names, "^[^_]*"),
timepoint_1 = str_extract(names, "[0-9](?=|)"),
celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
timepoint_2 = str_extract(names, "[0-9]$")
)
head(res, n = 7)
#> # A tibble: 7 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|B… 1.68 B 0 B 0
#> 2 1 B_T0|B… 19.3 B 0 B 1
#> 3 1 B_T0|F… 10.6 B 0 Fibro 0
#> 4 1 B_T5|E… 12.5 B 5 Endo 5
#> 5 1 Macro_… 2.84 Macro 1 Fibro 1
#> 6 2 B_T0|B… 2.17 B 0 B 0
#> 7 2 B_T0|B… 10.1 B 0 B 1
# only keep interactions of different cell types
res %>%
filter(celltype_1 != celltype_2) %>%
head()
#> # A tibble: 6 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|F… 10.6 B 0 Fibro 0
#> 2 1 B_T5|E… 12.5 B 5 Endo 5
#> 3 1 Macro_… 2.84 Macro 1 Fibro 1
#> 4 2 B_T0|F… 1.47 B 0 Fibro 0
#> 5 2 B_T5|E… 11.3 B 5 Endo 5
#> 6 2 Macro_… 13.0 Macro 1 Fibro 1
Created on 2022-09-19 by the reprex package (v1.0.0)

Average across rows, but leaving out own group [duplicate]

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

How to compute a leave one out average using dplyr in R? [duplicate]

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

Calculate group mean while excluding current observation using dplyr

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))
No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0
I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

Resources