Group when values in two columns are identical and calculate the mean - r

I got a dataset like this:
df1 <- data.frame(
var1 = c(1, 1, 1, 2),
var2 = c(1, 2, 2, 1),
value = c(1, 2, 3, 4))
I want to group rows in var1 and var2 and calculate the mean of value, and the condition is that when rows with the same var1 and var2 values will be grouped together (so it is not simply grouped by unique values in var1 and var2).
The output dataset will be this:
df2 <- data.frame(
var1 = c(1, 1, 2),
var2 = c(1, 2, 1),
value = c(1, 2.5, 4))
How can I do this?

Using aggregate().
aggregate(value ~ var1 + var2, df1, mean)
# var1 var2 value
# 1 1 1 1.0
# 2 2 1 4.0
# 3 1 2 2.5

You may try
library(dplyr)
df1 %>%
group_by(var1, var2) %>%
summarise(value = mean(value))
var1 var2 value
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2.5
3 2 1 4
group_by(var1, var2) will group both variable together.

Related

Why does dplyr's coalesce(.) and fill(.) not work and still leave missing values?

I have a simple test dataset that has many repeating rows for participants. I want one row per participant that doesn't have NAs, unless the participant has NAs for the entire column. I tried grouping by participant name and then using coalesce(.) and fill(.), but it still leaves missing values. Here's my test dataset:
library(dplyr)
library(tibble)
test_dataset <- tibble(name = rep(c("Justin", "Corey", "Sibley"), 4),
var1 = c(rep(c(NA), 10), 2, 3),
var2 = c(rep(c(NA), 9), 2, 4, 6),
var3 = c(10, 15, 7, rep(c(NA), 9)),
outcome = c(3, 9, 23, rep(c(NA), 9)),
tenure = rep(c(10, 15, 20), 4))
And here's what I get when I use coalesce(.) or fill(., direction = "downup"), which both produce the same result.
library(dplyr)
library(tibble)
test_dataset_coalesced <- test_dataset %>%
group_by(name) %>%
coalesce(.) %>%
slice_head(n=1) %>%
ungroup()
test_dataset_filled <- test_dataset %>%
group_by(name) %>%
fill(., .direction="downup") %>%
slice_head(n=1) %>%
ungroup()
And here's what I want--note, there is one NA because that participant only has NA for that column:
library(tibble)
correct <- tibble(name = c("Justin", "Corey", "Sibley"),
var1 = c(NA, 2, 3),
var2 = c(2, 4, 6),
var3 = c(10, 15, 7),
outcome = c(3, 9, 23),
tenure = c(10, 15, 20))
You can group_by the name column, then fill the NA (you need to fill every column using everything()) with the non-NA values within the group, then only keep the distinct rows.
library(tidyverse)
test_dataset %>%
group_by(name) %>%
fill(everything(), .direction = "downup") %>%
distinct()
# A tibble: 3 × 6
# Groups: name [3]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
Try this
cleaned<- test_dataset |>
dplyr::group_by(name) |>
tidyr::fill(everything(),.direction = "downup") |>
unique()
# To filter out the ones with all NAs
cleaned[sum(is.na(cleaned[,-1]))<ncol(cleaned[,-1]),]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
``

R dplyr, distinct, unique combination of variables, with maximum value of third

I'm close but don't have the syntax correct. I'm trying to select all columns of a data table based on selection of unique combinations of two variables (columns) based on the maximum value of a third. MWE of progress thus far. Thx. J
library(dplyr)
dt1 <- tibble (var1 = c("num1", "num2", "num3", "num4", "num5"),
var2 = rep("A", 5),
var3 = c(rep("B", 2), rep("C", 3)),
var4 = c(5, 10, 3, 7, 19))
dt1 %>% distinct(var2, var3, max(var4), .keep_all = TRUE)
# A tibble: 2 x 5
var1 var2 var3 var4 `max(var4)`
<chr> <chr> <chr> <dbl> <dbl>
1 num1 A B 5 19
2 num3 A C 3 19
which is close, but I want the row where the value of var4 is the max value, within the unique combination of var2 and var3. I'm attempting to get:
# A tibble: 2 x 5
var1 var2 var3 var4 `max(var4)`
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 5 10
2 num5 A C 3 19
Do I need a formula for the third argument of the distinct function?
We can add an arrange statement before the distinct
library(dplyr)
dt1 %>%
arrange(var2, var3, desc(var4)) %>%
distinct(var2, var3, .keep_all = TRUE)
-output
# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19
Or another option is slice_max
dt1 %>%
group_by(var2, var3) %>%
mutate(var4new = first(var4)) %>%
slice_max(order_by= var4, n = 1) %>%
ungroup
-output
# A tibble: 2 x 5
var1 var2 var3 var4 var4new
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 10 5
2 num5 A C 19 3
slice() will do what you want. Though you have drop "var4" = 5, 3 (not really sure if that is important)?
tibble (var1 = c("num1", "num2", "num3", "num4", "num5"),
var2 = rep("A", 5),
var3 = c(rep("B", 2), rep("C", 3)),
var4 = c(5, 10, 3, 7, 19)) %>%
group_by(var2, var3) %>%
slice(which.max(var4)) %>%
ungroup()
# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19
Does this work:
library(dplyr)
dt1 %>% group_by(var2, var3) %>% filter(dense_rank(desc(var4)) == 1)
# A tibble: 2 x 4
# Groups: var2, var3 [2]
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19

Merge data frames and divide rows by group

I would like to divide the values from df1 over the values from df2. In this reproducible example, I am able to sum these values. What about the division? Thanks in advance!
df1 <- data.frame(country = c("a", "b", "c"), year1 = c(1, 2, 3), year2 = c(1, 2, 3))
df2 <- data.frame(country = c("a", "b", "d"), year1 = c(1, 2, NA), year2 = c(1, 2, 3))
df3 <- bind_rows(df1, df2) %>%
mutate_if(is.numeric, tidyr::replace_na, 0) %>%
group_by(country) %>%
summarise_all(., sum, na.rm = TRUE) %>%
na_if(., 0)
Expected result is:
# A tibble: 4 x 3
country year1 year2
<chr> <dbl> <dbl>
1 a 1 1
2 b 1 1
3 c NA NA
4 d NA NA
As there are groups with 2 rows and some with 1, use an if/else condition within summarise/across to divide the first element by the last if there are two elements or else return NA
library(dplyr) # version 1.0.4
library(tidyr)
bind_rows(df1, df2) %>%
mutate(across(where(is.numeric), replace_na, 0)) %>%
group_by(country) %>%
summarise(across(everything(), ~ if(n() == 2) first(.)/last(.)
else NA_real_))
-output
# A tibble: 4 x 3
# country year1 year2
#* <chr> <dbl> <dbl>
#1 a 1 1
#2 b 1 1
#3 c NA NA
#4 d NA NA
Here is a base R option using merge + split.default
df <- merge(df1, df2, by = "country", all = TRUE)
cbind(
df[1],
list2DF(lapply(
split.default(df[-1], gsub("\\.(x|y)", "", names(df)[-1])),
function(v) do.call("/", v)
))
)
which gives
country year1 year2
1 a 1 1
2 b 1 1
3 c NA NA
4 d NA NA

conditionally summarize several variables by group

I want to conditionally summarize several variables by group. The following code does that, but I'm not sure how to do this without specifying each variable and the conditions in the summarize step.
library(tidyverse)
dat <- data.frame(group = c("A", "A", "A", "B", "B", "B"),
indicator = c(1, 2, 3, 1, 2, 3),
var1 = c(1, 0, 1, 2, 1, 2),
var2 = c(1, 0, 1, 1, 2, 1))
# dat
# group indicator var1 var2
#1 A 1 1 1
#2 A 2 0 0
#3 A 3 1 1
#4 B 1 2 1
#5 B 2 1 2
#6 B 3 2 1
dat %>%
group_by(group) %>%
summarise(var1 = sum(var1[indicator==1 | indicator==2]),
var2 = sum(var2[indicator==1 | indicator==2]))
# A tibble: 2 x 3
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3
Use across :
library(dplyr)
dat %>%
group_by(group) %>%
summarise(across(starts_with('var'), ~sum(.[indicator %in% 1:2])))
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3

Calculating the delta between multiple variables grouped by user ids

How might I calculate the delta between multiple variables grouped by user ids in a "long" data frame?
Data format:
d1 <- data.frame(
id = rep(c(1, 2, 3, 4, 5), each = 2),
purchased = c(rep(c(T, F), 3), F, T, T, F),
product = rep(c("A", "B"), 5),
grade = c(1, 2, 1, 2, 2, 3, 7, 5, 1, 2),
rate = c(10, 12, 10, 12, 12, 14, 22, 18, 10, 12),
fee = rep(c(1, 2), 5))
This is my roundabout solution:
dA <- d1 %>%
filter(product == "A")
dB <- d1 %>%
filter(product == "B")
d2 <- inner_join(dA, dB, by = "id", suffix = c(".A", ".B"))
d3 <- d2 %>%
mutate(
purchased = if_else(purchased.A == T, "A", "B"),
dGrade = grade.B - grade.A,
dRate = rate.B - rate.A,
dFee = fee.B - fee.A) %>%
select(id, purchased:dFee)
All of this just seems terribly inefficient and complex. Is tidyr::spread or another dplyr/tidyr function appropriate here? (I couldn't get anything else to work)...
We can do this with gather/spread. Reshape the data from 'wide' to 'long' using gather, grouped by 'id', 'Var', we get the 'product' based on the logical column 'purchased', get the difference of 'Val' for 'product' that are 'B' and 'A', and spread it from 'long' to 'wide' format.
library(dplyr)
library(tidyr)
gather(d1, Var, Val, grade:fee) %>%
group_by(id, Var) %>%
summarise(purchased = product[purchased],
Val = Val[product == 'B'] - Val[product == 'A'])%>%
spread(Var, Val)
# id purchased fee grade rate
# <dbl> <fctr> <dbl> <dbl> <dbl>
#1 1 A 1 1 2
#2 2 A 1 1 2
#3 3 A 1 1 2
#4 4 B 1 -2 -4
#5 5 A 1 1 2
The OP's output ('d3') is
d3
# id purchased dGrade dRate dFee
#1 1 A 1 2 1
#2 2 A 1 2 1
#3 3 A 1 2 1
#4 4 B -2 -4 1
#5 5 A 1 2 1

Resources