Using dplyr::select semantics within a dplyr::mutate function [duplicate] - r

This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 5 years ago.
What I'm trying to do here is bring in dplyr::select() semantics into a function supplied to dplyr::mutate(). Below is a minimal example.
dat <- tibble(class = rep(c("A", "B"), each = 10),
x = sample(100, 20),
y = sample(100, 20),
z = sample(100, 20))
.reorder_rows <- function(...) {
x <- list(...)
y <- as.matrix(do.call("cbind", x))
h <- hclust(dist(y))
return(h$order)
}
dat %>%
group_by(class) %>%
mutate(h_order = .reorder_rows(x, y, z))
## class x y z h_order
## <chr> <int> <int> <int> <int>
## 1 A 85 17 5 1
## 2 A 67 24 35 5
## ...
## 18 B 76 7 94 9
## 19 B 65 39 85 8
## 20 B 49 11 100 10
##
## Note: function applied across each group, A and B
What I would like to do is something along the lines of:
dat %>%
group_by(class) %>%
mutate(h_order = .reorder_rows(-class))
The reason this is important is that when dat has many more variables, I need to be able to exclude the grouping/specific variables from the function's calculation.
I'm not sure how this would be implemented, but somehow using select semantics within the .reorder_rows function might be one way to tackle this problem.

For this particular approach, you should probably nest and unnest (using tidyr) by class rather than grouping by it:
library(tidyr)
library(purrr)
dat %>%
nest(-class) %>%
mutate(h_order = map(data, .reorder_rows)) %>%
unnest()
Incidentally, notice that while this works with your function you could also write a shorter version that takes the data frame directly:
.reorder_rows <- function(x) {
h <- hclust(dist(as.matrix(x)))
return(h$order)
}

Related

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

How to mean columns by values of another column? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I would like to get the mean of a variable according to the group it belongs to.
Here is a reproducible example.
gender <- c("M","F","M","F")
vec1 <- c(1:4)
vec2 <- c(10:13)
df <- data.frame(vec1,vec2,gender)
variables <- names(df)
variables <- variables[-3]
#Wished result
mean1 <- c(mean(c(1,3)),mean(c(2,4)))
mean2 <- c(mean(c(10,12)),mean(c(11,13)))
gender <- c("M","F")
result <- data.frame(gender,mean1,mean2)
How can I achieved such a result ? I would like to use the vector variables, containing the names of the variables to be summarized instead of writing each variables, as my dataset is quite big.
A dplyr solution
library(dplyr)
df %>% group_by(gender) %>% summarise(across(variables, list(mean = mean), .names = "{.fn}_{.col}"))
Output
# A tibble: 2 x 3
gender mean_vec1 mean_vec2
<chr> <dbl> <dbl>
1 F 3 12
2 M 2 11
Use library dplyr
library(dplyr)
gender <- c("M","F","M","F")
df <- data.frame(1:4,gender)
df %>%
group_by(gender) %>%
summarise(mean = X1.4 %>% mean())
Using aggregate.
## formula notation
aggregate(cbind(vec1, vec2) ~ gender, df, FUN=mean)
# gender vec1 vec2
# 1 F 3 12
# 2 M 2 11
## list notation
with(df, aggregate(list(mean=cbind(vec1, vec2)), list(gender=gender), mean))
# gender mean.vec1 mean.vec2
# 1 F 3 12
# 2 M 2 11
If you get an error in the formula notation, it is because you have named another object mean. Use rm(mean) in this case.

melting to long format

df <- data.frame('Dev' = 1:12,
'GWP' = seq(10,120,10),
'2012' = 1:12,
'Inc' = seq(10,120,10),
'GWP2' = c(seq(10,100,10),NA,NA),
'2013'= 1:12,
'Inc2' = c(seq(10,100,10),NA,NA),
'GWP3' = c(seq(10,80,10),NA,NA,NA,NA),
'2014'= 1:12,
'Inc3' = c(seq(10,80,10),NA,NA,NA,NA))
head(df)
result_df <- data.frame('Dev' = rep(1:12,3),
'GWP' = c(seq(10,120,10),
c(seq(10,100,10),NA,NA),
c(seq(10,80,10),NA,NA,NA,NA)),
'YEAR' = c(rep(2012,12),
rep(2013,12),
rep(2014,12)),
'Inc' = c(seq(10,120,10),
c(seq(10,100,10),NA,NA),
c(seq(10,80,10),NA,NA,NA,NA)))
head(result_df)
The above is my data structure.
I'm trying to make the df to look like result_df. I'm assuming using the library reshape2 somehow would do the trick but I'm having troubles getting it to come out as expected:
x <- melt(df,id=c("Dev"))
x$value <- ifelse(x$variable == 'X2012',2012,
ifelse(x$variable == 'X2013',2013,
ifelse(x$variable == 'X2014',2014,x$value)))
x$variable <- ifelse(x$variable %in% c('GWP','GWP2','GWP3'),'GWP',
ifelse(x$variable %in% c('Inc','Inc2','Inc3'), 'Inc',
ifelse(x$variable %in% c('X2012','X2013','X2014'),"Year",
x$variable)))
The problem is that the "year" column in my actual data can go for 20-30 years and I want to avoid using multiple ifelse statements to map them up. Is there a way to do this?
The data needs some pre-processing before getting the expected output. Using tidyverse one possible way is
library(tidyverse)
df %>%
gather(key, value, -Dev) %>%
mutate(col = case_when(str_detect(key, "^GWP") ~ "GWP",
str_detect(key, "^X") ~ "Year",
str_detect(key, "^Inc") ~ "Inc"),
value = ifelse(col == "Year", sub("^X", "", key), value)) %>%
select(-key) %>%
group_by(col) %>%
mutate(Dev1 = row_number()) %>%
spread(col, value) %>%
select(-Dev1)
# A tibble: 36 x 4
# Dev GWP Inc Year
# <int> <chr> <chr> <chr>
# 1 1 10 10 2012
# 2 1 10 10 2013
# 3 1 10 10 2014
# 4 2 20 20 2012
# 5 2 20 20 2013
# 6 2 20 20 2014
# 7 3 30 30 2012
# 8 3 30 30 2013
# 9 3 30 30 2014
#10 4 40 40 2012
# … with 26 more rows
I found that this works for the first part:
apply(matrix(c(2012:2014)), 1, function(y) x$value[x$variable == paste("X", y, sep = "")] <<- y )
create a 1 dim matrix to iterate over using apply.
create a function to replace the values found through masking.
Note the use of the <<-, it assigns the respective values to the x scoped one level above that of the function defined in the apply.
Note it applies the function to the variable x and returns the values used in the replacement.
For the second part:
x$variable[x$variable %in% c('GWP', 'GWP2', 'GWP3')] <- "GWP"
x$variable[x$variable %in% c('Inc', 'Inc2', 'Inc3')] <- "Inc"
Since the variable column is type factor and Year is not a level:
x <- transform(x, variable = as.character(variable))
x$variable[x$variable %in% c('X2012', 'X2013', 'X2014')] <- "Year"
x <- transform(x, variable = as.factor(variable))

How to quickly create multiple summary tables with group_by() / summarise()?

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5

Compute variable according to factor levels

I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.

Resources