I am trying to learn purrr to simulate data using rnorm with different means, sd, and n in each iteration.
This code generates my dataframe:
parameter = crossing(n = c(60,80,100),
agegroup = c("a", "b","c"),
effectsize = c(0.2, 0.5, 0.8),
sd =2
) %>%
# create a simulation id number
group_by(agegroup) %>%
mutate(sim= row_number())%>%
ungroup() %>%
mutate(# change effect size so that one group has effect, others d=0
effectsize= if_else(agegroup == "a", effectsize, 0),
# calculate the mean for the distribution from effect size
mean =effectsize*sd)
Now I want to iterate over the different simulations and for each row, generate data according to mean, sd and r using rnorm
# create a nested dataframe to iterate over each simulation and agegroup
nested_df = parameter %>%
group_by(sim, agegroup, effectsize)%>%
nest() %>% arrange(sim)
This is what my dataframe then looks like:
Now I want create normally-distributed data with the mean, sd, and n given in the "data" column
nested_df = nested_df %>%
mutate(data_points = pmap(data,rnorm))
However the code above gives an error that I haven't been able to find a solution to:
Error in mutate_impl(.data, dots) :
Evaluation error: unused arguments
I read the Iteration chapter in R for Data Science and googled a bunch, but I can't figure out how to combine pmap and nest. The reason I would like to use those functions is that it would make it easier to keep the parameters, simulated data, and output all in one dataframe.
You don't necessarily need to nest the parameters. For example:
parameter %>%
# Use `pmap` because we explicitly specify three arguments
mutate(data_points = pmap(list(n, mean, sd), rnorm))
# A tibble: 27 x 7
# n agegroup effectsize sd sim mean data_points
# <dbl> <chr> <dbl> <dbl> <int> <dbl> <list>
# 1 60 a 0.2 2 1 0.4 <dbl [60]>
# 2 60 a 0.5 2 2 1 <dbl [60]>
# 3 60 a 0.8 2 3 1.6 <dbl [60]>
With the nested data frame, you can use map rather than pmap:
nested_df %>%
# Use `map` because there is really one argument, `data`,
# but then refer to three different columns of `data`.
mutate(data_points = map(data, ~ rnorm(.$n, .$mean, .$sd)))
first, it is okay to use pmap like this:
x <- tibble(n = 100, mean = 5, sd = 0.1)
pmap(x, rnorm)
which is very similar to use do.call:
do.call(rnorm, x)
However, if you want to use pmap inside mutate you bring the inputs for the function .f into the right shape.
Writing
nested_df %>%
mutate(y = pmap(x, f))
means that f expects input x.
In your case, rnorm expects three inputs, but only gets one.
So if you insist on nesting the inputs you can do this:
nested_df %>%
mutate(data_points = pmap(list(data), function(z) pmap(z, rnorm))[[1]])
or
nested_df %>%
mutate(data_points = pmap(list(data), function(z) do.call(rnorm, z))).
However I would recommend to do it a little bit differently:
parameter %>%
mutate(data_points = pmap(list(n, mean, sd), rnorm))
Hope this helps a little.
Related
I have a dataframe in R, let's call it df, which I would like to analyse in terms of mean, median, standard deviation, IQR etc column-wise. I have prepared succinct functions (where it's not just mean or sd) which can take a vector as input and output, say, the IQR or coefficient of variance. Now, if I want to apply any of these across the attributes (columns), I could use IQRs <- apply(df,2,IQR) for example.
My question is, how can I apply multiple of these functions together (really, I want to chain them all together), so as to fill in a table where there will be one column for the attributes and then one column per function (i.e. Means will be one column, IQRs will be one column), and the different attributes of the data-frame (which were columns in df) will be rows of this table (listed in the first column)?
Suppose your data looked like this:
set.seed(69)
df <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5))
And your function names were like this:
funcs <- c("mean", "median", "sd", "var", "min", "max")
Then you can use an apply inside an lapply like this:
as.data.frame(setNames(lapply(funcs, function(f) apply(df, 2, as.name(f))), funcs))
#> mean median sd var min max
#> A -0.3546864 -0.3348139 0.5948611 0.3538597 -0.949889 0.3743156
#> B -0.2016318 -0.9039467 1.4092795 1.9860687 -1.571073 1.4440935
#> C -0.3537707 -0.1691765 0.7955558 0.6329090 -1.311374 0.4149940
You can use tidyr::gather and dplyr::summarize:
# Toy data
df <- data.frame(x = 1:10, y = 11:20)
# Libs
library(tidyverse)
# Code
df %>%
gather(var, val) %>%
group_by(var) %>%
summarize(med = median(val), mean = mean(val), iqr = IQR(val))
Output:
# A tibble: 2 x 4
var med mean iqr
<chr> <dbl> <dbl> <dbl>
1 x 5.5 5.5 4.5
2 y 15.5 15.5 4.5
For example:
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
df1 <- x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
df1$Var = z # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, the results are:
Treatment n Mean SD Var
<fct> <int> <dbl> <dbl> <chr>
1 A 2 2.5 1.29 Price
2 B 2 2.5 1.29 Price
They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?
If I take the code out of the function environment, it works totally fine. Please help, thanks.
If you have a better way to achieve my purpose, that would be great! Thanks!
When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.
library(dplyr)
SummarizeFn <- function(x,y,z) {
x %>%
group_by(Treatment) %>%
summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}
SummarizeFn(df, Price, "Price")
# Treatment n Mean SD Var
# <fct> <int> <dbl> <dbl> <chr>
#1 A 2 1.5 0.707 Price
#2 B 2 3.5 0.707 Price
This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)
Regarding your problem, I think data.table offers you a concise solution
dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
.SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output
# output
# mpg qsec stat
# 1: 32 32 observations
# 2: 20.09062 17.84875 mean
# 3: 6.026948 1.786943 sd
I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I want to append (mutate) multiple columns to a data frame, where those columns are stored in a matrix. Is there a way to do this using functions from the tidyverse? (Note that it is possible by resorting to base:: functions, though.) Equivalently, what I am asking is what is the most natural (or idiomatic) way to do this using functions from the tidyverse.
For example, suppose we estimate a quantile regression:
library(dplyr)
tibble(x = runif(100)) %>%
mutate(y = rnorm(n())) ->
EstimationData
library(quantreg)
taus <- (1:9)/10
rq_fit <- rq(y ~ x, tau = taus, data = EstimationData)
and we would like to predict the model on the following values of x:
PredictionData <- tibble(x = seq(0, 1, len = 10))
This can be done via:
predict(rq_fit, newdata = PredictionData)
which returns a matrix (with one column corresponding to each tau). A natural thing would be to package the predictions along with their corresponding xs. One might hope to be able to mutate() the above matrix onto PredictionData, but as far as I can see that is not possible. One possibility is to do:
PredictionData %>%
data.frame(predict(rq_fit, newdata = .), check.names = FALSE) # (*)
which works well (particularly since the matrix columns have names), although it relies base::data.frame(). Note that tibble() and as_tibble() do not work.
One way to try to write more idiomatic tidyverse code is to turn the matrix into a list of vectors, as follows:
row_split <- function(X) split(X, row(X, as.factor = TRUE))
PredictionData %>%
mutate(y = row_split(predict(rq_fit, newdata = .))) %>%
unnest(.id = 'tau_ix') %>%
mutate(tau = taus[as.integer(tau_ix)]) %>%
select(-tau_ix)
But I'm not convinced it's any better.
Is method (*) the best way?
I think the function you want is dplyr::bind_cols(). Note this doesn't work with a matrix, so you also have to use dplyr::as_tibble().
If your goal is to keep things as a tibble, use functions from dplyr, etc., I think this is the easiest way:
PredictionData %>% bind_cols(as_tibble(predict(rq_fit, newdata = .)))
However, one might think this is a little too "from the inside out" rather than "left to right" to be really idiomatic for a dplyr approach. So, maybe you want something more like
rq_fit %>%
predict(newdata = PredictionData) %>%
as_tibble() %>%
bind_cols(PredictionData) %>%
select(x, everything())
Both approaches give the following output:
# A tibble: 10 x 10
x `tau= 0.1` `tau= 0.2` `tau= 0.3` `tau= 0.4` `tau= 0.5` `tau= 0.6` `tau= 0.7` `tau= 0.8` `tau= 0.9`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0000000 -1.5755585 -0.8082654 -0.3133431 -0.1952309 0.058074887 0.44450275 0.6679990 0.8802325 1.650510
2 0.1111111 -1.4767907 -0.7915847 -0.3517192 -0.1909820 0.041473996 0.39935461 0.6132367 0.8618259 1.618999
3 0.2222222 -1.3780228 -0.7749040 -0.3900952 -0.1867331 0.024873104 0.35420647 0.5584744 0.8434194 1.587488
4 0.3333333 -1.2792549 -0.7582233 -0.4284712 -0.1824842 0.008272213 0.30905833 0.5037121 0.8250128 1.555976
5 0.4444444 -1.1804871 -0.7415425 -0.4668472 -0.1782353 -0.008328679 0.26391019 0.4489498 0.8066063 1.524465
6 0.5555556 -1.0817192 -0.7248618 -0.5052233 -0.1739865 -0.024929570 0.21876205 0.3941875 0.7881997 1.492954
7 0.6666667 -0.9829513 -0.7081811 -0.5435993 -0.1697376 -0.041530462 0.17361391 0.3394252 0.7697932 1.461442
8 0.7777778 -0.8841835 -0.6915004 -0.5819753 -0.1654887 -0.058131353 0.12846577 0.2846630 0.7513866 1.429931
9 0.8888889 -0.7854156 -0.6748196 -0.6203513 -0.1612398 -0.074732245 0.08331763 0.2299007 0.7329801 1.398419
10 1.0000000 -0.6866477 -0.6581389 -0.6587274 -0.1569909 -0.091333136 0.03816949 0.1751384 0.7145735 1.366908
Data
For reproducibility, I created the data using your code, but setting the seed first:
set.seed(1234)
library(dplyr)
tibble(x = runif(100)) %>%
mutate(y = rnorm(n())) ->
EstimationData
library(quantreg)
taus <- (1:9)/10
rq_fit <- rq(y ~ x, tau = taus, data = EstimationData)
PredictionData <- tibble(x = seq(0, 1, len = 10))
I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).