This problem has me stumped.
I have the following data frame:
library(dplyr)
# approximation of data frame
x <- data.frame(doy = sample(c(seq(200, 300)), 20, replace = T),
year = sample(c("2000", "2005"), 20, replace = T),
phase = sample(c("pre", "post"), 20, replace = T))
and a simple 'summarize' function that takes in the column name as a variable, and works nicely:
getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}
> getStats(x, "doy")
Source: local data frame [4 x 6]
Groups: year [?]
year phase n mean sd se
<fctr> <fctr> <int> <dbl> <dbl> <dbl>
1 2000 post 8 248.625 30.42526 10.75695
2 2000 pre 2 290.000 14.14214 10.00000
3 2005 post 5 231.400 32.86031 14.69558
4 2005 pre 5 274.200 29.79429 13.32441
However, if I modify the function to get the median, it returns an error:
getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
med = median(col, na.rm = T), # new line
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}
> getStats(x, "doy")
Error in median (doy, na.rm = TRUE): object "doy" not found
I've tried a host of name and position changes, but all yield the same result: 'median' doesn't accept the column name as a passed variable. I assume I'm missing something so basic I'll do a face palm when someone points it out to me, but in the interim I feel like I'm losing my sanity. I appreciate any insights!
Your proximal problem may be that median doesn't have a ... argument, while mean does (I'm not sure why sd is working ... maybe an interaction between methods and ...?)
In any case, IMO the right way to handle this sort of problem is to use standard evaluation (i.e., not non-standard evaluation, i.e. use summarise_ rather than summarise, as illustrated in vignette("nse",package="dplyr")):
Illustrating how this works in the global environment rather than inside a function, but I think that shouldn't matter ...
col <- "doy"
funs <- c("n","mean","stats::median","sd","se")
## put together function calls
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
dots
## n mean
## "sum(!is.na(doy))" "mean(doy,na.rm=TRUE)"
## median sd
## "stats::median(doy,na.rm=TRUE)" "sd(doy,na.rm=TRUE)"
## se
## "sd/sqrt(n)"
x %>%
group_by(year, phase) %>%
summarise_(.dots=dots)
The only annoying thing here is that for some reason dplyr can't find median unless I call it as stats::median, which means we have to work a little harder to get nice column names. The standard-evaluation method is a little uglier, but that's the price you pay for this kind of flexibility.
Embedding this in a function, I would probably break off getStats in a different place, i.e.
getStats <- function(data,col) {
## if you want to pass a string argument instead, remove
## the next line
col <- deparse(substitute(col))
funs <- c("n","mean","stats::median","sd","se")
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
summarise_(data,.dots=dots)
}
x %>% group_by(year,phase) %>% getStats(doy)
This gives you more flexibility to do different groupings ...
Related
I am trying to split a dataset from tidymodels in R.
library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
I want to describe the distribution of the training dataset, but the following error occurs.
Sac_train %>%
select(price) %>%
summarize(min_sell_price = min(),
max_sell_price = max(),
mean_sell_price = mean(),
sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf
However, the following code works.
Sac_train %>%
summarize(min_sell_price = min(price),
max_sell_price = max(price),
mean_sell_price = mean(price),
sd_sell_price = sd(price))
My question is: why select(price) is not working in the first example? Thanks.
Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.
In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:
Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13
will calculate the mean, whereas
Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)
throws an error.
In the special case of only one column, this may be more parsimonious code:
# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
Sac_train %>%
select(price) %>%
summarize(across(everything(),
list(min = min, max = max, mean = mean, sd = sd)))
Output:
# price_min price_max price_mean price_sd
# 1 1 25 13 7.359801
For example:
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
df1 <- x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
df1$Var = z # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, the results are:
Treatment n Mean SD Var
<fct> <int> <dbl> <dbl> <chr>
1 A 2 2.5 1.29 Price
2 B 2 2.5 1.29 Price
They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?
If I take the code out of the function environment, it works totally fine. Please help, thanks.
If you have a better way to achieve my purpose, that would be great! Thanks!
When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.
library(dplyr)
SummarizeFn <- function(x,y,z) {
x %>%
group_by(Treatment) %>%
summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}
SummarizeFn(df, Price, "Price")
# Treatment n Mean SD Var
# <fct> <int> <dbl> <dbl> <chr>
#1 A 2 1.5 0.707 Price
#2 B 2 3.5 0.707 Price
This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)
Regarding your problem, I think data.table offers you a concise solution
dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
.SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output
# output
# mpg qsec stat
# 1: 32 32 observations
# 2: 20.09062 17.84875 mean
# 3: 6.026948 1.786943 sd
I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed
I'm trying to calculate the mean of some grouped data, but I'm running into an issue where the mean generated using base::mean() is generating a different value than when I use base:rowMeans() or try to replicate the mean in Excel.
Here's the code with a simplified data frame looking at just a small piece of the data:
df <- data.frame("ID" = 1101372,
"Q1" = 5.996667,
"Q2" = 6.005556,
"Q3" = 5.763333)
avg1 <- df %>%
summarise(new_avg = mean(Q1,
Q2,
Q3)) # Returns a value of 5.99667
avg2 <- rowMeans(df[,2:4]) # Returns a value of 5.921852
The value in avg2 is what I get when I use AVERAGE in Excel, but I can't figure out why mean() is not generating the same number.
Any thoughts?
Here, the mean is taking only the first argument i.e. Q1 as 'x' because the usage for ?mean is
mean(x, trim = 0, na.rm = FALSE, ...)
i.e. the second and third argument are different. In the OP's code, x will be taken as "Q1", trim as "Q2" and so on.. The ... at the end also means that the user can supply n number of parameters without any error and leads to confusions like this (if we don't check the usage)
We can specify the data as ., subset the columns of interest and use that in rowMeans
df %>%
summarise(new_avg = rowMeans(.[-1]))
This would be more efficient. But, if we want to use mean as such, then do a rowwise
df %>%
rowwise() %>%
summarise(new_avg = mean(c(Q1, Q2, Q3)))
# A tibble: 1 x 1
# new_avg
# <dbl>
#1 5.92
Or convert to 'long' format and then do the group_by 'ID' and get the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% # can skip this step if there is only a single row
summarise(new_avg = mean(value))
# A tibble: 1 x 2
# ID new_avg
# <dbl> <dbl>
#1 1101372 5.92
I am trying to learn purrr to simulate data using rnorm with different means, sd, and n in each iteration.
This code generates my dataframe:
parameter = crossing(n = c(60,80,100),
agegroup = c("a", "b","c"),
effectsize = c(0.2, 0.5, 0.8),
sd =2
) %>%
# create a simulation id number
group_by(agegroup) %>%
mutate(sim= row_number())%>%
ungroup() %>%
mutate(# change effect size so that one group has effect, others d=0
effectsize= if_else(agegroup == "a", effectsize, 0),
# calculate the mean for the distribution from effect size
mean =effectsize*sd)
Now I want to iterate over the different simulations and for each row, generate data according to mean, sd and r using rnorm
# create a nested dataframe to iterate over each simulation and agegroup
nested_df = parameter %>%
group_by(sim, agegroup, effectsize)%>%
nest() %>% arrange(sim)
This is what my dataframe then looks like:
Now I want create normally-distributed data with the mean, sd, and n given in the "data" column
nested_df = nested_df %>%
mutate(data_points = pmap(data,rnorm))
However the code above gives an error that I haven't been able to find a solution to:
Error in mutate_impl(.data, dots) :
Evaluation error: unused arguments
I read the Iteration chapter in R for Data Science and googled a bunch, but I can't figure out how to combine pmap and nest. The reason I would like to use those functions is that it would make it easier to keep the parameters, simulated data, and output all in one dataframe.
You don't necessarily need to nest the parameters. For example:
parameter %>%
# Use `pmap` because we explicitly specify three arguments
mutate(data_points = pmap(list(n, mean, sd), rnorm))
# A tibble: 27 x 7
# n agegroup effectsize sd sim mean data_points
# <dbl> <chr> <dbl> <dbl> <int> <dbl> <list>
# 1 60 a 0.2 2 1 0.4 <dbl [60]>
# 2 60 a 0.5 2 2 1 <dbl [60]>
# 3 60 a 0.8 2 3 1.6 <dbl [60]>
With the nested data frame, you can use map rather than pmap:
nested_df %>%
# Use `map` because there is really one argument, `data`,
# but then refer to three different columns of `data`.
mutate(data_points = map(data, ~ rnorm(.$n, .$mean, .$sd)))
first, it is okay to use pmap like this:
x <- tibble(n = 100, mean = 5, sd = 0.1)
pmap(x, rnorm)
which is very similar to use do.call:
do.call(rnorm, x)
However, if you want to use pmap inside mutate you bring the inputs for the function .f into the right shape.
Writing
nested_df %>%
mutate(y = pmap(x, f))
means that f expects input x.
In your case, rnorm expects three inputs, but only gets one.
So if you insist on nesting the inputs you can do this:
nested_df %>%
mutate(data_points = pmap(list(data), function(z) pmap(z, rnorm))[[1]])
or
nested_df %>%
mutate(data_points = pmap(list(data), function(z) do.call(rnorm, z))).
However I would recommend to do it a little bit differently:
parameter %>%
mutate(data_points = pmap(list(n, mean, sd), rnorm))
Hope this helps a little.
First of all my data comes from Temperature.xls which can be downloaded from this link: RBook
My code is this:
temp = read.table("Temperature.txt", header = TRUE)
length(unique(temp$Year)) # number of unique values in the Year vector.
res = ddply(temp, c("Year","Month"), summarise, Mean = mean(Temperature, na.rm = TRUE))
res1 = ddply(temp, .(Year,Month), summarise,
SD = sd(Temperature, na.rm = TRUE),
N = sum(!is.na(Temperature))
)
# ordering res1 by sd and year:
res1 = res1[order(res1$Year,res1$SD),];
# finding maximum of SD in res1 by year and displaying just them in a separate data frame
res1_maxsd = ddply(res1, .(Year), summarise, MaxSD = max(SD, na.rm = TRUE)) # find the maxSD in each Year
res1_max = merge(res1_maxsd,res1, all = FALSE) # merge it with the original to see other variables at the max's rows
res1_m = res1_max[res1_max$MaxSD==res1_max$SD,] # find which rows are the ones corresponding to the max value
res1_mm = res1_m[complete.cases(res1_m),] # trim all others (which are NA's)
I know that I can cut the 4 last lines to less lines. Can I somehow execute the last 2 lines in one command? I have stumbled across:
res1_m = res1_max[complete.cases(res1_max$MaxSD==res1_max$SD),]
But this does not give me what I want which is eventually a smaller data frame only with the rows (with all the variables) that contain the maxSD.
Rather than fixing the last 2 lines why not start with res1? Reversing the order of SD and taking the first row per year gives you an equivalent final data set...
res1 <- res1[order(res1$Year,-res1$SD),]
res_final <- res1[!duplicated(res1$Year),]
The last four lines can be cut down if you use dplyr package. Since you want to keep some information from original data set, you probably don't want to use summarise because it only returns summarized information and you have to merge with original dataset, so mutate and filter would be a better choice:
library(dplyr)
res1_mm1 <- res1 %>% group_by(Year) %>% filter(SD == max(SD, na.rm = T))
You can also use a mutate function to create the new column MaxSD which is the same as SD in the result data frame for your case:
res1_mm1 <- res1 %>% group_by(Year) %>% mutate(MaxSD = max(SD, na.rm = T)) %>%
filter(SD == MaxSD)