Related
Let's pretend I am measuring the distance the distance grasshoppers can jump pre- and post-treatment. This is just for fun, the real measurement could be anything, and the bigger picture is to understand the group_by() command.
For the statistical test I would like to run, each observation needs to have its own column, but I'm given a dataset that is not in this format...!!, and I would like to use the package library(dplyr) , and the command group_by()to shape the data for my needs, because if this were to happen again, I could make a more general code to work over other datasets :)
I am able to do this using commands, such as filter(), and then cbind()at a later step (see example below). But it also requires renaming a column. Additionally, if I wanted to add a column, let's say "difference", to calculate the observed difference between observation 1, and observation 2, I can do this, but then I need to add another line of code (again, see example below)
It would be great to do this with less lines of code
Please see what I have tried, and let me know how I could modify the code group by() to work properly.
example_df <- data.frame( "observation" = character(0), "distance" = integer(0))
Assign names for our "observations", remember, in this example, it's done twice
variable_names <- c( "obs_1", "obs_2")
Assign fictitious values to y
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
Combine everything for this pretend exercise
df <- data.frame( "observation" = variable_names, "distance" = c(w,x,y,z))
attach(df)
Here's how I achieved the desired results for this example
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
I tried using the group_by() command, I just get an error message
final <- df %>% group_by(observation,distance) %>% summarise(
Observation_1 = first(observation), distance_1 = first(distance),
Observation_2 = last(observation), distance_2 = last(distance,difference=distance-distance_2)))
It would be great to get the above code to work
To make things even more "fun" :), what if more than one variable was measured. Could I make a general code to achieve the desired results, again, without having the go over the whole filter() process, with cbind()etc..
Here's an example (expanded on the above one)
example_df <- data.frame( "observation" = character(0), "distance" = integer(0),"weight" = integer(0),"speed" = integer(0))
variable_names <- c( "obs_1", "obs_2")
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
df <- data.frame( "observation" = variable_names, "distance" = c(w,x),"weight" = c(y,z),"speed" = c(a,b))
attach(df)
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
colnames(dat2)[3] <- "weight_2"
colnames(dat2)[4] <- "speed_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
final$difference_weight <- weight-weight_2
final$difference_speed <- speed-speed_2
Thanks everyone!
Would be simple with pivot_wider, though I presume your data also has an id column to link observations somehow, so have added one here:
library(tidyverse)
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
variable_names <- c( "obs_1", "obs_2")
df <-
data.frame(
"id" = rep(1:200, each = 2),
"observation" = variable_names,
"distance" = c(w, x),
"weight" = c(y, z),
"speed" = c(a, b)
)
df %>%
pivot_wider(
id_cols = id,
names_from = observation,
values_from = distance:speed
)
#> # A tibble: 200 x 7
#> id distance_obs_1 distance_obs_2 weight_obs_1 weight_obs_2 speed_obs_1
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.63 2.80 2.98 -0.795 3.58
#> 2 2 4.96 6.84 4.11 9.92 8.21
#> 3 3 4.84 7.51 6.32 3.28 9.02
#> 4 4 3.79 6.82 5.42 6.86 7.96
#> 5 5 5.48 2.84 9.56 3.27 3.55
#> 6 6 8.78 2.06 3.81 4.35 5.93
#> 7 7 8.42 4.21 3.92 4.40 9.37
#> 8 8 8.26 9.67 4.05 6.19 3.17
#> 9 9 3.80 4.47 6.58 5.38 6.09
#> 10 10 4.67 2.86 6.27 6.88 3.72
#> # ... with 190 more rows, and 1 more variable: speed_obs_2 <dbl>
Follow-up
You can also tell pivot_wider to use a function in combining values. Here in this example I've passed names_from = NULL so that every column is paired up by id, and using the diff function to calculate the difference:
df %>%
pivot_wider(
id_cols = id,
names_from = NULL,
values_from = distance:speed,
values_fn = diff,
names_sep = ""
)
#> # A tibble: 200 x 4
#> id distance weight speed
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.828 -3.77 4.45
#> 2 2 1.88 5.82 -1.07
#> 3 3 2.66 -3.04 -4.31
#> 4 4 3.03 1.45 -0.969
#> 5 5 -2.64 -6.29 5.06
#> 6 6 -6.72 0.541 -2.24
#> 7 7 -4.20 0.481 -5.82
#> 8 8 1.41 2.14 3.71
#> 9 9 0.669 -1.19 -1.14
#> 10 10 -1.81 0.607 -2.62
#> # ... with 190 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
I'd like to produce "wide" summary tables of data in this sort of format:
---- Centiles ----
Param Group Mean SD 25% 50% 75%
Height 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
Weight 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.
First, some toy data:
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
Now, a simple summary function, and a helper:
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
So I can say things like
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()
Giving
# A tibble: 6 x 5
Parameter Group Q$Value $Quantile Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.45 0.25 1.54 0.141
2 Height 1 1.49 0.5 1.54 0.141
3 Height 1 1.59 0.75 1.54 0.141
4 Height 2 1.64 0.25 1.66 0.0649
5 Height 2 1.68 0.5 1.66 0.0649
6 Height 2 1.68 0.75 1.66 0.0649
So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:
is_tibble(summary$Q)
[1] TRUE
So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:
toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise: Group, Mean, SD
Group Mean SD data
<int> <dbl> <dbl> <list<tbl_df[,2]>>
1 1 1.54 0.141 [3 × 2]
2 1 78.8 10.2 [3 × 2]
3 2 1.66 0.0649 [3 × 2]
4 2 82.9 9.09 [3 × 2]
5 3 1.63 0.100 [3 × 2]
6 3 71.0 10.8 [3 × 2]
But now the format of the centiles is even more complicated:
> toySummary$data[1]
<list_of<
tbl_df<
Parameter: character
Q :
tbl_df<
Value : double
Quantile: double
>
>
>[1]>
[[1]]
# A tibble: 3 x 2
Parameter Q$Value $Quantile
<chr> <dbl> <dbl>
1 Height 1.45 0.25
2 Height 1.49 0.5
3 Height 1.59 0.75
It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.
To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.
revised answer
Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.
This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.
This should generalize on centiles etc. as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
pivot_wider(enframe(quantile(x, q)),
names_from = name,
values_from = value)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q)
#> # A tibble: 6 x 7
#> Parameter Group `25%` `50%` `75%` Mean SD
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.62 1.66 1.73 1.70 0.108
#> 2 Height 2 1.73 1.77 1.78 1.76 0.105
#> 3 Height 3 1.55 1.64 1.76 1.65 0.109
#> 4 Weight 1 75.6 80.6 84.3 80.0 9.05
#> 5 Weight 2 75.4 76.9 79.6 77.4 7.27
#> 6 Weight 3 70.7 75.2 82.0 76.3 6.94
Created on 2020-06-13 by the reprex package (v0.3.0)
Second approach
without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q) %>%
pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#> Parameter Group Mean SD `0.25` `0.5` `0.75`
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.70 0.108 1.62 1.66 1.73
#> 2 Height 2 1.76 0.105 1.73 1.77 1.78
#> 3 Height 3 1.65 0.109 1.55 1.64 1.76
#> 4 Weight 1 80.0 9.05 75.6 80.6 84.3
#> 5 Weight 2 77.4 7.27 75.4 76.9 79.6
#> 6 Weight 3 76.3 6.94 70.7 75.2 82.0
Created on 2020-06-13 by the reprex package (v0.3.0)
generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.
Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.
The approach is not perfect, and could be extended by including unnest_wider in the widen function.
Note, that I changed the grouping in the example to be able to use t.test as another example output.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
# modified summary function
mySummary <- function(data, ...) {
fns <- rlang::enquos(...)
fns <- map(fns, function(x) {
res <- rlang::eval_tidy(x, data = data)
if ( ((is.vector(res) || is.factor(res)) && length(res) == 1) ||
("list" %in% class(res) && is.list(res)) ||
rlang::call_name(rlang::quo_get_expr(x)) == "list") {
x
}
else if ((is.vector(res) || is.factor(res)) && length(res) > 1) {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0(
"pivot_wider(enframe(",
x_expr,
"), names_from = name, values_from = value)"
)
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
} else {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0("list(", x_expr,")")
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
}
})
data %>%
group_by(Parameter) %>%
summarise(!!! fns, .groups="drop")
}
# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {
df_cols <- names(df)[map_lgl(df, is.data.frame)]
df <- unpack(df, all_of(df_cols), names_sep = "_")
try_tidy <- function(x) {
tryCatch({
broom::tidy(x)
}, error = function(e) {
x
})
}
df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
ungroup(df)
}
# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))
summary <- mySummary(toy,
Q = quantile3(Value),
R = range(Value),
T_test = t.test(Value),
Mean = mean(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE)
)
summary
#> # A tibble: 2 x 6
#> Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1` $`2` T_test Mean SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 <htest> 1.70 0.109
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 <htest> 77.9 7.40
widen(summary)
#> # A tibble: 2 x 11
#> Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%` R_1 R_2 T_test$estimate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 1.70
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 77.9
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> # $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> # Mean <dbl>, SD <dbl>
Created on 2020-06-14 by the reprex package (v0.3.0)
What if you change quibble2 to return a list, and then use unnest_wider?
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
list(quantile(x, q))
}
mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE)) %>%
unnest_wider(Q)
# A tibble: 6 x 7
Parameter Group `25%` `50%` `75%` Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.62 1.66 1.73 1.70 0.108
2 Height 2 1.73 1.77 1.78 1.76 0.105
3 Height 3 1.55 1.64 1.76 1.65 0.109
4 Weight 1 75.6 80.6 84.3 80.0 9.05
5 Weight 2 75.4 76.9 79.6 77.4 7.27
6 Weight 3 70.7 75.2 82.0 76.3 6.94
I'd like to store functions, or at least their names, in a column of a data.frame for use in a call to mutate. A simplified broken example:
library(dplyr)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
group_by(mu, sd, stat) %>%
mutate(sample = get(stat)(rnorm(100, mu, sd))) %>%
ungroup()
If this worked how I thought it would, the value of sample would be generated by the function in .GlobalEnv corresponding to either 'mean' or 'sd', depending on the row.
The error I get is:
Error in mutate_impl(.data, dots) :
Evaluation error: invalid first argument.
Surely this has to do with non-standard evaluation ... grrr.
A few issues here. First expand.grid will convert character values to factors. And get() doesn't like working with factors (ie get(factor("mean")) will give an error). The tidyverse-friendly version is tidyr::crossing(). (You could also pass stringsAsFactors=FALSE to expand.grid.)
Secondly, mutate() assumes that all functions you call are vectorized, but functions like get() are not vectorized, they need to be called one-at-a-time. A safer way rather than doing the group_by here to guarantee one-at-a-time evaluation is to use rowwise().
And finally, your real problem is that you are trying to call get("sd") but when you do, sd also happens to be a column in your data.frame that is part of the mutate. Thus get() will find this sd first, and this sd is just a number, not a function. You'll need to tell get() to pull from the global environment explicitly. Try
tidyr::crossing(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
rowwise() %>%
mutate(sample = get(stat, envir = globalenv())(rnorm(100, mu, sd)))
Three problems (that I see): (1) expand.grid is giving you factors; (2) get finds variables, so using "sd" as a stat is being confused with the column names "sd" (that was hard to find!); and (3) this really is a row-wise operation, grouping isn't helping enough. The first is easily fixed with an option, the second can be fixed by using match.fun instead of get, and the third can be mitigated with dplyr::rowwise, purrr::pmap, or base R's mapply.
This helper function was useful during debugging and can be used to "clean up" the code within mutate, but it isn't required (for other than this demonstration). Inline "anonymous" functions will work as well.
func <- function(f,m,s) get(f)(rnorm(100,mean=m,sd=s))
Several implementation methods:
set.seed(0)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd'),
stringsAsFactors=FALSE) %>%
group_by(mu, sd, stat) %>% # can also be `rowwise() %>%`
mutate(
sample0 = match.fun(stat)(rnorm(100, mu, sd)),
sample1 = purrr::pmap_dbl(list(stat, mu, sd), ~ match.fun(..1)(rnorm(100, ..2, ..3))),
sample2 = purrr::pmap_dbl(list(stat, mu, sd), func),
sample3 = mapply(function(f,m,s) match.fun(f)(rnorm(100,m,s)), stat, mu, sd),
sample4 = mapply(func, stat, mu, sd)
) %>%
ungroup()
# # A tibble: 20 x 8
# mu sd stat sample0 sample1 sample2 sample3 sample4
# <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 mean 1.02 1.03 0.896 1.08 0.855
# 2 2 1 mean 1.95 2.07 2.05 1.90 1.92
# 3 3 1 mean 2.93 3.07 3.03 2.89 3.01
# 4 4 1 mean 4.01 3.94 4.23 4.05 3.96
# 5 5 1 mean 5.04 5.11 5.05 5.17 5.19
# 6 1 10 mean 1.67 1.21 1.30 2.08 -0.641
# 7 2 10 mean 1.82 2.82 2.35 3.65 1.78
# 8 3 10 mean 1.45 3.10 3.15 4.28 2.58
# 9 4 10 mean 3.49 6.33 5.11 2.84 3.41
# 10 5 10 mean 5.33 4.85 4.07 5.58 6.66
# 11 1 1 sd 0.965 1.04 0.993 0.942 1.08
# 12 2 1 sd 0.974 0.967 0.981 0.984 1.15
# 13 3 1 sd 1.12 0.902 1.06 0.977 1.02
# 14 4 1 sd 0.946 0.928 0.960 1.01 0.992
# 15 5 1 sd 1.06 1.01 0.911 1.11 1.00
# 16 1 10 sd 9.46 8.95 10.0 8.91 9.60
# 17 2 10 sd 9.51 9.11 11.5 9.85 10.6
# 18 3 10 sd 9.77 9.96 11.0 9.09 10.7
# 19 4 10 sd 10.5 9.84 10.1 10.6 8.89
# 20 5 10 sd 11.2 8.82 10.4 9.06 9.64
sample0 happens to work because you have grouped it to be row-wise. If at some point any one grouping has two or more values, this will fail.
For sample1 through sample4, you can remove the group_by and it works equally well (though sample0 demonstrates its failing, so remove it too). You won't get identical results as above with grouping removed, because the entropy is being consumed differently.
Having problems with quoting and unquoting variable names within an R function that uses dplyr. Have been through this site as well as Hadley's Programming with dplyr site and it's still getting the best of me.
The function code that doesn't work is:
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(!!iv1 := factor(!!iv1)) %>%
group_by(!!iv1, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
calling it with mtcars it would be
sss <- gcreatedata(mtcars,mpg,am,cyl)
I'm simply trying to convert the variable am to a factor for use downstream in a ggplot. Yes I know I could do it before I enter the function but I want it generic. The function works minus the factor step just fine which you can see if you run this version.
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(foo := factor(!!iv1)) %>%
group_by(foo, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
sss <- gcreatedata(mtcars,mpg,am,cyl)
It returns what I want except for the fact that am has become foo how do I get the name right in this line of code mutate(!!iv1 := factor(!!iv1)) %>% right now I'm getting an Error: LHS must be a name or string message and despite all manner of combinations I could think of no dice.
Thanks in advance.
Your situation is described in the tutorial part here: http://dplyr.tidyverse.org/articles/programming.html#different-input-and-output-variable
The following code works for me:
> library(dplyr)
>
> gcreatedata <- function(dataframe,depvar,iv1,iv2){
+ depvar <- enquo(depvar)
+ iv1_q <- enquo(iv1)
+ iv2 <- enquo(iv2)
+
+ iv1_name <- paste0("mean_", quo_name(iv1_q))
+
+ newdata <- dataframe %>%
+ mutate(!!iv1_name := factor(!!iv1_q)) %>%
+ group_by(!!iv1_q, !!iv2) %>%
+ summarise(TheMean = mean(!!depvar,na.rm=TRUE),
+ TheSD = sd(!!depvar,na.rm=TRUE),
+ TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
+ CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
+ return(as_tibble(newdata))
+ }
> sss <- gcreatedata(mtcars,mpg,am,cyl)
> sss
# A tibble: 6 x 6
# Groups: am [?]
am cyl TheMean TheSD TheSEM CI95Muliplier
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 4.00 22.9 1.45 0.839 4.30
2 0 6.00 19.1 1.63 0.816 3.18
3 0 8.00 15.0 2.77 0.801 2.20
4 1.00 4.00 28.1 4.48 1.59 2.36
5 1.00 6.00 20.6 0.751 0.433 4.30
6 1.00 8.00 15.4 0.566 0.400 12.7
Hope that helps!
Well 24 hours brought me a clearer head. Here's the answer should anyone else need it in the future...
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(!!quo_name(iv1) := factor(!!iv1), !!quo_name(iv2) := factor(!!iv2)) %>%
group_by(!!iv1, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
to test it on common data...
gcreatedata(mtcars,mpg,am,vs)
# A tibble: 4 x 6
# Groups: am [?]
am vs TheMean TheSD TheSEM CI95Muliplier
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 0 0 15.0 2.77 0.801 2.20
2 0 1 20.7 2.47 0.934 2.45
3 1 0 19.8 4.01 1.64 2.57
4 1 1 28.4 4.76 1.80 2.45
I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.