Summarise multiple columns using weighted t-test - r

I have the following data and want to calculate weighted-p-value. I reviewed dplyr summarise multiple columns using t.test. But my version should use weight. I can use Code2 to do this. But there are over 30 columns. How do I calculate weighted p-values efficiently?
Code 1
# A tibble: 877 x 5
cat population farms farmland weight
<chr> <dbl> <dbl> <dbl> <dbl>
1 Treated 9.89 8.00 12.3 1
2 Control 10.3 7.81 12.1 0.714
3 Control 10.2 8.04 12.4 0.156
4 Control 10.3 7.97 12.1 0.340
5 Control 10.9 8.87 12.7 2.85
6 Control 10.4 8.35 12.5 0.934
7 Control 10.5 8.58 12.9 0.193
8 Control 10.6 8.57 12.6 0.276
9 Control 10.2 8.54 12.5 0.344
10 Control 10.5 8.76 12.6 0.625
# … with 867 more rows
Code 2
wtd.t.test(
x = df$population[df$cat == "Treated"],
y = df$population[df$cat == "Control"],
weight = df$weight[df$cat == "Treated"],
weighty = df$weight[df$cat == "Control"])$coefficients[3]

We can use summarise with across
library(dplyr)
df %>%
summarise(across(c(population:farmland),
~ weights::wtd.t.test(x = .[cat == 'Treated'],
y = .[cat == 'Control'],
weight = weight[cat == 'Treated'],
weighty= weight[cat == 'Control'])$coefficients[3]))
Or using lapply/sapply
sapply(df[2:4], function(v)
weights::wtd.t.test(x = v[df$cat == "Treated"],
y = v[df$cat == "Control"],
weight = df$weight[df$cat == "Treated"],
weighty = df$weight[df$cat == "Control"])$coefficients[3])

Related

Run linear regression model on nested dataset after grouping (multiply imputed dataset)

*I want to group nested (multiply imputed) dataset and then apply linear regression on each dataset. I have tried a number of approaches, including the map options (2) and the for loop (3). I have had no luck at all. I want the model results to look like results from summary(mod1). Does anyone know what I could be doing wrong?
# get dependencies
library(mice)
library(tidyverse)
# impute the boys dataset from mice package
boys_imp <- mice(boys)
# 1) I want to run a model like this on my multiply imputed dataset
mod <- boys %>%
group_by(reg) %>%
do(tidy(
lm(
data=.,
formula = wgt ~ bmi),
conf.int = T))
summary(mod1)
# A tibble: 12 × 8
# Groups: reg [6]
reg term estimate std.error statistic p.value conf.low conf.high
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) -81.9 9.84 -8.32 2.48e-12 -101. -62.3
2 north bmi 6.84 0.500 13.7 2.53e-22 5.85 7.84
3 east (Intercept) -75.3 7.62 -9.89 3.21e-18 -90.4 -60.3
4 east bmi 6.29 0.420 15.0 4.53e-32 5.46 7.12
5 west (Intercept) -91.9 6.31 -14.6 2.48e-34 -104. -79.4
6 west bmi 7.17 0.347 20.7 3.49e-54 6.49 7.86
7 south (Intercept) -79.8 6.73 -11.9 1.83e-24 -93.1 -66.5
8 south bmi 6.47 0.373 17.3 1.63e-40 5.73 7.20
9 city (Intercept) -92.0 13.9 -6.61 6.75e- 9 -120. -64.2
10 city bmi 6.95 0.757 9.18 1.39e-13 5.44 8.46
11 NA (Intercept) -88.6 43.8 -2.02 2.92e- 1 -645. 468.
12 NA bmi 6.46 2.89 2.24 2.68e- 1 -30.2 43.1
# 2) the map way --------------------------------------------------------
mod_imp <- boys_imp %>%
mice::complete("all") %>%
map(group_by, reg) %>%
map(lm, formula = wgt ~ bmi) %>%
pool()
summary(mod_imp)
term estimate std.error statistic df p.value
1 (Intercept) -85.473428 3.5511961 -24.06891 715.1703 0
2 bmi 6.793622 0.1945322 34.92287 693.7835 0
# 3) for loop way-------------------------------------------------------
# nest the mids dataset
boys_imp2 <- boys_imp %>%
mice::complete("all")
dat1 <- replicate(length(boys_imp2), NULL) # preallocate same size
# run the for loop
for (i in seq_along(boys_imp2)) {
dat1[[i]] <- boys_imp2[[i]] %>%
group_by(reg) %>%
do(lm(wgt ~ bmi, data = boys_imp2[[i]]))
}
|==================================================================|100% ~0 s remaining Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not lm.
Run `rlang::last_error()` to see where the error occurred.*
I have found a solution to the problem. This involve grouping the data by ID and variable of interest, subsequently I map lm on to the datasets. I then finish off with unnesting the data
boys_imp %>%
mice::complete("long", include = FALSE) %>%
group_by(.imp, reg) %>%
nest() %>%
mutate(lm_model = map(data, ~lm(bmi ~ phb, data = .))) %>%
group_by(reg) %>%
summarise(model = list(tidy(pool(lm_model),conf.int = T))) %>%
unnest_wider(model) %>%
unnest(cols = c(term, estimate, std.error,
statistic, p.value, conf.low, conf.high))
# A tibble: 30 × 16
reg term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) 19.3 0.332 57.9 0 18.6 19.9
2 north phb.L 5.10 0.678 7.53 1.81e-10 3.75 6.46
3 north phb.Q 1.25 0.800 1.56 1.24e- 1 -0.357 2.86
4 north phb.C -0.430 0.882 -0.487 6.30e- 1 -2.25 1.39
5 north phb^4 -1.10 0.948 -1.16 2.57e- 1 -3.07 0.862
6 north phb^5 -0.156 1.08 -0.144 8.87e- 1 -2.41 2.10
7 east (Intercept) 18.7 0.244 76.8 0 18.3 19.2
8 east phb.L 4.83 0.509 9.48 4.44e-15 3.82 5.84
9 east phb.Q 1.10 0.692 1.60 1.27e- 1 -0.343 2.55
10 east phb.C -0.518 0.671 -0.772 4.49e- 1 -1.91 0.878
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows

Unlisting multiple columns in R data frame

I have a dataframe with values for multiple macro variables. When i compute log of the values and then the log differences it changes the variables into lists, causing problems with my script later on.
Example code:
#Compute log of relevant macrovariables
macro[,c("hp", "unem", "m1", "inc")] <- log(macro[,c("hp", "unem", "m1", "inc")])
colnames(macro)[2:5] <- paste(colnames(macro)[2:5], "log", sep = "_")
#Computing log differences
macro$ldiff_hp <- c(-diff(macro$hp_log), na.omit)
Im trying to unlist the columns and convert them to numeric with either of the following:
#Alternative 1
macro[,15:19]<- unlist(as.numeric(macro[,15:19]))
#Alternative 2
macro[,15:19] <- sapply(macro[,15:19],as.numeric)
It gives me the following error output:
> macro[,15:19]<- unlist(as.numeric(macro[,15:19]))
Error in unlist(as.numeric(macro[, 15:19])) :
(list) object cannot be coerced to type 'double'
Using the economics dataset from ggplot2 as example data and making use of dplyrs lag function the log differenced vars can be computed like so:
library(ggplot2)
library(dplyr)
macro <- ggplot2::economics
vars <- c("uempmed", "psavert")
vars_log <- paste(vars, "log", sep = "_")
vars_ldiff <- paste(vars, "ldiff", sep = "_")
#Compute log of relevant macrovariables
macro[, vars_log] <- sapply(macro[, vars], log)
# Lag values
macro[, vars_ldiff] <- sapply(macro[, vars_log], dplyr::lag)
# First Difference of logs
macro[, vars_ldiff] <- macro[, vars_log] - macro[, vars_ldiff]
macro
#> # A tibble: 574 x 10
#> date pce pop psavert uempmed unemploy uempmed_log psavert_log
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1967-07-01 507. 198712 12.6 4.5 2944 1.50 2.53
#> 2 1967-08-01 510. 198911 12.6 4.7 2945 1.55 2.53
#> 3 1967-09-01 516. 199113 11.9 4.6 2958 1.53 2.48
#> 4 1967-10-01 512. 199311 12.9 4.9 3143 1.59 2.56
#> 5 1967-11-01 517. 199498 12.8 4.7 3066 1.55 2.55
#> 6 1967-12-01 525. 199657 11.8 4.8 3018 1.57 2.47
#> 7 1968-01-01 531. 199808 11.7 5.1 2878 1.63 2.46
#> 8 1968-02-01 534. 199920 12.3 4.5 3001 1.50 2.51
#> 9 1968-03-01 544. 200056 11.7 4.1 2877 1.41 2.46
#> 10 1968-04-01 544 200208 12.3 4.6 2709 1.53 2.51
#> # ... with 564 more rows, and 2 more variables: uempmed_ldiff <dbl>,
#> # psavert_ldiff <dbl>
Created on 2020-03-23 by the reprex package (v0.3.0)

Export results from split and lapply to a csv or Excel file

I'm using the split() and lapply functions to run Mann Kendall trend tests in bulk. In the code below, split() separates the results (ConcLow) by Analyte (water quality parameter). Then lapply runs the MannKendall and summary for each. The output goes to the console (example shown below code), but I'd like it to go into an Excel or cvs document so I can work with it. Ideally the Excel document would have the analyte (TOC for example) in the first column, then end column = tau value, 3rd column = pvalue. Then the next tab or following columns would display results from the summary function. Any assistance you can provide is greatly appreciated! I'm quite new to R.
mk.analyte <- split(BarkTop$ConcLow, BarkTop$Analyte)
lapply(mk.analyte, MannKendall)
lapply(mk.analyte, summary)
Output for each analyte looks like this (abbreviated here, but it's a long list):
$TOC
tau = 0.0108, 2-sided pvalue =0.8081
$TOC
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.378 2.054 2.255 2.434 2.600 4.530
Data look like this:
Date Location Analyte ConcLow Units
5/8/2000 Barker Res. Hardness 3.34 mg/L (as CaCO3)
11/24/2000 Barker Res. Hardness 9.47 mg/L (as CaCO3)
6/12/2001 Barker Res. Hardness 1.4 mg/L (as CaCO3)
12/29/2001 Barker Res. Hardness 21.9 mg/L (as CaCO3)
7/17/2002 Barker Res. Fe (diss 81 ug/L
2/2/2003 Barker Res. Fe (diss 90 ug/L
8/21/2003 Barker Res. Fe (diss 0.08 ug/L
3/8/2004 Barker Res. Fe (diss 15.748 ug/L
9/24/2004 Barker Res. TSS 6.2 mg/L
4/12/2005 Barker Res. TSS 8 mg/L
10/29/2005 Barker Res. TSS 10 mg/L
In my own opinion, I would use the tidyverse, as it is easier to read.
Short way:
#Sample data
set.seed(42)
df <- data.frame(
Location = replicate(1000, sample(letters[1:15], 1)),
Analyte = replicate(1000, sample(c("Hardness", "TSS", "Fe"), 1)),
ConcLow = runif(1000, 1, 30))
#Soltion
df %>%
nest(-Location, -Analyte) %>%
mutate(
mannKendall = purrr::map(data, function(x) {
broom::tidy(Kendall::MannKendall(x$ConcLow))}),
sumData = purrr::map(data, function(x) {
broom::tidy(summary(x$ConcLow))})) %>%
select(-data) %>%
unnest(mannKendall, sumData) %>%
write_excel_csv(path = "mydata.xls")
#How the table looks like:
# A tibble: 45 x 13
Location Analyte statistic p.value kendall_score denominator var_kendall_sco~ minimum q1 median
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 n Fe 0.264 0.0907 61 231. 1258. 1.38 14.4 20.6
2 o Hardne~ 0.0870 0.568 24 276. 1625. 2.02 9.52 18.3
3 e Fe -0.108 0.499 -25 231. 1258. 1.14 9.24 15.9
4 m TSS -0.00654 1 -1 153 697 2.19 5.89 10.4
5 j TSS -0.158 0.363 -27 171. 817 1.20 6.44 12.8
6 h Hardne~ 0.0909 0.466 48 528 4165. 4.28 11.1 19.4
7 l TSS -0.0526 0.780 -9 171. 817 5.39 12.5 21.1
8 c Fe -0.0736 0.652 -17 231. 1258. 1.63 5.87 10.6
9 j Hardne~ 0.415 0.0143 71 171. 817 4.50 11.7 15.4
10 k Fe -0.146 0.342 -37 253. 1434. 2.68 12.3 15.4
# ... with 35 more rows, and 3 more variables: mean <dbl>, q3 <dbl>, maximum <dbl>
Long way
It's a bit backwards but you can do something below.
Please note that I used subset from the mtcars dataset for my solution.
require(tidyverse)
df <- mtcars %>%
select(cyl, disp)
wilx <- df %>%
split(.$cyl) %>%
map(function(x) {broom::tidy(wilcox.test(x$disp, paired = FALSE,
exact = FALSE))})
sumData <- df %>%
split(.$cyl) %>%
map(function(x) {summary(x$disp)})
for (i in 1:length(wilx)) {
write_excel_csv(as.data.frame(wilx[i]), path = paste0(getwd(), "/wilx", i, ".xls"))
write_excel_csv(as.data.frame(unlist(sumData[i])), path = paste0(getwd(), "/sumData", i, ".xls"))
}

Lookup function for mutate in data

I'd like to store functions, or at least their names, in a column of a data.frame for use in a call to mutate. A simplified broken example:
library(dplyr)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
group_by(mu, sd, stat) %>%
mutate(sample = get(stat)(rnorm(100, mu, sd))) %>%
ungroup()
If this worked how I thought it would, the value of sample would be generated by the function in .GlobalEnv corresponding to either 'mean' or 'sd', depending on the row.
The error I get is:
Error in mutate_impl(.data, dots) :
Evaluation error: invalid first argument.
Surely this has to do with non-standard evaluation ... grrr.
A few issues here. First expand.grid will convert character values to factors. And get() doesn't like working with factors (ie get(factor("mean")) will give an error). The tidyverse-friendly version is tidyr::crossing(). (You could also pass stringsAsFactors=FALSE to expand.grid.)
Secondly, mutate() assumes that all functions you call are vectorized, but functions like get() are not vectorized, they need to be called one-at-a-time. A safer way rather than doing the group_by here to guarantee one-at-a-time evaluation is to use rowwise().
And finally, your real problem is that you are trying to call get("sd") but when you do, sd also happens to be a column in your data.frame that is part of the mutate. Thus get() will find this sd first, and this sd is just a number, not a function. You'll need to tell get() to pull from the global environment explicitly. Try
tidyr::crossing(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
rowwise() %>%
mutate(sample = get(stat, envir = globalenv())(rnorm(100, mu, sd)))
Three problems (that I see): (1) expand.grid is giving you factors; (2) get finds variables, so using "sd" as a stat is being confused with the column names "sd" (that was hard to find!); and (3) this really is a row-wise operation, grouping isn't helping enough. The first is easily fixed with an option, the second can be fixed by using match.fun instead of get, and the third can be mitigated with dplyr::rowwise, purrr::pmap, or base R's mapply.
This helper function was useful during debugging and can be used to "clean up" the code within mutate, but it isn't required (for other than this demonstration). Inline "anonymous" functions will work as well.
func <- function(f,m,s) get(f)(rnorm(100,mean=m,sd=s))
Several implementation methods:
set.seed(0)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd'),
stringsAsFactors=FALSE) %>%
group_by(mu, sd, stat) %>% # can also be `rowwise() %>%`
mutate(
sample0 = match.fun(stat)(rnorm(100, mu, sd)),
sample1 = purrr::pmap_dbl(list(stat, mu, sd), ~ match.fun(..1)(rnorm(100, ..2, ..3))),
sample2 = purrr::pmap_dbl(list(stat, mu, sd), func),
sample3 = mapply(function(f,m,s) match.fun(f)(rnorm(100,m,s)), stat, mu, sd),
sample4 = mapply(func, stat, mu, sd)
) %>%
ungroup()
# # A tibble: 20 x 8
# mu sd stat sample0 sample1 sample2 sample3 sample4
# <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 mean 1.02 1.03 0.896 1.08 0.855
# 2 2 1 mean 1.95 2.07 2.05 1.90 1.92
# 3 3 1 mean 2.93 3.07 3.03 2.89 3.01
# 4 4 1 mean 4.01 3.94 4.23 4.05 3.96
# 5 5 1 mean 5.04 5.11 5.05 5.17 5.19
# 6 1 10 mean 1.67 1.21 1.30 2.08 -0.641
# 7 2 10 mean 1.82 2.82 2.35 3.65 1.78
# 8 3 10 mean 1.45 3.10 3.15 4.28 2.58
# 9 4 10 mean 3.49 6.33 5.11 2.84 3.41
# 10 5 10 mean 5.33 4.85 4.07 5.58 6.66
# 11 1 1 sd 0.965 1.04 0.993 0.942 1.08
# 12 2 1 sd 0.974 0.967 0.981 0.984 1.15
# 13 3 1 sd 1.12 0.902 1.06 0.977 1.02
# 14 4 1 sd 0.946 0.928 0.960 1.01 0.992
# 15 5 1 sd 1.06 1.01 0.911 1.11 1.00
# 16 1 10 sd 9.46 8.95 10.0 8.91 9.60
# 17 2 10 sd 9.51 9.11 11.5 9.85 10.6
# 18 3 10 sd 9.77 9.96 11.0 9.09 10.7
# 19 4 10 sd 10.5 9.84 10.1 10.6 8.89
# 20 5 10 sd 11.2 8.82 10.4 9.06 9.64
sample0 happens to work because you have grouped it to be row-wise. If at some point any one grouping has two or more values, this will fail.
For sample1 through sample4, you can remove the group_by and it works equally well (though sample0 demonstrates its failing, so remove it too). You won't get identical results as above with grouping removed, because the entropy is being consumed differently.

How to divide dataset into balanced sets based on multiple variables

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

Resources