Looping over multiple values and storing results in a data frame - r

I am trying to calculate some macronutrients values obtained from a reference table. These are my inputs:
DF1. The values indicate the amount of portions consumed per day, on a monthly basis.
id Cow Milk Soy Milk Yoghurt (...)
001 0.07 0 0 ...
002 0 0.4 0 ...
003 0.07 0.07 0.13 ...
004 2.5 0 0 ...
... ... ... ... ...
My reference table looks like this:
DF2. Reference values
Food Kcal Proteins Trans Fat Satured Fat (5 more columns)
Cow Milk 91.50 4.95 4.95 3.12 ...
Soy Milk 49.50 4.20 2.85 1.80 ...
Yoghurt 122.00 7.00 6.60 0.18 ...
...... ... ... ... ... ...
What I need to do is:
Multiply portions value of the food times the corresponding value of that food in the reference table for each variable (i.e., kcal, protein, fat...).
Sum all the values obtained for each food in the same variable (sum all kcal, sum all the protein...) for that id.
Consolidate in one data.frame.
So, for example, the kcal and protein values only for id 001 so far should be:
id001
kcal
(0.07*91.5) + (0*49.5) + (0*122) = 6.405
protein
(0.07*4.95) + (0*4.2) + (0*7) = 0.3465
...
And I need to calculate that for all the foods, all the other variables of reference table for that same id and for dozens of other ids.
My final table should look like this:
id
Total Kcal
Total Proteins
...
001
6.405
0.3465
...
...
...
...
...
I was thinking of implementing a loop:
results <- data.frame()
for (i in 1:ncol(df1)) {
kcal <- df1[,i] * df2[i,]
results$kcal <- rbind(results$kcal, kcal)
}
But I don't even know how to make it iterate through each variable while maintaining df1[,i] position, nor make it sum the values once has finalized... never have done such a complex thing before. Any help is appreciated.

Here is a tidyverse option
library(tidyverse)
DF1 %>%
pivot_longer(-id, names_to = "Food", values_to = "portion") %>%
left_join(DF2 %>% pivot_longer(-Food), by = "Food") %>%
group_by(id, name) %>%
summarise(value = sum(value * portion), .groups = "drop") %>%
pivot_wider(names_prefix = "Total ")
## A tibble: 4 × 5
# id `Total Kcal` `Total Proteins` `Total Satured Fat` `Total Trans Fat`
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 6.40 0.347 0.218 0.347
#2 2 19.8 1.68 0.72 1.14
#3 3 25.7 1.55 0.368 1.40
#4 4 229. 12.4 7.8 12.4
Please note that there is an error in your example calculation for Total Proteins for id001:
(0.07 * 4.95) + (0 * 4.2) + (0 * 7) = 0.198 0.3465
Explanation: We reshape both DF1 and DF2 from wide to long, then do a left-join of long DF1 with long DF2 by "Food". We can then group_by(id, name) (where name gives the name of the quantity from DF2: Kcal, Proteins, Trans Fat, etc.) and calculate the desired quantities as the sum(value * portion), respectively. Finally, we reshape again from long to wide, and add the prefix "Total " to the new wide columns.
Sample data
DF1 <- read.table(text = "id 'Cow Milk' 'Soy Milk' Yoghurt
001 0.07 0 0
002 0 0.4 0
003 0.07 0.07 0.13
004 2.5 0 0", header = T, check.names = F)
DF2 <- read.table(text = "Food Kcal Proteins 'Trans Fat' 'Satured Fat'
'Cow Milk' 91.50 4.95 4.95 3.12
'Soy Milk' 49.50 4.20 2.85 1.80
Yoghurt 122.00 7.00 6.60 0.18", header = T, check.names = F)

Here is a way to achieve this using for loop:
results = data.frame()
for (i in 1:nrow(DF1)) {
df_composition_for_id_i = DF2 %>% filter(Food %in% names(DF1[i,])[DF1[i,]>0])
quantity_food = t(DF1[i,-1])[t(DF1[i,-1])>0]
df_transform = sweep(df_composition_for_id_i[,-1], 1, quantity_food, `*`)
Total = c(i, colSums(df_transform))
names(Total)[1]= "id"
results = rbind(results, Total)
}
names(results) = names(DF2)
names(results)[1] = "id"
> results
id Kcal Proteins Trans Fat Satured Fat
1 1 6.405 0.3465 0.3465 0.2184
2 2 19.800 1.6800 1.1400 0.7200
3 3 25.730 1.5505 1.4040 0.3678
4 4 228.750 12.3750 12.3750 7.8000
Using this for loop you can feed a dataframe with more columns in your DF2 (eg. carbohydrates, vitamins, ...) which will be computed in the loop without more intervention.
Explanation:
In the for loop the first df_composition_for_id_i is a dataframe with only the nutrient present in the current iteration for example when i=3:
i=3
df_composition_for_id_i = DF2 %>% filter(Food %in% names(DF1[i,])[DF1[i,]>0])
df_composition_for_id_i
Food Kcal Proteins Trans Fat Satured Fat
1 Cow Milk 91.5 4.95 4.95 3.12
2 Soy Milk 49.5 4.20 2.85 1.80
3 Yoghurt 122.0 7.00 6.60 0.18
quantity_food is the quantity of each nutrient that will be pass to multiply by row
quantity_food
[1] 0.07 0.07 0.13
df_transform take the first element created in this loop (df_composition_for_id_i) and then multiply by row with the second (quantity_food) (excluding the Food name) using the sweep function:
df_transform = sweep(df_composition_for_id_i[,-1], 1, quantity_food, `*`)
df_transform
Kcal Proteins Trans Fat Satured Fat
1 6.405 0.3465 0.3465 0.2184
2 3.465 0.2940 0.1995 0.1260
3 15.860 0.9100 0.8580 0.0234
Lastly, the sum of this is calculated and id added with some tidy up for the naming and binded by row on the new dataframe:
Total = c(i, colSums(df_transform))
names(Total)[1]= "id"
results = rbind(results, Total)
id Kcal Proteins Trans Fat Satured Fat
1 3 25.73 1.5505 1.404 0.3678

Related

How to apply function with multiple outputs on each group in R and store results in different columns?

Suppose I am using panel data: for each individual and time, there is an observation of a numerical variable. I want to apply a function to this numerical variable but this function outputs a vector of numbers. I'd like to apply this function over the observations of each individual and store the resulting vector as columns of a new dataframe.
Example:
TICKER OFTIC CNAME ANNDATS_ACT ACTUAL
<chr> <chr> <chr> <date> <dbl>
1 0001 EPE EP ENGR CORP 2019-05-08 -0.15
2 0004 ACSF AMERICAN CAPITAL 2014-08-04 0.29
3 000R CRCM CARECOM 2018-02-27 0.32
4 000V EIGR EIGER 2018-05-11 -0.84
5 000Y RARE ULTRAGENYX 2016-02-25 -1.42
6 000Z BIOC BIOCEPT 2018-03-28 -54
7 0018 EGLT EGALET 2016-03-08 -0.28
8 001A SESN SESEN BIO 2021-03-15 -0.11
9 001C ARGS ARGOS 2017-03-16 -7
10 001J KN KNOWLES 2021-02-04 0.38
For each TICKER, I will consider the time-series implied by ACTUAL and compute the autocorrelation function. I defined the following wrapper to perform the operation:
my_acf <- function(x, lag = NULL){
acf_vec <- acf(x, lag.max = lag, plot = FALSE, na.action = na.contiguous)$acf
acf_vec <- as.vector(acf_vec)[-1]
return(acf_vec)
}
If the desired maximum lag is, say, 3, I'd like to create another dataset in which I have 4 columns: TICKER and the correspoding 3 first autocorrelations of the associated series of ACTUAL observations.
My solution was:
max_lag = 3
autocorrs <- final_sample %>%
group_by(TICKER) %>%
filter(!all(is.na(ACTUAL))) %>%
summarise(rho = my_acf(ACTUAL, lag = max_lag)) %>%
mutate(order = row_number()) %>%
pivot_wider(id_cols = TICKER, values_from = rho, names_from = order, names_prefix = "rho_")
This indeed provides the desired output:
TICKER rho_1 rho_2 rho_3
<chr> <dbl> <dbl> <dbl>
1 0001 0.836 0.676 0.493
2 0004 0.469 -0.224 -0.366
3 000R 0.561 0.579 0.327
4 000V 0.634 0.626 0.604
5 000Y 0.370 0.396 0.117
6 000Z 0.476 0.454 0.382
7 0018 0.382 -0.0170 -0.278
8 001A 0.330 0.316 0.0944
9 001C 0.727 0.590 0.400
10 001J 0.281 -0.308 -0.0343
My question is how can one perform this operation without a pivot_wider and the manual creation of the order column? The summarise verb creates a single column that store the autocorrelations sequentially for each TICKER. Is there a way to force summarize to create different columns for the different output a given function may provide when applied to, let's say, the ACTUAL series?

R: ggplot to visualize all variables in each cluster after cluster analysis

Sorry in advance if the post isn't clear.
So I have my dataframe, 74 observations and 43 columns. I performed cluster analysis on them.
I then got 5 clusters, and assigned the cluster number to each respective row.
Now,
my df has 74 rows (obs) and 44 variables. And I would like to plot and see in each cluster what variables are enriched and what variables are not, for all variables.
I want to achieve this by ggplot.
My imaginary output panel is to have 5 boxplots per row, and 42 rows plots, each row will describe a variable measured in the dataset.
Example of the dataset (sorry its very big so I made an example, actual values are different)
df
ID EGF FGF_2 Eotaxin TGF G_CSF Flt3L GMSF Frac IFNa2 .... Cluster
4300 4.21 139.32 3.10 0 1.81 3.48 1.86 9.51 9.41 .... 1
2345 7.19 233.10 0 1.81 3.48 1.86 9.41 0 11.4 .... 1
4300 4.21 139.32 4.59 0 1.81 3.48 1.86 9.51 9.41 .... 1
....
3457 0.19 233.10 0 1.99 3.48 1.86 9.41 0 20.4 .... 3
5420 4.21 139.32 3.10 0.56 1.81 3.48 1.86 9.51 29.8 .... 1
2334 7.19 233.10 2.68 2.22 3.48 1.86 9.41 0 28.8 .... 5
str(df)
$ ID : Factor w/ 45 levels "4300"..... : 44 8 24 ....
$ EGF : num ....
$ FGF_2 : num ....
$ Eotaxin : num ....
....
$ Cluster : Factor w/ 5 levels "1" , "2"...: 1 1 1.....3 1 5
#now plotting
#thought I pivot the datafram
new_df <- pivot_longer(df[,2:44],df$cluster, names_to = "Cytokine measured", values_to = "count")
#ggplot
ggplot(new_df,aes(x = new_df$cluster, y = new_df$count))+
geom_boxplot(width=0.2,alpha=0.1)+
geom_jitter(width=0.15)+
facet_grid(new_df$`Cytokine measured`~new_df$cluster, scales = 'free')
So the code did generate a small panel of the graphs that fit my imaginary output. But I can see only
5 rows instead of 42.
So going back to new_df, the last 3 columns draw my attention:
Cluster Cytokine measured count
1 EGF 2.66
1 FGF_2 390.1
1 Eotaxin 6.75
1 TGF 0
1 G_CSF 520
3 EGF 45
5 FGF_2 4
4 Eotaxin 0
1 TGF 0
1 G_CSF 43
....
So it seems the cluster number and count column is correct whereas the cytokine measured just kept repeating the 5 variable names, instead of the total 42 variables I want to see.
I think the table conversion step is wrong, but I dont quite know what went wrong and how to fix it.
Please enlighten me.
We can try this, I simulate something that looks like your data frame:
df = data.frame(
ID=1:74,matrix(rnorm(74*43),ncol=43)
)
colnames(df)[-1] = paste0("Measurement",1:43)
df$cluster = cutree(hclust(dist(scale(df[,-1]))),5)
df$cluster = factor(df$cluster)
Then melt:
library(ggplot2)
library(tidyr)
library(dplyr)
melted_df = df %>% pivot_longer(-c(cluster,ID),values_to = "count")
g = ggplot(melted_df,aes(x=cluster,y=count,col=cluster)) + geom_boxplot() + facet_wrap(~name,ncol=5,scale="free_y")
You can save it as a bigger plot to look at:
ggsave(g,file="plot.pdf",width=15,height=15)

Using a loop to calculate correlation based on subset data in R

I have a large dataset with several products in one column and information on each product including unit retail and quantity by week for the previous several years. I am trying to write a for loop that subsets the data by product name and calculates the correlation between unit retail and quantity for the number of rows for each product.
I have been able to subset the data based on product and calculate the correlation, but there are many products and it would be more beneficial to implement a loop to go through each unique product.
Example of dataset:
`Category Label` `Fiscal Year` `Fiscal Week` `Net Sales` `Extended Quantity` `Unit Retail` `Log QTY` `Log Retail`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 LOOSE CITRUS FY2018 FY2018-P01-W1 170833. 204901. 0.834 12.2 -0.182
2 LOOSE CITRUS FY2018 FY2018-P01-W2 158609. 187650. 0.845 12.1 -0.168
3 LOOSE CITRUS FY2018 FY2018-P01-W3 163580. 196313. 0.833 12.2 -0.182
4 LOOSE CITRUS FY2018 FY2018-P01-W4 146240. 185984. 0.786 12.1 -0.240
5 LOOSE CITRUS FY2018 FY2018-P02-W1 147494. 171036. 0.862 12.0 -0.148
6 LOOSE ONIONS FY2018 FY2018-P01-W1 88802. 78446. 1.13 11.3 0.124
7 LOOSE ONIONS FY2018 FY2018-P01-W2 77365. 66898. 1.16 11.1 0.145
8 LOOSE ONIONS FY2018 FY2018-P01-W3 88026. 75055. 1.17 11.2 0.159
9 LOOSE ONIONS FY2018 FY2018-P01-W4 114720. 97051. 1.18 11.5 0.167
10 LOOSE ONIONS FY2018 FY2018-P02-W1 95746. 82128. 1.17 11.3 0.153
#subset data into own df based on category
allProduce_split <- split(allProduce, allProduce$`Category Label`)
#correlation
cor_produce <- cor(allProduce_split$LOOSE CITRUS$`Unit Retail`,
allProduce_split$LOOSE CITRUS$`Extended Quantity`)
Rather than just return the correlation for the "LOOSE CITRUS' product in the example, I am hoping to have a table that contains single row for each product name with the correlation between unit retail and quantity for all 5 fiscal weeks. For example:
'Category Label' 'Cor'
LOOSE CITRUS .5363807
LOOSE ONIONS .6415218
product C .6498723
Product D -.451258
Product E .0012548
Consider by which is similar to split but then allows any function to be applied on the subsets using a third argument. In your case, your function can build a data frame of product label and correlation result:
df_list <- by(allProduce, allProduce$`Category Label`, function(sub)
data.frame(product = sub$Category_Label[1],
cor_produce = cor(sub$`Unit Retail`,
sub$`Extended Quantity`)
)
)
final_df <- do.call(rbind, unname(df_list))
Alternatively, you can still use the split but then run an lapply:
allProduce_split <- split(allProduce, allProduce$`Category Label`)
df_list <- lapply(allProduce_split, function(sub)
data.frame(product = sub$Category_Label[1],
cor_produce = cor(sub$`Unit Retail`,
sub$`Extended Quantity`)
)
)
final_df <- do.call(rbind, unname(df_list))
Try :
library(dplyr)
df <-allProduce %>% group_by(Category Label) %>% mutate(correlation = cor(Unit Retail,Extended Quantity))

R turn each dataframe element into multiple n elements

I have a dataframe (df) with yearly value for atmospheric deposition and a vector of 12 elements (mul).
str(df) 'data.frame': 220 obs. of 11 variables:
$ year : num 1900 1902 1903 1904 1906 ...
$ BOX1 : num 0.72 0.72 0.72 0.72 0.72 ...
mul <- c(0.7,0.7,1.3,1.7,0.7,1.0,0.7, 1.7,1.7,1.7,1.0,0.4)
I want to multiply each element of a given column 'BOX' for the 12 value of 'mul', so that a 220 elements column becomes a 2640 elements column.
In the following example I've done it for the first element of column BOX1 but I need to apply the procedure to all the elements at once.
df$BOX1[1] * mul
[1] 0.503811 0.503811 0.935649 1.223541 0.503811 0.719730 0.503811 1.223541 1.223541 1.151568 0.719730 0.287892
Thank you for your help
# example data
df = data.frame(year = c(1900,1902,1903),
BOX1 = c(0.72, 0.75, 0.80))
mul <- c(0.7,0.7,1.3,1.7,0.7,1.0,0.7, 1.7,1.7,1.7,1.0,0.4)
library(tidyverse)
df %>%
rowwise() %>% # for each row
mutate(x = list(BOX1 * mul)) %>% # multiply value in BOX1 with mul and save results as a list
unnest() # unnest data
# # A tibble: 36 x 3
# year BOX1 x
# <dbl> <dbl> <dbl>
# 1 1900 0.72 0.504
# 2 1900 0.72 0.504
# 3 1900 0.72 0.936
# 4 1900 0.72 1.22
# 5 1900 0.72 0.504
# 6 1900 0.72 0.72
# 7 1900 0.72 0.504
# 8 1900 0.72 1.22
# 9 1900 0.72 1.22
# 10 1900 0.72 1.22
# # ... with 26 more rows
You can remove column BOX1 if you want.
You can also try to use a vectorized function instead of rowwise, which might be faster:
# vectorised function to multiply vectors
f = function(x,y) x*y
f = Vectorize(f)
df %>%
mutate(x = list(f(BOX1, mul))) %>%
unnest()
Another solution with tidyverse :
library(tidyr)
library(dplyr)
library(tibble)
mul %>% as.data.frame %>% rowid_to_column %>% # to keep duplicates in 'mul'
crossing(df) %>% mutate(v=BOX1*mul) # when calling 'crossing'

How to divide dataset into balanced sets based on multiple variables

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

Resources