Related
Let's pretend I am measuring the distance the distance grasshoppers can jump pre- and post-treatment. This is just for fun, the real measurement could be anything, and the bigger picture is to understand the group_by() command.
For the statistical test I would like to run, each observation needs to have its own column, but I'm given a dataset that is not in this format...!!, and I would like to use the package library(dplyr) , and the command group_by()to shape the data for my needs, because if this were to happen again, I could make a more general code to work over other datasets :)
I am able to do this using commands, such as filter(), and then cbind()at a later step (see example below). But it also requires renaming a column. Additionally, if I wanted to add a column, let's say "difference", to calculate the observed difference between observation 1, and observation 2, I can do this, but then I need to add another line of code (again, see example below)
It would be great to do this with less lines of code
Please see what I have tried, and let me know how I could modify the code group by() to work properly.
example_df <- data.frame( "observation" = character(0), "distance" = integer(0))
Assign names for our "observations", remember, in this example, it's done twice
variable_names <- c( "obs_1", "obs_2")
Assign fictitious values to y
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
Combine everything for this pretend exercise
df <- data.frame( "observation" = variable_names, "distance" = c(w,x,y,z))
attach(df)
Here's how I achieved the desired results for this example
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
I tried using the group_by() command, I just get an error message
final <- df %>% group_by(observation,distance) %>% summarise(
Observation_1 = first(observation), distance_1 = first(distance),
Observation_2 = last(observation), distance_2 = last(distance,difference=distance-distance_2)))
It would be great to get the above code to work
To make things even more "fun" :), what if more than one variable was measured. Could I make a general code to achieve the desired results, again, without having the go over the whole filter() process, with cbind()etc..
Here's an example (expanded on the above one)
example_df <- data.frame( "observation" = character(0), "distance" = integer(0),"weight" = integer(0),"speed" = integer(0))
variable_names <- c( "obs_1", "obs_2")
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
df <- data.frame( "observation" = variable_names, "distance" = c(w,x),"weight" = c(y,z),"speed" = c(a,b))
attach(df)
library(dplyr)
dat = filter(df,observation=="obs_1")
dat2 = filter(df,observation=="obs_2")
names(dat2)
colnames(dat2)[2] <- "distance_2"
colnames(dat2)[3] <- "weight_2"
colnames(dat2)[4] <- "speed_2"
final <- cbind(dat,dat2)
attach(final)
final$difference <- distance-distance_2
final$difference_weight <- weight-weight_2
final$difference_speed <- speed-speed_2
Thanks everyone!
Would be simple with pivot_wider, though I presume your data also has an id column to link observations somehow, so have added one here:
library(tidyverse)
w<-rnorm(200, mean=5, sd=2)
x<-rnorm(200, mean=5, sd=2)
y<-rnorm(200, mean=5, sd=2)
z<-rnorm(200, mean=5, sd=2)
a<-rnorm(200, mean=5, sd=2)
b<-rnorm(200, mean=5, sd=2)
variable_names <- c( "obs_1", "obs_2")
df <-
data.frame(
"id" = rep(1:200, each = 2),
"observation" = variable_names,
"distance" = c(w, x),
"weight" = c(y, z),
"speed" = c(a, b)
)
df %>%
pivot_wider(
id_cols = id,
names_from = observation,
values_from = distance:speed
)
#> # A tibble: 200 x 7
#> id distance_obs_1 distance_obs_2 weight_obs_1 weight_obs_2 speed_obs_1
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.63 2.80 2.98 -0.795 3.58
#> 2 2 4.96 6.84 4.11 9.92 8.21
#> 3 3 4.84 7.51 6.32 3.28 9.02
#> 4 4 3.79 6.82 5.42 6.86 7.96
#> 5 5 5.48 2.84 9.56 3.27 3.55
#> 6 6 8.78 2.06 3.81 4.35 5.93
#> 7 7 8.42 4.21 3.92 4.40 9.37
#> 8 8 8.26 9.67 4.05 6.19 3.17
#> 9 9 3.80 4.47 6.58 5.38 6.09
#> 10 10 4.67 2.86 6.27 6.88 3.72
#> # ... with 190 more rows, and 1 more variable: speed_obs_2 <dbl>
Follow-up
You can also tell pivot_wider to use a function in combining values. Here in this example I've passed names_from = NULL so that every column is paired up by id, and using the diff function to calculate the difference:
df %>%
pivot_wider(
id_cols = id,
names_from = NULL,
values_from = distance:speed,
values_fn = diff,
names_sep = ""
)
#> # A tibble: 200 x 4
#> id distance weight speed
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.828 -3.77 4.45
#> 2 2 1.88 5.82 -1.07
#> 3 3 2.66 -3.04 -4.31
#> 4 4 3.03 1.45 -0.969
#> 5 5 -2.64 -6.29 5.06
#> 6 6 -6.72 0.541 -2.24
#> 7 7 -4.20 0.481 -5.82
#> 8 8 1.41 2.14 3.71
#> 9 9 0.669 -1.19 -1.14
#> 10 10 -1.81 0.607 -2.62
#> # ... with 190 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
I am very new in R and I need some advice about very basic issues.
I want to create a new column that is the sum of existent columns in my data frame Data4
The extended code is this:
Data4$E<-(Data4$E1+Data4$E2+Data4$E3+Data4$E4+Data4$E5)
I would like to simplify the code and find a way to not write the sequence of the column's name every time.
I tried this, but it indeed wrong
Data4$E<-(Data4$E[1:5])
Do you know a way to do it?
Thank you!
Among your options are:
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
# base R
Data4$E <- rowSums(Data4) # if there are just columns E1 to E5
Data4$E_option2 <- rowSums(subset(Data4, select = paste0("E", 1:5))) # if there are other columns ..
# "tidy"
library(tidyverse)
Data4 <- Data4 %>%
mutate(E_option3 = pmap_dbl(Data4 %>%
select(E1:E5),
sum))
# E1 E2 E3 E4 E5 E E_option2 E_option3
#1 8.519432 9.727704 9.222280 9.296536 10.223641 46.98959 46.98959 46.98959
#2 11.577169 9.684651 8.706118 11.188879 12.007201 53.16402 53.16402 53.16402
#3 9.043256 9.371745 9.220433 10.340512 11.011979 48.98793 48.98793 48.98793
#4 9.079995 9.893536 10.011952 10.506968 9.697541 49.18999 49.18999 49.18999
#5 8.002358 10.428015 9.847584 9.706695 8.974755 46.95941 46.95941 46.95941
Use functions like sum or rowSums. It seems you want row sums. These functions are better than + because they have na.rm argument that controls wether or not NAs are ignored.
Data4$E <- rowSums(Data[, c("E1", "E2", "E3", "E4", "E5")], na.rm = TRUE)
An easy way to generate column names is to paste them with numbers. Equivalently, we could write it so we can reuse this for other such operations:
E_col_names <- sprintf("E%d", 1:5)
Data4$E <- rowSums(Data[, E_col_names], na.rm = TRUE)
One more way to do it in dplyr demonstrating it on toy_data created in one of the above answers. Just use E1:E5 inside c_across. Of course you may also use select helper functions e.g. starts_with here
#toy_data
set.seed(12)
Data4 <- data.frame(replicate(5, rnorm(5, 10, 1)))
colnames(Data4) <- paste0("E", 1:5)
library(dplyr)
Data4 %>% rowwise() %>%
mutate(E = sum(c_across(E1:E5)))
#> # A tibble: 5 x 6
#> # Rowwise:
#> E1 E2 E3 E4 E5 E
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8.52 9.73 9.22 9.30 10.2 47.0
#> 2 11.6 9.68 8.71 11.2 12.0 53.2
#> 3 9.04 9.37 9.22 10.3 11.0 49.0
#> 4 9.08 9.89 10.0 10.5 9.70 49.2
#> 5 8.00 10.4 9.85 9.71 8.97 47.0
Created on 2021-05-25 by the reprex package (v2.0.0)
In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers here are completely random. I would like to pull in all data points that are greater than or less than 2 SD using a 95% winsorization.
df <- read.csv(header=TRUE, text="
id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure: how to ensure the winsorization accounts for the 2 different groups, and including multiple variables in that winsorization.
I have tried this code to fix the problem but the code is not complete:
library(robustHD)
library(dplyr)
new.df.wins = df %>%
group_by(group) %>%
mutate(measure_winsorized = winsorize(c(test1,test2)))
An error is returned indicating
Error: Column `measure_winsorized` must be length 45 (the group size) or one, not 90
I am open to other ideas too. Thanks!
Consider creating two new fields for each numeric field to be winsorized:
new.df.wins <- df %>%
group_by(group) %>%
mutate(measure_winsorized_test1 = winsorize(test1),
measure_winsorized_test2 = winsorize(test2))
Alternatively with base R's ave:
new.df.wins <- within(df, {
measure_winsorized_test2 <- ave(test2, group, FUN=winsorize)
measure_winsorized_test1 <- ave(test1, group, FUN=winsorize)
})
Should you want to winsorize both simultaneously, assign to two new columns at once:
# TIDYVERSE (dplyr)
new.df.wins <- df %>%
group_by(group) %>%
mutate_at(.funs = list(wins = winsorize), .vars = vars(test1:test2))
# TINYVERSE (I.E. BASE R)
df[c("test1_wins", "test2_wins")] <- with(df, ave(cbind(test1, test2),
group, FUN=winsorize))
You can make a version of winsorize() that works on data frames, and use that with by()
# Example data
set.seed(1)
df2 <- round(matrix(rt(100, 4), 20), 3)
df2 <- data.frame(id=seq_len(nrow(df2)),
group=sort(rep(1:2, length=nrow(df2))),
test=df2)
df2[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.673 -1.227 0.015 -0.831 0.024
# 2 2 1 -0.584 1.059 1.492 0.833 -0.377
# 3 3 1 0.572 0.613 -1.924 -0.672 1.184
# 11 11 2 0.054 0.020 2.241 -0.103 -0.047
# 12 12 2 1.746 -0.788 -0.268 -1.921 4.577
# 13 13 2 -0.472 -1.294 -0.258 0.795 -1.110
# data frame version of winsorize
winsorizedf <- function(x, ...) {
do.call(cbind, lapply(x, winsorize, ...))
}
# winsorize every column, except the two first ones, grouped by df2$group
w <- do.call(rbind,
by(df2[, -(1:2)], df2$group, winsorizedf))
# combine the winsorized columns with the original id and group columns
dfw <- data.frame(df2[, 1:2], round(w, 2))
dfw[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.63 -1.23 0.02 -0.83 0.02
# 2 2 1 -0.58 1.06 1.49 0.26 -0.38
# 3 3 1 0.57 0.61 -1.60 -0.67 1.18
# 11 11 2 0.05 0.02 1.23 -0.10 -0.05
# 12 12 2 1.70 -0.79 -0.27 -1.92 4.58
# 13 13 2 -0.47 -1.07 -0.26 0.80 -1.11
I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)
EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))
If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000
We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)
I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.