Simulate data and randomly add missing values to dataframe - r

How can I randomly add missing values to some or each column (say random ~5% missing in each) in a simulated dataframe, plus, is there a more efficient way of simulating a dataframe with both continuous and factor columns?
#Simulate some data
N <- 2000
data <- data.frame(id = 1:2000,age = rnorm(N,18:90),bmi = rnorm(N,15:40),
chol = rnorm(N,50:350), insulin = rnorm(N,2:40),sbp = rnorm(N, 50:200),
dbp = rnorm(N, 30:150), sex = c(rep(1, 1000), rep(2, 1000)),
smoke = rep(c(1, 2), 1000), educ = sample(LETTERS[1:4]))
#Manually add some missing values
data <- data %>%
mutate(age = "is.na<-"(age, age <19 | age >88),
bmi = "is.na<-"(bmi, bmi >38 | bmi <16),
insulin = "is.na<-"(insulin, insulin >38),
educ = "is.na<-"(educ, bmi >35))

Best solution in my opinion would be using the mice package for this. This is a R package dedicated to imputation. It also has a function called amputate for introducing missing data into a data.frame.
ampute - Generate Missing Data For Simulation Purposes
This function generates multivariate missing data in a MCAR, MAR or MNAR manner.
The advantage of this solution is you can set multiple parameters for the simulation of your missing data.
ampute(data, prop = 0.5, patterns = NULL, freq = NULL, mech = "MAR",
weights = NULL, cont = TRUE, type = NULL, odds = NULL,
bycases = TRUE, run = TRUE)
As you can see you can set the percentage of missing values, the missing data mechanism (MCAR would be your choice for missing completely at random) and several other parameters. This solution would also be quite clean since it is only 1 line of code.

Here's a tidyverse approach that will remove roughly 20% of your data for each column you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
data %>%
gather(var, value, -id) %>% # reshape data
mutate(r = runif(nrow(.)), # simulate a random number from 0 to 1 for each row
value = ifelse(var %in% c_names & r <= prc_missing, NA, value)) %>% # if it's one of the variables you specified and the random number is less than your threshold update to NA
select(-r) %>% # remove random number
spread(var, value) # reshape back to original format
# id age bmi chol
# 1 1 17.37355 15.91898 49.83548
# 2 2 19.18364 16.78214 50.74664
# 3 3 19.16437 17.07456 52.69696
# 4 4 NA 16.01065 53.55666
# 5 5 22.32951 19.61983 53.31124
# 6 6 22.17953 19.94387 54.29250
# 7 7 24.48743 NA 56.36458
# 8 8 25.73832 20.52925 57.76853
# 9 9 26.57578 NA 57.88765
# 10 10 26.69461 24.41794 59.88111
# 11 11 29.51178 26.35868 60.39811
# 12 12 NA 25.89721 60.38797
# 13 13 NA 27.38767 62.34112
# 14 14 28.78530 27.94619 61.87064
# 15 15 33.12493 27.62294 65.43302
# 16 16 32.95507 NA 66.98040
# 17 17 33.98381 30.60571 65.63278
# 18 18 35.94384 NA 65.95587
# 19 19 36.82122 34.10003 68.56972
# 20 20 37.59390 34.76318 68.86495
And this is an alternative that will remove exactly 20% of data for the columns you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
n_remove = prc_missing * nrow(data)
data %>%
gather(var, value, -id) %>% # reshape data
sample_frac(1) %>% # shuffle rows
group_by(var) %>% # for each variables
mutate(value = ifelse(var %in% c_names & row_number() <= n_remove, NA, value)) %>% # update to NA top x number of rows if it's one of the variables you specified
spread(var, value) # reshape to original format
# # A tibble: 20 x 4
# id age bmi chol
# <int> <dbl> <dbl> <dbl>
# 1 1 17.4 15.9 49.8
# 2 2 19.2 16.8 50.7
# 3 3 19.2 17.1 52.7
# 4 4 NA 16.0 53.6
# 5 5 22.3 NA 53.3
# 6 6 22.2 19.9 54.3
# 7 7 24.5 20.8 56.4
# 8 8 25.7 NA 57.8
# 9 9 26.6 NA 57.9
# 10 10 NA NA 59.9
# 11 11 NA 26.4 60.4
# 12 12 NA 25.9 60.4
# 13 13 29.4 27.4 62.3
# 14 14 28.8 27.9 61.9
# 15 15 33.1 27.6 65.4
# 16 16 33.0 29.6 67.0
# 17 17 34.0 30.6 65.6
# 18 18 35.9 31.9 66.0
# 19 19 36.8 34.1 68.6
# 20 20 37.6 34.8 68.9

Would this work?
n_rows <- nrow(data)
perc_missing <- 5 # percentage missing data
row_missing <- sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows,0))) # sample randomly x% of rows
col_missing <- 1 # define column
data[row_missing, col_missing] <- NA # assign missing values

Related

Make a loop function by including gender into the algorithm

I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA

Problems with appending t.test results in a for loop

Let me take simulated datasets to explain:
I have dataset dt and dt1
# dataset 1 `dt`
set.seed(12)
dt <- rnorm(5000,mean=10,sd=1)
dt <- data.frame(dt)
dt$group <- c("case","control")
colnames(dt) <- c("severity","group")
head(dt)
severity group
1 8.519432 case
2 11.577169 control
3 9.043256 case
4 9.079995 control
5 8.002358 case
6 9.727704 control
# dataset 2 `dt2`
set.seed(12)
dt2 <- rnorm(200,mean=12,sd=1)
dt2 <- data.frame(dt2)
dt2$group <- c("case2","control2")
colnames(dt2) <- c("severity","group")
head(dt2)
severity group
1 10.51943 case2
2 13.57717 control2
3 11.04326 case2
4 11.07999 control2
5 10.00236 case2
6 11.72770 control2
I am building one 1000 iterations for loop to do the following steps:
randomly take 500 rows from the dt and save as dt_sub
rbind dt_sub with dt2 and save as bd
select only rows with group as either case2 or control from the bd dataset (only cares the difference between these two groups)
t.tests on the variable severity between the case2 and control group
output t.tests results to t
use a for loop to repeat 1000 times
iteratively appends all t.test results to a dataframe results
Following is the code that I built in r
library(broom)
library(dplyr)
iter <- 1000
t <- data.frame()
for (i in 1:iter) {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t
t$iter <- i
}
results <- do.call(rbind,t)
My question is, this code works well when iter=1, but how should I set the compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t line to ensure each run's t.test results will not be overwritten when iter ≥ 1? I tried t[i] but failed, anyone could advise please?
Thanks.
Create a function which runs the process once.
library(broom)
library(dplyr)
t_test_function <- function() {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>%
group_by(group) %>%
do(tidy(t.test(severity ~ group,data = compare))) %>%
ungroup
}
t_test_function()
# group estimate estimate1 estimate2 statistic p.value parameter conf.low
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 case2 1.94 11.9 9.99 17.4 9.40e-42 199. #1.72
#2 cont… 1.94 11.9 9.99 17.4 9.40e-42 199. 1.72
# … with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
Now you can call this iter times using replicate and combine the dataset.
iter <- 5
results <- bind_rows(replicate(iter, t_test_function(), simplify = FALSE), .id = 'iter')
# A tibble: 10 x 12
# iter group estimate estimate1 estimate2 statistic p.value parameter
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 case2 1.88 11.9 10.1 17.3 1.05e-40 189.
# 2 1 cont… 1.88 11.9 10.1 17.3 1.05e-40 189.
# 3 2 case2 1.96 11.9 9.97 17.8 9.88e-43 194.
# 4 2 cont… 1.96 11.9 9.97 17.8 9.88e-43 194.
# 5 3 case2 1.94 11.9 9.99 17.9 3.76e-42 184.
# 6 3 cont… 1.94 11.9 9.99 17.9 3.76e-42 184.
# 7 4 case2 2.03 11.9 9.90 18.6 1.82e-44 189.
# 8 4 cont… 2.03 11.9 9.90 18.6 1.82e-44 189.
# 9 5 case2 1.96 11.9 9.97 18.1 7.05e-43 187.
#10 5 cont… 1.96 11.9 9.97 18.1 7.05e-43 187.
# … with 4 more variables: conf.low <dbl>, conf.high <dbl>, method <chr>,
# alternative <chr>

Sum total distance by groups

I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.
Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().
Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.

For loop warning: "number of items to replace is not a multiple of replacement length" with two dataframes

I am trying to create a new vector by applying a transformation to a variable in one of my dataframe based on data from another dataframe.
I have two dataframes df1 and df2. df1 and df2 have different dimension, I have over 20,000 rows in df1 and 76 rows in df2.
df1 is my original dataset. I created df2 for Ag_ppm as follow:
df2 <- df1%>%
filter(!is.na(Ag_ppm)) %>%
group_by(Year,Zone, SubZone) %>%
summarise(
n = sum(!is.na(Ag_ppm)),
min = min(Ag_ppm),
max = max(Ag_ppm),
mean = mean(Ag_ppm),
sd = sd(Ag_ppm),
iqr = IQR(Ag_ppm),
Q1 = quantile(Ag_ppm, 0.25),
median = median(Ag_ppm),
Q3 = quantile(Ag_ppm, 0.75),
LW = min(Ag_ppm > (quantile(Ag_ppm, .25)-1.5*IQR(Ag_ppm))),
UF = quantile(Ag_ppm, .75) + 1.5*IQR(Ag_ppm))
Here is what the first rows of each data frames look like:
head(df1, n=5)
# A tibble: 5 x 12
Year Zone SubZone Au_ppm Ag_ppm Cu_ppm Pb_ppm Zn_ppm As_ppm Sb_ppm Bi_ppm Mo_ppm
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 BugLake BugLake 0.007 3.7 17 27 23 1 1 NA 1
2 1983 Johnny Mountain Johnny Mountain 0.01 1.6 71 63 550 4 NA NA NA
3 1983 Khyber Pass Khyber Pass 0.12 11.5 275 204 8230 178 7 60 NA
4 1987 Chebry Ridge Line Grid 0.05 2.2 35 21 105 16 6 NA NA
5 1987 Chebry Handel Grid 0.004 1.3 29 27 663 45 2 NA NA
head(df2, n=5)
# A tibble: 5 x 14
# Groups: Year, Zone [3]
Year Zone SubZone n min max mean sd iqr Q1 median Q3 LW UF
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1981 Chebry Handel 52 0.6 5.1 1.83 0.947 0.925 1.2 1.6 2.12 1 3.51
2 1981 Imperial Metals Handel 24 0.9 6.9 2.81 1.43 1.35 1.95 2.65 3.3 1 5.33
3 1983 Chebry Chebry 5 0.7 3.7 1.78 1.19 0.9 1.2 1.2 2.1 1 3.45
4 1983 Chebry Handel 17 0.1 0.7 0.318 0.163 0.2 0.2 0.3 0.4 1 0.7
5 1983 Chebry Handel Grid 225 0.1 16 0.892 1.33 0.7 0.3 0.6 1 1 2.05
I want to apply the following equation to my column Ag_ppm in df1 using the median and IQR calculated for each subgroup in df2:
Z = (X - median)/IQR
For that purpose, I wrote:
# Initialize Ag_std vector with NA values
Ag_std <- rep(NA, times = nrow(df1))
# Populate Ag_std vector with standardized Ag values
Ag_std <-
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr
}
}
But the loop does not work (it returns a NULL vector) and I have this warning:
1: In Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr :
number of items to replace is not a multiple of replacement length
I've looked similar questions, and I did not find an answer that would work for me. Any help would be much appreciated!
If there are better ways of achieving the same without a loop (I'm sure there are, e.g. apply()), I would appreciate such comments as well. Unfortunately I'm not familiar enough with the alternatives to be able to implement them quickly.
This can be done relatively easily in data.table
library(data.table)
DT <- data.table(df1)
#function to apply
fun <- function(x) (x - median(x)) / diff (quantile( x, c(.25, .75)))
# create a new column with desired result
DT[, Ag_std := fun(Ag_ppm), by = list(Year, Zone, SubZone)]
Also, I think your loop can be fixed by assigning the result of 'filter' to a temporary object
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
temp.var <- filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - temp.var$median)/temp.var$iqr
}
}
Since you have df2 as a seperate dataframe, you can join and mutate:
df1 %>%
left_join(df2, by = c("Year", "Zone", "SubZone")) %>%
mutate(Z = (Ag_ppm - median) / iqr)
In fact, you could have generated info in df2 in df1 itself using summarise

How to divide dataset into balanced sets based on multiple variables

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

Resources