How to divide dataset into balanced sets based on multiple variables - r

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!

library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.

The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

Related

Is there another way to calculate within-subject Hedges'g (and error)?

I'm carrying out a meta-analysis of within-subject studies (crossover studies). I've read some papers that used the esc package (esc_mean_sd function, more precisely) to calculate Hedges'g to perform it. However, its output is doubling the "n" of each study.
Please, look that the "n" in the data is n=12 for all the three studies, while in the output there are n=24.
ID mean_exp mean_con sd_exp sd_con n
1 A 150 130 15 22 12
2 B 166 145 10 8 12
3 C 179 165 11 14 12
# What I did:
e1 <- esc_mean_sd(data[1,2],data[1,4],data[1,6],
data[1,3],data[1,5],data[1,6],
r = .9,es.type = "g")
e2 <- esc_mean_sd(data[2,2],data[2,4],data[2,6],
data[2,3],data[2,5],data[2,6],
r = .9,es.type = "g")
e3 <- esc_mean_sd(data[3,2],data[3,4],data[3,6],
data[3,3],data[3,5],data[3,6],
r = .9,es.type = "g")
data2 <- combine_esc(e1, e2, e3)
colnames(data2) <- ("study","es","weight","n","se","var","lCI","uCI","measure")
head(data2, 3)
# study es weight n se var lCI uCI measure
# 1 1.80 4.18 24 0.489 0.239 0.842 2.76 g
# 2 4.53 1.60 24 0.791 0.626 2.983 6.08 g
# 3 2.14 3.71 24 0.519 0.269 1.126 3.16 g

standardize a variable values differently based on another categorical variable in R (Using R Base)

I have a large dataset that has a continuous variable "Cholesterol" for two visits for each participant (each participant has two rows: first visit = Before & second visit= After). I'd like to standadise cholesterol but I have both Before and After visits merged which will not make my standardisation accurate as it is calculated using the mean and the SD
USING R BASE, How can I create a new cholesterol variable standardised based on Visit in the same data set (in this process standardisation should be done twice; once for Before and another time for After, but the output (standardised values) will be in a one variable again following the same structure of this DF
DF$Cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
DF$Visit< -c(Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before, After,Before,After,Before,After)
# the standardisation function I want to apply
standardise <- function(x) {return((x-min(x,na.rm = T))/sd(x,na.rm = T))}
thank you in advance
Let's make your data, fix the df$visit assignment, fix the standardise function to be mean rather than min, and then assume each new occasion of before is the next person, pivot to wide format, then mutate our before and after standardised variables:
df <- data.frame(x = rep(1, 30))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df <- df %>%
mutate(person = cumsum(visit == "before"))%>%
pivot_wider(names_from = visit, id_cols = person, values_from = cholesterol)%>%
mutate(before_std = standardise(before),
after_std = standardise(after))
gives:
person before after before_std after_std
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.986 2.92 -1.16 1.33
2 2 3.93 2.95 1.63 1.36
3 3 4.22 2.48 1.91 0.842
4 4 0.997 0.388 -1.15 -1.49
5 5 1.18 1.41 -0.979 -0.356
6 6 1.05 1.28 -1.10 -0.503
7 7 2.85 0.437 0.609 -1.44
8 8 2.24 0.757 0.0300 -1.08
9 9 3.04 1.73 0.788 0.00940
10 10 1.92 2.45 -0.271 0.814
11 11 2.85 0.843 0.611 -0.985
12 12 1.70 2.40 -0.483 0.749
13 13 2.65 1.22 0.420 -0.567
14 14 2.34 1.90 0.126 0.199
15 15 1.17 2.73 -0.986 1.12
If you actually want min in your standardise function rather than mean, editing it should be simple enough.
Edited for BaseR solution, but with a cautionary tale that there's probably a much neater solution:
df <- data.frame(id = rep(c(seq(1, 15, 1)), each = 2))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
df <- reshape(df, direction = "wide", idvar = "id", timevar = "visit")
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df$before_std <- round(standardise(df$cholesterol.before), 2)
df$aafter_std <- round(standardise(df$cholesterol.after), 2)
gives:
i id cholesterol.before cholesterol.after before_std after_std
1 1 0.9861551 2.9154158 -1.16 1.33
3 2 3.9302373 2.9453085 1.63 1.36
5 3 4.2248018 2.4789901 1.91 0.84
7 4 0.9972635 0.3879830 -1.15 -1.49
9 5 1.1782336 1.4065341 -0.98 -0.36
11 6 1.0495609 1.2750138 -1.10 -0.50
13 7 2.8515144 0.4369885 0.61 -1.44
15 8 2.2410429 0.7566147 0.03 -1.08
17 9 3.0395565 1.7335131 0.79 0.01
19 10 1.9242212 2.4539439 -0.27 0.81
21 11 2.8528908 0.8432039 0.61 -0.99
23 12 1.7002653 2.3952744 -0.48 0.75
25 13 2.6522959 1.2178764 0.42 -0.57
27 14 2.3426695 1.9030782 0.13 0.20
29 15 1.1708246 2.7267124 -0.99 1.12

Swapping data frame values randomly between different deciles of the data frame

Its a bit complicated to explain, so I hope it is clear enough, but if not I'll try and expand more.
So I have a data-frame like this:
df <- data.frame(index=sort(runif(300, -10,10)), v1=runif(300, -2,-1), v2=runif(300, 1,2))
It gives us a 3-column 300-row df. The first column ("index") contains sorted values from -10 to 10, and the next two columns ("v1"/"v2") contain random numeric values that are not important for this issue.
Now I classify my df rows into deciles according to the index column, (e.g. decile 1: places 1-30, decile 2: places 31-60) and I want to swap randomly between the rows such that all the 1st decile values are swapped randomly with the 6th decile, all 2nd decile values are swapped randomly with the 7th decile, and so on. When I say swapped I mean that the index value remains in its place but the v1 and v2 values are swapped (still coupled) with the v1 and v2 of a random row in the appropriate decile.
For example, the v1 and v2 of the first row in the df (and thus from the 1st decile), will be swapped with the v1 and v2 of the 160th row in the df (6th decile), the v1 and v2 of the second row in the df (1st decile) will be swapped with the v1 and v2 of the 175th row in the df (also 6th decile), the v1 and v2 of the 31st row in the df (2nd decile) will be swapped with the v1 and v2 of the 186th row in the df (7th decile) and so on so all of the v1+v2 values have changed places randomly to their appropriate new decile.
Hope it's clear. I've been trying to solve it for hours and couldn't figure it out.
Thanks
Using order() to sort by two indices, one being the rearranged deciles, the other one random.
set.seed(123)
dtf <- data.frame(round(cbind(index=sort(runif(20, -10, 10)),
v1=runif(20, 0, 5),
v2=runif(20, 5, 10)), 2))
ea <- nrow(dtf)/10
# Deciles shifted by 5
d <- rep(((1:10 + 4) %% 10) + 1, each=ea)
# Random index within decile
r <- c(replicate(10, sample(ea)))
cbind(dtf, z=dtf[order(d, r), -1])
# index v1 v2 z.v1 z.v2
# 12 -9.16 4.45 5.71 4.51 7.21
# 11 -9.09 3.46 7.07 4.82 5.23
# 14 -7.94 3.20 7.07 3.98 5.61
# 13 -5.08 4.97 6.84 3.45 8.99
# 15 -4.25 3.28 5.76 0.12 7.80
# 16 -3.44 3.54 5.69 2.39 6.03
# 17 -1.82 2.72 6.17 3.79 5.64
# 18 -0.93 2.97 7.33 1.08 8.77
# 19 -0.87 1.45 6.33 1.59 9.48
# 20 0.56 0.74 9.29 1.16 6.87
# 2 1.03 4.82 5.23 3.46 7.07
# 1 1.45 4.51 7.21 4.45 5.71
# 3 3.55 3.45 8.99 3.20 7.07
# 4 5.77 3.98 5.61 4.97 6.84
# 6 7.66 0.12 7.80 3.54 5.69
# 5 7.85 2.39 6.03 3.28 5.76
# 8 8.00 3.79 5.64 2.97 7.33
# 7 8.81 1.08 8.77 2.72 6.17
# 10 9.09 1.59 9.48 0.74 9.29
# 9 9.14 1.16 6.87 1.45 6.33
I think that this is what you need.
swapByBlocks <- function(df, blockSize = 30, nblocks = 10){
if((nrow(df) != blockSize*nblocks) || nblocks %%2) stop("Undefined behaviour")
swappedDF <- df[c((nrow(df)/2 +1):nrow(df), 1:(nrow(df)/2)),]
ndxMat <- sapply(1:(nblocks/2),function(dummy) sample(1:blockSize))
for(i in 1:ncol(ndxMat)) {
swappedDF[(i-1)*blockSize + 1:blockSize, ] <- swappedDF[((i-1)*blockSize + 1:blockSize)[ndxMat[,i]], ]
swappedDF[(i+nblocks/2-1)*blockSize + 1:blockSize, ] <- swappedDF[((i+nblocks/2-1)*blockSize + 1:blockSize)[order(ndxMat[,i])], ]
}
return(swappedDF)
}
A small case where you can check how it works:
res <- swapByBlocks(df[1:18,], blockSize = 3, nblocks = 6)
> df[1:18,]
index v1 v2
1 -9.859624 -1.657779 1.954094
2 -9.774898 -1.015825 1.006341
3 -9.624402 -1.713754 1.527065
4 -9.441129 -1.891834 1.803793
5 -9.424195 -1.125674 1.581199
6 -8.890537 -1.142044 1.219111
7 -8.838012 -1.173445 1.013408
8 -8.296938 -1.780396 1.570550
9 -8.172076 -1.789056 1.178596
10 -7.671897 -1.988539 1.690468
11 -7.655868 -1.095662 1.876414
12 -7.450011 -1.337443 1.632104
13 -7.204528 -1.880350 1.408944
14 -7.085862 -1.232293 1.593247
15 -7.030691 -1.087031 1.924306
16 -6.989892 -1.639967 1.495058
17 -6.978945 -1.395340 1.872944
18 -6.930379 -1.841031 1.061046
> res
index v1 v2
10 -7.450011 -1.337443 1.632104
11 -7.655868 -1.095662 1.876414
12 -7.671897 -1.988539 1.690468
13 -7.030691 -1.087031 1.924306
14 -7.085862 -1.232293 1.593247
15 -7.204528 -1.880350 1.408944
16 -6.989892 -1.639967 1.495058
17 -6.930379 -1.841031 1.061046
18 -6.978945 -1.395340 1.872944
1 -9.624402 -1.713754 1.527065
2 -9.774898 -1.015825 1.006341
3 -9.859624 -1.657779 1.954094
4 -8.890537 -1.142044 1.219111
5 -9.424195 -1.125674 1.581199
6 -9.441129 -1.891834 1.803793
7 -8.838012 -1.173445 1.013408
8 -8.172076 -1.789056 1.178596
9 -8.296938 -1.780396 1.570550
>
Here there are 18 rows with six blocks of three numbers each. Rows 1 to 3 get swapped with rows 10 to 12, rows 4 to 6 with rows 13 to 15 and rows 4
7 to 9 with rows 16 to 17.

Lookup function for mutate in data

I'd like to store functions, or at least their names, in a column of a data.frame for use in a call to mutate. A simplified broken example:
library(dplyr)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
group_by(mu, sd, stat) %>%
mutate(sample = get(stat)(rnorm(100, mu, sd))) %>%
ungroup()
If this worked how I thought it would, the value of sample would be generated by the function in .GlobalEnv corresponding to either 'mean' or 'sd', depending on the row.
The error I get is:
Error in mutate_impl(.data, dots) :
Evaluation error: invalid first argument.
Surely this has to do with non-standard evaluation ... grrr.
A few issues here. First expand.grid will convert character values to factors. And get() doesn't like working with factors (ie get(factor("mean")) will give an error). The tidyverse-friendly version is tidyr::crossing(). (You could also pass stringsAsFactors=FALSE to expand.grid.)
Secondly, mutate() assumes that all functions you call are vectorized, but functions like get() are not vectorized, they need to be called one-at-a-time. A safer way rather than doing the group_by here to guarantee one-at-a-time evaluation is to use rowwise().
And finally, your real problem is that you are trying to call get("sd") but when you do, sd also happens to be a column in your data.frame that is part of the mutate. Thus get() will find this sd first, and this sd is just a number, not a function. You'll need to tell get() to pull from the global environment explicitly. Try
tidyr::crossing(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd')) %>%
rowwise() %>%
mutate(sample = get(stat, envir = globalenv())(rnorm(100, mu, sd)))
Three problems (that I see): (1) expand.grid is giving you factors; (2) get finds variables, so using "sd" as a stat is being confused with the column names "sd" (that was hard to find!); and (3) this really is a row-wise operation, grouping isn't helping enough. The first is easily fixed with an option, the second can be fixed by using match.fun instead of get, and the third can be mitigated with dplyr::rowwise, purrr::pmap, or base R's mapply.
This helper function was useful during debugging and can be used to "clean up" the code within mutate, but it isn't required (for other than this demonstration). Inline "anonymous" functions will work as well.
func <- function(f,m,s) get(f)(rnorm(100,mean=m,sd=s))
Several implementation methods:
set.seed(0)
expand.grid(mu = 1:5, sd = c(1, 10), stat = c('mean', 'sd'),
stringsAsFactors=FALSE) %>%
group_by(mu, sd, stat) %>% # can also be `rowwise() %>%`
mutate(
sample0 = match.fun(stat)(rnorm(100, mu, sd)),
sample1 = purrr::pmap_dbl(list(stat, mu, sd), ~ match.fun(..1)(rnorm(100, ..2, ..3))),
sample2 = purrr::pmap_dbl(list(stat, mu, sd), func),
sample3 = mapply(function(f,m,s) match.fun(f)(rnorm(100,m,s)), stat, mu, sd),
sample4 = mapply(func, stat, mu, sd)
) %>%
ungroup()
# # A tibble: 20 x 8
# mu sd stat sample0 sample1 sample2 sample3 sample4
# <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 mean 1.02 1.03 0.896 1.08 0.855
# 2 2 1 mean 1.95 2.07 2.05 1.90 1.92
# 3 3 1 mean 2.93 3.07 3.03 2.89 3.01
# 4 4 1 mean 4.01 3.94 4.23 4.05 3.96
# 5 5 1 mean 5.04 5.11 5.05 5.17 5.19
# 6 1 10 mean 1.67 1.21 1.30 2.08 -0.641
# 7 2 10 mean 1.82 2.82 2.35 3.65 1.78
# 8 3 10 mean 1.45 3.10 3.15 4.28 2.58
# 9 4 10 mean 3.49 6.33 5.11 2.84 3.41
# 10 5 10 mean 5.33 4.85 4.07 5.58 6.66
# 11 1 1 sd 0.965 1.04 0.993 0.942 1.08
# 12 2 1 sd 0.974 0.967 0.981 0.984 1.15
# 13 3 1 sd 1.12 0.902 1.06 0.977 1.02
# 14 4 1 sd 0.946 0.928 0.960 1.01 0.992
# 15 5 1 sd 1.06 1.01 0.911 1.11 1.00
# 16 1 10 sd 9.46 8.95 10.0 8.91 9.60
# 17 2 10 sd 9.51 9.11 11.5 9.85 10.6
# 18 3 10 sd 9.77 9.96 11.0 9.09 10.7
# 19 4 10 sd 10.5 9.84 10.1 10.6 8.89
# 20 5 10 sd 11.2 8.82 10.4 9.06 9.64
sample0 happens to work because you have grouped it to be row-wise. If at some point any one grouping has two or more values, this will fail.
For sample1 through sample4, you can remove the group_by and it works equally well (though sample0 demonstrates its failing, so remove it too). You won't get identical results as above with grouping removed, because the entropy is being consumed differently.

calculating qchisq in on a sparklyr tbl

I need to use the qchisq function on a column of a sparklyr data frame.
The problem is that it seems that qchisq function is not implemented in Spark. If I am reading the error message below correctly, sparklyr tried execute a function called "QCHISQ", however this doesn't exist neither in Hive SQL, nor in Spark.
In general, is there a way to run arbitrary functions that are not implemented in Hive or Spark, with sparklyr? I know about spark_apply, but haven't figured out how to configure it yet.
> mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1))
> mydf_tbl = copy_to(con, mydf)
> mydf_tbl
# Source: table<mydf> [?? x 2]
# Database: spark_connection
beta pval
<dbl> <dbl>
1 3.42 0.0913
2 -1.72 0.0629
3 0.515 0.0335
4 -3.12 0.0717
5 -2.12 0.0253
6 1.36 0.00640
7 -3.33 0.0896
8 1.36 0.0235
9 0.619 0.0414
10 4.73 0.0416
> mydf_tbl %>% mutate(se = sqrt(beta^2/qchisq(pval)))
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'QCHISQ'.
This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 49
As you noted you can use spark_apply:
mydf_tbl %>%
spark_apply(function(df)
dplyr::mutate(df, se = sqrt(beta^2/qchisq(pval, df = 12))))
# # Source: table<sparklyr_tmp_14bd5feacf5> [?? x 3]
# # Database: spark_connection
# beta pval X3
# <dbl> <dbl> <dbl>
# 1 1.66 0.0763 0.686
# 2 0.153 0.0872 0.0623
# 3 2.96 0.0485 1.30
# 4 4.86 0.0349 2.22
# 5 -1.82 0.0712 0.760
# 6 2.34 0.0295 1.10
# 7 3.54 0.0297 1.65
# 8 4.57 0.0784 1.88
# 9 4.94 0.0394 2.23
# 10 -0.610 0.0906 0.246
# # ... with more rows
but fair warning - it is embarrassingly slow. Unfortunately you don't have alternative here, short of writing your own Scala / Java extensions.
In the end I've used an horrible hack, which for this case works fine.
Another solution would have been to write a User Defined Function (UDF), but sparklyr doesn't support it yet: https://github.com/rstudio/sparklyr/issues/1052
This is the hack I've used. In short, I precompute a qchisq table, upload it as a sparklyr object, then join. If I compare this with results calculated on a local data frame, I get a correlation of r=0.99999990902236146617.
#' #param n: number of significant digits to use
> check_precomputed_strategy = function(n) {
chisq = data.frame(pval=seq(0, 1, 1/(10**(n)))) %>%
mutate(qval=qchisq(pval, df=1, lower.tail = FALSE)) %>%
mutate(pval_s = as.character(round(as.integer(pval*10**n),0)))
chisq %>% head %>% print
chisq_tbl = copy_to(con, chisq, overwrite=T)
mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1)) %>%
mutate(se1 = sqrt(beta^2/qchisq(pval, df=1, lower.tail = FALSE)))
mydf_tbl = copy_to(con, mydf)
mydf_tbl.up = mydf_tbl %>%
mutate(pval_s=as.character(round(as.integer(pval*10**n),0))) %>%
left_join(chisq_tbl, by="pval_s") %>%
mutate(se=sqrt(beta^2 / qval)) %>%
collect %>%
filter(!duplicated(beta))
mydf_tbl.up %>% head %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% nrow %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% select(se, se1) %>% cor
}
> check_precomputed_strategy(4)
pval qval pval_s
1 0.00000000000000000000000 Inf 0
2 0.00010000000000000000479 15.136705226623396570 1
3 0.00020000000000000000958 13.831083619091122827 2
4 0.00030000000000000002793 13.070394140069462097 3
5 0.00040000000000000001917 12.532193305401813532 4
6 0.00050000000000000001041 12.115665146397173402 5
# A tibble: 6 x 8
beta pval.x se1 myvar pval_s pval.y qval se
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 3.42 0.0913 2.03 1. 912 0.0912 2.85 2.03
2 -1.72 0.0629 0.927 1. 628 0.0628 3.46 0.927
3 0.515 0.0335 0.242 1. 335 0.0335 4.52 0.242
4 -3.12 0.0717 1.73 1. 716 0.0716 3.25 1.73
5 -2.12 0.0253 0.947 1. 253 0.0253 5.00 0.946
6 1.36 0.00640 0.498 1. 63 0.00630 7.46 0.497
[1] 100
se se1
se 1.00000000000000000000 0.99999990902236146617
se1 0.99999990902236146617 1.00000000000000000000

Resources