I have trouble creating a function that is too complex for my R knowledge and I'd appreciate any help.
I have a data set (DRC_epi) consisting of ~800.000 columns of epigenetic data. I'd like to randomly draw 1000 samples consisting of 500 column names each:
set.seed(42)
y <- replicate(1000, {
names(DRC_epi[, sample(ncol(DRC_epi), 500, replace = TRUE)])
})
I want to use these samples to select samples of a different data frame (DRC_epi_pheno) from which I want to create correlations with the outcome variable of my interest (phenotype_aas). So for the first sub sample it would look like this:
library(tidyverse)
library(correlation)
DRC_cor_sign_1 <- DRC_epi_pheno %>%
select(phenotype_aas, any_of(y[,1])) %>%
correlation(method = "spearman", p_adjust = "fdr") %>%
filter(Parameter1 %in% "phenotype_aas") %>%
filter(p <= 0.05) %>%
select(Parameter1, Parameter2, p)
From this result, I want to store the percentage of significant results in an object:
percentage <- data.frame()
percentage() <- length(DRC_cor_sign_1)/500*100
The question I have now is, how can I put it all together and automate it, so that I don't have to run the analyses 1000 times manually?
So that you have an idea of my data, I create here a toy data set that is similar to my real data set:
set.seed(42)
DRC_epi <- data.frame("cg1" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg2" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg3" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg4" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg5" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg6" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg7" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg8" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg9" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg10" = rnorm(n = 10, mean = 1, sd = 1.5))
DRC_epi_pheno <- cbind(DRC_epi, phenotype_aas = sample(x = 0:40, size = 10, replace = TRUE))
I have the following data.frame:
df_1 <- data.frame(
x = replicate(
n = 6, expr = runif(n = 30, min = 20, max = 100), simplify = TRUE
)
)
I want generate 50 data.frames with this function:
f_1 <- function(x) {
data.frame(x = replicate(n = 5, runif(n = 30, min = 20, max = 100)))
}
lt_1 <- replicate(n = 50, expr = f_1(), simplify = FALSE)
The result is ok. But, when apply f_1 within a function (on the fly), this function doesn't work:
lt_2 <- replicate(
n = 50, expr = function(x) {
data.frame(x = replicate(n = 5, runif(n = 30, min = 20, max = 100)))
}, simplify = FALSE
)
What's problem?
We can wrap it inside the () and call () to execute the function
lt_2 <- replicate(
n = 50, expr = (function(x) {
data.frame(x = replicate(n = 5, runif(n = 30, min = 20, max = 100)))
})(), simplify = FALSE
)
In the OP's lt_1, the function is called with f_1()
My list (lt):
df_1 <- data.frame(
x = replicate(
n = 2,
expr = runif(n = 30, min = 20, max = 100)
),
y = sample(
x = 1:3, size = 30, replace = TRUE
)
)
lt <- split(
x = df_1,
f = df_1[['y']]
)
vars <- names(df_1)[1:2]
I try:
for (i in vars) {
for (i in i) {
print(pairwise.t.test(x = lt[, i], g = lt[['y']], p.adj = 'bonferroni'))
}
}
But, the error message is:
Error in lista[, i] : incorrect number of dimensions
What's problem?
We don't need to split
pairwise.t.test(unlist(df_1[1:2]), g = rep(df_1$y, 2), p.adj = 'bonferroni')
#Pairwise comparisons using t tests with pooled SD
#data: unlist(df_1[1:2]) and rep(df_1$y, 2)
# 1 2
#2 1.00 -
#3 0.91 1.00
I have a data set like the following
set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T))
foo[, period := 1:.N, by = group]
Problem: I want to forecast y one step ahead, for each group, using variables x1, ..., x5
I want to run a few models in caret to decide which I will use.
As of now, I am running it in a loop using timeslice
window.length <- 115
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
model_list <- list()
for(g in unique(foo$group)){
for(model in c("xgbTree", "earth", "cubist")){
dat <- foo[group == g][, c('group', 'period') := NULL]
model_list[[g]][[model]] <- train(y ~ . - 1,
data = dat,
method = model,
trControl = timecontrol)
}
}
However, I would like to run all groups at the same time, using dummy variables to identify each one, like
dat <- cbind(foo, model.matrix(~ group- 1, foo))
y x1 x2 x3 x4 x5 period groupA groupB groupC groupD groupE groupF
1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04 1 1 1 0 0 0 0 0
2: 3.442213 8.6558983 32.41881 45.70801 3.255423e-01 1 1 0 1 0 0 0 0
3: 3.485286 7.7295448 21.99022 56.42133 8.668391e+00 1 1 0 0 1 0 0 0
4: 9.659601 0.9166456 30.34609 55.72661 -7.666063e+00 1 1 0 0 0 1 0 0
5: 5.567950 3.0306864 22.07813 52.21099 5.377153e-01 1 1 0 0 0 0 1 0
But still running the time series with the correct time ordering using timeslice.
Is there a way to declare the time variable in trainControl, so my one step ahead forecast uses, in this case, six more observations for each round and droping the first 6 observations?
I can do it by ordering the data and messing with the horizon argument (given n groups, order by the time variable and put horizon = n), but this has to change if the number of groups change. And initial.window will have to be time * n_groups
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
Is there any ohter way?
I think the answer you are looking for is actually quite simple. You can use the skip argument to trainControl() to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.
Using the example you provided, if you set skip = 6 and horizon = 6 (the number of groups), and initialWindow = 115, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
dat <- cbind(foo, model.matrix(~ group- 1, foo)) %>%
select(-group)
window.length <- 115
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
skip = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
~ train(
y ~ . - 1,
data = dat,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_names)
I would use tidyr::nest() to nest groups and then iterate over the data with purrr::map(). This approach is much more flexible because it can accommodate different group sizes, different numbers of groups, and variable models or other arguments passed to caret::train(). Also, you can easily run everything in parallel using furrr.
Load packages and create data
I use tibble instead of data.table. I also reduce the size of the data.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(
group = rep(LETTERS[1:6], 10),
y = rnorm(n = 6 * 10, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 10, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 10, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 10, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 10, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 10, replace = T)
) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
Reduce initialWindow size
window.length <- 9
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
Create a function that will return a list of fit model objects
# To fit each model in model_list to data and return model fits as a list.
fit_models <- function(data, model_list, timecontrol) {
map(model_list,
~ train(
y ~ . - 1,
data = data,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_list)
}
Fit models
model_list <- c("xgbTree", "earth", "cubist")
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))
If you want to view the results for a particular group / model you can do:
mods[which(mods$group == "A"), ]$fits[[1]]$xgbTree
Use furrr for parallel processing
Just initialize workers with plan(multiprocess) and change map to future_map. Note you might want to change the number of workers to something less than 6 if your computer has fewer than 6 processing cores.
library(furrr)
plan(multiprocess, workers = 6)
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = future_map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))
Suppose I have a data frame with 8 schools and its characteristics, and another with 48 teachers and its characteristics. I can generate some fake data with the following code:
library(dplyr)
library(geosphere)
set.seed(6232015)
n.schools <-8
n.teachers <- 48
makeRandomString <- function(pre, n=1, length=12) {
randomString <- c(1:n) # initialize vector
for (i in 1:n) {
randomString[i] <- paste0(pre,'.', paste(sample(c(0:9, letters, LETTERS),
length, replace=TRUE),
collapse=""))
}
return(randomString)
}
gen.teachers <- function(n.teachers){
Teacher.ID <- makeRandomString(pre= 'T', n = n.teachers, length = 20)
Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5), size = n.teachers)
Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other)
return(Teachers)
}
gen.schools <- function(n.schools){
School.ID <- makeRandomString(pre= 'S', n = n.schools, length = 20)
School.lat <- runif(n = n.schools, min = -2, max = 2)
School.long <- runif(n = n.schools, min = -2, max = 2)
Schools <- data.frame(School.ID, School.lat, School.long) %>%
rowwise() %>% mutate (School.distance = distHaversine(p1 = c(School.long, School.lat),
p2 = c(0, 0), r = 3961))
return(Schools)
}
Teachers <- gen.teachers(n.teachers = n.teachers)
Schools <- gen.schools(n.schools = n.schools)
To each shool, I want to assign 6 teachers (every teacher get 1 and only 1 school). I could use:
Teachers %>% sample_n(6)
To get a list of 6 teachers assign those to a school, remove them from my original pool and keep going with a loop. My guess/hope is that there is a much easier way of doing this.
Thanks for the help!
In the context of your code
sample(rep(Schools$School.ID, each = 6))
gives a random sequence of schools where each school.id appears 6 times. Set Teachers$AssignedSchool to this sample and each teacher has an assigned school