Generating multiple datasets and applying function and output multiple dataset - r

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...

You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

How to write a function in R that will implement the "best subsets" approach to model selection?

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.
The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).
The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.
The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.
I figured out that this code will get me the explanatory and response variables:
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
I know that I can get a model with something like this (ignore variable names for now):
model <- final_data %>%
group_by(debt) %>%
lm(debt ~ distance, data = .) %>%
glance()
I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.
What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.
For reference here is a very early (not working at all) attempt I made:
best_subsets <- function(data) {
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
# Create the formula as a string and then later in the lm function
# have it turned into a real formula.
form <- paste(y, "~", xs, sep = " ")
data %>%
lm(as.formula(form), data = .) %>%
glance()
}
I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:
library(tidyverse)
library(broom)
library(data.table)
lm_func <- function(df){
fit1 <- lm(df[, 1] ~ df[, 2], data = df)
fit2 <- lm(df[, 1] ~ df[, 3], data = df)
fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
results <- list(fit1, fit2, fit3)
names(results) <- paste0("explanitory_variables_", 1:3)
r_sq <- lapply(results, function(x){
glance(x)
})
r_sq_df <- rbindlist(r_sq, idcol = "df_name")
r_sq_df
}
lm_func(iris)
This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?
Let me know if this helps at all or if I can refine the answer a little more towards your goal.
UPDATE:
lm_func <- function(df) {
lst <- c()
for (i in 2:ncol(df)) {
ind <- i
form_df <- df[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = df)
lst[[i - 1]] <- glance(fit)
}
lst
names(lst) <- paste0("explanitory_variables_", 1:length(lst))
lst <- rbindlist(lst, idcol = "df_name")
lst
}
lm_func(iris)
This assumes your first column is y and you want a model for every additional column.
OK one more UPDATE:
I think this does everything possible but is probably overkill:
library(combinat)
library(data.table)
library(tidyverse)
library(broom)
#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
cbind(dependent, x)
})
df_list
}
permd_df_list <- formula_func(iris) # voila
# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
mega_lst <- c()
mega_lst <- lapply(form_df_list, function(x) {
lst <- vector(mode = "list", length = length(2:ncol(x)))
for (i in 2:ncol(x)) {
ind <- i
form_df <- x[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = x)
lst[[i - 1]] <- glance(fit)
names(lst)[[i-1]] <- deparse(form)
}
lst <- rbindlist(lst, idcol = "Model_formula")
return(lst)
})
return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>%
rbindlist() %>%
distinct()
## You can now subset and select whichever column you want from the final output
I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

How to run a for loop to run regressions by dummy variables

I have the following code:
reg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2 + d2 + d3 + d4, df)
Where all x_i are continuous variables and d_i are mutually exclusive dummy variables (d1 is present but exclude to avoid perfect multicollinearity). Rather than including the dummy variables, I want to run separate regressions for each dummy variable == 1. I wish to achieve this through a loop in the following form:
dummylist <- list("d1", "d2", "d3", "d4")
for(i in dummylist){
if(i==1){
ireg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2, df)
} else {
Unsure what to put here
}
}
My three(?) questions are:
in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."? and,
included in the code above, what would I put after the -else- statement?
This all begs the question, is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?
Sorry if this is too much, please let me know if it is and I can cut it down or separate into multiple questions. I could not find a similar question, probably as I am rather new to running loops in R and don't know what to look for.
in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."?
Short: No
In R there are many data types. One of the more versatile once is the list object, which can store any type of object. Alternatively one could create an environment to store the lists within, but that is a bit overkill.
If you know roughly how many elements should be in your list, the easiest is to initialize it prior to your loop as
n <- 3
regList <- vector(mode = "list", length = n)
# Optional naming:
#names(regList) <- c("d1 reg", "d2 reg", "d3 reg")
In your loop you then fill in your list iteratively:
for(i in seq_along(regList)){
regList[[i]] <- lm(...)
}
what would I put after the -else- statement? This all begs the question,
It is not entirely clear what you want here. Either you want to 'only' include the seperate dummy variables. For this the simplest is likely to save your formula and updating it iteratively.
form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
for(i in seq_along(regList)){
#paste0 combine strings. ". ~ . + d1" means take the formula and add the element d1
form <- update(form, as.formula(paste0(". ~ . + d", i))
regList[[i]] <- lm(form, data = df)
}
or maybe you are actually trying to run separate regressions on the subset where d[i] == 1. This can actually be done with lm itself
form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
d <- list(d1, d2, d3)
for(i in seq_along(regList)){
#Using the subset argument
regList[[i]] <- lm(form, data = df, subset = which(d[[i]] == 1))
#Alternatively:
#regList[[i]] <- lm(form, data = subset(df, d[[i]] == 1))
}
Disclaimer: It is not entirely clear if d1, d2, d3 is a part of df. In this case the example below would work
regList[[i]] <- with(df, lm(form, subset = which(d[[i]] == 1)))
is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?
In this case it is not clearly the correct approach. But it isn't the wrong approach either in all circumstances. Here it just doesn't serve a clear purpose. And note that i in dummylist would return "d1", "d2", "d3", "d4" as the variables have been quoted, rather than directly placed within the list.
However another thing to address, is whether you have transformed the variables yourself, before performing your linear regression. Note that R's internal function allows you to do this directly in the formula, and doing this will allow it to help you avoid dummy-mistakes, such as testing variables for which an interaction exists, unless it is very very much what you wanted to do. For example i assume x1_sq = x1^2. Maybe d1, d2, d3 are all contained in a variable d? In these cases you should use the original variables as shown below:
lm(formula = Y ~ poly(x1, 2, raw = TRUE) + poly(x2, 2, raw = TRUE) + x1:x2, data = df ) #+d if d1, d2, d3 is part of the formula
poly being the second order polynomial and raw = TRUE returning the parameters as x1 + I(x1^2) rather than the orthogonal representation.
If one does this, the output of drop1, anova etc. will take into account that it should not test the first order variables to the second order interactions.

multiple imputation and multigroup SEM in R

I want to perform multigroup SEM on imputed data using the R packages mice and semTools, specifically the runMI function that calls Lavaan.
I am able to do so when imputing the entire dataset at once, but whilst trawling through stackoverflow/stackexchange I have come across the recommendation to impute data separately for each level of a grouping variable (e.g. men, women), so that the features of each group are preserved
(e.g. https://stats.stackexchange.com/questions/149053/questions-on-multiple-imputation-with-mice-for-a-multigroup-sem-analysis-inclu). However, I've not been able to find any references to support this course.
My question is both conceptual and practical -
1) Is splitting the dataset by group prior to imputing the correct course? Could anyone point me towards references advising this?
2) If so, how can I combine the datasets imputed by group using mice together, whilst still retaining multiple imputed datasets in a list of dataframes of the mids class? I have attempted to do so, but end up with an integer
set.seed(12345)
HSMiss <- HolzingerSwineford1939[ , paste("x", 1:9, sep = "")]
HSMiss$x5 <- ifelse(HSMiss$x1 <= quantile(HSMiss$x1, .3), NA, HSMiss$x5)
HSMiss$x9 <- ifelse(is.na(HSMiss$x5), NA, HSMiss$x9)
HSMiss$school <- HolzingerSwineford1939$school
HS.model <- '
visual =~ x1 + a*x2 + b*x3
textual =~ x4 + x5 + x6
x7 ~ textual + visual + x9
'
group1 <- subset(HSMiss, school =='Pasteur')
group2 <- subset(HSMiss, school =='Grant-White')
imputed.group1 <- mice(group1, m = 3, seed = 12345)
imputed.group2 <- mice(group2, m = 3, seed = 12345)
#attempted merging:
imputed.both <- nrow(complete(rbind(imputed.group1, imputed.group2)))
I would be incredibly grateful if anyone can offer me some help. As you can tell, I am very much still learning about R and imputation, so apologies if this is a stupid question - however, I couldn't find anything regarding this specific query elsewhere.
You are getting just an integer when mergin because you are calling nrow(). Remove that call and you'll get a merged data frame.
imputed.both <- complete(rbind(imputed.group1, imputed.group2))
In case you find yourself with datasets that have multiple groups, you can something like the following to simplify this task.
imputed.groups <- lapply(split(HSMiss, HSMiss$school), function(x) {
complete(mice(x, m = 3, seed = 12345))
})
imputed.both <- do.call(args = imputed.groups, what = rbind)
About how appropiate is this approach for imputing, that's probably a question better suited for Cross Validated.

Save iterations of for loop in R

I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]

Resources