I wanted to model my snps array. I can do this one by one using the following code.
Data$DX=as.factor(Data$DX)
univariate=glm(relevel(DX, "CON") ~ relevel(rs6693065_D,"AA"), family = binomial, data = Data)
summary(univariate)
exp(cbind(OR = coef(univariate), confint(univariate)))
How can I do this for all other snps using a loop or apply? The snps are rs6693065_D, rs6693065_A and hundreds of them. From the above code only "rs6693065_D" will be replaced by all other snps.
Best Regards
Zillur
Consider developing a generalized method to handle any snps. Then call it iteratively passing every snps column using lapply or sapply:
# GENERALIZED METHOD
proc_glm <- function(snps) {
univariate <- glm(relevel(data$DX, "CON") ~ relevel(snps, "AA"), family = binomial)
return(exp(cbind(OR = coef(univariate), confint(univariate))))
}
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], proc_glm)
Use tryCatch in case of errors like relevel:
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], function(col)
tryCatch(proc_glm(col), error = function(e) e))
For building a data frame, adjust method and lapply call followed with a do.call + rbind:
proc_glm <- function(col){
# BUILD FORMULA BY STRING
univariate <- glm(as.formula(paste("y ~", col)), family = binomial, data = Data)
# RETURN DATA FRAME OF COLUMN AND ESTIMATES
cbind.data.frame(COL = col,
exp(cbind(OR = coef(univariate), confint(univariate)))
)
}
# BUILD LIST OF DFs, PASSING COLUMN NAMES
glm_list <- lapply(names(Data)[3:426],
tryCatch(proc_glm(col), error = function(e) NA))
# APPEND ALL DFs FOR SINGLE MASTER DF
final_df <- do.call(rbind, glm_list)
Related
anova_test <- function(dataSet, dataOne, dataTwo){
for (j in 1:8){
for (i in 1:4){
for (k in i:4){
if(i!=k){
res <- manova(cbind(colnames(dataOne)[i], colnames(dataOne)[k]) ~ colnames(dataTwo)[j], data = dataSet)
summary(res.man)
# Look to see which differ
summary.aov(res.man)
}
}
}
}
}
D <- apply_impute(data)
dataOne <- select(D, age, child, balance, previous)
dataTwo <- select(D, job, marital, education, default, housing, loan,
contact, month)
anova_test(D, dataOne, dataTwo)
Here is my code. D is a Dataset. In dataOne I put the quantitative variables of D and in dataTwo I put the categorical variables of D. I want to iterate through D to use manova with every pair of quantitative variable with every categorical variable.
But when I run it, I get the following error :
Error in `[[<-.data.frame`(`*tmp*`, i, value = 1:2) :
replacement has 2 rows, data has 1
De plus : Warning message:
In storage.mode(v) <- "double" :
Error in `[[<-.data.frame`(`*tmp*`, i, value = 1:2) :
replacement has 2 rows, data has 1
Could you please help me to find what's wrong in my code?
Consider capturing all possible combinations of both sets of column names with expand.grid then call one elementwise loop with Map (wrapper to mapply) instead of three-level, nested for loops that do not save results to any object.
# BUILD DATA FRAME OF ALL POSSIBLE COMBINATIONS
params_df <- expand.grid(cat1 = c("age", "child", "balance", "previous"),
cat2 = c("age", "child", "balance", "previous"),
quant = c("job", "marital", "education", "default",
"housing", "loan", "contact", "month"))
# REMOVE ROWS WHERE CATEGORIES ARE THE SAME
params_df <- subset(params_df, cat1 != cat2)
# USER-DEFINED METHOD TO CALL manova WITH DYNAMIC FORMULA AND RESULTS
anova_test <- function(dataSet, cat1, cat2, quant) {
frml <- as.formula(paste0("cbind(", cat1, ",", cat2, ") ~ ", quant))
res.man <- manova(frml, data = dataSet)
res.list <- list(estimates = summary(res.man),
aov = summary.aov(res.man))
return(res.list)
}
# RETREIVE DATA
D <- apply_impute(data)
# BUILD LIST OF MANOVA RESULTS
manova_list <- Map(anova_test,
cat1 = params_df$cat1,
cat2 = params_df$cat2,
quant = params_df$quant,
MoreArgs = list(dataSet = D))
Output
# DISPLAY SELECT RESULTS BY INDEX AND NAMES
manova_list[[1]]$estimates
manova_list[[1]]$aov
manova_list[[2]]$estimates
manova_list[[2]]$aov
# ...
# DISPLAY ALL RESULTS
lapply(manova_list, `[[`, "estimates")
lapply(manova_list, `[[`, "aov")
First, you don't need to pass the whole data into your anova_test function because you are passing it in two blocks already.
Then in your modelling line you need to supply the actual data not just the column names, and you don't need to specify the dataset because you are already supplying the data.
Eg:
res <- manova(dataOne[,c(i,k)] ~ dataTwo[,j])
You could do this using the column names and the full dataset, but its needlessly more difficult. The difference from your code is the use of get to turn the name as a string into the object it refers to.
res <- manova(cbind(get(colnames(dataOne)[i]), get(colnames(dataOne)[k])) ~ get(colnames(dataTwo)[j]), data = dataSet)
Finally, I'm not sure why you want so many pairwise MANOVAs like this, there may be a better way to do what you want to do (statistically speaking)..
I want to perform a certain number of statistical models based on selection criteria specified in a dataframe. So using a basic example, say I had 2 responses variables and 2 explanatory variables:
#######################Data Input############################
Responses <- as.data.frame(matrix(sample(0:10, 1*100, replace=TRUE), ncol=2))
colnames(Responses) <- c("A","B")
Explanatories <- as.data.frame(matrix(sample(20:30, 1*100, replace=TRUE), ncol=2))
colnames(Explanatories) <- c("x","y")
I then define which statistical models that I would like to run, which can include different combinations of Response / Explanatory variables and different statistical functions:
###################Model selection#########################
Function <- c("LIN","LOG","EXP") ##Linear, Logarithmic (base 10) and exponential - see the formula for these below
Respo <- c("A","B","B")
Explan <- c("x","x","y")
Model_selection <- data.frame(Function,Respo,Explan)
How do I then perform a list of models based on these selection criteria? Here is an example of the models I would like to create based on the inputs from the Model_selection data frame.
####################Model creation#########################
Models <- list(
lm(Responses$A ~ Explanatories$x),
lm(Responses$B ~ log10(Explanatories$x)),
lm(Responses$B ~ exp(Explanatories$y))
)
I would guess that some kind of loop function would be required and after looking around perhaps paste too? Thanks in advance for any help with this
This isn't the prettiest solution, but it seems to work for your example:
Models <- list()
idx <- 1L
for (row in 1:nrow(Model_selection)){
if (Model_selection$Function[row]=='LOG'){
expl <- paste0('LOG', Model_selection$Explan[row])
Explanatories[[expl]] <- log10(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='EXP'){
expl <- paste0('EXP', Model_selection$Explan[row])
Explanatories[[expl]] <- exp(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='LIN'){
expl <- paste0('LIN', Model_selection$Explan[row])
Explanatories[[expl]] <- Explanatories[[Model_selection$Explan[row]]]
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
names(Models)[idx] <- paste(Model_selection$Respo[row], '~', expl)
idx <- idx+1L
}
Models
This is a perfect use-case for the tidyverse
library(tidyverse)
## cbind both data sets into one
my_data <- cbind(Responses, Explanatories)
## use 'mutate' to change function names to the existing function names
## mutate_all to transform implicit factors to characters
## NB this step could be ommitted if Function would already use the proper names
model_params <- Model_selection %>%
mutate(Function = case_when(Function == "LIN" ~ "identity",
Function == "LOG" ~ "log10",
Function == "EXP" ~ "exp")) %>%
mutate_all(as.character)
## create a function which estimates the model given the parameters
## NB: function params must be named exactly like columns
## in the model_selection df
make_model <- function(Function, Respo, Explan) {
my_formula <- formula(paste0(Respo, "~", Function, "(", Explan, ")"))
my_mod <- lm(my_formula, data = my_data)
## syntactic sugar: such that we see the value of the formula in the print
my_mod$call$formula <- my_formula
my_mod
}
## use purrr::pmap to loop over the model params
## creates a list with all the models
pmap(model_params, make_model)
In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))
I'm working with ten training datasets, train1 through train10, and would like to repeat the following statements for 1 through 10 with a single block of code:
train_y_1 <- c(train1$y)
train1$y <-NULL
train_x_1 <- data.matrix(train1)
olsfit_1 <- cv.glmnet(y=train_y_1, x=train_x_1, alpha=1, family="gaussian")
I've read in the forums that lapply() is preferable to for loops. My code:
# Create empty data frames and list (to be populated with values in main program)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10)), envir=.GlobalEnv)
# Create y and x input matrices and run ten lasso regressions
list2env(lapply(mget(paste0('train', 1:10)), mget(paste0('train_y_', 1:10)), mget(paste0('train_x_', 1:10)), mget(paste0('lasso_', 1:10)),
function(a,b,c,d)
{
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}), envir=.GlobalEnv)
which produces the error message:
Error in match.fun(FUN) :
'mget(paste0("train_y_", 1:10))' is not a function, character or symbol
So it looks like R is confused by the four mget() functions which I intended to be reading in values for the a,b,c,d arguments, but I'm not sure how to proceed next.
Any suggestions?
You want to keep all your data in lists whenever possible, avoiding polluting the global environment with a bunch of variables. This isn't tested, and train is missing, but should be a similar list of your train data. Then, you could do something like,
trainy <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10))
trainx <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10))
lasso <- setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10))
f <- function(a,b,c,d) {
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}
mapply(f, train, trainy, trainx, lasso, SIMPLIFY=F)
Although, since your lists are just initializing variables, you probably just want to loop (apply) over a list of your training data,
lapply(train, function(x) {
... # the statements you want to repeat
list(...) # return a list of the three data.frames
})
We can achieve this with the following code.
# Load libraries
library(dplyr);library(glmnet)
# Gather all the variables in global into a list
fit = mget(paste0("train", 1:10), envir = .GlobalEnv) %>%
# Pipe each element of the list into `cv.glmnet` function
lapply(function(dat) {cv.glmnet(y = dat$y,
x = data.matrix(dat %>% mutate(y = NULL)),
alpha = 1,
family = "gaussian")})
Your output will be neatly stored in fit, which is a list with 10 elements. You can call each element with fit[[i]]. For example coef(fit[[1]]) pulls out the coefs for train1 and lapply(fit, coef) pulls the coef for all 10 models and stores them in a list.
I'm new to loops and I have a problem with calling variable from i'th data frame.
I'm able to call each data frame correctly, but when I should call a specified variable inside each data frame problems come:
Example:
for (i in 1:15) {
assign(
paste("model", i, sep = ""),
(lm(response ~ variable, data = eval(parse(text = paste("data", i, sep = "")))))
)
plot(data[i]$response, predict.lm(eval(parse(text = paste("model", i, sep = ""))))) #plot obs vs preds
}
Here I'm doing a simple one variable linear model 15 times, which works just fine. Problems come when I try to plot the results. How should I call data[i] response?
Let's say there are multiple dataframes with names: data1 ...data15 and that there are no other data-objects that begin with the letters: d,a,t,a. Lets also assume that in each of those dataframes are columns named 'response' and 'variable'. The this would gather the dataframes into a list and draw separate plots for the linear regression lines.
dlist <- lapply ( ls(patt='^data'), get)
lapply(dlist, function(df)
plot(NA, xlim=range(df$variable), ylim=range(df$response)
abline( coef( lm(response ~ variable, data=df) ) )
)
If you wanted to name the dataframes in that list, you could use your paste code to supply names:
names(dlist) <- paste("data", i, sep = "")
There are many other assignments you could make in the context of this loop, but you would need to describe the desired results better than with failed efforts.
Here's modified code that should work. It does one variable lm-model and calculates correlation of predicted and observed values and stores it into an empty matrix. It also plots these values.
Thanks Thomas for help.
par(mfrow=c(4,5))
results.matrix <- matrix(NA, nrow = 20, ncol = 2)
colnames(results.matrix) <- c("Subset","Correlation")
for (i in 1:length(datalist)) {
model <- lm(response ~ variable, data = datalist[[i]])
pred <- predict.lm(model)
cor <- (cor.test(pred, datalist[[i]]$response))
plot(pred, datalist[[i]]$response, xlab="pred", ylab="obs")
results.matrix[i, 1] <- i
results.matrix[i, 2] <- cor$estimate
}