anova_test <- function(dataSet, dataOne, dataTwo){
for (j in 1:8){
for (i in 1:4){
for (k in i:4){
if(i!=k){
res <- manova(cbind(colnames(dataOne)[i], colnames(dataOne)[k]) ~ colnames(dataTwo)[j], data = dataSet)
summary(res.man)
# Look to see which differ
summary.aov(res.man)
}
}
}
}
}
D <- apply_impute(data)
dataOne <- select(D, age, child, balance, previous)
dataTwo <- select(D, job, marital, education, default, housing, loan,
contact, month)
anova_test(D, dataOne, dataTwo)
Here is my code. D is a Dataset. In dataOne I put the quantitative variables of D and in dataTwo I put the categorical variables of D. I want to iterate through D to use manova with every pair of quantitative variable with every categorical variable.
But when I run it, I get the following error :
Error in `[[<-.data.frame`(`*tmp*`, i, value = 1:2) :
replacement has 2 rows, data has 1
De plus : Warning message:
In storage.mode(v) <- "double" :
Error in `[[<-.data.frame`(`*tmp*`, i, value = 1:2) :
replacement has 2 rows, data has 1
Could you please help me to find what's wrong in my code?
Consider capturing all possible combinations of both sets of column names with expand.grid then call one elementwise loop with Map (wrapper to mapply) instead of three-level, nested for loops that do not save results to any object.
# BUILD DATA FRAME OF ALL POSSIBLE COMBINATIONS
params_df <- expand.grid(cat1 = c("age", "child", "balance", "previous"),
cat2 = c("age", "child", "balance", "previous"),
quant = c("job", "marital", "education", "default",
"housing", "loan", "contact", "month"))
# REMOVE ROWS WHERE CATEGORIES ARE THE SAME
params_df <- subset(params_df, cat1 != cat2)
# USER-DEFINED METHOD TO CALL manova WITH DYNAMIC FORMULA AND RESULTS
anova_test <- function(dataSet, cat1, cat2, quant) {
frml <- as.formula(paste0("cbind(", cat1, ",", cat2, ") ~ ", quant))
res.man <- manova(frml, data = dataSet)
res.list <- list(estimates = summary(res.man),
aov = summary.aov(res.man))
return(res.list)
}
# RETREIVE DATA
D <- apply_impute(data)
# BUILD LIST OF MANOVA RESULTS
manova_list <- Map(anova_test,
cat1 = params_df$cat1,
cat2 = params_df$cat2,
quant = params_df$quant,
MoreArgs = list(dataSet = D))
Output
# DISPLAY SELECT RESULTS BY INDEX AND NAMES
manova_list[[1]]$estimates
manova_list[[1]]$aov
manova_list[[2]]$estimates
manova_list[[2]]$aov
# ...
# DISPLAY ALL RESULTS
lapply(manova_list, `[[`, "estimates")
lapply(manova_list, `[[`, "aov")
First, you don't need to pass the whole data into your anova_test function because you are passing it in two blocks already.
Then in your modelling line you need to supply the actual data not just the column names, and you don't need to specify the dataset because you are already supplying the data.
Eg:
res <- manova(dataOne[,c(i,k)] ~ dataTwo[,j])
You could do this using the column names and the full dataset, but its needlessly more difficult. The difference from your code is the use of get to turn the name as a string into the object it refers to.
res <- manova(cbind(get(colnames(dataOne)[i]), get(colnames(dataOne)[k])) ~ get(colnames(dataTwo)[j]), data = dataSet)
Finally, I'm not sure why you want so many pairwise MANOVAs like this, there may be a better way to do what you want to do (statistically speaking)..
Related
I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.
I am trying to generate a function to calculate the correlation between variables age and views in data frame data for each category of gender.
My data frame is called tv_viewing with 5 columns: adhd (numeric), sex (factor, boy/girl), famsize (factor with 4 levels (1 child, 2 child, 3 child, 4+child), age (numeric) and views (numeric, amount of watching television).
I have gotten this far:
partcorr <- function(tv_viewing, age, views, sex) {
corrs <- list()
for(i(tv_viewing[,sex])) {
corrs[i] <- round(sex(tv_viewing[tv_viewing[,sex] == i, age], tv_viewing[tv_viewing[,sex] == i, views], method = "pearson"), digits = 2)
}
return()
}
Or, more generally,
partcorr <- function(data, x, y, cat) {
corrs <- list()
for(i in levels(data[,cat])) {
corrs[i] <- round(cor(data[data[,cat] == i, x], data[data[,cat] == i, y], method = "pearson"), digits = 2)
}
return()
}
But this is not working. What am I doing wrong?
You can use the function by() and do not need a for loop. With your data and columns, it should work like this:
by(tv_viewing[, c("age", "views")], tv_viewing$sex, cor)
Or, as I do not have access to your data, a reproducible using the iris data set is as follows:
by(iris[, 1:2], iris$Species, cor)
Maybe your code did not work because you did not quote your column names (tv_viewing[, sex] will not read your column sex, but tv_viewing[, "sex"] does).
EDIT: If you need the correlations in one vector, you can use this code:
c(by(iris[, 1:2], iris$Species, function(x) cor(x)[2, 1]))
partcorr <- function(df,
col1,
col2,
cat,
digits = 2,
method = "pearson",
... ) {
dfs <- split(df[, c(col1, col2)], df[, cat])
sapply(dfs, function(sdf) round(cor(sdf[, col1],
sdf[, col2],
method = method,
...),
digits = digits))
}
creates for each category a subdataframe of the original data
with the columns given for col1 and col2 - in a list of data frames (dfs).
Then it loops over this list of data frames and calculates
the correlation with the given method between the col1 and col2 columns
of the subdataframes. The ... makes that you can pass to partcorr
any additional arguments for the cor function - it will forward it to the
cor function.
Also the digits can be changed.
I wanted to model my snps array. I can do this one by one using the following code.
Data$DX=as.factor(Data$DX)
univariate=glm(relevel(DX, "CON") ~ relevel(rs6693065_D,"AA"), family = binomial, data = Data)
summary(univariate)
exp(cbind(OR = coef(univariate), confint(univariate)))
How can I do this for all other snps using a loop or apply? The snps are rs6693065_D, rs6693065_A and hundreds of them. From the above code only "rs6693065_D" will be replaced by all other snps.
Best Regards
Zillur
Consider developing a generalized method to handle any snps. Then call it iteratively passing every snps column using lapply or sapply:
# GENERALIZED METHOD
proc_glm <- function(snps) {
univariate <- glm(relevel(data$DX, "CON") ~ relevel(snps, "AA"), family = binomial)
return(exp(cbind(OR = coef(univariate), confint(univariate))))
}
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], proc_glm)
Use tryCatch in case of errors like relevel:
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], function(col)
tryCatch(proc_glm(col), error = function(e) e))
For building a data frame, adjust method and lapply call followed with a do.call + rbind:
proc_glm <- function(col){
# BUILD FORMULA BY STRING
univariate <- glm(as.formula(paste("y ~", col)), family = binomial, data = Data)
# RETURN DATA FRAME OF COLUMN AND ESTIMATES
cbind.data.frame(COL = col,
exp(cbind(OR = coef(univariate), confint(univariate)))
)
}
# BUILD LIST OF DFs, PASSING COLUMN NAMES
glm_list <- lapply(names(Data)[3:426],
tryCatch(proc_glm(col), error = function(e) NA))
# APPEND ALL DFs FOR SINGLE MASTER DF
final_df <- do.call(rbind, glm_list)
I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame
I'm new to loops and I have a problem with calling variable from i'th data frame.
I'm able to call each data frame correctly, but when I should call a specified variable inside each data frame problems come:
Example:
for (i in 1:15) {
assign(
paste("model", i, sep = ""),
(lm(response ~ variable, data = eval(parse(text = paste("data", i, sep = "")))))
)
plot(data[i]$response, predict.lm(eval(parse(text = paste("model", i, sep = ""))))) #plot obs vs preds
}
Here I'm doing a simple one variable linear model 15 times, which works just fine. Problems come when I try to plot the results. How should I call data[i] response?
Let's say there are multiple dataframes with names: data1 ...data15 and that there are no other data-objects that begin with the letters: d,a,t,a. Lets also assume that in each of those dataframes are columns named 'response' and 'variable'. The this would gather the dataframes into a list and draw separate plots for the linear regression lines.
dlist <- lapply ( ls(patt='^data'), get)
lapply(dlist, function(df)
plot(NA, xlim=range(df$variable), ylim=range(df$response)
abline( coef( lm(response ~ variable, data=df) ) )
)
If you wanted to name the dataframes in that list, you could use your paste code to supply names:
names(dlist) <- paste("data", i, sep = "")
There are many other assignments you could make in the context of this loop, but you would need to describe the desired results better than with failed efforts.
Here's modified code that should work. It does one variable lm-model and calculates correlation of predicted and observed values and stores it into an empty matrix. It also plots these values.
Thanks Thomas for help.
par(mfrow=c(4,5))
results.matrix <- matrix(NA, nrow = 20, ncol = 2)
colnames(results.matrix) <- c("Subset","Correlation")
for (i in 1:length(datalist)) {
model <- lm(response ~ variable, data = datalist[[i]])
pred <- predict.lm(model)
cor <- (cor.test(pred, datalist[[i]]$response))
plot(pred, datalist[[i]]$response, xlab="pred", ylab="obs")
results.matrix[i, 1] <- i
results.matrix[i, 2] <- cor$estimate
}