R: Loop to extract data from multiple dataframes?

R: Loop to extract data from multiple dataframes? - r

I have run several (17) meta-analyses (identified by specific names) and I need to extract the models' outputs into one single table, as well as add a column with the name of each name. I have done it manually, but I was wondering if I could build a loop to do so.
I'm attaching the first three of the 17 analyses, the "names" being "cent", "dist", and "sqrs"
#meta-analyses
res_cent<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="cent"))
res_dist<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="dist"))
res_sqrs<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="sqrs"))
#Creating list for model output - cent
list_cent<-coef(summary(res_cent))
list_cent<-setNames(cbind(rownames(list_cent), list_cent, row.names = NULL),
c("Drug", "Estimate", "se","zval","p-value","CI_l","CI_u"))
df_cent <- list_cent[ -c(3,4) ]
df_cent$Drug<-gsub("factor*","",df_cent$Drug)
df_cent$Drug<-gsub("drug*","",df_cent$Drug)
df_cent$Drug<-gsub("[[:punct:]]","",df_cent$Drug)
n_cent<-plyr::count(cent_sum2, vars = "drug")
names(n_cent)[names(n_cent) == "freq"] <- "n_cent"
df_cent<-cbind(df_cent,n_cent[2])
##same thing can be repeated for the other two measures "dist", and "sqrs".
The output is a data frame that contains the name of the drugs used as factors in the meta-analyses, their estimated effect sizes, p-values, confidence intervals, and how many measures we have per factor (n). I want to compile all of these outputs in a table, (at the end of the code called "matrix_ps") and add a column with the name of the measures.
I have done all the steps manually (below) but it looks extremely inefficient.
Is there a way to create a loop to do this, in which the all the names of the measures are changed an then outcome is appended?
Something like
measures<-c("cent","dist","sqrs")
for(i in measures) - not sure how to continue?
matrix_cent<-data.frame(df_cent$Drug,list_cent$`p-value`,df_cent$n_cent,df_cent$Estimate,df_cent$CI_l,df_cent$CI_u)
matrix_dist<-data.frame(df_dist$Drug,list_dist$`p-value`,df_dist$n_dist,df_dist$Estimate,df_dist$CI_l,df_dist$CI_u)
matrix_sqrs<-data.frame(df_sqrs$Drug,list_sqrs$`p-value`,df_sqrs$n_sqrs,df_sqrs$Estimate,df_sqrs$CI_l,df_sqrs$CI_u)
matrix_cent$measure<-"cent"
matrix_dist$measure<-"dist"
matrix_sqrs$measure<-"sqrs"
matrix_cent<-matrix_cent%>% rename(drug=df_cent.Drug,measure=measure,p=list_cent..p.value.,n=df_cent.n_cent,estimate=df_cent.Estimate,ci_low=df_cent.CI_l,ci_up=df_cent.CI_u)
matrix_dist<-matrix_dist%>% rename(drug=df_dist.Drug,measure=measure,p=list_dist..p.value.,n=df_dist.n_dist,estimate=df_dist.Estimate,ci_low=df_dist.CI_l,ci_up=df_dist.CI_u)
matrix_sqrs<-matrix_sqrs%>% rename(drug=df_sqrs.Drug,measure=measure,p=list_sqrs..p.value.,n=df_sqrs.n_sqrs,estimate=df_sqrs.Estimate,ci_low=df_sqrs.CI_l,ci_up=df_sqrs.CI_u)
matrix_ps<-rbind(matrix_cent,matrix_dist,matrix_rear,matrix_sqrs,matrix_toa,matrix_eca,matrix_eoa,matrix_trans,matrix_dark,matrix_light,matrix_stps,matrix_rrs,matrix_time,matrix_toc,matrix_cross,matrix_hd,matrix_lat)

We don't have your data but you can put all your code in a function :
get_result <- function(x, y) {
list_cent<-coef(summary(x))
list_cent<-setNames(cbind(rownames(list_cent), list_cent, row.names = NULL),
c("Drug", "Estimate", "se","zval","p-value","CI_l","CI_u"))
df_cent <- list_cent[ -c(3,4) ]
df_cent$Drug<-gsub("factor*","",df_cent$Drug)
df_cent$Drug<-gsub("drug*","",df_cent$Drug)
df_cent$Drug<-gsub("[[:punct:]]","",df_cent$Drug)
n_cent<-plyr::count(cent_sum2, vars = "drug")
names(n_cent)[names(n_cent) == "freq"] <- y
df_cent<-cbind(df_cent,n_cent[2])
return(df_cent)
}
Now assuming all your analyses follow the pattern 'res_' you can do :
library(purrr)
list_models <- mget(ls(pattern = 'res_'))
result <- imap(list_models, get_result) %>% reduce(inner_join)

Related

how to loop over string variable in R

How to use the names in the string vector in models? for example why cant I do this
loop_variables = c("Age", "BMI", "Height")
for (i in 1:length(loop_variables){
basic_logistic_model = glm(outcome~loop_variable[i], data=DB, family="binomial"
summary(basic_logistic_model)
}
I see alot of R users doing vectors with names of study variables then looping it
what am I doing wrong?

It is the formula that needs to be update. We may use paste or reformulate. In addition, it is better to have an object to store the output of summary especially a list would suit.
summary_lst <- vector('list', length(loop_variables))
names(summary_lst) <- loop_variables
for (i in 1:length(loop_variables){
# convert the column to factor column
DB[[loop_variables[i]]] <- factor(DB[[loop_variables[i]]])
# create the formula
fmla <- reformulate(loop_variable[i], response = 'outcome')
basic_logistic_model = glm(fmla, data=DB, family="binomial")
# assign the summary output to the list element
summary_lst[[i]] <- summary(basic_logistic_model)
}

In my view, data transformation part (e.g. converting variables into factors) and modelling part (e.g. using glm()) ought to be separated and not to be mixed in the loop for the code readability and efficiency.
Here, I will show how to execute looping iterations using purrr::map(), while the data to be analysed is transformed using dplyr::mutate() beforehand.
Package loading
library(purrr) # for `map`, `set_names`
library(dplyr) # for `mutate`
Data transformation
Add new variables that was converted into factors using dummy coding
fct_ToothGrowth <- ToothGrowth |>
mutate(
fct_dose = dose |>
as.factor()
fct_len = len |>
## The numeric variable `len` is converted
## into a three-level factor
cut(3) |>
as.factor()
)
contrasts(fct_ToothGrowth$fct_dose)
contrasts(fct_ToothGrowth$fct_len)
Add new variables that was converted into factors using non-dummy coding
Sum contrast and forward difference coding are used here as examples.
fct_ToothGrowth <- ToothGrowth |>
mutate(
fct_dose = `contrasts<-`(
factor(
dose,
levels = c("0.5", "1", "2")
), ,
## sum contrast coding (as known as deviation coding)
contr.sum(3)
),
fct_len = `contrasts<-`(
factor(
cut(len, 3)
), ,
## Forward difference coding
MASS::contr.sdif(3)
)
)
contrasts(fct_ToothGrowth$fct_dose)
contrasts(fct_ToothGrowth$fct_len)
Looping glm()
explanatory_variables <- c("fct_len", "fct_dose", "len", "dose")
summaries <- map(
.x = explanatory_variables,
## "fct_len", "fct_dose", "len", and "dose" are replaced
## by the arguments specified in `.x`.
~ paste0("supp ~ ", .x) |>
## `supp ~ fct_len`, ..., `supp ~ dose` are inputted
## into the first argument of `glm()`, namely `formula` argument
glm(family = binomial, data = fct_ToothGrowth)
) |>
## set names to the returned sublists
set_names(nm = explanatory_variables)
summaries$fct_len
summaries$fct_dose
summaries$len
summaries$dose

tbl_uvregression gives Error: C stack usage is too close to the limit

I have a df with 2,946 obs and 600 variables.
I want to produce a table of univariate regression models for 599 variables from the dataset. To do this, I am using the tbl_uvregression() function from the 'gtsummary' package.
Here's my code:
RAPOA_labelled[,-1] %>% #remove ID column
tbl_uvregression(
method = glm,
y = GIR.2cat, #dependent variable
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
add_nevent() %>% # add number of events of the outcome
bold_p() %>% # bold p-values under a given threshold (default 0.05)
bold_labels()
Everytime it is run, I get te following error:
Error: C stack usage 7971168 is too close to the limit.
My Cstack_info() is:
> Cstack_info()
size current direction eval_depth
7969177 12800 1 2
EDIT
As final output, I needed a table with the estimate, std.error, pvalue, odds ratio and confident interval for each variable in the data frame. tbl_regression did not works fine for me, so, finally I do it with a loop.
I’ll leave the code here in case it serves anyone.
name <- colnames(datos_rapoa_gir[,-c(1:2)]) # to remove ID and outcome columns
term <- {}
B <- {}
SE <- {}
pvalue <- {}
OR <- {}
lowIC <- {}
highIC <- {}
for (i in seq_along(name)) {
mod_formula <- as.formula(sprintf("GIR.2cat ~ %s", name[i]))
mod <- glm(formula = mod_formula, family = "binomial", data = datos_rapoa_gir, na.action = na.omit)
term <- c(term, broom::tidy(mod)$term)
B <- c(B, broom::tidy(mod)$estimate)
SE <- c(SE, broom::tidy(mod)$std.error)
pvalue <- c(pvalue, broom::tidy(mod)$p.value)
OR <- c(OR, exp(mod$coefficients))
lowIC <- c(lowIC, exp(confint(mod))[,1])
highIC <- c(highIC, exp(confint(mod))[,2])
}
univars <- data.frame(variable = term, B = B, SE = SE, pvalue = pvalue, OR = OR, LowIC = lowIC, HighIC = highIC) %>%
remove_rownames()

A tbl_uvregression() object can become large (containing the full data, the model object, etc.), and it looks like your machine's memory can't handle it.
What do you want the output to look like? It sounds like you're going to end up with a table with 600 rows (likely more if you have categorical covariates).
Here are some steps you can take to reduce the size:
The tbl_uvregression() object is a helper function that iterates over the columns in a data frame, builds a model for each column, calls tbl_regression() on each model, and combines all the tables with tbl_stack(). Rather than using the helper function, follow these steps yourself to reduce the size. After you call tbl_regression(), you can further reduce the size of the tbl_regression() object using the tbl_butcher().
Happy Programming!

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?

Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))

Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

R factors in a loop to build full data frame

I'm trying to write a loop that performs anova and TukeyHSD on my data across 3 samples for each "Label". Label in this case is a metabolic pathway. Data that goes into it are the genes expressed in said metabolic pathway.
For the test data, I created a small df that reproduces my error. In my actual data, I'm hoping to do perform this across 2 factors (not just one) and I have thousands of more rows.
library(reshape2)
df<-melt(data.frame(sample1 = c(0,0,3,4,5,1),sample2 = c(1,0,0,4,5,0),sample3 = c(0,0,0,8,0,0),Label = c("TCA cycle", "TCA cycle","TCA cycle", "Glycolysis","Glycolysis","Glycolysis"),Gene = c("k1","k2","k3","k4","k5","k6")))
My approach (annotated the best way I can!):
fxn<-unique(df$Label) #create list
for (i in 1:length(fxn)){
if (!exists("data")){ #if the "data" dataframe does not exist, start here!
depth<-aov(df$value[df$Label==fxn[i]]~df$variable[df$Label==fxn[i]]) #perform anova on my "df", gene values as a factor of samples (for each "fxn")
hsd<-TukeyHSD(depth) #calculate tukeyHSD
data<-as.data.frame(hsd$`df$variable[df$Label == fxn[i]]`) #grab dataframe of tukey HSD output
data$Label<-fxn[i] #add in the Label name as a column (so it looks like my original df, but with TukeyHSD output for each pairwise comparison
data<-as.data.frame(data)
}
if (exists("data")){ #if "data" exists, do this:
tmpdepth<-aov(df$value[df$Label==fxn[i]]~df$variable[df$Label==fxn[i]])
tmphsd<-TukeyHSD(tmpdepth)
tmpdata<-as.data.frame(tmphsd$`df$variable[df$Label == fxn[i]]`)
tmpdata$Label<-fxn[i]
tmpdata<-as.data.frame(tmpdata)
data<-rbind(data,tmpdata) #combine with original data
data<-as.data.frame
rm(tmpdata)
}
}
I'd like my output to look like this:
diff lwr upr p adj Label
sample2-sample1 -0.3333333 -8.600189 7.933522 0.9916089 Glycolysis
sample3-sample1 -0.6666667 -8.933522 7.600189 0.9669963 Glycolysis
sample3-sample2 -0.3333333 -8.600189 7.933522 0.9916089 Glycolysis
but the Label column has all the factors that went into "fxn".
Errors:
Error in rep(xi, length.out = nvar) :
attempt to replicate an object of type 'closure'

You forgot the second data in the last line before rm(tmpdata). It should be:
data<-as.data.frame(data)
I my implementation I changed your code as follows:
datav <- data.frame(diff = double(),
lwr = double(),
upr = double(),
'p adj' = double(),
'Label' = character())
for (fxn in unique(df$Label)){
depth <- aov(df$value[df$Label==fxn] ~ df$variable[df$Label==fxn])
hsd <- TukeyHSD(depth)
tmp <- as.data.frame(hsd$`df$variable[df$Label == fxn]`)
tmp$Label <- fxn
datav <- rbind(datav, tmp)
}
Initializing the data.frame before hand you do not need the if statement. Also data is a function in R, so I rename the variable data to datav.

R : Error in linear regression model

I have a 2 different data frames for which i would like to perform linear regression
I have written following code for it
mydir<- "/media/dev/Daten/Task1/subject1/t1"
#multiple subject paths should be given here
# read full paths
myfiles<- list.files(mydir,pattern = "regional_vol*",full.names=T)
# initialise the dataframe from first file
df<- read.table( myfiles[1], header = F,row.names = NULL, skip = 3, nrows = 1,sep = "\t")
# [-c(1:3),]
df
#read all the other files and update dataframe
#we read 4 lines to read the header correctly, then remove 3
ans<- lapply(myfiles[-1], function(x){ read.table( x, header = F, skip = 3, nrows = 1,sep = "\t") })
ans
#update dataframe
#[-c(1:3),]
lapply(ans, function(x){df<<-rbind(df,x)} )
#this should be the required dataframe
uncorrect<- array(df)
# Linear regression of ICV extracted from global size FSL
# Location where your icv is located
ICVdir <- "/media/dev/Daten/Task1/T1_Images"
#loding csv file from ICV
mycsv <- list.files(ICVdir,pattern = "*.csv",full.names = T )
af<- read.csv(file = mycsv,header = TRUE)
ICV<- as.data.frame(af[,2],drop=FALSE)
#af[1,]
#we take into consideration second column of csv
#finalcsv <-lapply(mycsv[-1],fudnction(x){read.csv(file="global_size_FSL")})
subj1<- as.data.frame(rep(0.824,each=304))
plot(df ~ subj1, data = df,
xlab = "ICV value of each subject",
ylab = "Original uncorrected volume",
main="intercept calculation"
)
fit <- lm(subj1 ~ df )
The data frame df has 304 values in following format
6433 6433
1430 1430
1941 1941
3059 3059
3932 3932
6851 6851
and another data frame Subj1 has 304 values in following format
0.824
0.824
0.824
0.824
0.824
When i run my code i am incurring following error
Error in model.frame.default(formula = subj1 ~ df, drop.unused.levels = TRUE) :
invalid type (list) for variable 'subj1'
any suggestions why the data.frame values from variable subj1 are invalid

As mentioned, you are trying to give a data.frame as an independent variable. Try:
fit <- lm(subj1 ~ ., data=df )
This will use all variables in the data frame, as long as subj1 is the dependent variable's name, and not a data frame by itself.
If df has two columns which are the predictors, and subj1 is the predicted (dependent) variable, combing the two, give them proper column names, and create the model in the format above.
Something like:
data <- cbind(df, subj1)
names(data) <- c("var1", "var2", "subj1")
fit <- lm(subj1 ~ var1 + var2, data=df )
Edit: some pointers:
make sure you use a single data frame that holds all of your independent variables, and your dependent variable.
The number of rows should be equal.
If an independent variable in a constant, it has no variance for different values of the dependent variable, and so will have no meaning. If the dependent variable is a constant, there is no point for regressing - we can predict the value with 100% accuracy.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Loop to extract data from multiple dataframes? - r

Related

how to loop over string variable in R

tbl_uvregression gives Error: C stack usage is too close to the limit

Multiple imputation and mlogit for a multinomial regression

R factors in a loop to build full data frame

R : Error in linear regression model

Categories

Resources