multiple imputation and multigroup SEM in R - r

I want to perform multigroup SEM on imputed data using the R packages mice and semTools, specifically the runMI function that calls Lavaan.
I am able to do so when imputing the entire dataset at once, but whilst trawling through stackoverflow/stackexchange I have come across the recommendation to impute data separately for each level of a grouping variable (e.g. men, women), so that the features of each group are preserved
(e.g. https://stats.stackexchange.com/questions/149053/questions-on-multiple-imputation-with-mice-for-a-multigroup-sem-analysis-inclu). However, I've not been able to find any references to support this course.
My question is both conceptual and practical -
1) Is splitting the dataset by group prior to imputing the correct course? Could anyone point me towards references advising this?
2) If so, how can I combine the datasets imputed by group using mice together, whilst still retaining multiple imputed datasets in a list of dataframes of the mids class? I have attempted to do so, but end up with an integer
set.seed(12345)
HSMiss <- HolzingerSwineford1939[ , paste("x", 1:9, sep = "")]
HSMiss$x5 <- ifelse(HSMiss$x1 <= quantile(HSMiss$x1, .3), NA, HSMiss$x5)
HSMiss$x9 <- ifelse(is.na(HSMiss$x5), NA, HSMiss$x9)
HSMiss$school <- HolzingerSwineford1939$school
HS.model <- '
visual =~ x1 + a*x2 + b*x3
textual =~ x4 + x5 + x6
x7 ~ textual + visual + x9
'
group1 <- subset(HSMiss, school =='Pasteur')
group2 <- subset(HSMiss, school =='Grant-White')
imputed.group1 <- mice(group1, m = 3, seed = 12345)
imputed.group2 <- mice(group2, m = 3, seed = 12345)
#attempted merging:
imputed.both <- nrow(complete(rbind(imputed.group1, imputed.group2)))
I would be incredibly grateful if anyone can offer me some help. As you can tell, I am very much still learning about R and imputation, so apologies if this is a stupid question - however, I couldn't find anything regarding this specific query elsewhere.

You are getting just an integer when mergin because you are calling nrow(). Remove that call and you'll get a merged data frame.
imputed.both <- complete(rbind(imputed.group1, imputed.group2))
In case you find yourself with datasets that have multiple groups, you can something like the following to simplify this task.
imputed.groups <- lapply(split(HSMiss, HSMiss$school), function(x) {
complete(mice(x, m = 3, seed = 12345))
})
imputed.both <- do.call(args = imputed.groups, what = rbind)
About how appropiate is this approach for imputing, that's probably a question better suited for Cross Validated.

Related

DEA analysis: variables are excluded in analysis?

I’m working on a DEA (Data Envelopment Analysis) analysis to analyze the relative effects of different banks efficiencies.
The packages I’m using are rDEA and kableExtra.
What this analysis if doing is measuring the relative effect of input and output variables that I use to examine the efficiency for each individual bank.
The problem is that my code only includes two out of four output variables and I can’t find anywhere in the code where I ask it to do so.
Can some of you identify the problem?
Thank you in advance!
I have tried to format the data in several different ways, assign the created "inp_var" and "out_var" as a matrix'.
#install.packages('rDEA')
#install.packages('dplyr')
#install.packages('kableExtra')
library(kableExtra)
library(rDEA)
library(dplyr)
dea <- tbl_df(PANELDATA)
head(dea)
inp_var <- select(dea, 'IE', 'NIE')
out_var <- select(dea, 'L', 'D', 'II','NII')
inp_var <- as.matrix(inp_var)
out_var <- as.matrix(out_var)
model <- dea(XREF= inp_var, YREF = out_var, X = inp_var, Y = out_var, model= "output", RTS = "constant")
model
I want a number between 0 and 1 for every observation, where the most efficient one receives a 1. What I get now is the same result no matter if I include the two extra output variables L and II or not.
L stands for Loans to the public and II for interest income and it would be weird if these variables had NO effect for the efficiency of banks.
I think you could type this:
result <- cbind(round(model$thetaOpt, 3), round(model$lambda, 3))
rownames(result)<-dea[[1]]
colnames(result)<-c("Efficiency", rownames(result))
kable(result[,])

How to create a loop for Regression

I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471

Missing values in lmFit [limma R package]

[This question is specific to bioinformatics. There are posts elsewhere but I couldn't find a satisfactory answer.]
If I have a gene/protein expression data with missing values (NA), how does lmFit of the limma package handle these values? Note that the missing values are not in the design matrix, but rather, in the data matrix only.
Here is a simulated, working example that illustrates my question:
library(limma)
my_genes <- matrix(rnorm(9000, -10, 10), ncol=4)
my_genes <- as.data.frame(my_genes)
rownames(my_genes) <- paste("Gene", 1:nrow(my_genes))
## Randomly introducing NAs
purrr::map_df(my_genes, function(x) {x[sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(x), replace = TRUE)]})
tx <- 1:2 #suppose treatment is columns 1 & 2
ctrls <- 3:4 #suppose controls is columns 3 & 4
my_design <- model.matrix( ~factor(c(1,1,0,0)))
my_design
fit <- lmFit(my_genes, my_design)
fit <- eBayes(fit)
plot(fit$logFC, -log10(fit$p.value))
If you find any websites / posts that can help, feel free to share with a post or comment.
This post in CrossValidated answers my own question in detail. In short, the way of how lmFit deals with missing values is similar to how lm does. Rows with missing values are subjected to na.exclude, or "case-wise deletion."
Alternatively: Though it's not an ideal solution, it might be appropriate to just impute the missing gene-expression values. For example, using the knn.impute function in the impute Bioconductor package.

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

Generating multiple datasets and applying function and output multiple dataset

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...
You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Resources