What is the limit of missing values for multiple imputation in the mice package? - r

I have two questions about the mice package.
The first, is the mincor in the quickpred argument. When on the cran it says it is the absolute minimum correlation compared. Does this mean that if I set mincor to zero even very weak correlations will be accepted? If I understand correctly, for a good result I should put values close to 1. Sorry if I'm being too layman or ignorant on the subject, but I had to learn from scratch about multiple imputation.
Another question I have is about the size of the missing values. I think my data has a lot of missing values, but I'm not sure if I can imput even though.
An example of how I made the function for the multiple imputation
m.out <- mice(result.wide, m=10,
pred=quickpred(result.wide, mincor=0, include =
c("category", "region"), exclude=c( "NAME_AP")))
These are the amounts of missing values.

Related

Why are polychoric correlation coefficients in matrices calculated by different R packages slightly different for the same data?

I calculated polychoric correlation matrices for the same data frame (20 ordinal variables, 190 missing values) in R, using three different packages and the coefficients for same variables are slightly different from each other.
I used the lavCor function from "lavaan" (I did list the ordinal variables when calling the function), polychoric function from "psych" (1.9.1) (took the rhos), and cor_auto function from "qgraph" (which is supposed to automatically calculate polychoric correlations for ordinal data). I am confused because I thought they were supposed to give exactly the same results. I read package documentations but could not find anything that helped me understand why. Could anyone let me know why this happens? I am sure I am missing some tiny difference between those, but I cannot figure it out.
PS: I guess this could have happened because psych package adjusts missing values (I have 190) using the correction for continuity, but I still do not understand why qgraph yields different results than lavaan as qgraph says it uses lavaan's lavCor function to calculate polychoric correlations.
Thanks!!
depanx<-data[1:20]
cor.depanx<-cor_auto(depanx)
polychor<-polychoric(depanx)
polymat<-polychor$rho
lav<-lavCor(depanx,ordered=c("unh","enj","trd","rst","noG","cry","cnc","htd","bdp","lnl","lov",
"cmp","wrg","pst","sch","dss","hlt","bad","ftr","oth"))
# as a result, matrices "cor.depanx", "polymat", and "lav" are different from each other.
Nice question! I do not know what the "data" dataset in you example is, but i recreate the two possible scenarios, which have most probably caused the discrepancy between cor_auto and lavCor results. In summary, first you must set the "ordinalLevelMax" argument in cor_auto based on your data and second you need to synchronize the "missing" argument in the two functions. Detailed explanation in the code snippet below:
depanx<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace = T)),
stringsAsFactors = F)
colnames(depanx)=LETTERS[1:5]
lav<-lavaan::lavCor(depanx,ordered=colnames(depanx))
cor.depanx<-cor_auto(depanx)
all(lav==cor.depanx)#TRUE
#The first argument in cor auto, which you need to pay attention to is
#"ordinalLevelMax". #It is set to 7 by default in cor_auto,
#so any variable with levels more than 7 is sent to lavCor as plain numeric and not
#ordinal.
#Now we create the same dataset with 8 level variables. lavCor detects all as ordinal,
#since we have labeled them as so by "ordered" argument of lavCor, so it uses
#ploychorial
#correlations. Since "ordinalLevelMax" in cor_auto is 7 by default and you have not
#changed it,
#cor_auto detect none as ordinaland does not send them to lavCor as Ordinalvariables,
#so Lavcor computes pearson correlations between them,all.
depanx2<-data.frame(lapply(1:5,function(x)sample(1:8,100,replace =T)),
stringsAsFactors = F)
colnames(depanx2)=LETTERS[1:5]
lav2<-lavaan::lavCor(depanx2,ordered=colnames(depanx2))
cor.depanx2<-cor_auto(depanx2)
all(lav2==cor.depanx2)#FALSE
# the next argument you must synchronise in lavCor and cor_auto is the "missing",
#which is by default set to "pairwise" and "listwise" in cor_auto and lavCor,
#respectively.
#here we set row 10:20 value of the fifth variable to NA, without synchronizing the
#argument
depanx3<-data.frame(lapply(1:5,function(x)sample(1:6,100,replace =T)),
stringsAsFactors = F)
colnames(depanx3)=LETTERS[1:5]
depanx3[10:20,5]<-NA
lav3<-lavaan::lavCor(depanx3,ordered=colnames(depanx3))
cor.depanx3<-cor_auto(depanx3)
all(lav3==cor.depanx3)#FALSE

Can I use quickpred in Mice to impute a subset of variables from a larger set of variables in a nested longitudinal (and long) dataframe?

I've tried to create a test data.frame to demonstrate my question but my r capacity isn't quite strong enough to even do that. I am not in a position to share my true database. I hope my question can stand on its own.
I am working with a nested longitudinal dataset that is saved as a long file (1000 subjects nested in 8 sites, 4 potential time points/subject, 68 potential predictor variables). I want to impute missing values on 4 static predictors (e.g., maternal education, family income) prior to conducting lme on the longitudinal outcomes in order to have a consistent number of cases for all models.
I am working with the package mice in r. From all that I have read, it is recommended that I use all the variables in my models and any other variables that may predict the missing values in my imputation. Given the number of variables in my models, I need something like quickpred to simplify this. But I'm getting an error that I do not understand.
I tried the following initial code for my database N2NPL, indicating c(14, 16, 18, 19) as the variables that I want to predict.
iniN2NPL <- mice(N2NPL[,c(14,16,18,19)], pred= quickpred(N2NPL,
minpuc = 0.25, exclude = c('ID','TypeConvNon','TypeCtPr','TypeName','CHR_converter')),
maxit = 0)
"Error in check.predictorMatrix(setup) :
The predictorMatrix has 73 rows and 73 columns. Both should be 4'
I know that mice::quickpred needs to be a square matrix, but is there anyway of not imputing all of the variables? Is it sufficient to include site as a predictor given the nesting of subjects within sites?
Thank you for any help directing me to the proper code or instructions on this. The examples I see all seem much simpler than mine, and thus little help with the issues I'm having.

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

randomForest's importance only contains MeanDecreaseGini

I have two scripts which both generate random forests in R, which as far as I can work out have the same inputs, although my problem suggests this isn't the case. One of them returns an importance table containing
row.names importance.blue importance.red importance.MeanDecreaseAccuracy importance.MeanDecreaseGini
the other importance table just contains
row.names MeanDecreaseGini
Whats the difference between these two forests, and more importantly what's causing the difference given what I thought were identical inputs?
(The scripts are too large to paste here, but both are trying to predict a factor on the basis of a bunch of continuous variables)
The help page of randomForest tells us, that importance (when used for classification) is a matrix with nclass + 2 columns. The first nclass columns are the class-specific measures computed as mean descrease in accuracy. The nclass + 1st column is the mean descrease in accuracy over all classes. The last column is the mean decrease in Gini index.
If importance=FALSE, the last measure is still returned as a vector.
So, it seems to me, that you called randomForest once with importance=TRUE and once with importance=FALSE.

Resources