Analysis of complex survey design with multiple plausible values - r

I am working with several large databases (e.g. PISA and NAEP) that use a complex survey design with replicate weights and multiple plausible values. I can address the former using the survey package. However, does there exist an R package/function to analyze the latter?
For reference, I have found this article to provide a good overview of the issue: http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

I'm not sure how the general idea of 'plausible values' differs from using multiple imputation to generate several sets of imputed values (such as the the Amelia package does). But Thomas Lumley's mitools package can be used to combine the various sets of imputed values, and it might be the case that it can be used to combine your sets of plausible values to obtain the 'correct' standard errors of the estimates.

Daniel Caro develop an R package for large scale assessments. You can find it here http://cran.r-project.org/web/packages/intsvy/index.html
This is code example using the regression command, over the plausible values on Mathemathics:
## Not run:
# Table I.2.3a, p. 305, International Report 2012
pisa.reg.pv(pvlabel="MATH", x="ST04Q01", by = "IDCNTRYL", data=pisa)
Although, I'm not sure if this package can be used to analyze NAEP data.
I hope this fulfill your purposes; at least partially.

As of survey version 3.36 there's withPV
data(pisamaths, package="mitools")
des<-svydesign(id=~SCHOOLID+STIDSTD, strata=~STRATUM, nest=TRUE,
weights=~W_FSCHWT+condwt, data=pisamaths)
options(survey.lonely.psu="remove")
results<-withPV(list(maths~PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH),
data=des,
action=quote(svyglm(maths~ST04Q01*(PCGIRLS+SMRATIO)+MATHEFF+OPENPS, design=des)))
summary(MIcombine(results))

Related

Specification of a mixed model using glmmLasso package

I have a dataset containing repeated measures and quite a lot of variables per observation. Therefore, I need to find a way to select explanatory variables in a smart way. Regularized Regression methods sound good to me to address this problem.
Upon looking for a solution, I found out about the glmmLasso package quite recently. However, I have difficulties defining a model. I found a demo file online, but since I'm a beginner with R, I had a hard time understanding it.
(demo: https://rdrr.io/cran/glmmLasso/src/demo/glmmLasso-soccer.r)
Since I cannot share the original data, I would suggest you use the soccer dataset (the same dataset used in glmmLasso demo file). The variable team is repeated in observations and should be taken as a random effect.
# sample data
library(glmmLasso)
data("soccer")
I would appreciate if you can explain the parameters lambda and family, and how to tune them.

How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?

I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.
Your question is very broad. There are actually multiple R packages that can help you here:
missForest
imputeR
mice
VIM
simputation
There are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
library("missForest")
# create example dataset with missing values
missing_data_iris <- prodNA(iris, noNA = 0.1)
# Impute the dataset
missForest(missing_data_iris)
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.

Multivariate Classification Trees in R

I'm looking for advice on creating classification trees where each split is based on multiple variables. A bit of background: I'm helping design a vegetation classification system, and we're hoping to use a classification and regression tree algorithm to both classify new veg data and create (or at least help to create) visual keys which can be used in publications. The data I'm using is laid out as community data, with tree species as columns, and observations as rows, and the first column is a factor with classes. I'll also add that I'm very new to this type of analysis, and while I've tried to read about it as much as possible, it's quite likely that I've missed some simple but important aspects. My apologies.
Now the problem: R has excellent packages and great documentation for classification with univariate splits (e.g. rpart, partykit, C5.0). However, I would ideally like to be able to create classification trees where each split was based on multiple criteria - so instead of each split having one decision (e.g. "Percent cover of Species A > 6.67"), it would have multiple (Percent cover of Species A > 6.67 AND Percent cover of Species B < 4.2). I've had a lot of trouble finding packages that are capable of doing multivariate splits and creating trees. This answer: https://stats.stackexchange.com/questions/4356/does-rpart-use-multivariate-splits-by-default has been very useful, and I've tried all the packages suggested there for multivariate splitting. Prim does do multivariate splits, but doesn't seem to make trees; the partDSA package seems to be somewhat what I'm looking for, but it also only creates trees with one criteria per split; the optpart package also doesn't seem to be able to make classification trees. If anyone has advice on how I could go about making a classification tree based on a multivariate partitioning method, that would be super appreciated.
Also, this is my first question, and I am very open to suggestions about how to ask questions. I didn't feel that providing an example would be helpful in this case, but if necessary I easily can.
Many Thanks!

Determine number of factors in EFA (R) using Comparison Data

I am looking for ways to determine number of optimal factors in R factanal function. The most used method (conduct a pca and use scree plot to determine the number of factors) is already known to me. I have found a method described here to be easier for non technical folks like me. Unfortunately the R script is no longer accessible in which the method was implemented. I was wondering if there is a package available in R that does the same?
The method was originally proposed in this study: Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure.
The R code is now moved here as per the author.
EFA.dimensions ist also a nice and easy to use package for that

Imputation in large data

I need to impute missing values. My data set has about 800,000 rows and 92 variables. I tried kNNImpute in the imputation package in r but looks like the data set is too big. Any other packages/method in R? I would prefer not to use mean to replace the missing values.
thank you
1) You might try
library(sos)
findFn("impute")
This shows 400 matches in 113 packages. This shows 400 matches in 113 packages: you could narrow it down per your requirements of the imputation function.
2) Did you see/try Hmisc ?
Description: The Hmisc library contains many functions useful for data
analysis, high-level graphics, utility operations, functions
for computing sample size and power, importing datasets,
imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of S
objects to LaTeX code, and recoding variables.
3) Possibly mice
Multiple imputation using Fully Conditional Specification (FCS)
implemented by the MICE algorithm. Each variable has its own
imputation model. Built-in imputation models are provided for
continuous data (predictive mean matching, normal), binary data
(logistic regression), unordered categorical data (polytomous logistic
regression) and ordered categorical data (proportional odds). MICE can
also impute continuous two-level data (normal model, pan, second-level
variables). Passive imputation can be used to maintain consistency
between variables. Various diagnostic plots are available to inspect
the quality of the imputations.
MICE is a great package, with strong diagnostic tools, and may be capable of doing the job in such a large dataset.
One thing you should be aware of: MICE is S-L-O-W. Working on such a big dataset, if you intend to use MICE, I would strongly recommend you to use a computing cloud -- otherwise, you're better planning your self in advance because, with a 800k x ~100 matrix, it may take a few days to get the job done, depending on how you specify your model.
MICE offers you a number of different imputation methods to be used according to the type of variable to be imputed. The fastest one is predictive mean matching. PMM was initially intended to be used to impute continuous data but it seems pmm is flexible enough to accomodate other types of variable. Take a look at this Paul Allison's post and Stef van Buuren's response: http://statisticalhorizons.com/predictive-mean-matching
(I see this is a three years old post but I have been using MICE and have been amazed by how powerful -- and oftentimes slow -- it can be!)

Resources