I'm working with Federal Reserve Survey of Consumer Finances (SCF) data, which expands the ~6500 actual observed responses into ~29,000 entries through multiple imputation. I'm able to generate summary statistics (counts, means, quantiles, etc.) using scf_MIcombine in the lodown package, but I'm having a lot of trouble representing it visually. The functions that account for multiple imputation tend to spit out svyimputationlist objects, which are challenging to cast into objects that ggplot can understand.
For example:
scf_design <-
svrepdesign(
weights = ~wgt ,
repweights = scf_rw[ , -1 ] ,
data = imputationList( scf_imp ) ,
scale = 1 ,
rscales = rep( 1 / 998 , 999 ) ,
mse = FALSE ,
type = "other" ,
combined.weights = TRUE
)
scf_design_work <- subset(scf_design, age>24 & age<65)
tab_knolLIT <- scf_MIcombine(with(svytable(~finlit+knowlcat, design = subset(scf_design_work, finlit!=0))))
#Error in UseMethod("svytable", design) : no applicable method for 'svytable' applied to an object of class "svyimputationList"
Any suggestions?
To graph multiply imputed survey data, you can use the mi or mice packages in R. Both packages offer functions for creating graphs based on multiply imputed data. Here is a step by step process:
Combine the multiple imputed data using pool() or complete()
functions in the mi or mice package respectively.
Use ggplot2 to create visualizations. The ggplot2 library is
powerful and flexible, allowing you to create a wide range of
visualizations.
Use mi_boxplot() or mi_histogram() to create box plots or histograms for each variable. These functions take multiply imputed data as input and automatically handle the uncertainty introduced by the multiple imputations.
Use mi_qqplot() to create a quantile-quantile plot, which can help
you to assess whether the distribution of the data is normal.
Use mi_scatterplot() to create a scatterplot matrix, which can help
you visualize the relationship between multiple variables.
Note that it is important to account for the uncertainty introduced by multiple imputation when interpreting results, as the estimates and standard errors may be different than if a single imputed dataset was used.
Related
I am trying to do an exploratory factor analysis using a data set that contains items from questionnaires. Some questions were answered and have 4 levels and some had only two. However I am not sure what is the implementation in R when using both types of variables.
I have seen that psych is a good toolbox and have used the ordinal and dichotomous items separately well, but I am not sure how to combine both. The tetrachoric or polyphoric function does not seem to work for example when I am using both data types together.
My code as it stands is this:
poly_cor = polychoric(data)
rho = poly_cor$rho
cor.plot(rho, numbers=T, upper=FALSE, main = "Polychoric Correlation", show.legend = FALSE)
fa.parallel(data, fm="pa", fa="fa", main = "Scree Plot")
poly_model = fa(data, nfactor=4, cor="poly", fm="mle", rotate = "oblimin")
However the polychoric functions says that there is not an equal number of response alternatives.
I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.
I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.
You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.
I've used the function betadisper() in the vegan package to generate multivariate dispersions and plot those data in a PCoA. In this example I'll be looking at the difference between the sexes in a singular species.
Load the original data. For our purposes this can legit be anything here. The data I'm using isn't special. Its feature measurements are from a bioacoustic dataset. I am walking through my process:
my_original_data = read.csv("mydata.csv", as.is = T, check.names = F)
#Just extract the numeric/quantitative data.
myData=my_original_data[, 13:107]
Based on previous research, we used an unsupervised randomForest to determine similarity within our original feature measurements:
require(randomForest)
full_urf = randomForest(myData, proximity=T, scale=TRUE, ntree=4999,importance = TRUE)
A index was then generated using the proximity matrix:
urf_dist_full = as.dist(1-full_urf$proximity)
An permutational MANOVA was run on the resulting index using the vegan package. The use of the pMANOVA was well researched and is the correct test for my purposes:
mod=adonis(formula = urf_dist_full ~ Sex * Age * Variant, data = my_original_data, permutations = 999, method = "euclidean")
my_original_data had qualitative factors, Sex, Age and Variant. I could have extracted them, but it seemed cleaner to keep them within the original dataset.
After running a few homogeneity tests, I want to plot the multivariate dispersions. To do this I have been using the betadisper function:
Sex=betadisper(urf_dist_full,my_original_data$Sex)
plot(Sex, main="Sex Multivariate Dispersions")
That plots this beauty:
How can I label the centroids as Male and Female? I also want to run this plot for the Variant category, but that has five factors rather than two, which really warrants labeling.
I've seen the boxplot() variant of this, but I like how the PCoA also shows clustering.
You can add labels to centroids like this:
ordilabel(scores(Sex, "centroids"))
where Sex is your betadisper result. If you do not want to use the original names of your centroids, you can change the names with:
ordilabel(scores(Sex, "centroids"), labels=c("A","B"))
You can use the identify-function:
A <- plot(sex)
identify(A, "centroids")
Or look at the scores (this don't add labels to the plot, but shows you the centroid position)
scores(sex, 1:2, display = "centroids")
I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.