I have a weighting variable that I'd like to apply to my dataset so that I have weighted totals. In SPSS, this is straightforward enough. However, in R, I've been multiplying the variable by the weight variable to create a new variable as shown in the following example:
https://stats.stackexchange.com/questions/210697/weighting-variable-based-on-another-variable
Is there a more sophisticated way of applying weights in R?
Thanks.
If you need to work with a weighted dataset and define a complex survey sample, you can use the survey package : https://cran.r-project.org/web/packages/survey/survey.pdf.
You can therefore use all sorts of summary statistics once you have defined the weights to be applied.
However, I would advise this for complex weighted analysis.
Otherwise, there are several other packages dealing with weights such as questionr for instance.
It all depends on if you have to do a simple weighted sum or go on to do other types of analysis that require using more sophisticated methods.
Related
In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.
I am researching how to use multiple imputation results. The following is my understanding, and please let me know if there're mistakes.
Suppose you have a data set with missing values, and you want to conduct a regression analysis. You may perform multiple imputation for m = 5 times, and for each imputed data set (5 imputed data sets now) you run a regression analysis, then "pool" the coefficient estimates from these m = 5 models via Rubin's rules (or use R package "pool").
My question is that, in mice you have a function complete(), and the manual says you can extract completed data set by using complete(object).
But if I use mice for m = 5 times, does it still make sense to use complete()? Which imputation results will complete() get for me?
Also, does it make sense if I only use mice with m = 1? Thank you.
You probably overlooked that mice::complete() in arguments uses action=1 as default, which "returns the first imputed data set" (see ?mice::complete) and actually is worthless.
You should definitely use action="long" to take account for the "multiplicity" of the multiple imputation!
No, it makes no sense at all to use m=1 (apart from debugging), because every imputation is based on a random process and you have to pool the results (using any method whatsoever) to account for the variation. Often m>20 is recommended1.
Basically, multiple imputation works as follows:
Create m imputation processes with a random component, to obtain
m slightly different imputed data sets.
Analyze each imputed data set to get slightly different parameter
estimates.
Combine results, calculating the variation in parameter estimates.
(Also see multiple-imputation-in-a-nutshell for a brief overview.)
When you use mice, you get an object that is not the imputed data set. You cannot perform operations on it directly without using the special functions in mice. If you want to extract that actual imputed datasets, you use complete, the output of which is a data.frame with one row per individual per imputation (if using the "long" format). If you are doing any analysis with your imputed data that cannot be performed within mice, you need to create this dataset first.
I am working on data set where most of the variables are categorical variables. some variables have even 5 categories. Is it possible to implement knn algorithm in a situation like this? If so, how can I proceed with these categorical variables? Do I have to normalize them? I am using R and it would be a help if someone could direct me to a source.
Your first step would be to decide on a distance/dissimilarity function between your observations.
One option is to transform your categorical variables into dummy binary variables and then calculate the Jaccard distance between each row pair. Here is a simple tutorial for these steps.
Once you have a distance defined you can proceed with the KNN algorithm as usual. I'm not sure if there are any packages implementing this in R already or if you should program this yourself. Shouldn't be that complicated
The sample for a survey I am analysing was not selected randomly and so I need to apply a vector of weights to make the findings representative of the population. I have used wtd.table() (from gmodels) successfully to create frequency tables but now want to create a contingency table to compare two categorical variables and conduct a Chi-sqrd test. I'm struggling to find the right function. The svytable() function in the survey package sounds promising but I don't see where I should input the weight vector. I'm new to R. Could anyone explain how to use svytable() or suggest an alternative?
So I'm currently using the Naive Bayes classifier from the e1071 package to classify data, and I was wondering if there was any way to interact with, and edit the data.
For example, using the iris dataset, and the methods described here to extract a classifier from it, I want to be able to select the individual tables in the classifier.
I would like to be able to select a specific data table (such as the Sepal.Length) table, and compare the values against each other in order to get more information.
Does anyone have any methods for doing this?
Just figured it out
Essentially, the classifier is a set of 4 values, the apriori probabilities, the mean and standard deviations of each of the probabilities, the different classes, and the original call.
Each of those values is a nested list with one item, and if you keep on delving into the individual lists you can get at the individual items, including the individual probability matrices, and work from there. The first value of each is the mean, and the second is the standard deviation. From there you can pull whatever data you want, and edit to your heart's extent.