I would like to use the svyglm function from the survey package to run stratified regression models/regression models on subset of my population.
Suppose x is my predictor, y is my outcome, and z is a third (factor) variable. I would like to see individual relationships between x and y for different levels of z.
The documentation for this package says that "The correct standard error estimate for a subpopulation that isn’t a stratum is not just obtained by pretending that the sub population was a designed survey of its own. However, the subset function and [ method for survey design objects handle all these details automagically, so you can ignore this problem."
There is a subset argument in the svyglm function. My question is - do you specify the subpopulation in the subset argument of the design function, in the svyglm function, or both?
Either one, but not both.
The code inside svyglm looks like
subset <- substitute(subset)
subset <- eval(subset, model.frame(design), parent.frame())
if (!is.null(subset))
design <- design[subset, ]
The first two lines are handling where to look up the subset, and then it just gets used to [ the design.
Related
I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
I'm using the function svyVGAM::svy_vglm to run a multinomial model with survey weights:
mmodel <- svy_glm(y~x1+x2+x3+x4..., family=multinomial, design=w_data)
where x represent categorical variables, some with three or more levels. Through model summary, I can know the p-value for each coefficient, but I don't know how to get the p-value for the global variable.
In other contexts, anova(), waldtest(), lrtest(), ... could be used, but none of them seem to work with svy_vglm objects. tbl_regression does not work either: Error: No tidy method for objects of class svy_vglm.
Any help?
Thanks
You can do this using the coef and vcov methods. There's probably a package, but it's not hard to program
Suppose that model is your model object, design is your survey design object and index is a vector with the positions of the coefficients you want to test. If you had ten coefficients and wanted to test all except the first two, you would have index<-3:10, for example.
beta<-coef(model)[index]
V<-vcov(model)[index,index]
teststat <- crossprod(beta, solve(V,beta))
pf(teststat, df1=length(beta), df2=degf(design), lower.tail=FALSE)
This doesn't give you a likelihood ratio test; you'd probably need to write to the package author and suggest that as a new feature.
I have a data.frame in R whose variables represent locations and whose observations are measures of a certain variable in those locations. I want to measure the decay of dependence for certain locations depending on distance, so the variogram comes particularly useful for my studies.
I am trying to use gstat library but I am a bit confused about certain parameters. As far as I understand the (empirical) variogram should only need as basic data:
The locations of the variables
Observations for these variables
And then other parameters like maximun distance, directions, ...
Now, gstat::variogram() function requires as first input an object of class gstat. Checking the documentation of function gstat() I see that it outputs an object of this class, but this function requires a formula argument, which is described as:
formula that defines the dependent variable as a linear model of independent variables; suppose the dependent variable has name z, for ordinary and simple kriging use the formula z~1; for simple kriging also define beta (see below); for universal kriging, suppose z is linearly dependent on x and y, use the formula z~x+y
Could someone explain me what this formula is for?
try
methods(variogram)
and you'll see that gstat has several methods for variogram, one requiring a gstat object as first argument.
Given a data.frame, the easiest is to use the formula method:
variogram(z~1, ~x+y, data)
which specifies that in data, z is the observed variable of interest, ~1 specifies a constant mean model, ~x+y specify that the coordinates are found in columns x and y of data.
I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. I want to use MICE package for missing value imputation. This is what I tried:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "mids"
Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. Then I am trying to use the output to impute the whole data.
Is there anything wrong with my approach? If yes, what should I do to make it correct? If no, then why am I getting this error?
Neither mice nor hmisc provide the parameter estimates from the imputation process. Both Amelia and imputeMulti do. In both cases, you can extract the parameter estimates and use them for imputing your other observations.
Amelia assumes your data are distributed as a multivariate normal (eg. X \sim N(\mu, \Sigma).
imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. That is the complete cell counts are distributed (X \sim M(n,\theta)) where n is the number of observations.
Fitting can be done as follows, via example data. Examining parameter estimates is shown further below.
library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta.
The parameter estimates in imputeMulti are found in IM_EM#mle_x_y and can be accessed via the get_parameters method.
imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor) data.
All of this information is in the currently unpublished vignette for imputeMulti. The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package.
You don't use predict() with mice. It's not a model you're fitting per se. Your imputed results are already there for the 100,000 rows.
If you want data for all rows then you have to put all rows in mice. I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores.
I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose.
Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?
The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the
predictions.
labels A vector, matrix, list, or data frame containing the true
class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n to the predictions argument, and what looks to me like nonsense to the labels argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest, it will be a list with two components. aggregate will contain the predicted values for the entire forest, while individual will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions argument, while the "labels" or true values, are passed to the labels argument.
You should erase the predict.all=TRUE argument from predict if you simply want to get the predicted classes. By using predict.all=TRUE you are telling the function to keep the predictions of all trees rather than the prediction from the forest.