Problems in calculating svymean for factor variables in R - r

I'm trying to calculate simple weighted means for my masters thesis using the survey package in RStudio. Most of my variables are factor variables. When I try to calculate the svymean I always get "NA" as results.
I did the following step for all of my variables:
data_subset$fTox <- factor(data_subset$PosNeg)
data_subset$fTox <- plyr::mapvalues(data_subset$fTox,from =c("1","2"), to = c("Negativ", "Positiv"))
Then created the survey design:
QSLab <- svydesign(ids = ~ppoint, data = data_subset1, weights = ~wQSLAB)
When I use this code:
svymean(design=QSLab, x=~fTox)
confint(svymean(design = QSLab, x = ~fTox))
I always get this as a result, no matter which variable I use:
enter image description here
Hope someone can help me.

Related

Error in looped regression using prodest package. R

I have an unbalanced panel, and I am trying to run this regression to estimate a production function.
I know this regression has a bootstrap/iterative component.
This is the package and a data similar to mine:
library(prodest)
data(chilean)
This data is way too similar to mine, the only difference is in chilean$idvar, this data got a numeric id but my data (base2$ruc) got a string id and way more obs.
This is the code for the regression:
On Chilean data:
LP_model1 <- prodestLP(chilean$Y,
fX = cbind(chilean$fX1, chilean$fX2),
sX = chilean$sX,
pX = chilean$pX,
idvar = chilean$idvar,
timevar = chilean$timevar,
R = 30)
And it works
This is for my data, all numerics, no missing values, just the difference in the idvar, and more columns that I don't use in this process.
levpet <- prodestLP(base2$c_y,
fX=base2$c_l,
sX=base2$c_k,
pX=base2$c_m,
idvar=base2$ruc,
timevar = base2$year,
R=30 )
and I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = c(42719L, 82109L, 82678L, :
replacement has 469326 rows, data has 78221
My data has 78221 obs, while chilean dataset has 2544.
I don't understand what the error is saying or how to solve it.
Can anybody please help me with this one?
Thanks in advance.
This is an example using the chilean model:https://rpubs.com/hacamvan/319728
This is the package documentation: https://cran.r-project.org/web/packages/prodest/prodest.pdf
And here's what does my variables mean:
c_y is log of value added
c_l is log of wages
c_k is log of capital
c_m is log of materials
ruc is the firm identifier (a string)
year is a numeric.

How to apply weights associated with the NIS (National inpatient sample) in R

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

Correct use of R naive_bayes() and predict()

I am trying to run a simple naive bayes model (trying to redo what I have seen the datacamp course).
I am using the R naivebayes package.
The training dataset is where9am which looks like this:
My first problem is the following... when I have several predictions in a dataframe thursday9am...
... and I use the following code:
locmodel <- naive_bayes(location ~ daytype, data = where9am)
my_pred <- predict(locmodel, thursday9am)
I get a series of <NA> while it works well with the correct prediction if the thursday9am dataframe only contains a single observation.
The second problem is the following: when I use the following code to get the associated probabilities...
locmodel <- naive_bayes(location ~ daytype, data = where9am, type = c("class", "prob"))
predict(locmodel, thursday9am , type = "prob")
... even if I have only one observation in thursday9am, I get a series of <NaN>.
I am not sure what I am doing wrong.

How are BRR weights used in the survey package for R?

Does anyone know how to use BRR weights in Lumley's survey package for estimating variance if your dataset already has BRR weights it in?
I am working with PISA data, and they already include 80 BRR replicates in their dataset. How can I get as.svrepdesign to use these, instead of trying to create its own? I tried the following and got the subsequent error:
dstrat <- svydesign(id=~uniqueID,strata=~strataVar, weights=~studentWeight,
data=data, nest=TRUE)
dstrat <- as.svrepdesign(dstrat, type="BRR")
Error in brrweights(design$strata[, 1], design$cluster[, 1], ...,
fay.rho = fay.rho, : Can't split with odd numbers of PSUs in a stratum
Any help would be greatly appreciated, thanks.
no need to use as.svrepdesign() if you have a data frame with the replicate weights already :) you can create the replicate weighted design directly from your data frame.
say you have data with a main weight column called mainwgt and 80 replicate weight columns called repwgt1 through repwgt80 you could use this --
yoursurvey <-
svrepdesign(
weights = ~mainwgt ,
repweights = "repwgt[0-9]+" ,
type = "BRR",
data = yourdata ,
combined.weights = TRUE
)
-- this way, you don't have to identify the exact column numbers. then you can run normal survey commands like --
svymean( ~variable , design = yoursurvey )
if you'd like another example, here's some example code and an explanatory blog post using the current population survey.
I haven't used the PISA data, I used the svprepdesign method last year with the Public Use Microsample from the American Community Survey (US Census Bureau) which also shipped with 80 replicate weights. They state to use the Fay method for that specific survey, so here is how one can construct the svyrep object using that data:
pums_p.rep<-svrepdesign(variables=pums_p[,2:7],
repweights=pums_p[8:87],
weights=pums_p[,1],combined.weights=TRUE,
type="Fay",rho=(1-1/sqrt(4)),scale=1,rscales=1)
attach(pums_p.rep)
#CROSS - TABS
#unweighted
xtabs(~ is5to17youth + withinAMILimit)
table(is5to17youth + withinAMILimit)
#weighted, mean income by sex by race for select age groups
svyby(~PINCP,~RAC1P+SEX,subset(
pums_p.rep,AGEP > 25 & AGEP <35),na.rm = TRUE,svymean,vartype="se","cv")
In getting this to work, I found the article from A. Damico helpful: Damico, A. (2009). Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data. The R Journal, 1(2), 37–44.

Resources