I'm new here and new to R. I'm wondering whether I've used the R survey package correctly to postStratify my data. Below you can see the data structure of my dataframe (df).
utype
gender
age
regzeit
finanz
sfeld
sindex
pri
female
23
ja
s
ARG
5
sta
male
23
nein
f
ARG
-7
sta
female
21
ja
ARG
11
pri
male
28
ja
t
ARG
1
I've oversampled females for the "gender" variable and students for the "utype" variable and now want to adjust for the distribution in the population. My n=383 was oversampled to n = 477
The intended distributions of my my n=383 sample are:
utype
male
female
Sum
pri
54
68
122
sta
128
133
261
Sum
187
196
383
design <- svydesign(id = ~utype+gender, data = df)
Warning message:
In svydesign.default(id = ~utype + gender, data = df) :
No weights or probabilities supplied, assuming equal probability
pop.types <- data.frame(utype=c("sta","pri"), Freq=c(261,122))
designp <- postStratify(design, ~utype, pop.types)
postStratify(design, ~utype, pop.types)
svymean(~sindex, design)
mean | SE
sindex 0.48008 | 0.0192
svymean(~sindex, designp)
mean | SE
sindex 0.47692 | 0
My question now is whether the below code is correct and how I can postStratify for both variables utype and gender in my code or whether I have to run the postStratify command twice. I'm especially concerned that the standard error is zero in my weighted sample and because of the warning message. And whether the Freq values are correct for what I'm trying to do here?
The last thing I've been trying to figure out is how to get the svymean, svyhist or svyboxplot functions for "sindex", but only for the observations with utype == pri, so by group basically. This should all be applied to the weighted sindex values.
I hope I'm conforming to all the rules. Many thanks in advance!
You don't want to post-stratify twice (that would give you raking). You want to post-stratify once, using a post-stratum variable that is a combination of your two variables gender, as in your 2x2 table. That is,
designp <- update(designp, combined_var = interaction(gender,utype))
You now specify a pop.types= argument that has the desired frequencies for each of the four levels of this new variable.
With only the four observations you list, you will end up with zero standard errors because there isn't any variation within any of the post-strata.
Related
I'm trying to interpret the results I'm getting when trying to cross validate my data for a k-nearest neighbors model. My data set is set up like
variable1(int) | variable2(int) | variable3(int) | variable4 (int) | Response (factor)
I split my data 80% into cvdata and 20% for testing once I choose my model.
A single iteration for my code is below:
cv <- cv.kknn(formula = Response~., cvdata, kcv = 10, k = 7, kernel = 'optimal', scale = TRUE)
cv
When I run 'cv' it just returns a list() containing some seemingly random numbers as the rownames, the observed outcome variable (y) and predicted outcome variable (yhat). I'm trying to calculate some sort of accuracy to the test set. Should I be comparing y to yhat to validate?
EDIT: output added below
[[1]]
y yhat
492 1 0.724282776
654 0 0.250394372
427 0 0.125159894
283 0 0.098561768
218 1 0.409990851
[[2]]
[1] 0.2267058 0.1060212
The first element in [[2]] is mean absolute error, the second is mean squared error.
Suppose df is your dataframe, then those values can be easily tested by mean(abs(df$y - df$yhat)) and mean((df$y - df$yhat)^2).
I am really new to R.
My data looks like this: Every city belongs to a state and has treatment and control values for the variable Num1. I need to perform a two way t tests or Anova (regardless) to check significance between Num1 values of cities for treatment group and Num1 values for the state for control to which the city belongs (this should be done for each city). The result should be p values for each city treatment- state control group.
City Measure_group State Day Num1
A Treatment X 01.01.1900 279
A Control X 02.01.1900 256
B Treatment Y 03.01.1900 71
B Control Y 04.01.1900 372
C Treatment Z 05.01.1900 127
C Control Z 06.01.1900 303
D Treatment W 07.01.1900 248
I have already tried basic ANOVA to do this but it does not work as it compares all cities to each other without any relationship.
model <- aov( Num1 ~City+State, data= data1)
comparison <- pairs(lsmeans(model, ~City|State))
test(comparison, by = NULL, adjust = "fdr")
I am sampling from a dataset I created myself. It is a two stage cluster sample. However, I do not seem to specify my design without error (the way I would want to).
I have created a database based on information I have from census EA data from Zanzibar.
The data contains 2 districts. District 1 has 32 subunits (called Shehias) and District 2 has 29. In turn each of the 61 shehias has between 2 and 19 Enumerations Areas (EAs). EAs themselves contain between 51 and 129 households.
The data selection process is the following: All (2) districts and all (61) shehias are included. In each shehia, 2 EAs are selected at random. In each selected EA 22/26 households (depending on the district) are selected. All household members should be selected.
Hence this is a two stage clustering process. The Primary Sampling Unit (PSU) is the EA, the SSU are the households. Both selections are at random.
These are the first six rows of the selected data called strategy_2:
District_C Shehia_Code EA_Code HH_Number District_Numb District_Shehias Shehia_EAs HH_in_EA Prev_U3R3
1 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
2 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
3 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
4 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
5 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
6 2 2_11 510201107001_1 510201107001_1_1173 1 29 19 115 1
If I spell out the whole process (including things as clusters that actually are not), then my design ought to be:
strategy_2_Design <- svydesign(id = ~ District_C + Shehia_Code + EA_Code + HH_Number,
fpc = ~ District_Numb + District_Shehias + Shehia_EAs + HH_in_EA,
data = strategy_2)
Here I define the district and the number of districts in the survey as well as the same for Shehias. In both cases sample pop = population pop so the weight contribution is 1 at each stage. The third and fourth element are the actual sampling units.
This design will give me a correct estimate (weights are correct) but the model only has one degree of freedom (2 districts – 1). Hence when I try to calculate values for subunits of Shehias through svyby it can calculate means but if I use svyciprop as FUN the confidence interval is NA because the degrees of freedom of the subset are 0.
Trying to reduce the model down to the two stages I truly am using does not work. Namely
strategy_2_Alt_1 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
data = strategy_2)
yields:
record 1 stage 1 : popsize= 19 sampsize= 122
Error in as.fpc(fpc, strata, ids, pps = pps) :
FPC implies >100% sampling in some strata
Note that 19 is the number of subunits (EAs) in that (first) PSU, 122 is the number of EAs all the sample (2 for each of the 61 Shehias, thus 122).
One way around could be to claim that EAs were stratified by Shehia. This would be:
strategy_2_Alt_2 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat + NULL,
data = strategy_2)
Shehias_Cat simply contains the name of the Shehia each EA is in. This give a stratified 2 level cluster sampling design with (122, 2916) clusters.
The weights here are the same as in the first design (strategy_2_Design):
> identical(weights(strategy_2_Design),weights(strategy_2_Alt_2))
[1] TRUE
Hence if I calculate the mean using the weights by hand I get the same result. However, if I try to use svymean to do this calculation, I get an error:
> svymean(~Prev_U3R3, strategy_2_Alt_2)
Error in v.sub[[i]] : subscript out of bounds
In addition: Warning message:
In by.default(1:n, list(as.numeric(clusters[, 1])), function(index) { :
NAs introduced by coercion
So my questions are 1) where do these errors come from and 2) how do I define my model correctly? I have been trying to think about this many a way but do not seem to get it right.
The data and my code are to get to this issue are available under https://www.dropbox.com/sh/u1ajzxaxgue57r8/AAAkCfPC2YrwhEq6gbLsQmGQa?dl=0.
I think you want
strategy_2_SHORT_Design <- svydesign(id = ~ factor(EA_Code) + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat,
data = strategy_2)
The design has households sampled within EA, within strata defined by shehias, and the population size in EAs is given by Shehia_EAs and then the size in households is given by HH_in_EA. In your data, EA_Code was a character variable, but it has to be numeric or factor.
The documentation for svydesign should make this clear, but doesn't, presumably because of the default conversion of strings to factors back in primitive times when the function was written.
I'm trying to fit a Poisson generalized mixed model using counts of categorical data labeled as s and v. Since the data was collected within sessions that have a different duration (see session_dur_s), I want to include this information as a predictor by putting offset in the glm model.
Here is my table:
label session counts session_dur_s
s 1 587 6843
s 2 203 2095
s 3 187 1834
s 4 122 1340
s 5 40 1108
s 6 64 476
s 7 60 593
v 1 147 6721
v 2 57 2095
v 3 58 1834
v 4 22 986
v 5 8 1108
v 6 12 476
v 7 11 593
My data:
label <- c("s","s","s","s","s","s","s","v","v","v","v","v","v","v")
session <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7)
counts <- c(587,203,187,122,40,64,60,147,54,58,22,8,12,11)
session_dur_s <-c(6843,2095,1834,1340,1108,476,593,6721,2095,1834,986,1108,476,593)
sv_dur <- data.frame(label,session,counts,session_dur_s)
That's my code:
sv_dur_mod <- glm(counts ~ label * session, data=sv_dur, family = "poisson",offset =session_dur_s)
summary(sv_dur_mod)
plot(allEffects(sv_dur_mod),type="response")
I can't execute the glm function because I receive the beautiful error:
Error: no valid set of coefficients has been found: please supply starting values
I'm not sure how to go about it. I would be really happy if a smart head could point me what can I do in order to work it out.
If there is a better model that I can use to predict the counts over time for the both s and v labels, I'm more than open to go for it.
Many thanks for comments and suggestions!
P.S. I'm running it in the R markdown script using following packages tidyverse, effects and dplyr
A Poisson GLM uses a log link as default. That is, it can executed as:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
family = poisson("log"))
Accordingly, a log offset is generally appropriate:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
offset = log(session_dur_s),
family = poisson("log"))
Which executes as expected. See the answer here for more information on using a log offset: https://stats.stackexchange.com/a/237980/70372
I have a moltinomial logistic regression and the outcome variable has 6 levels: 10,20,60,70,80,90
test<-multinom(y ~ x1 + x2 + as.factor(x3) ,data=data1)
I want to predict the probabilities associate with each level of y for each set of given input values. So I run this:
dfin <- data.frame( ses = c(10,20,60,70,80,90), x1=2.1, x2=4, x3=40)
predict(test, todaydata = dfin, type = "probs")
But instead of getting 6 probabilities (one for each level of outcome), I got many many rows of probabilities. Each row has 6 probabilities (summation is 1) but I don't know why I get many rows and which row I should trust.
5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
5705 5.642322e-01 1.830575e-01 5.134181e-02 8.952808e-04 8.796467e-03 1.916767e-01
5706 6.161694e-01 1.094046e-01 1.979044e-01 1.095385e-02 7.254592e-03 5.831323e-02
....
Am I missing anything in coding or do I need to set any parameter?
It is returning the probability for the observation to be in each of the classes. That is how multinomial logistic regressions are implemented. You can imagine a series of binomial logistic regressions (one for each class) and then choosing the class that has the highest probability. This is called the one-v-all approach.
In your example, observation 5541 is predicted to be class 1 because the first column has the highest value (probability). Observation 5670 is class 4 because that's the column with the highest probability. The matrix will have dimensions # of observations x # of classes.