Poisson GLM with categorical data - r

I'm trying to fit a Poisson generalized mixed model using counts of categorical data labeled as s and v. Since the data was collected within sessions that have a different duration (see session_dur_s), I want to include this information as a predictor by putting offset in the glm model.
Here is my table:
label session counts session_dur_s
s 1 587 6843
s 2 203 2095
s 3 187 1834
s 4 122 1340
s 5 40 1108
s 6 64 476
s 7 60 593
v 1 147 6721
v 2 57 2095
v 3 58 1834
v 4 22 986
v 5 8 1108
v 6 12 476
v 7 11 593
My data:
label <- c("s","s","s","s","s","s","s","v","v","v","v","v","v","v")
session <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7)
counts <- c(587,203,187,122,40,64,60,147,54,58,22,8,12,11)
session_dur_s <-c(6843,2095,1834,1340,1108,476,593,6721,2095,1834,986,1108,476,593)
sv_dur <- data.frame(label,session,counts,session_dur_s)
That's my code:
sv_dur_mod <- glm(counts ~ label * session, data=sv_dur, family = "poisson",offset =session_dur_s)
summary(sv_dur_mod)
plot(allEffects(sv_dur_mod),type="response")
I can't execute the glm function because I receive the beautiful error:
Error: no valid set of coefficients has been found: please supply starting values
I'm not sure how to go about it. I would be really happy if a smart head could point me what can I do in order to work it out.
If there is a better model that I can use to predict the counts over time for the both s and v labels, I'm more than open to go for it.
Many thanks for comments and suggestions!
P.S. I'm running it in the R markdown script using following packages tidyverse, effects and dplyr

A Poisson GLM uses a log link as default. That is, it can executed as:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
family = poisson("log"))
Accordingly, a log offset is generally appropriate:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
offset = log(session_dur_s),
family = poisson("log"))
Which executes as expected. See the answer here for more information on using a log offset: https://stats.stackexchange.com/a/237980/70372

Related

R survey poststratification: struggling with the survey function

I'm new here and new to R. I'm wondering whether I've used the R survey package correctly to postStratify my data. Below you can see the data structure of my dataframe (df).
utype
gender
age
regzeit
finanz
sfeld
sindex
pri
female
23
ja
s
ARG
5
sta
male
23
nein
f
ARG
-7
sta
female
21
ja
ARG
11
pri
male
28
ja
t
ARG
1
I've oversampled females for the "gender" variable and students for the "utype" variable and now want to adjust for the distribution in the population. My n=383 was oversampled to n = 477
The intended distributions of my my n=383 sample are:
utype
male
female
Sum
pri
54
68
122
sta
128
133
261
Sum
187
196
383
design <- svydesign(id = ~utype+gender, data = df)
Warning message:
In svydesign.default(id = ~utype + gender, data = df) :
No weights or probabilities supplied, assuming equal probability
pop.types <- data.frame(utype=c("sta","pri"), Freq=c(261,122))
designp <- postStratify(design, ~utype, pop.types)
postStratify(design, ~utype, pop.types)
svymean(~sindex, design)
mean | SE
sindex 0.48008 | 0.0192
svymean(~sindex, designp)
mean | SE
sindex 0.47692 | 0
My question now is whether the below code is correct and how I can postStratify for both variables utype and gender in my code or whether I have to run the postStratify command twice. I'm especially concerned that the standard error is zero in my weighted sample and because of the warning message. And whether the Freq values are correct for what I'm trying to do here?
The last thing I've been trying to figure out is how to get the svymean, svyhist or svyboxplot functions for "sindex", but only for the observations with utype == pri, so by group basically. This should all be applied to the weighted sindex values.
I hope I'm conforming to all the rules. Many thanks in advance!
You don't want to post-stratify twice (that would give you raking). You want to post-stratify once, using a post-stratum variable that is a combination of your two variables gender, as in your 2x2 table. That is,
designp <- update(designp, combined_var = interaction(gender,utype))
You now specify a pop.types= argument that has the desired frequencies for each of the four levels of this new variable.
With only the four observations you list, you will end up with zero standard errors because there isn't any variation within any of the post-strata.

Error when specifiying correct model with svydesgin, R survey package

I am sampling from a dataset I created myself. It is a two stage cluster sample. However, I do not seem to specify my design without error (the way I would want to).
I have created a database based on information I have from census EA data from Zanzibar.
The data contains 2 districts. District 1 has 32 subunits (called Shehias) and District 2 has 29. In turn each of the 61 shehias has between 2 and 19 Enumerations Areas (EAs). EAs themselves contain between 51 and 129 households.
The data selection process is the following: All (2) districts and all (61) shehias are included. In each shehia, 2 EAs are selected at random. In each selected EA 22/26 households (depending on the district) are selected. All household members should be selected.
Hence this is a two stage clustering process. The Primary Sampling Unit (PSU) is the EA, the SSU are the households. Both selections are at random.
These are the first six rows of the selected data called strategy_2:
District_C Shehia_Code EA_Code HH_Number District_Numb District_Shehias Shehia_EAs HH_in_EA Prev_U3R3
1 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
2 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
3 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
4 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
5 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
6 2 2_11 510201107001_1 510201107001_1_1173 1 29 19 115 1
If I spell out the whole process (including things as clusters that actually are not), then my design ought to be:
strategy_2_Design <- svydesign(id = ~ District_C + Shehia_Code + EA_Code + HH_Number,
fpc = ~ District_Numb + District_Shehias + Shehia_EAs + HH_in_EA,
data = strategy_2)
Here I define the district and the number of districts in the survey as well as the same for Shehias. In both cases sample pop = population pop so the weight contribution is 1 at each stage. The third and fourth element are the actual sampling units.
This design will give me a correct estimate (weights are correct) but the model only has one degree of freedom (2 districts – 1). Hence when I try to calculate values for subunits of Shehias through svyby it can calculate means but if I use svyciprop as FUN the confidence interval is NA because the degrees of freedom of the subset are 0.
Trying to reduce the model down to the two stages I truly am using does not work. Namely
strategy_2_Alt_1 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
data = strategy_2)
yields:
record 1 stage 1 : popsize= 19 sampsize= 122
Error in as.fpc(fpc, strata, ids, pps = pps) :
FPC implies >100% sampling in some strata
Note that 19 is the number of subunits (EAs) in that (first) PSU, 122 is the number of EAs all the sample (2 for each of the 61 Shehias, thus 122).
One way around could be to claim that EAs were stratified by Shehia. This would be:
strategy_2_Alt_2 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat + NULL,
data = strategy_2)
Shehias_Cat simply contains the name of the Shehia each EA is in. This give a stratified 2 level cluster sampling design with (122, 2916) clusters.
The weights here are the same as in the first design (strategy_2_Design):
> identical(weights(strategy_2_Design),weights(strategy_2_Alt_2))
[1] TRUE
Hence if I calculate the mean using the weights by hand I get the same result. However, if I try to use svymean to do this calculation, I get an error:
> svymean(~Prev_U3R3, strategy_2_Alt_2)
Error in v.sub[[i]] : subscript out of bounds
In addition: Warning message:
In by.default(1:n, list(as.numeric(clusters[, 1])), function(index) { :
NAs introduced by coercion
So my questions are 1) where do these errors come from and 2) how do I define my model correctly? I have been trying to think about this many a way but do not seem to get it right.
The data and my code are to get to this issue are available under https://www.dropbox.com/sh/u1ajzxaxgue57r8/AAAkCfPC2YrwhEq6gbLsQmGQa?dl=0.
I think you want
strategy_2_SHORT_Design <- svydesign(id = ~ factor(EA_Code) + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat,
data = strategy_2)
The design has households sampled within EA, within strata defined by shehias, and the population size in EAs is given by Shehia_EAs and then the size in households is given by HH_in_EA. In your data, EA_Code was a character variable, but it has to be numeric or factor.
The documentation for svydesign should make this clear, but doesn't, presumably because of the default conversion of strings to factors back in primitive times when the function was written.

Extract variable labels from rpart decision tree

I've used part to build a decision tree on a dataset that has categorical variables with hundreds of levels. The tree splits these variables based on select values of the variable. I would like to examine the labels on which the split is made.
If I just run the decision tree result, the display listing the splits in the console gets truncated and either way, it is not in an easily-interpretable format (separated by commas). Is there a way to access this as an R object? I'm open to using another package to build the tree.
One issue here is that some of the functions in the rpart package are not exported. It appears you're looking to capture the output of the function rpart:::print.rpart. So, beginning with a reproducible example:
set.seed(1)
df1 <- data.frame(y=rbinom(n=100, size=1, prob=0.5),
x1=rbinom(n=100, size=1, prob=0.25),
x2=rbinom(n=100, size=1, prob=0.75))
(r1 <- rpart(y ~ ., data=df1))
giving
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 24.960000 0.4800000
2) x1< 0.5 78 19.179490 0.4358974
4) x2>=0.5 66 15.954550 0.4090909 *
5) x2< 0.5 12 2.916667 0.5833333 *
3) x1>=0.5 22 5.090909 0.6363636
6) x2< 0.5 7 1.714286 0.4285714 *
7) x2>=0.5 15 2.933333 0.7333333 *
Now, looking at rpart:::print.rpart, we see a call to rpart:::labels.rpart, giving us the splits (or names of the 'rows' in the output above). The value of n, deviance, yval and more are stored in r1$frame, which can be seen by inspecting the output from unclass(r1).
Thus we could extract the above with
(df2 <- data.frame(split=rpart:::labels.rpart(r1), n=r1$frame$n))
giving
split n
1 root 100
2 x1< 0.5 78
3 x2>=0.5 66
4 x2< 0.5 12
5 x1>=0.5 22
6 x2< 0.5 7
7 x2>=0.5 15

Predict Future values by using OPERA package in R

I have been trying to understand Opera “Online Prediction by Expert Aggregation” by Pierre Gaillard and Yannig Goude. I read two posts by Pierre Gaillard (http://pierre.gaillard.me/opera.html) and Rob Hyndman (https://robjhyndman.com/hyndsight/forecast-combinations/). However, I do not understand how to predict future values. In Pierre's example, newY = Y represents the test data set (Y <- data_test$Load) which is weekly observations of the French electric load. As you shown below, the data ends at December 2009. Now, how can I forecast let's say 2010 values? What will be the newY here?
> tail(electric_load,5)
Time Day Month Year NumWeek Load Load1 Temp Temp1 IPI
727 727 30 11 2009 0.9056604 63568.79 58254.42 7.220536 10.163839 91.3 88.4
728 728 7 12 2009 0.9245283 63977.13 63568.79 6.808929 7.220536 90.1 87.7
729 729 14 12 2009 0.9433962 78046.85 63977.13 -1.671280 6.808929 90.1 87.7
730 730 21 12 2009 0.9622642 66654.69 78046.85 4.034524 -1.671280 90.1 87.7
731 731 28 12 2009 0.9811321 60839.71 66654.69 7.434115 4.034524 90.1 87.7
I noticed that by multiplying the weights of MLpol0 by X, we get similar outputs as online predictions values.
> weights <- predict(MLpol0, X, Y, type='weights')
> w<-weights[,1]*X[,1]+weights[,2]*X[,2]+weights[,3]*X[,3]
> predValues <- predict(MLpol0, newexpert = X, newY = Y, type='response')
Test_Data predValues w
620 65564.29 65017.11 65017.11
621 62936.07 62096.12 62096.12
622 64953.83 64542.44 64542.44
623 61580.44 60447.63 60447.63
624 71075.52 67622.97 67622.97
625 75399.88 72388.64 72388.64
626 65410.13 67445.63 67445.63
627 65815.15 62623.64 62623.64
628 65251.90 64271.97 64271.97
629 63966.91 61803.77 61803.77
630 64893.42 65793.14 65793.14
631 69226.32 67153.80 67153.80
But still I am not sure how to generate weights w/out newY. Maybe we can use final coefficients that are the output of MLpol to predict future values?
(c<-summary(MLpol <- mixture(Y = Y, experts = X, model = "MLpol", loss.type = "square"))$coefficients)
[1] 0.585902 0.414098 0.000000
I am sorry I may be way off on this and my question may not make sense at all, but I really appreciate any help/insight.
The idea of the opera package is a bit different from classical batch machine learning methods with a training set and a testing set. The goal is to make sequential predictions:
At each round t=1,...,n,
1) the algorithm receives predictions of the expert for round n+1,
2) it makes a prediction for this time step by combining the expert
3) it updates the weights used for the combination by using the new output
If you have out-of-sample forecasts (i.e., forecasts of experts for future values without the outputs), the best you can do is to use the last coefficients and use them to make a prediction by using:
newexperts %*% model$coefficients
In practice, you may also want to use the averaged coefficients. You can also obtained the same by using
predict (object, # for exemple, mixture(model='FS', loss.type="square")
newexperts = # matrix of out-of-sample experts predictions
online = FALSE,
type = 'response')
By using the parameter online = FALSE the model does not need any newY. It will not update the model. When you provide newY, the algorithm does not cheat. It does not use the value at rount t to make the prediction at round t. The values of newY are only used to update the coefficients step by step to do as if the prediction were made sequentially.
I hope this helped.

Fitting two parameter observations into copulas

I have one set of observations containing two parameters.
How to fit it into copula (estimate the parameter of the copula and the margin function)?
Let's say the margin distribution are log-normal distributions, and the copula is Gumbel copula.
The data is as below:
1 974.0304 1010
2 6094.2672 1150
3 3103.2720 1490
4 1746.1872 1210
5 6683.7744 3060
6 6299.6832 3330
7 4784.0112 1550
8 1472.4288 607
9 3758.5728 1970
10 4381.2144 1350
Library(copula)
gumbel.cop <- gumbelCopula(dim=2)
myMvd <- mvdc(gumbel.cop, c("lnorm","lnorm"), list(list(meanlog = 7.1445391,sdlog=0.4568783), list(meanlog = 7.957392,sdlog=0.559831)))
x <- rmvdc(myMvd, 1000)
fit <- fitMvdc(x, myMvd, c(7.1445391,0.4568783,7.957392,0.559831))
The meanlog and sdlog value are derived from the data set. Error message:
"Error in if (alpha - 1 < .Machine$double.eps^(1/3)) return(rCopula(n, :
missing value where TRUE/FALSE needed"
How to choose the copula parameter with the given data, and the margin distributions derived from the data set?
To close the question assessed in the comments.
It seems that giving a parameter of TRUE or FALSE close the problem as well as doing first the pseudo-observation and then fit the function.

Resources