GLM: Continuous variable with few states as factor or numeric? - r

I have a basic question. I am running binomial GLMs, with numeric predictors. Some of these predictors have very few unique values - some have 2, some 3 and some have 4. All these predictors are on a clear and interpretable continuous scale - I just sampled a lot of times from very few places on the scale (I know, not ideal for regression, but cannot be changed). Take for example the following table. Imagine this table is repeated like this for 10'000 more times, with just the response values varying:
response
pred1
pred2
pred3
0
20
100
100
1
50
900
200
1
20
4000
800
0
50
100
900
1
20
900
100
0
50
4000
100
1
20
100
800
0
50
900
900
My question is: (when) does it make sense to translate these predictors into factors?
If a numeric variable only contains 2 unique values, does it even make a difference if it's a factor or numeric? Can I trust estimates based on just 3 or 4 unique values? Would it be better to make it a factor and thereby "acknowledge" that we cannot infer a linear regression line from the few values we have sampled?
I assume, since they can all be placed on a continuous scale, it makes sense to keep them numeric, but I just wanted to make sure I'm doing the right thing.

Related

Output "randomForest" with changing MeanDecreaseAccuracyValues

I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change.
Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!

Random sampling with normal distribution from existing data in R

I have a large dataset of individuals who have rated some items (x1:x10). For each individual. the ratings have been combined into a total score (ranging 0-5). Now, I would like to draw two subsamples with the same sample size, in which the total score has a specific mean (1.5 and 3) and follows a normal distribution. Individuals might be part of both subsamples.
A guess to solve this, sampling with the outlined specifications from a vector (the total score) will work. Unfortunately, I have found only found different ways to draw random samples from a vector, but not a way to sample around a specific mean.
EDIT:
As pointed out I normal distribution would not be possible. Rather than I am looking way to sample a binomial distribution (directly from the data, without the work around of creating a similar distribution and matching).
You can't have normally distributed data on a discrete scale with hard limits. A sample drawn from a normal distribution with a mean between 0 and 5 would be symmetrical around the mean, would take coninuous rather than discrete values and would have a non-zero probability of containing values of less than zero and more than 5.
You want your sample to contain discrete values between zero and five and to have a central tendency around the mean. To emulate scores with a particular mean you need to sample from the binomial distribution using rbinom.
get_n_samples_averaging_m <- function(n, m)
{
rbinom(n, 5, m/5)
}
Now you can do
samp <- get_n_samples_averaging_m(40, 1.5)
print(samp)
# [1] 1 3 2 1 3 2 2 1 1 1 1 2 0 3 0 0 2 2 2 3 1 1 1 1 1 2 1 2 0 1 4 2 0 2 1 3 2 0 2 1
mean(samp)
# [1] 1.5

two phase sample weights

I am using survey package to analyze data from a two-phase survey. The first phase included ~5000 cases, and the second ~2700. Weights were calculated beforehand to adjust for several variables (ethnicity, sex, etc.) AND(as far as I understand) for the decrease in sample size when performing phase 2.
I'm interested in proportions of binary variables, e.g. suicides in sick vs. healthy individuals.
An example of a simple output I receive in the overall sample:
table (df$schiz,df$suicide)
0 1
0 4857 8
1 24 0
An example of a simple output I receive in the second phase sample only:
table (df2$schiz,df2$suicide)
0 1
0 2685 5
1 24 0
And within the phase two sample with the weights included:
dfw<-svydesign(ids=~1,data=df2, weights=df2$weights)
svytable (~schiz+suicide, design=dfW)
suicide
schiz 0 1
0 2701.51 2.67
1 18.93 0.00
My question is: Shouldn't the weights correct for the decrease in N when moving from phase 1 to phase 2? i.e. shouldn't the total N of the second table after correction be ~5000 cases, and not ~2700 as it is now?

Making zero-inflated or hurdle model with R

I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time.
I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).
The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.
But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.
Here is histogram of diff.time with 0 and without 0 data :
I tried these models with the pscl package:
summary(m1 <- zeroinfl(diff.time3 ~
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers |
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers,
data=df , link="logit",dist= "negbin"))
or the same with hurdle()
but they gave me an error :
Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge
with hurdle():
Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0
I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.
Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:
1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid.
2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510
3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000
4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows.
5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)
If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:
load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE TRUE
## 950 50
Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).
For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.
Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.
table(factor(a$Plan))
## Earning much more Free Mailing Premium
## 1 950 1 24
## Premium trial
## 24
table(factor(a$Registration.country))
## australia Australia Austria Bangladesh Belgium brasil Brasil
## 1 567 7 5 56 1 53
## Bulgaria Canada
## 10 300
Also, you need to clean up the country levels with all lower-case letters.
After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.

No zeros predicted from zeroinfl object in R?

I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.
data("bioChemists", package = "pscl")
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
table(round(predict(fm_zinb2, type="zero")))
> 0 1
> 891 24
table(round(bioChemists$art))
> 0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
> 275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
What is this telling me?
When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks
The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf
It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:
What is the expected count? (type="response")
What is the (conditional) expected probability of an excess zero? (type="zero")
What is the (marginal) expected probability of any count? (type="prob")
And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?
To read in the data that comes with the pscl package:
data("bioChemists", package = "pscl")
Then fit a zero-inflated negative binomial model:
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
If you wish to predict the expected values, then you use
predict(fm_zinb2, type="response")[29:31]
29 30 31
0.5213736 1.7774268 0.5136430
So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.
But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):
predict(fm_zinb2, type="zero")[29:31]
It produces this output:
29 30 31
0.58120120 0.01182628 0.58761308
So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.
And you have tabulated these predicted probabilities across the whole dataset
table(round(predict(fm_zinb2, type="zero")))
0 1
891 24
So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).
It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale
table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1)))
to give
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
751 73 34 23 10 22
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
2 0 0 0
So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero.
Graphically, this can be shown in a histogram
hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))
You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):
table(bioChemists$art)
0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
Who is the special biochemist with 19 publications?
most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)
You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!
preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]
and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)
expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,],
xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)
and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).
Getting back to the question, for biochemist 29,
the probability of an excess zero is
pzero <- predict(fm_zinb2, type="zero")
pzero[29]
29
0.5812012
The probability of a zero, overall (marginally) is
preds[29,1]
[1] 0.7320871
So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:
pzero[29]/preds[29,1]
29
0.7938962
Or the additional probability of a zero, beyond the chance of an excess zero is:
preds[29,1] - pzero[29]
29
0.1508859
The actual number of publications for biochemist 29 is
bioChemists$art[29]
[1] 0
So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).
And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:
hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")
which gives you:

Resources