I have a dataset that has weather=0 if temp is <65 degrees Fahrenheit, weather = 1 if temp is =65 degrees Fahrenheit, and weather = 2 if temp is >68 degrees Fahrenheit. I need to estimate a probability that the temp is between 65 <= weather < 68 degrees Fahrenheit, given the days = 20. Here is the formula and output
multinom(formula = weather ~ days, data = USWeather13)
Which gives the coefficient table:
Coefficients:
(Intercept) days
1 5.142 -.252
2 25.120 .343
Std. Errors:
(Intercept) days
1 1.742 .007
2 1.819 .004
Does anyone know how I can interpret this or figure out this problem?
In your example, weather=0 is the reference level, and you have the coefficients as the log odds ratio of weather=1 or weather=2 for every unit of your predictor Days .
It's an example without the complete information, but reading your coefficients, it means for every unit increase in days, you reduce the log-odd probability of 1 vs 0 by -.252 and log-odd probability of 2 vs 0 by .343.
If you need to figure the respective probabilities at days=20, you do:
fit = multinom(formula = weather ~ days, data = USWeather13)
predict(fit,newdata=data.frame(days=20),type="prob")
Think this website might provide a good guide on how to interpret the coefficients.
Related
I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.
This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)
You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82
Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.
Using random forest package:-
#install.packages("randomForest")
library(randomForest)
I used an online code to run random forest on my system. I got a model with confusion matrix and accuracy etc.
Now, my data is in the form of training and validation sets. I got the data from here:-
https://archive.ics.uci.edu/ml/machine-learning-databases/car/
I divided it in a ratio of 70%-30% (training - validation, respectively).
Then i ran a model on it.
The model results gave me an answer that around 30 observations were misclassified for one particular value of the variable on which the random forest was run.
Below is the sample data:-
BuyingPrice Maintenance NumDoors NumPersons Bootspace Safety Condition
vhigh low 4 4 med low unacc
vhigh med 2 4 med high acc
vhigh med 2 more small high unacc
vhigh high 3 4 big high unacc
vhigh med 4 more small med unacc
low low 2 more med med acc
The randomForest was run on predicting the last variable, "Condition".
Below is the model summary
Call:
randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500,
mtry = 6, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 2.48%
Confusion matrix:
acc good unacc vgood class.error
acc 244 4 6 2 0.04687500
good 3 44 1 0 0.08333333
unacc 11 1 843 0 0.01403509
vgood 2 0 0 47 0.04081633
If we take the first row of the table (the one just above us), we see that the value "acc" has had 244 correct predictions (95%) and 12 wrong predictions.
Similarly, "good" has had 44 correct predictions (91%) and 4 wrong predictions. And so on for the other two.
Total number of wrong predictions are 30 (12+4+12+2)
Now, technically the predicted values of this model should differ from the actual by 30 misclassified values.
Now i tried getting the predicted values by two methods:-
1. First method :- model2$predicted.
2. Second method :- predTrain <- predict(model2, TrainSet, type = "class")
The First method gives me a predicted value set that differs from the actual in 30 places while the second method gives me an dataset which is exactly equal to the actual values.
I think the first method is correct but the guy in the link has used the second one.
https://www.r-bloggers.com/how-to-implement-random-forests-in-r/
Not sure where my concepts are going wrong
Please help.
PS:- I know there is a similar question that has been asked but i feel that both the question and the answers below it were not sufficiently elaborate or easily explainable for me. That's why, i asked a new question.
SAMPLE CODE
set.seed(100)
train <- sample(nrow(data1),0.7*nrow(data1),replace=FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry=6,
importance = TRUE)
predTrain <- predict(model2, TrainSet, type = "class")
new1 <- data.frame(actual = TrainSet$Condition, predicted = predTrain)
new2 <- data.frame(actual = TrainSet$Condition, predicted =
model2$predicted)
new1$third <- 0
for(i in 1:nrow(new1))
{
if(new1[i,1] == new1[i,2])
{
new1[i,3] = 1
}else{
new1[i,3] = 0
}
}
new2$third <- 0
for(i in 1:nrow(new2))
{
if(new2[i,1] == new2[i,2])
{
new2[i,3] = 1
}else{
new2[i,3] = 0
}
}
Thanks,
Abhay
According to the documentation of randomForest function:
predicted: the predicted values of the input data based on out-of-bag samples.
So the predicted value of an observation is obtained with a model that does not use this observation.
The predict function applies the model learnt to new data and doesn't know they was used for training. So any observation is used for both learning and predict.
You should used the predicted output as every predicted value is computed without the corresponding observation used for training.
I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.
I'm currently trying to use the predict.averaging function in MuMIn to create some graphs from some model averaging I've done on some GLMMs. I'm interested in whether the number of insects caught per daylight hour in some traps changes when the traps are left out for different lengths of time; I included offset(log(Daylight)) in my GLMMs to account for this. But when I use the predict function it doesn't take the offset into account and I get the same graph that I get if I hadn't included the offset in the first place. But I know the offset is having an effect due to the output from my model averaged GLMMs, and it's the kind of effect I would expect from observations of my data.
Does anyone know why this problem might be and how I might make predict.averaging take the offset into account? I've included the code that I'm using below:
# global model for total insect abundance
glmm11 <- glmmadmb(Total_polls ~ Max_temp+Wind+Precipitation+Veg_height+Season+Year+log(Mean.nectar+1)+I(log(Nectar+1)-log(Mean.nectar+1))+Pan_colour*Assoc_col+Treatment*Area*Depth+(1|Transect)+offset(log(Daylight)), data = ab, zeroInflation = FALSE, family = "nbinom")
# make predictions based on model averaging output (subset delta < 2)
preds<-predict(ave21, full = F, type = "response", backtransform = FALSE) # on the response scale
Where ave21 is a model averaging object generated using pdredge and model.avg that was constrained to have the offset in every model: model11 <- pdredge(glmm11, cluster = clust, fixed = ~offset(log(Daylight))+(1|Transect)). The object itself looks like this:
Call:
model.avg(object = get.models(object = model11, subset = delta <
2))
Component model call:
glmmadmb(formula = Total_polls ~ <3 unique rhs>, data = ab, family = nbinom,
zeroInflation = FALSE)
Component models:
df logLik AICc delta weight
1/2/3/4/5/6/7/8/9/10/11/12 20 -864.14 1769.22 0.00 0.47
1/2/3/4/5/6/7/8/9/10/11/12/13 23 -861.39 1770.03 0.81 0.31
1/3/4/5/6/7/8/9/10/11/12 19 -865.97 1770.79 1.57 0.21
Term codes:
Area Assoc_col
1 2
Depth I(log(Nectar + 1) - log(Mean.nectar + 1))
3 4
Max_temp Pan_colour
5 6
Season Treatment
7 8
Year log(Mean.nectar + 1)
9 10
offset(log(Daylight)) Area:Depth
11 12
Assoc_col:Pan_colour
13
Which I then used to get predictions:
pred_results<-cbind(glmm21$frame, preds) # append original dataframe to predictions
plot(pred_results$preds~pred_results$Treatment) # Treatment = trap duration (hours)
This code might go around the houses a little as I borrowed it off of a fellow PhD student. The graph I get when I plot my predictions looks like this:[Model predictions vs. Trap duration (hours)][1], which is very different from the view given by the summary results of my model averaging:
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -5.896725 0.948102 0.949386 6.211 < 2e-16 ***
Treatment24 -0.714283 0.130226 0.130403 5.478 < 2e-16 ***
Treatment48 -0.983881 0.122416 0.122582 8.026 < 2e-16 ***
Any help would be great, as I can't find any specific instances where this has been addressed on the site to date. Thank you in advance and please let me know if you need me to add anything to make this question better.
Tom
[1]:
https://i.stack.imgur.com/Pn4dK.jpg
I was implement logistic regression to the following data frame and got a reasonable (the same as using STATA) results. But the Pearson chi square and degree of freedom I got from R is very different from STATA, which in turn gave me an very small p-value. And I cannot get the area under ROC curve. Could anyone help me to find out why residual() does not work on glm() with priori weights, and how to deal with area under ROC curve?
Following is my code and output.
1. Data
Here is my data frame test_data, y is outcome, x1 and x2 are covariates:
y x1 x2 freq
0 No 0 268
0 No 1 14
0 Yes 0 109
0 Yes 1 1
1 No 0 31
1 No 1 6
1 Yes 0 45
1 Yes 1 6
I generated this data frame from the original data by counting occurrence of each covariate pattern, and store the number in new variable freq.
2. GLM Model
Then I did the logistic regression as:
logit=glm(y~x1+x2, data=test_data, family=binomial, weights=freq)
Output shows:
Deviance Residuals:
1 2 3 4 5 6 7 8
-7.501 -3.536 -8.818 -1.521 11.957 3.501 10.409 2.129
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2010 0.1892 -11.632 < 2e-16 ***
x1 1.3538 0.2516 5.381 7.39e-08 ***
x2 1.6261 0.4313 3.770 0.000163 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 457.35 on 7 degrees of freedom
Residual deviance: 416.96 on 5 degrees of freedom
AIC: 422.96
Number of Fisher Scoring iterations: 5
Coefficients are the same as STATA.
3. Chi Square Statistics
when I tried to calculate the Pearson chi square:
pearson_chisq = sum(residuals(logit, type = "pearson", weights=test_data$freq)^2)
I got 488, instead of 1.3 given by STATA. Also the DOF generated by R is chisq_dof = df.residuals(logit)=5, instead of 1. So I got an extremely small p-value~e^-100.
4. Discrimination
Then I calculated the area under ROC curve as:
library(verification)
logit_mf = model.frame(logit)
roc.area(logit_mf $y, fitted(logit))$A
The output is:
[1] 0.5
Warning message:
In wilcox.test.default(pred[obs == 1], pred[obs == 0], alternative = "great") :
cannot compute exact p-value with ties
Thanks!
I figured out how to solve this problem eventually. The data set I used above should be summarised to covariate patterns. Then use the definition of Pearson chi square to do calculation. I provide the R code as follows:
# extract covariate patterns
library(dplyr)
temp=test_data %>% group_by(x1, x2) %>% summarise(m=sum(freq),y=sum(freq*y))
temp$pred=fitted(p01_logit_j)[1:4]
# calculate Pearson chi square
temp=mutate(temp, pearson=(y-mpred)/sqrt(mpred*(1-pred)))
pearson_chi2 = with(temp, sum(pearson^2))
temp_dof = 4-(2+1) #dof=J-(p+1)
# calculate p-value
pchisq(pearson_chi2, temp_dof, lower.tail=F)
The result of p-value is 0.241941, which is same as STATA.
In order to calculate AUC, we should first expand the covariate pattern to the "original" data, then use the "expanded" data to get AUC. Noted we have 392 "0" and 88 "1" in the frequency table. My code follows:
# expand observation
y_expand=c(rep(0,392),rep(1,88))
logit_mf = model.frame(logit)
logit_pred = fitted(logit)
logit_mf$freq=test_data$freq
# expand prediction
yhat_expand=with(logit_mf, rep(pred, freq))
library(verification)
roc.area(y_expand, yhat)$A
AUC=0.6760, same as that of STATA.