How to plot model glm result with a lot of parameters - r

I really need help with this. I want to make a predict model for my glm quasipoisson. I have a problems since i wrongly make a glm model with my dataset.
I used to make a predict model based on my glm quasipoisson for all my parameters, but I ended up predicting for each parameter, and the result is different from the glm quasipoisson data.
Here is my dataset. I use a csv file for all my dataset. Idk how to upload this csv data in this post, pardon me for this.
Richness = as.matrix(dat1[,14])
Richness
8
3
3
4
3
5
4
3
7
8
Parameter = as.matrix(dat1[,15:22])
Parameter
JE Temp Hmdt Sond HE WE L MH
1 31.3 93 63.3 3.89 4.32 80 7.82
2 26.9 92 63.5 9.48 8.85 60 8.32
1 27.3 93 67.4 1.23 2.37 60 10.10
3 31.6 99 108.0 1.90 3.32 80 4.60
1 29.3 99 86.8 2.42 7.83 460 12.20
2 29.4 85 86.1 4.71 15.04 200 10.10
1 29.4 87 93.5 3.65 14.70 200 12.20
1 29.5 97 87.5 1.42 3.17 80 4.07
1 25.9 95 62.3 5.23 16.89 140 10.03
1 29.5 95 63.5 1.85 6.50 120 6.97
Rich = glm(Richness ~ Parameter, family=quasipoisson, data = dat1)
summary(Rich)
Call:
glm(formula = Richness ~ Parameter, family = quasipoisson, data = dat1)
Deviance Residuals:
1 2 3 4 5
-0.017139 0.016769 -0.008652 0.002194 -0.003153
6 7 8 9 10
-0.016828 0.022914 -0.013823 -0.012597 0.030219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.4197959 0.5061733 -14.659 0.0434 *
ParameterJE 0.1833651 0.0224198 8.179 0.0775 .
ParameterTemp 0.2441301 0.0073380 33.269 0.0191 *
ParameterHmdt 0.0393258 0.0032176 12.222 0.0520 .
ParameterSond -0.0319313 0.0009662 -33.050 0.0193 *
ParameterHE -0.0982213 0.0060587 -16.212 0.0392 *
ParameterWE 0.1001758 0.0027575 36.329 0.0175 *
ParameterL -0.0014170 0.0001554 -9.117 0.0695 .
ParameterMH 0.0137196 0.0073704 1.861 0.3138
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 0.002739787)
Null deviance: 7.8395271 on 9 degrees of freedom
Residual deviance: 0.0027358 on 1 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 3
This is the model that i tried make with ggplot
ggplot(dat1, aes(Temp, Richness))+
geom_point() +
geom_smooth(method = "glm", method.args = list(family = quasipoisson),
fill = "grey", color = "black", linetype = 2)``
and this is the result.
I make for each parameters, but i just know this result turn wrong because it used a quasipoisson data for each parameter, what i want is the predict model based on quasipoisson data like in the summary above.
I tried to used the code from plot the results glm with multiple explanatories with 95% CIs, but i really confuse to set my data like the example there. But the result in that example is nearly like what i want.
Can anyone help me with this? How can I put the glm predict model for all parameters in one frame with ggplot?
Hope anyone can help me to fix this. Thank you so much!

Have you tried the plot_model function from sjplot package?
I'm writing from my phone, but the code is something Like this.
library(sjPlot)
plot_model(glm_model)
More info:
http://www.strengejacke.de/sjPlot/reference/plot_model.html
code:
data("mtcars")
glm_model<-glm(am~.,data = mtcars)
glm_model
library(sjPlot)
plot_model(glm_model, vline.color = "red")
plot_model(glm_model, show.values = TRUE, value.offset = .3)

Related

Save survdiff output

I have run a logrank test with survdiff like below:
survdiff(formula = Surv(YearsToEvent, Event) ~ Cat, data = RegressionData)`
I get the following output:
N Observed Expected (O-E)^2/E (O-E)^2/V
0 30913 487 437.9 5.50 11.9
1 3755 56 23.2 46.19 48.0
2 3322 36 45.2 1.89 2.0
3 15796 260 332.6 15.85 27.3
Chisq= 71.9 on 3 degrees of freedom, p= 0.000000000000002
How can I save this (especially the p-value) to a .txt file? I am looping a bunch of regressions like this and want to save them all to a .text file.

How to evaluate a string variable as factor in the emmeans() command in R?

I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model

Why can't I use cv.glm on the output of bestglm?

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

R - two types of prediction in cross validation

When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.

Adding confidence intervals to a qq plot?

Is there a way to add confidence intervals to a qqplot?
I have a dataset of gene expression values, which I've visualized using PCA:
pca1 = prcomp(data, scale. = TRUE)
I'm now looking for outliers by checking the distribution of the data against the normal distribution through:
qqnorm(pca1$x,pch = 20, col = c(rep("red", 73), rep("blue", 33)))
qqline(pca1$x)
This is my data:
data = [2.48 104 4.25 219 0.682 0.302 1.09 0.586 90.7 344 13.8 1.17 305 2.8 79.7 3.18 109 0.932 562 0.958 1.87 0.59 114 391 13.5 1.41 208 2.37 166 3.42]
I would now like to plot 95% confidence intervals to check which data points lie outside. Any tips on how to do this?
The library car provides the function qqPlot(...) which adds a pointwise confidence envelope to the normal qq-plot by default:
library(car)
qqPlot(pca1$x)

Resources