How to create confidence Interval plot in R - r

b
DF1 = read.csv("Nerlove.3.csv",header=TRUE)
head(DF1, n=5)
split = round(nrow(DF1) * 0.60)
train = (DF1[1:split, ])
test = (DF1[(split + 1):nrow(DF1), ])
model = lm(output ~ ., train)
summary(model)
plot(train$cost, train$output, ylab = "Output", xlab = "Cost",main = "....")
abline(model, col=2)
c
plot(test$cost, test$output, ylab = "Output", xlab = "Cost",main = "....")
model1 = lm(output ~ ., test)
abline(model, col=2)
prediction = predict(model, test)
plot(prediction, main = "....")
abline(model1, col=2)
summary(model1)
d
library(stats)
X_0 = data.frame(cost = test$cost)
FI_mean = predict(model, newdata = X_0, interval="confidence", level = 0.95)
FI_ind = predict(model,newdata = X_0, interval = "prediction")
plot(test$cost, test$output, ylab = "Output", xlab = "Cost",main = "....")
abline(model, col=2)
min = test$cost
max = test$cost
newx = seq(min,max)
matlines(newx, FI_mean[,2:3], col = "blue", lty=2)
I need to plot the Confidence interval result I found around the regression line, but I'm getting an error. can anybody please help me to fix this. Thanks
This is the link for my data. I have edited it and only using the cost and output data in my dataframe

You are doing it wrong because you are using all the variables while developing the linear model by the command model = lm(output ~ ., train). But during plotting, you are using cost vs. output plotting (as in case of b and c) and in case of d, you are trying to predict using only one variable i.e. cost. Regression plot should be made between observed output vs. predicted output. For that, you can use the following code
library(lattice)
library(mosaic)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: ggformula
#> Loading required package: ggplot2
#> Loading required package: ggstance
#>
#> Attaching package: 'ggstance'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_errorbarh, GeomErrorbarh
#>
#> New to ggformula? Try the tutorials:
#> learnr::run_tutorial("introduction", package = "ggformula")
#> learnr::run_tutorial("refining", package = "ggformula")
#> Loading required package: mosaicData
#> Loading required package: Matrix
#> Registered S3 method overwritten by 'mosaic':
#> method from
#> fortify.SpatialPolygonsDataFrame ggplot2
#>
#> The 'mosaic' package masks several functions from core packages in order to add
#> additional features. The original behavior of these functions should not be affected by this.
#>
#> Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
#>
#> Attaching package: 'mosaic'
#> The following object is masked from 'package:Matrix':
#>
#> mean
#> The following object is masked from 'package:ggplot2':
#>
#> stat
#> The following objects are masked from 'package:dplyr':
#>
#> count, do, tally
#> The following objects are masked from 'package:stats':
#>
#> binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
#> quantile, sd, t.test, var
#> The following objects are masked from 'package:base':
#>
#> max, mean, min, prod, range, sample, sum
DF1 = read.csv("Nerlove.csv",header=TRUE)
head(DF1, n=5)
#> cost output pl sl pk sk pf sf
#> 1 0.082 2 2.09 0.3164 183 0.4521 17.9 0.2315
#> 2 0.661 3 2.05 0.2073 174 0.6676 35.1 0.1251
#> 3 0.990 4 2.05 0.2349 171 0.5799 35.1 0.1852
#> 4 0.315 4 1.83 0.1152 166 0.7857 32.2 0.0990
#> 5 0.197 5 2.12 0.2300 233 0.3841 28.6 0.3859
split = round(nrow(DF1) * 0.60)
train = (DF1[1:split, ])
test = (DF1[(split + 1):nrow(DF1), ])
model = lm(output ~ ., train)
summary(model)
#>
#> Call:
#> lm(formula = output ~ ., data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -409.65 -112.20 -4.61 94.20 430.76
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1182.296 1861.876 0.635 0.527
#> cost 146.231 6.127 23.866 < 2e-16 ***
#> pl 14.897 80.142 0.186 0.853
#> sl -2471.312 1930.034 -1.280 0.204
#> pk 1.477 1.011 1.460 0.148
#> sk -869.468 1874.585 -0.464 0.644
#> pf -13.701 2.365 -5.794 1.08e-07 ***
#> sf -947.958 1861.490 -0.509 0.612
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 169.3 on 87 degrees of freedom
#> Multiple R-squared: 0.9111, Adjusted R-squared: 0.904
#> F-statistic: 127.4 on 7 and 87 DF, p-value: < 2.2e-16
#Calibration plotting
pred_cal <- predict(model, newdata=train)
df_cal <- data.frame(Observed=train$output, Predicted=pred_cal)
xyplot(Predicted ~ Observed, data = df_cal, pch = 19, panel=panel.lmbands,
band.lty = c(conf =2, pred = 1))
#Validation plottig
pred_val <- predict(model, newdata=test)
df_val <- data.frame(Observed=test$output, Predicted=pred_val)
xyplot(Predicted ~ Observed, data = df_val, pch = 19, panel=panel.lmbands,
band.lty = c(conf =2, pred = 1))
Created on 2020-01-07 by the reprex package (v0.3.0)

Related

How to fit beta-binomial model on proportional data (not counts) in gamlss

I want to fit beta-binomial regression. I don't have counts but proportions that I want to fit. Here's example:
library(dplyr)
library(gamlss)
df <- tibble(
cluster = LETTERS[1:20]
) |>
mutate(
p = rbeta(n(), 1, 1),
n = as.integer(3 * runif(n()))
)
fit <- gamlss(
p ~ log(n),
weights = n,
data = df,
family = BB(mu.link='identity')
)
I get error:
Error in while (abs(olddv - dv) > cc && itn < cyc) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Warnings look like:
In dbinom(x, size = bd, prob = mu, log = log) : non-integer x = 0.834502
Note that I DON'T want to rounded number of successes such as mutate(y = round(p * n)).
The help file for the BB() family suggests that the dependent variable is expected to be a two-column matrix of the numbers of successes and failures. If you've got p (the probability of success) and you've got n (the number of trials), then you can make both the number of successes k=floor(p*n) and the number of failures notk = n-k. Then, you can do as I did below.
library(dplyr)
library(gamlss)
df <- tibble(
cluster = LETTERS[1:20]
) |>
mutate(
p = rbeta(n(), 1, 1),
n = as.integer(100 * runif(n()))
)
df <- df %>%
mutate(k = floor(p*n),
notk = n-k)
fit <- gamlss(
cbind(k, notk) ~ cluster,
data = df,
family = BB(mu.link='logit')
)
#> ******************************************************************
#> Family: c("BB", "Beta Binomial")
#>
#> Call:
#> gamlss(formula = cbind(k, notk) ~ cluster, family = BB(mu.link = "logit"),
#> data = df)
#>
#> Fitting method: RS()
#>
#> ------------------------------------------------------------------
#> Mu link function: logit
#> Mu Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 8.130e-01 3.836e-01 2.119 0.03406 *
#> clusterB -8.130e-01 2.036e+00 -0.399 0.68973
#> clusterC -3.686e+01 1.000e+05 0.000 0.99971
#> clusterD -2.970e+00 6.922e-01 -4.291 1.78e-05 ***
#> clusterE 3.618e-01 4.843e-01 0.747 0.45508
#> clusterF -3.381e-01 5.317e-01 -0.636 0.52479
#> clusterG -3.569e+00 6.506e-01 -5.485 4.13e-08 ***
#> clusterH -1.118e+00 4.356e-01 -2.566 0.01030 *
#> clusterI -1.712e+00 4.453e-01 -3.845 0.00012 ***
#> clusterJ 1.825e+00 6.315e-01 2.889 0.00386 **
#> clusterK -3.686e+01 1.000e+05 0.000 0.99971
#> clusterL -5.247e-01 4.602e-01 -1.140 0.25419
#> clusterM 1.439e+00 7.167e-01 2.008 0.04464 *
#> clusterN 9.161e-02 4.721e-01 0.194 0.84613
#> clusterO -2.405e+00 1.000e+05 0.000 0.99998
#> clusterP 3.034e-01 5.583e-01 0.543 0.58686
#> clusterQ -1.523e+00 5.389e-01 -2.826 0.00471 **
#> clusterR -2.498e+00 6.208e-01 -4.024 5.73e-05 ***
#> clusterS 1.006e+00 5.268e-01 1.910 0.05619 .
#> clusterT -6.228e-02 4.688e-01 -0.133 0.89433
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> Sigma link function: log
#> Sigma Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -36.034363 0.005137 -7014 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> No. of observations in the fit: 20
#> Degrees of Freedom for the fit: 21
#> Residual Deg. of Freedom: -1
#> at cycle: 9
#>
#> Global Deviance: 69.49748
#> AIC: 111.4975
#> SBC: 132.4079
#> ******************************************************************
Created on 2023-01-20 by the reprex package (v2.0.1)

How to automatically inserting default key in prais winsten estimation?

I got an error message saying
Error in prais_winsten(agvprc.lm1, data = data) :
argument "index" is missing, with no default
How can I avoid inputting all the features
agvprc.lm1=lm(log(avgprc) ~ mon+tues+wed+thurs+t+wave2+wave3)
summary(agvprc.lm1)
agvprc.lm.pw = prais_winsten(agvprc.lm1, data=data)
summary(agvprc.lm.pw)
Not sure if I've understood the question correctly, but to avoid the "Error in prais_winsten(agvprc.lm1, data = data) : argument "index" is missing, with no default" error you need to provide an 'index' to the function (which is a character variable of "ID" and "time"). Using the inbuilt mtcars dataset as an example, with "cyl" as "time":
library(tidyverse)
#install.packages("prais")
library(prais)
#> Loading required package: sandwich
#> Loading required package: pcse
#>
#> Attaching package: 'pcse'
#> The following object is masked from 'package:sandwich':
#>
#> vcovPC
ggplot(mtcars, aes(x = cyl, y = mpg, group = cyl)) +
geom_boxplot() +
geom_jitter(aes(color = hp), width = 0.2)
agvprc.lm1 <- lm(log(mpg) ~ cyl + hp, data = mtcars)
summary(agvprc.lm1)
#>
#> Call:
#> lm(formula = log(mpg) ~ cyl + hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.35699 -0.09882 0.01111 0.11948 0.24118
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7829495 0.1062183 35.615 < 2e-16 ***
#> cyl -0.1072513 0.0279213 -3.841 0.000615 ***
#> hp -0.0011031 0.0007273 -1.517 0.140147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 29 degrees of freedom
#> Multiple R-squared: 0.7503, Adjusted R-squared: 0.7331
#> F-statistic: 43.57 on 2 and 29 DF, p-value: 1.83e-09
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = c("hp", "cyl"))
#> Iteration 0: rho = 0
#> Iteration 1: rho = 0.6985
#> Iteration 2: rho = 0.7309
#> Iteration 3: rho = 0.7285
#> Iteration 4: rho = 0.7287
#> Iteration 5: rho = 0.7287
#> Iteration 6: rho = 0.7287
#> Iteration 7: rho = 0.7287
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = c("hp",
#> "cyl"))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
Created on 2022-02-28 by the reprex package (v2.0.1)
# To present the index without having to write out all of the variables
# perhaps you could use:
agvprc.lm.pw <- prais_winsten(formula = agvprc.lm1,
data = mtcars,
index = names(agvprc.lm1$coefficients)[3:2])
summary(agvprc.lm.pw)
#>
#> Call:
#> prais_winsten(formula = agvprc.lm1, data = mtcars, index = names(agvprc.lm1$coefficients)[3:2])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.33844 -0.08166 0.03109 0.13612 0.25811
#>
#> AR(1) coefficient rho after 7 iterations: 0.7287
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.7643405 0.1116876 33.704 <2e-16 ***
#> cyl -0.1061198 0.0298161 -3.559 0.0013 **
#> hp -0.0011470 0.0007706 -1.489 0.1474
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.1077 on 29 degrees of freedom
#> Multiple R-squared: 0.973, Adjusted R-squared: 0.9712
#> F-statistic: 523.3 on 2 and 29 DF, p-value: < 2.2e-16
#>
#> Durbin-Watson statistic (original): 0.1278
#> Durbin-Watson statistic (transformed): 0.4019
NB. there are a number of assumptions here that may not apply to your actual data; with more information, such as a minimal reproducible example, you will likely get a better/more informed answer on stackoverflow

Incorrect number of dimensions in forecasting regression model

I am using regression model and forecasting its data.
I have the following code:
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
It works until now and while forecasting,
plot(forecast(fit, h=18, level=c(80,90,95,99)))
It gives the following error:
Error in `[.default`(X, , piv, drop = FALSE) :
incorrect number of dimensions
Appreciate your help.
This works for me using the current CRAN version (8.15) of the forecast package:
library(forecast)
library(Mcomp)
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
#>
#> Call:
#> tslm(formula = data_ts ~ trend + season)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -204.81 -73.66 -11.44 69.99 368.96
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4438.403 65.006 68.277 < 2e-16 ***
#> trend 2.402 1.323 1.815 0.07828 .
#> season2 43.298 84.788 0.511 0.61289
#> season3 598.145 84.819 7.052 3.84e-08 ***
#> season4 499.993 84.870 5.891 1.19e-06 ***
#> season5 673.940 84.942 7.934 3.05e-09 ***
#> season6 604.988 85.035 7.115 3.20e-08 ***
#> season7 571.785 85.148 6.715 1.03e-07 ***
#> season8 695.533 85.282 8.156 1.64e-09 ***
#> season9 176.930 85.436 2.071 0.04603 *
#> season10 656.028 85.610 7.663 6.58e-09 ***
#> season11 -260.875 85.804 -3.040 0.00453 **
#> season12 -887.062 91.809 -9.662 2.79e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 119.9 on 34 degrees of freedom
#> Multiple R-squared: 0.949, Adjusted R-squared: 0.931
#> F-statistic: 52.74 on 12 and 34 DF, p-value: < 2.2e-16
plot(forecast(fit, h=18, level=c(80,90,95,99)))
Created on 2022-01-02 by the reprex package (v2.0.1)
Perhaps you're loading some other packages that are over-writing forecast().

Why does a predict.glm() not create predicted values in the expected manner?

I am trying to get my head around what the predict.glm() function does for a project at work which uses it.
To do this, I first looked at the example code found in the documentation for ?predict.glm(). This has given me the sense that it can take a glm and predict response values for a given input vector. However I found it very difficult to customise that "budworm" example. So I created an exceptionally simply model of my own to try and see how it works. Spoiler- I'm still failing to get it to work.
a<-c(1,2,3,4,5)
b<-c(2,3,4,5,6)
result<-glm(b~a,family=gaussian)
summary(result)
plot(c(0,10), c(0,10), type = "n", xlab = "dose",
ylab = "response")
xvals<-seq(0,10,0.1)
data.frame(xinputs=xvals)
predict.glm(object=result,newdata= data.frame(xinputs=xvals),type='terms')
#lines(xvals, predict.glm(object=result,newdata = xvals, type="response" ))
When I run predict.glm(object=result,newdata= data.frame(xinputs=xvals),type='terms') I get the error message:
Warning message:
'newdata' had 101 rows but variables found have 5 rows
From what I understand, it shouldn't matter that the input GLM only used 5 rows... it should use the statistics of that GLM to predict response values to each of the 101 entries of the new data?
Column names in the newdata data frame must match column names from the data you used to fit the model. Thus,
predict.glm(object=result,newdata= data.frame(a=xvals),type='terms')
will resolve your issue.
a <- c(1, 2, 3, 4, 5)
b <- c(2, 3, 4, 5, 6)
result <- glm(b ~ a, family = gaussian)
summary(result)
#>
#> Call:
#> glm(formula = b ~ a, family = gaussian)
#>
#> Deviance Residuals:
#> 1 2 3 4 5
#> -1.776e-15 -8.882e-16 -8.882e-16 0.000e+00 0.000e+00
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.000e+00 1.317e-15 7.591e+14 <2e-16 ***
#> a 1.000e+00 3.972e-16 2.518e+15 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.577722e-30)
#>
#> Null deviance: 1.0000e+01 on 4 degrees of freedom
#> Residual deviance: 4.7332e-30 on 3 degrees of freedom
#> AIC: -325.47
#>
#> Number of Fisher Scoring iterations: 1
plot(c(0, 10),
c(0, 10),
type = "n",
xlab = "dose",
ylab = "response")
xvals <- seq(0, 10, 0.1)
head(data.frame(xinputs = xvals))
#> xinputs
#> 1 0.0
#> 2 0.1
#> 3 0.2
#> 4 0.3
#> 5 0.4
#> 6 0.5
head(predict.glm(object = result,
newdata = data.frame(a = xvals),
type = 'terms'))
#> a
#> 1 -3.0
#> 2 -2.9
#> 3 -2.8
#> 4 -2.7
#> 5 -2.6
#> 6 -2.5
Created on 2020-09-15 by the reprex package (v0.3.0)

Why is flexsurvreg failing in this painfully simple case?

Behold the painfully simple case and the error(s). Comments inline.
library(flexsurv)
#> Loading required package: survival
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
set.seed(2019)
train_data <- tribble(
~wait_time, ~called_yet, ~time_queued,
131.282999992371, 0, 1570733365.28,
358.296000003815, 1, 1570733421.187,
1352.13999986649, 1, 1570733540.923,
1761.61400008202, 0, 1570733941.343,
1208.25300002098, 0, 1570734327.11,
522.296999931335, 1, 1570734376.953,
241.75, 0, 1570734659.44,
143.156999826431, 0, 1570734809.673,
1202.79999995232, 1, 1570734942.907,
614.640000104904, 1, 1570735526.567
)
# Base survival works fine!
survival_model <- survreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
survival_model
#> Call:
#> survreg(formula = Surv(wait_time, called_yet) ~ time_queued,
#> data = train_data, dist = "weibull")
#>
#> Coefficients:
#> (Intercept) time_queued
#> 4.533765e+05 -2.886352e-04
#>
#> Scale= 0.518221
#>
#> Loglik(model)= -40.2 Loglik(intercept only)= -40.5
#> Chisq= 0.5 on 1 degrees of freedom, p= 0.48
#> n= 10
# flexsurvreg can't even get a valid initializer for time_queued, even though
# the doc says it takes the mean of the data
flexsurv_model <- flexsurvreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
#> Error in flexsurvreg(Surv(wait_time, called_yet) ~ time_queued, data = train_data, : Initial value for parameter 2 out of range
# Maybe the low variance of the predictor here is the problem? So let's up the
# variance just to see
train_data %<>% mutate_at("time_queued", subtract, 1.57073e9)
train_data
#> # A tibble: 10 x 3
#> wait_time called_yet time_queued
#> <dbl> <dbl> <dbl>
#> 1 131. 0 3365.
#> 2 358. 1 3421.
#> 3 1352. 1 3541.
#> 4 1762. 0 3941.
#> 5 1208. 0 4327.
#> 6 522. 1 4377.
#> 7 242. 0 4659.
#> 8 143. 0 4810.
#> 9 1203. 1 4943.
#> 10 615. 1 5527.
# Now it initializes, so that's different... but now it won't converge!
flexsurv_model <- flexsurvreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
#> Warning in flexsurvreg(Surv(wait_time, called_yet) ~ time_queued, data
#> = train_data, : Optimisation has probably not converged to the maximum
#> likelihood - Hessian is not positive definite.
Created on 2019-10-19 by the reprex package (v0.3.0)
I mainly wanted to use flexsurv for its better plotting options and more standard shape & scale definitions - and the ancillary parameters are very attractive too - but now I'm mainly just wondering if I'm doing something really wrong, and flexsurv is trying to tell me not to trust my base survival model either.
Marco Sandri pointed out that recentering fixes it; however, recentering without rescaling only guarantees initialization, and still results in no convergence if the variance is very large. I'm considering this a bug since survival has no problem with the exact same model with the exact same values. Created an issue here.

Resources