offset() term in glm() sparkR 2.1.0 ignored? - r

I have been trying to fit a glm (Poisson with log link, to be specific) on a dataset in sparkR. It is pretty large, and as such collecting it and using R's own glm() isn't likely to work. This includes an exposure term which needs to be included as an offset (regressor with known coefficient - 1 in my case). Unfortunately, neither adding an offset term in the formula, nor passing the column name (or the column itself, or a numeric vector formed by collecting the coumn after selecting it) works - in the first case the formula isn't parsed, and in the other cases the offset term is ignored - with no error messages at all. Here's an example of what I've been trying to do (outputs in comments):
library(datasets)
#set up Spark session
#Sys.setenv(SPARK_HOME = "/usr/share/spark_2.1.0")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
options(scipen = 15, digits = 5)
sparkR.session(spark.executor.instances = "20", spark.executor.memory = "6g")
# # Setting default log level to "WARN".
# # To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# # 17/06/19 06:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# # 17/06/19 06:33:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
# # 17/06/19 06:34:22 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
message(sparkR.conf()$spark.app.id)
# # application_*************_****
#Test glm() in sparkR
data("iris")
iris_df = createDataFrame(iris)
# # Warning messages:
# # 1: In FUN(X[[i]], ...) :
# # Use Sepal_Length instead of Sepal.Length as column name
# # 2: In FUN(X[[i]], ...) :
# # Use Sepal_Width instead of Sepal.Width as column name
# # 3: In FUN(X[[i]], ...) :
# # Use Petal_Length instead of Petal.Length as column name
# # 4: In FUN(X[[i]], ...) :
# # Use Petal_Width instead of Petal.Width as column name
model = glm(Sepal_Length ~ offset(Sepal_Width) + Petal_Length, data = iris_df)
# # 17/06/19 08:46:47 ERROR RBackendHandler: fit on org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
# # java.lang.reflect.InvocationTargetException
# # ......
# # Caused by: java.lang.IllegalArgumentException: Could not parse formula: Sepal_Length ~ offset(Sepal_Width) + Petal_Length
# # at org.apache.spark.ml.feature.RFormulaParser$.parse(RFormulaParser.scala:200)
# # ......
model = glm(Sepal_Length ~ Petal_Length + offset(Sepal_Width), data = iris_df)
# # (Same error as above)
# The one below runs.
model = glm(Sepal_Length ~ Petal_Length, offset = Sepal_Width, data = iris_df, family = gaussian())
# # 17/06/19 08:51:21 WARN WeightedLeastSquares: regParam is zero, which might cause numerical instability and overfitting.
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
summary(model)
# # Deviance Residuals:
# # (Note: These are approximate quantiles with relative error <= 0.01)
# # Min 1Q Median 3Q Max
# # -1.24675 -0.30140 -0.01999 0.26700 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.3066 0.078389 54.939 0
# # Petal_Length 0.40892 0.018891 21.646 0
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160
# #
# # Number of Fisher Scoring iterations: 1
# (RESULTS ARE SAME AS GLM WITHOUT OFFSET)
# Results in R:
model = glm(Sepal.Length ~ Petal.Length, offset = Sepal.Width, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris, offset = Sepal.Width)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -0.93997 -0.27232 -0.02085 0.28576 0.88944
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 0.85173 0.07098 12.00 <2e-16 ***
# # Petal.Length 0.51471 0.01711 30.09 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1358764)
# #
# # Null deviance: 143.12 on 149 degrees of freedom
# # Residual deviance: 20.11 on 148 degrees of freedom
# # AIC: 130.27
# #
# # Number of Fisher Scoring iterations: 2
#Results in R without offset. Matches SparkR output with and w/o offset.
model = glm(Sepal.Length ~ Petal.Length, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -1.24675 -0.29657 -0.01515 0.27676 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
# # Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160.04
# #
# # Number of Fisher Scoring iterations: 2
Note: The Spark version is 2.1.0 (as in the code). From what I checked the implementation is supposed to be there. Also, the warning messages after gl don't always appear, but that does not appear to have an effect on what's going on.
Am I doing something wrong, or is the offset term not used in the glm implementation of spark? If it is the second, is there any workaround to get the same results as having an offset term?

A poisson GLM with response Y and offset log(K) is the same as a GLM with response Y/K and weights K.
Example using the Insurance dataset from MASS:
> glm(Claims ~ District + Group + Age, data=Insurance, family=poisson, offset=log(Holders))
Call: glm(formula = Claims ~ District + Group + Age, family = poisson,
data = Insurance, offset = log(Holders))
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: 388.7
> glm(Claims/Holders ~ District + Group + Age, data=Insurance, family=quasipoisson, weights=Holders)
Call: glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson,
data = Insurance, weights = Holders)
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: NA
(The quasipoisson family is to shut R up about the non-integral values detected for the response.)
This technique should be usable with Spark's GLM implementation as well.
See also a similar question on stats.SE.

Related

Create a function(x) including glm(... ~ x, ...) when x = parameter1 * parameter2. Summary of glm() just shows intercept and x (not the parameters)

There you can see my code and the output r gives. My question is: How can I get r to print the arguments of the function as separated values in the summary of glm(). So the intercept, gender_m0, age_centered and gender_m0 * age_centered instead of the intercept and the y? I hope someone could help me with my little problem. Thank you.
test_reg <- function(parameters){
glm_model2 <- glm(healing ~ parameters, family = "binomial", data = psa_data)
summary(glm_model2)}
test_reg(psa_data$gender_m0 * age_centered)
Call:
glm(formula = healing ~ parameters, family = "binomial", data = psa_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2323 0.4486 0.4486 0.4486 0.6800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24590 0.13844 16.223 <2e-16 ***
parameters -0.02505 0.01369 -1.829 0.0674 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.99 on 649 degrees of freedom
Residual deviance: 423.79 on 648 degrees of freedom
(78 Beobachtungen als fehlend gelöscht)
AIC: 427.79
Number of Fisher Scoring iterations: 5
The terms inside formulas are never substituted but taken literally, so glm is looking for a column called "parameters" in your data frame, which of course doesn't exist. You will need to capture the parameters from your call, deparse them and construct the formula if you want to call your function this way:
test_reg <- function(parameters) {
f <- as.formula(paste0("healing ~ ", deparse(match.call()$parameters)))
mod <- glm(f, family = binomial, data = psa_data)
mod$call$formula <- f
summary(mod)
}
Obviously, I don't have your data, but if I create a little sample data frame with the same names, we can see this works as expected:
set.seed(1)
psa_data <- data.frame(healing = rbinom(20, 1, 0.5),
age_centred = sample(21:40),
gender_m0 = rbinom(20, 1, 0.5))
test_reg(age_centred * gender_m0)
#>
#> Call:
#> glm(formula = healing ~ age_centred * gender_m0, family = binomial,
#> data = psa_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.416 -1.281 0.963 1.046 1.379
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.05873 2.99206 0.354 0.723
#> age_centred -0.02443 0.09901 -0.247 0.805
#> gender_m0 -3.51341 5.49542 -0.639 0.523
#> age_centred:gender_m0 0.10107 0.17303 0.584 0.559
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.526 on 19 degrees of freedom
#> Residual deviance: 27.027 on 16 degrees of freedom
#> AIC: 35.027
#>
#> Number of Fisher Scoring iterations: 4
Created on 2022-06-29 by the reprex package (v2.0.1)

How to extract p value from ca.po function in R?

I want to get the p-value of both ca.po models. Can someone show me how?
at?
library(data.table)
library(urca)
dt_xy = as.data.table(timeSeries::LPP2005REC[, 2:3])
res = urca::ca.po(dt_xy, type = "Pz", demean = demean, lag = "short")
summary(res)
And the results. I marked the p-values I need in the result.
Model 1 p-value = 0.9841
Model 2 p-value = 0.1363
########################################
# Phillips and Ouliaris Unit Root Test #
########################################
Test of type Pz
detrending of series with constant and linear trend
Response SPI :
Call:
lm(formula = SPI ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.036601 -0.003494 0.000243 0.004139 0.024975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.702e-04 7.954e-04 1.220 0.223
zrSPI -1.185e-02 5.227e-02 -0.227 0.821
zrSII -3.037e-02 1.374e-01 -0.221 0.825
trd -6.961e-07 3.657e-06 -0.190 0.849
Residual standard error: 0.007675 on 372 degrees of freedom
Multiple R-squared: 0.0004236, Adjusted R-squared: -0.007637
F-statistic: 0.05255 on 3 and 372 DF, p-value: 0.9841 **<--- I need this p.value**
Response SII :
Call:
lm(formula = SII ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.0096931 -0.0018105 -0.0002734 0.0017166 0.0115427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.598e-05 3.012e-04 -0.252 0.8010
zrSPI -1.068e-02 1.979e-02 -0.540 0.5897
zrSII -9.574e-02 5.201e-02 -1.841 0.0664 .
trd 1.891e-06 1.385e-06 1.365 0.1730
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002906 on 372 degrees of freedom
Multiple R-squared: 0.01476, Adjusted R-squared: 0.006813
F-statistic: 1.858 on 3 and 372 DF, p-value: 0.1363 **<--- I need this p.value**
Value of test-statistic is: 857.4274
Critical values of Pz are:
10pct 5pct 1pct
critical values 71.9586 81.3812 102.0167
You have to dig into the res object to see its attributes and what's available there.
attributes(reg)
...
#>
#> $testreg
#> Response SPI :
#>
#> Call:
#> lm(formula = SPI ~ zr + trd)
#>
...
A long list of objects is returned, but we can see what is looking like the summary of lm being called under testreg, which we can see is one of the attributes of res. We can also access attributes of res using attr(res, "name"), so let's look at the components of testreg.
names(attributes(res))
#> [1] "z" "type" "model" "lag" "cval" "res"
#> [7] "teststat" "testreg" "test.name" "class"
names(attr(res, "testreg"))
#> [1] "Response SPI" "Response SII"
As you noted above, you're looking for 2 separate p-values, so makes since we have two separate models. Let's retrieve these and look at what they are.
spi <- attr(res, "testreg")[["Response SPI"]]
sii <- attr(res, "testreg")[["Response SII"]]
class(spi)
#> [1] "summary.lm"
So, each of them is a summary.lm object. There's lots of documentation on how to extract p-values from lm or summary.lm objects, so let's use the method described here.
get_pval <- function(summary_lm) {
pf(
summary_lm$fstatistic[1L],
summary_lm$fstatistic[2L],
summary_lm$fstatistic[3L],
lower.tail = FALSE
)
}
get_pval(spi)
#> value
#> 0.9840898
get_pval(sii)
#> value
#> 0.1363474
And there you go, those are the two p-values you were interested in!

designing custom model objects in R

I'm coded up an estimator in R and tried to follow R syntax. It goes something like this:
model <- myEstimator(y ~ x1 + x2, data = df)
model has the usual stuff: coefficients, standard errors, p-values, etc.
Now I want model to play nicely with the R ecosystem for summarizing models, like summary() or sjPlot::plot_model() or stargazer::stargazer(). The way you might do summary(lm_model) where lm_model is an lm object.
How do I achieve this? Is there a standard protocol? Define a custom S3 class? Or just coerce model to an existing class like lm?
Create an S3 class and implement summary etc methods.
myEstimator <- function(formula, data) {
result <- list(
coefficients = 1:3,
residuals = 1:3
)
class(result) <- "myEstimator"
result
}
model <- myEstimator(y ~ x1 + x2, data = df)
Functions like summary will just call summary.default.
summary(model)
#> Length Class Mode
#> coefficients 3 -none- numeric
#> residuals 3 -none- numeric
If you wish to have your own summary function, implement summary.myEstimator.
summary.myEstimator <- function(object, ...) {
value <- paste0(
"Coefficients: ", paste0(object$coefficients, collapse = ", "),
"; Residuals: ", paste0(object$residuals, collapse = ", ")
)
value
}
summary(model)
#> [1] "Coefficients: 1, 2, 3; Residuals: 1, 2, 3"
If your estimator is very similar to lm (your model is-a lm), then just add your class to the lm class.
myEstimatorLm <- function(formula, data) {
result <- lm(formula, data)
# Some customisation
result$coefficients <- pmax(result$coefficients, 1)
class(result) <- c("myEstimatorLm", class(result))
result
}
model_lm <- myEstimatorLm(Petal.Length ~ Sepal.Length + Sepal.Width, data = iris)
class(model_lm)
#> [1] "myEstimator" "lm"
Now, summary.lm will be used.
summary(model_lm)
#> Call:
#> lm(formula = formula, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.25582 -0.46922 -0.05741 0.45530 1.75599
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00000 0.56344 1.775 0.078 .
#> Sepal.Length 1.77559 0.06441 27.569 < 2e-16 ***
#> Sepal.Width 1.00000 0.12236 8.173 1.28e-13 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.6465 on 147 degrees of freedom
#> Multiple R-squared: 0.8677, Adjusted R-squared: 0.8659
#> F-statistic: 482 on 2 and 147 DF, p-value: < 2.2e-16
You can still implement summary.myEstimatorLm
summary.myEstimatorLm <- summary.myEstimator
summary(model_lm)
#> [1] "Coefficients: 1, 1.77559254648113, 1; Residuals: ...

SLR of transformed data in R

For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020:
bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T)
bc$x[which.max(bc$y)] # gives lambda
Now I want to fit a simple linear regression using the transformed data, so I've entered this code
transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020))
transform
But all I get is the error message
'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package.
## make sure you do install.packages("car") if you haven't already
library(car)
data(Prestige)
p <- powerTransform(prestige ~ income + education + type ,
data=Prestige,
family="bcPower")
summary(p)
# bcPower Transformation to Normality
# Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
# Y1 1.3052 1 0.9408 1.6696
#
# Likelihood ratio test that transformation parameter is equal to 0
# (log transformation)
# LRT df pval
# LR test, lambda = (0) 41.67724 1 1.0765e-10
#
# Likelihood ratio test that no transformation is needed
# LRT df pval
# LR test, lambda = (1) 2.623915 1 0.10526
mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige)
summary(mod)
#
# Call:
# lm(formula = bcPower(prestige, 1.3052) ~ income + education +
# type, data = Prestige)
#
# Residuals:
# Min 1Q Median 3Q Max
# -44.843 -13.102 0.287 15.073 62.889
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 *
# income 3.363e-03 6.928e-04 4.854 4.87e-06 ***
# education 1.205e+01 2.009e+00 5.999 3.78e-08 ***
# typeprof 2.027e+01 1.213e+01 1.672 0.0979 .
# typewc -1.078e+01 7.884e+00 -1.368 0.1746
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.25 on 93 degrees of freedom
# (4 observations deleted due to missingness)
# Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427
# F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator:
I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020)
However, I think you should do what #DaveArmstrong suggested anyway:
it's only the predictor variable that gets transformed
the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)

Log-odds formula

I am interested in calculating the log-odds of the relationship between a continuous predictor and dichotomous outcome for purposes of graphically evaluating the linearity assumption for a logistic regression model. Does anyone know a formula for this? My key issue is I am unsure how to calculate an event rate for each level of the continuous predictor (i.e. number with outcome/total observations at that level).
Thank you!
Let's simulate some data to show how this can be done.
Imagine we are testing a new electrical product, and we test at a variety of temperatures to see whether temperature affects failure rate.
set.seed(69)
df <- data.frame(temperature = seq(0, 100, length.out = 1000),
failed = rbinom(1000, 1, seq(0.1, 0.9, length.out = 1000)))
So we have two columns: the temperature, and a dichotomous column of 1 (failed) and 0 (passed).
We can get a rough measure of the relationship between temperature and failure rate just by cutting our data frame into 5 degree bins:
df$temp_range <- cut(df$temperature, seq(0, 100, 5), include.lowest = TRUE)
We can now plot the proportion of devices that failed within each 5 degree temperature band:
library(ggplot2)
ggplot(df, aes(x = temp_range, y = failed)) + stat_summary()
#> No summary function supplied, defaulting to `mean_se()`
We can see that the probability of failure appears to go up linearly with temperature.
Now, if we get the proportions of failures in each bin, we take these as the estimate of probability of failure. This allows us to calculate the log odds of failure within each bin:
counts <- table(df$temp_range, df$failed)
probs <- counts[,2]/rowSums(counts)
logodds <- log(probs/(1 - probs))
temp_range <- seq(2.5, 97.5, 5)
logit_df <- data.frame(temp_range, probs, logodds)
So now, we can plot the log odds. Here, we will make our x axis continuous by taking the mid point of each bin as the x co-ordinate. We can then draw a linear regression through our points:
p <- ggplot(logit_df, aes(temp_range, logodds)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", linetype = 2, se = FALSE)
p
#> `geom_smooth()` using formula 'y ~ x'
and in fact carry out a linear regression:
summary(lm(logodds ~ temp_range))
#>
#> Call:
#> lm(formula = logodds ~ temp_range)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.70596 -0.20764 -0.06761 0.18100 1.31147
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.160639 0.207276 -10.42 4.70e-09 ***
#> temp_range 0.046025 0.003591 12.82 1.74e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.463 on 18 degrees of freedom
#> Multiple R-squared: 0.9012, Adjusted R-squared: 0.8957
#> F-statistic: 164.2 on 1 and 18 DF, p-value: 1.738e-10
We can see that the linear assumption is reasonable here.
What we have just done is like a crude form of logistic regression. Let's now do it properly:
model <- glm(failed ~ temperature, data = df, family = binomial())
summary(model)
#>
#> Call:
#> glm(formula = failed ~ temperature, family = binomial(), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1854 -0.8514 0.4672 0.8518 2.0430
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.006197 0.159997 -12.54 <2e-16 ***
#> temperature 0.043064 0.002938 14.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1383.4 on 999 degrees of freedom
#> Residual deviance: 1096.0 on 998 degrees of freedom
#> AIC: 1100
#>
#> Number of Fisher Scoring iterations: 3
Notice how close the coefficients are to our hand-crafted model.
Now that we have this model, we can plot its predictions over our crude linear estimate:
mod_df <- data.frame(temp_range = 1:100,
logodds = predict(model, newdata = list(temperature = 1:100)))
p + geom_line(data = mod_df, colour = "red", linetype = 3, size = 2)
#> `geom_smooth()` using formula 'y ~ x'
Pretty close.
Created on 2020-06-19 by the reprex package (v0.3.0)

Resources