designing custom model objects in R - r

I'm coded up an estimator in R and tried to follow R syntax. It goes something like this:
model <- myEstimator(y ~ x1 + x2, data = df)
model has the usual stuff: coefficients, standard errors, p-values, etc.
Now I want model to play nicely with the R ecosystem for summarizing models, like summary() or sjPlot::plot_model() or stargazer::stargazer(). The way you might do summary(lm_model) where lm_model is an lm object.
How do I achieve this? Is there a standard protocol? Define a custom S3 class? Or just coerce model to an existing class like lm?

Create an S3 class and implement summary etc methods.
myEstimator <- function(formula, data) {
result <- list(
coefficients = 1:3,
residuals = 1:3
)
class(result) <- "myEstimator"
result
}
model <- myEstimator(y ~ x1 + x2, data = df)
Functions like summary will just call summary.default.
summary(model)
#> Length Class Mode
#> coefficients 3 -none- numeric
#> residuals 3 -none- numeric
If you wish to have your own summary function, implement summary.myEstimator.
summary.myEstimator <- function(object, ...) {
value <- paste0(
"Coefficients: ", paste0(object$coefficients, collapse = ", "),
"; Residuals: ", paste0(object$residuals, collapse = ", ")
)
value
}
summary(model)
#> [1] "Coefficients: 1, 2, 3; Residuals: 1, 2, 3"
If your estimator is very similar to lm (your model is-a lm), then just add your class to the lm class.
myEstimatorLm <- function(formula, data) {
result <- lm(formula, data)
# Some customisation
result$coefficients <- pmax(result$coefficients, 1)
class(result) <- c("myEstimatorLm", class(result))
result
}
model_lm <- myEstimatorLm(Petal.Length ~ Sepal.Length + Sepal.Width, data = iris)
class(model_lm)
#> [1] "myEstimator" "lm"
Now, summary.lm will be used.
summary(model_lm)
#> Call:
#> lm(formula = formula, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.25582 -0.46922 -0.05741 0.45530 1.75599
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00000 0.56344 1.775 0.078 .
#> Sepal.Length 1.77559 0.06441 27.569 < 2e-16 ***
#> Sepal.Width 1.00000 0.12236 8.173 1.28e-13 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.6465 on 147 degrees of freedom
#> Multiple R-squared: 0.8677, Adjusted R-squared: 0.8659
#> F-statistic: 482 on 2 and 147 DF, p-value: < 2.2e-16
You can still implement summary.myEstimatorLm
summary.myEstimatorLm <- summary.myEstimator
summary(model_lm)
#> [1] "Coefficients: 1, 1.77559254648113, 1; Residuals: ...

Related

Perform multiple linear regression analysis including interaction terms, interpret results using summary() and diagnostic plots using lm()

I tried to perform a multiple linear regression analysis with code like this one but with no success. I tried to do it with lm() function. I think there is a problem with the 'x1*x2'.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1 + x2 + x1*x2)
summary(model)
plot(model)
It shows me error.
What should I do?
The error did not occur because of your interaction term. When testing it, that worked perfectly for me. You forgot to specify the data. The lm() function requires you to provide the data your variables should stem from. In the code below I also shortened the code within the function because x1*x2 is already sufficient. R detects that you have an interaction term, so you don't have to repeat the same variable names.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1*x2,
data= data)
summary(model)
#>
#> Call:
#> lm(formula = y ~ x1 * x2, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.21772 -0.77564 0.06347 0.56901 2.15324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.05853 0.09914 -0.590 0.5564
#> x1 0.17384 0.09466 1.836 0.0694 .
#> x2 -0.02830 0.08646 -0.327 0.7442
#> x1:x2 -0.00836 0.07846 -0.107 0.9154
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9792 on 96 degrees of freedom
#> Multiple R-squared: 0.03423, Adjusted R-squared: 0.004055
#> F-statistic: 1.134 on 3 and 96 DF, p-value: 0.3392
Created on 2023-01-14 with reprex v2.0.2

Create a function(x) including glm(... ~ x, ...) when x = parameter1 * parameter2. Summary of glm() just shows intercept and x (not the parameters)

There you can see my code and the output r gives. My question is: How can I get r to print the arguments of the function as separated values in the summary of glm(). So the intercept, gender_m0, age_centered and gender_m0 * age_centered instead of the intercept and the y? I hope someone could help me with my little problem. Thank you.
test_reg <- function(parameters){
glm_model2 <- glm(healing ~ parameters, family = "binomial", data = psa_data)
summary(glm_model2)}
test_reg(psa_data$gender_m0 * age_centered)
Call:
glm(formula = healing ~ parameters, family = "binomial", data = psa_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2323 0.4486 0.4486 0.4486 0.6800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24590 0.13844 16.223 <2e-16 ***
parameters -0.02505 0.01369 -1.829 0.0674 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.99 on 649 degrees of freedom
Residual deviance: 423.79 on 648 degrees of freedom
(78 Beobachtungen als fehlend gelöscht)
AIC: 427.79
Number of Fisher Scoring iterations: 5
The terms inside formulas are never substituted but taken literally, so glm is looking for a column called "parameters" in your data frame, which of course doesn't exist. You will need to capture the parameters from your call, deparse them and construct the formula if you want to call your function this way:
test_reg <- function(parameters) {
f <- as.formula(paste0("healing ~ ", deparse(match.call()$parameters)))
mod <- glm(f, family = binomial, data = psa_data)
mod$call$formula <- f
summary(mod)
}
Obviously, I don't have your data, but if I create a little sample data frame with the same names, we can see this works as expected:
set.seed(1)
psa_data <- data.frame(healing = rbinom(20, 1, 0.5),
age_centred = sample(21:40),
gender_m0 = rbinom(20, 1, 0.5))
test_reg(age_centred * gender_m0)
#>
#> Call:
#> glm(formula = healing ~ age_centred * gender_m0, family = binomial,
#> data = psa_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.416 -1.281 0.963 1.046 1.379
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.05873 2.99206 0.354 0.723
#> age_centred -0.02443 0.09901 -0.247 0.805
#> gender_m0 -3.51341 5.49542 -0.639 0.523
#> age_centred:gender_m0 0.10107 0.17303 0.584 0.559
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.526 on 19 degrees of freedom
#> Residual deviance: 27.027 on 16 degrees of freedom
#> AIC: 35.027
#>
#> Number of Fisher Scoring iterations: 4
Created on 2022-06-29 by the reprex package (v2.0.1)

How to extract p value from ca.po function in R?

I want to get the p-value of both ca.po models. Can someone show me how?
at?
library(data.table)
library(urca)
dt_xy = as.data.table(timeSeries::LPP2005REC[, 2:3])
res = urca::ca.po(dt_xy, type = "Pz", demean = demean, lag = "short")
summary(res)
And the results. I marked the p-values I need in the result.
Model 1 p-value = 0.9841
Model 2 p-value = 0.1363
########################################
# Phillips and Ouliaris Unit Root Test #
########################################
Test of type Pz
detrending of series with constant and linear trend
Response SPI :
Call:
lm(formula = SPI ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.036601 -0.003494 0.000243 0.004139 0.024975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.702e-04 7.954e-04 1.220 0.223
zrSPI -1.185e-02 5.227e-02 -0.227 0.821
zrSII -3.037e-02 1.374e-01 -0.221 0.825
trd -6.961e-07 3.657e-06 -0.190 0.849
Residual standard error: 0.007675 on 372 degrees of freedom
Multiple R-squared: 0.0004236, Adjusted R-squared: -0.007637
F-statistic: 0.05255 on 3 and 372 DF, p-value: 0.9841 **<--- I need this p.value**
Response SII :
Call:
lm(formula = SII ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.0096931 -0.0018105 -0.0002734 0.0017166 0.0115427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.598e-05 3.012e-04 -0.252 0.8010
zrSPI -1.068e-02 1.979e-02 -0.540 0.5897
zrSII -9.574e-02 5.201e-02 -1.841 0.0664 .
trd 1.891e-06 1.385e-06 1.365 0.1730
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002906 on 372 degrees of freedom
Multiple R-squared: 0.01476, Adjusted R-squared: 0.006813
F-statistic: 1.858 on 3 and 372 DF, p-value: 0.1363 **<--- I need this p.value**
Value of test-statistic is: 857.4274
Critical values of Pz are:
10pct 5pct 1pct
critical values 71.9586 81.3812 102.0167
You have to dig into the res object to see its attributes and what's available there.
attributes(reg)
...
#>
#> $testreg
#> Response SPI :
#>
#> Call:
#> lm(formula = SPI ~ zr + trd)
#>
...
A long list of objects is returned, but we can see what is looking like the summary of lm being called under testreg, which we can see is one of the attributes of res. We can also access attributes of res using attr(res, "name"), so let's look at the components of testreg.
names(attributes(res))
#> [1] "z" "type" "model" "lag" "cval" "res"
#> [7] "teststat" "testreg" "test.name" "class"
names(attr(res, "testreg"))
#> [1] "Response SPI" "Response SII"
As you noted above, you're looking for 2 separate p-values, so makes since we have two separate models. Let's retrieve these and look at what they are.
spi <- attr(res, "testreg")[["Response SPI"]]
sii <- attr(res, "testreg")[["Response SII"]]
class(spi)
#> [1] "summary.lm"
So, each of them is a summary.lm object. There's lots of documentation on how to extract p-values from lm or summary.lm objects, so let's use the method described here.
get_pval <- function(summary_lm) {
pf(
summary_lm$fstatistic[1L],
summary_lm$fstatistic[2L],
summary_lm$fstatistic[3L],
lower.tail = FALSE
)
}
get_pval(spi)
#> value
#> 0.9840898
get_pval(sii)
#> value
#> 0.1363474
And there you go, those are the two p-values you were interested in!

Avoid losing formulas when applying the lm function over a list of formulas in R

I'm trying to take all pairs of variables in the mtcars data set and make a linear model using the lm function. But my approach is causing me to lose the formulas when I go to summarize or plot the models. Here is the code that I am using.
library(tidyverse)
my_vars <- names(mtcars))
pairs <- t(combn(my_vars, 2)) # Get all possible pairs of variables
# Create formulas for the lm model
fmls <-
as.tibble(pairs) %>%
mutate(fml = paste(V1, V2, sep = "~")) %>%
select(fml) %>%
.[[1]] %>%
sapply(as.formula)
# Create a linear model for ear pair of variables
mods <- lapply(fmls, function(v) lm(data = mtcars, formula = v))
# print the summary of all variables
for (i in 1:length(mods)) {
print(summary(mods[[i]]))
}
(I snagged the idea of using strings to make formulas from here
[1]: Pass a vector of variables into lm() formula.) Here is the output of the summary for the first model (summary(mods[[1]])):
Call:
lm(formula = v, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
I'm searching for a (perhaps metaprogramming) technique so that the call line looks something like lm(formula = var1 ~ var2, data = mtcars) as opposed to formula = v.
I made pairs into a data frame, to make life easier:
library(tidyverse)
my_vars <- names(mtcars)
pairs <- t(combn(my_vars, 2)) %>%
as.data.frame# Get all possible pairs of variables
You can do this using eval() which evaluates an expression.
listOfRegs <- apply(pairs, 1, function(pair) {
V1 <- pair[[1]] %>% as.character
V2 <- pair[[2]] %>% as.character
fit <- eval(parse(text = paste0("lm(", pair[[1]] %>% as.character,
"~", pair[[2]] %>% as.character,
", data = mtcars)")))
return(fit)
})
lapply(listOfRegs, summary)
Then:
> lapply(listOfRegs, summary)
[[1]]
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
... etc

Need help modeling data with a log() function

I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)

Resources