Get access to regression train model - r

I got an exercise, where I need to train a linear regression model and get some information about the model:
linear relationship between my chosen variable and the other variables
which variables are important for the model
significance
It´s easy to create an model with the lm-function, so that I can interpret it with
summary(mod).
mod <- lm(cars$height ~ ., data = cars)
The summary()-MEthod returns everything: r-squared, coefficients, p-value, significance ...
But when Im training my model like:
library(mlr)
lrn = makeLearner("regr.ksvm")
mod = train(learner = lrn, task = task)
pred = predict(object = mod, newdata = test)
performance(pred = pred, measures = list(mse, arsq))
I´m just getting the mse and r-squareZd. How to get to the other information like significance, important variables ...
Is there a hance to get access to this mod?
Thanks for help

library(mlr)
#> Loading required package: ParamHelpers
#> 'mlr' is in maintenance mode since July 2019. Future development
#> efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
lrn = makeLearner("regr.lm")
mod = train(learner = lrn, task = bh.task)
getLearnerModel(mod)
#>
#> Call:
#> stats::lm(formula = f, data = d)
#>
#> Coefficients:
#> (Intercept) crim zn indus chas1 nox
#> 3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00 -1.777e+01
#> rm age dis rad tax ptratio
#> 3.810e+00 6.922e-04 -1.476e+00 3.060e-01 -1.233e-02 -9.527e-01
#> b lstat
#> 9.312e-03 -5.248e-01
Created on 2020-01-15 by the reprex package (v0.3.0.9001)

Related

Predict function in R not giving an interval

I'm trying to use predict() in R to compute a prediction interval for a linear model. When I tried this on a simpler model with only one covariate, it gave the expected output of a point estimate with a confidence interval. When I added a categorical predictor to the model, the predict() output gives what seems like a single-point estimate with no interval. I've Googled to no avail. Can anyone tell me what I've done wrong here?
medcost <- data.frame(
ID = c(1:100),
charges = sample(0:100000, 100, replace = T),
bmi = sample(18:40, 100, replace = T),
smoker = factor(sample(c("smoker", "nonsmoker"), 100, replace = TRUE))
)
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")))
If you want to have the standard error, you could use se.fit = TRUE like this:
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")),
se.fit = TRUE)
#> $fit
#> 1 2
#> 47638.66 47106.14
#>
#> $se.fit
#> 1 2
#> 4304.220 4475.473
#>
#> $residual.scale
#> [1] 28850.85
Created on 2023-01-17 with reprex v2.0.2
I would recommend you having a look at this post: R: glm(...,family=poisson) plot confidence and prediction intervals

Can the out of bag error for a random forests model in R's TidyModel's framework be obtained?

If you directly use the ranger function, one can obtain the out-of-bag error from the resulting ranger class object.
If instead, one proceeds by way of setting up a recipe, model specification/engine, with tuning parameters, etc., how can we extract that same error? The Tidymodels approach doesn't seem to hold on to that data.
If you want to access the ranger object inside of the parsnip object, it is there as $fit:
library(tidymodels)
data("ad_data", package = "modeldata")
rf_spec <-
rand_forest() %>%
set_engine("ranger", oob.error = TRUE) %>%
set_mode("classification")
rf_fit <- rf_spec %>%
fit(Class ~ ., data = ad_data)
rf_fit
#> parsnip model object
#>
#> Fit time: 158ms
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, oob.error = ~TRUE, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 333
#> Number of independent variables: 130
#> Mtry: 11
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1340793
class(rf_fit)
#> [1] "_ranger" "model_fit"
class(rf_fit$fit)
#> [1] "ranger"
rf_fit$fit$prediction.error
#> [1] 0.1340793
Created on 2021-03-11 by the reprex package (v1.0.0)

Line not getting in scatterplot R

I have an issue getting a line in R. I have the following code:
#7.4
NFull <- tp$ntest;
Ni <- 0.7*log(tp$ntest);
#install.packages(mgcv)
library(mgcv)
plot(tp$pos ~ tp$dateno, main="Deltaudglatning")
xval <- with(tp, seq(min(tp$dateno), max(tp$dateno), length.out = 224))
fitgam <- gam(tp$pos ~ s(tp$dateno, k=4)+offset(Ni), tp, family=quasipoisson, method="REML")
summary(fitgam)
lines(xval, predict(fitgam, data.frame(xval),type="response"), col="green")
I would like to get it in this scatterplot:
[![enter image description here][1]][1]
Can anyone help here?
Screenshot of my data, which has 224 lines in total:
[![enter image description here][2]][2]
Link to data:
[1]: https://i.stack.imgur.com/DoOKR.png
[2]: https://i.stack.imgur.com/KFycU.png
Your issue is that you did not name the variables you try to predict for in the data frame you give the predict function.
Here is an example with (obviously quite different) simulated data that should work:
tp <- data.frame(date = as.Date('2020-04-01') + 0:223,
dateno = 1:224,
ntest = sample(1000:7000, 224),
pos = sample(140:500,224,T))
NFull <- tp$ntest;
Ni <- 0.7*log(tp$ntest);
#install.packages(mgcv)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
plot(tp$pos ~ tp$dateno, main="Deltaudglatning")
xval <- with(tp, seq(min(tp$dateno), max(tp$dateno), length.out = 224))
fitgam <- gam(pos ~ s(dateno, k=4)+offset(Ni), tp, family=quasipoisson, method="REML")
summary(fitgam)
#>
#> Family: quasipoisson
#> Link function: log
#>
#> Formula:
#> pos ~ s(dateno, k = 4) + offset(Ni)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.008858 0.033060 -0.268 0.789
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(dateno) 1 1 1.422 0.234
#>
#> R-sq.(adj) = -1.03 Deviance explained = 0.629%
#> -REML = 607.37 Scale est. = 79.403 n = 224
lines(xval, predict(fitgam, data.frame(dateno = xval),type="response"), col="green")
#> Warning in predict.gam(fitgam, data.frame(dateno = xval), type = "response"): not all required variables have been supplied in newdata!
Created on 2020-12-17 by the reprex package (v0.3.0)

Why is lm_robust() HC3 standard error smaller than coeftest() HC0 standard error?

I am using lm_robust of package 'estimatr' for a fixed effect model including HC3 robust standard errors. I had to switch from vcovHC(), because my data sample was just to large to be handled by it.
using following line for the regression:
lm_robust(log(SPREAD) ~ PERIOD, data = dat, fixed_effects = ~ STOCKS + TIME, se_type = "HC3")
The code runs fine, and the coefficients are the same as using fixed effects from package plm. Since I can not use coeftest to estimate HC3 standard errors with the plm output due to a too large data sample, I compared the HC3 estimator of lm_robustwith the HC1 of coeftest(model, vcov= vcovHC(model, type = HC1))
As result the HC3 standarderror of lm_robust is much smaller than HC1 from coeftest.
Does somebody has an explanation, since HC3 should be more restrictive than HC1. I appreciate any recommendations and solutions.
EDIT model used for coeftest:
plm(log(SPREAD) ~ PERIOD, data = dat, index = c("STOCKS", "TIME"), effect = "twoway", method = "within")
It appears that the vcovHC() method for plm automatically estimates cluster-robust standard errors, while for lm_robust(), it does not. Therefore, the HC1 estimation of the standard error for plm will appear inflated compared to lm_robust (of lm for that matter).
Using some toy data:
library(sandwich)
library(tidyverse)
library(plm)
library(estimatr)
library(lmtest)
set.seed(1981)
x <- sin(1:1000)
y <- 1 + x + rnorm(1000)
f <- as.character(sort(rep(sample(1:100), 10)))
t <- as.character(rep(sort(sample(1:10)), 100))
dat <- tibble(y = y, x = x, f = f, t = t)
lm_fit <- lm(y ~ x + f + t, data = dat)
plm_fit <- plm(y ~ x, index = c("f", "t"), model = "within", effect = "twoways", data = dat)
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, data = dat, se_type = "HC1", return_vcov = TRUE)
sqrt(vcovHC(lm_fit, type = "HC1")[2, 2])
#> [1] 0.04752337
sqrt(vcovHC(plm_fit, type = "HC1"))
#> x
#> x 0.05036414
#> attr(,"cluster")
#> [1] "group"
sqrt(rb_fit$vcov)
#> x
#> x 0.04752337
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, data = dat, se_type = "HC3", return_vcov = TRUE)
sqrt(vcovHC(lm_fit, type = "HC3")[2, 2])
#> [1] 0.05041177
sqrt(vcovHC(plm_fit, type = "HC3"))
#> x
#> x 0.05042142
#> attr(,"cluster")
#> [1] "group"
sqrt(rb_fit$vcov)
#> x
#> x 0.05041177
There does not appear to be equivalent cluster-robust standard error types in the two packages. However, the SEs get closer when specifying cluster-robust SEs in lm_robust():
rb_fit <- lm_robust(y ~ x, fixed_effects = ~ f + t, clusters = f, data = dat, se_type = "CR0")
summary(rb_fit)
#>
#> Call:
#> lm_robust(formula = y ~ x, data = dat, clusters = f, fixed_effects = ~f +
#> t, se_type = "CR0")
#>
#> Standard error type: CR0
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
#> x 0.925 0.05034 18.38 1.133e-33 0.8251 1.025 99
#>
#> Multiple R-squared: 0.3664 , Adjusted R-squared: 0.2888
#> Multiple R-squared (proj. model): 0.3101 , Adjusted R-squared (proj. model): 0.2256
#> F-statistic (proj. model): 337.7 on 1 and 99 DF, p-value: < 2.2e-16
coeftest(plm_fit, vcov. = vcovHC(plm_fit, type = "HC1"))
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> x 0.925009 0.050364 18.366 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2020-04-16 by the reprex package (v0.3.0)

How can I omit the regression intercept from my results table in stargazer

I run a regression of the type
model <- lm(y~x1+x2+x3, weights = wei, data=data1)
and then create my table
,t <- stargazer(model, omit="x2", omit.labels="x1")
but I haven't found a way to omit the intercept results from the table. I need it in the regression, yet I don't want to show it in the table.
Is there a way to do it through stargazer?
I haven't your dataset, but typing omit = c("Constant", "x2") should work.
As a reproducible example (stargazer 5.2)
stargazer::stargazer(
lm(Fertility ~ . ,
data = swiss),
type = "text",
omit = c("Constant", "Agriculture"))
Edit: Add in omit.labels
mdls <- list(
m1 = lm(Days ~ -1 + Reaction, data = lme4::sleepstudy),
m2 = lm(Days ~ Reaction, data = lme4::sleepstudy),
m3 = lm(Days ~ Reaction + Subject, data = lme4::sleepstudy)
)
stargazer::stargazer(
mdls, type = "text", column.labels = c("Omit none", "Omit int.", "Omit int/subj"),
omit = c("Constant", "Subject"),
omit.labels = c("Intercept", "Subj."),
keep.stat = "n")
#>
#> ==============================================
#> Dependent variable:
#> ---------------------------------
#> Days
#> Omit none Omit int. Omit int/subj
#> (1) (2) (3)
#> ----------------------------------------------
#> Reaction 0.015*** 0.027*** 0.049***
#> (0.001) (0.003) (0.004)
#>
#> ----------------------------------------------
#> Intercept No No No
#> Subj. No No No
#> ----------------------------------------------
#> Observations 180 180 180
#> ==============================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Created on 2020-05-08 by the reprex package (v0.3.0)
Note the table should read. This appears to be a bug (stargazer 5.2.2).
#> Intercept No Yes Yes
#> Subj. No No Yes
I got a way of doing it. It is not the most clever way, but works.
I just change the omit command to a keep command. In my example above:
library(stargazer)
model <- lm(y~x1+x2+x3, weights = wei, data=data1)
t <- stargazer(model, keep=c("x1","x3"), omit.labels="x1")
However, it's not an efficient way when you have many variables you want to keep in the regression table

Resources