Using code below I summarize a linear model :
x = c(1,2,3)
y = c(1,2,3)
m = lm(y ~ x)
summary(m)
This prints :
Call:
lm(formula = y ~ x)
Residuals:
1 2 3
0 0 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NA NA
x 1 0 Inf <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 1 DF, p-value: < 2.2e-16
Warning message:
In summary.lm(m) : essentially perfect fit: summary may be unreliable
How to create new summary function which will just return the 'Coefficients' property :
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
here is my code :
tosum2 <- summary(m)
summary.myclass <- function(x)
{
return(x$Coefficients)
}
class(tosum2) <- c('myclass', 'summary')
summary(tosum2)
but NULL is returned.
Update :
how can check available methods from summary function ( coef is example of method available on summary) ? methods(class="summary") returns null
Related
I am a beginner in R and statistics in general. I am trying to build a regression model with 4 variables (one of them is nominal data, with 3 alt categories). I think i managed to build a model with my raw data. But I wanted to standardize my data-set and the lm essentially perfect fit: summary may be unreliable message.
This is the summary of the model with raw data;
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0713 5.9131 -0.350 0.727
Vdownload 8.6046 0.5286 16.279 < 2e-16 ***
DownloadDegisim 2.8854 0.6822 4.229 4.25e-05 ***
Vupload -4.2877 0.5418 -7.914 7.32e-13 ***
Saglayici2 -8.2084 0.6043 -13.583 < 2e-16 ***
Saglayici3 -9.8869 0.5944 -16.634 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.885 on 138 degrees of freedom
Multiple R-squared: 0.8993, Adjusted R-squared: 0.8956
F-statistic: 246.5 on 5 and 138 DF, p-value: < 2.2e-16
I wrote these codes to standardize my data
memnuniyet_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vdownload_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vupload_scaled <-scale(Vupload, center = TRUE, scale = TRUE)
DownloadD_scaled <- scale(DownloadDegisim, center = TRUE, scale = TRUE)
result<-lm(memnuniyet_scaled~Vdownload_scaled+DownloadD_scaled+Vupload_scaled+Saglayıcı)
summary(result)
And this is the summary of my standardized data
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.079e-17 5.493e-17 9.250e-01 0.357
Vdownload_scaled 1.000e+00 6.667e-17 1.500e+16 <2e-16 ***
DownloadD_scaled -4.591e-17 8.189e-17 -5.610e-01 0.576
Vupload_scaled 9.476e-18 6.337e-17 1.500e-01 0.881
Saglayici2 -6.523e-17 7.854e-17 -8.300e-01 0.408
Saglayici3 -8.669e-17 7.725e-17 -1.122e+00 0.264
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.75e-16 on 138 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.034e+32 on 5 and 138 DF, p-value: < 2.2e-16
I do know that R value should not have changed with standardization and have no idea what I did wrong.
I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
I run a categorial logistic regression.
Int is a Intelligence ranking (1st place, 2nd, 3rd and 4th)
My questions: I detected, that the significances vary depending on how I define the level of Sex and Pos (Body Posture), setting first M or F on Sex and Open and Closed on Pos (Posture).
This very strange for me, because I thought, changing the level order just alters - and + of the coefficient. What did I wrong? Is the strong Pos*Sex interaction the key to the solution?
Many thanks for every hint.
Here can you see the output of every combination:
> Pos = relevel(Pos,ref="Open")
> mopen<- clm(Int ~ Pos*Sex, data = x)
> summary(mopen)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.30e-12 6.7e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosClosed 1.128633 0.204955 5.507 3.66e-08 ***
SexF 0.008686 0.195416 0.044 0.964548
PosClosed:SexF -0.991075 0.281194 -3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.8356 0.1489 -5.614
2|3 0.2956 0.1451 2.037
3|4 1.4497 0.1557 9.310
>
> Sex = relevel(Sex,ref="F")
> Pos = relevel(Pos,ref="Open")
> fopen<- clm(Int ~ Pos*Sex, data = x)
> summary(fopen)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.27e-12 6.4e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosClosed 0.137559 0.193101 0.712 0.476238
SexM -0.008686 0.195416 -0.044 0.964548
PosClosed:SexM 0.991075 0.281194 3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.8443 0.1458 -5.791
2|3 0.2869 0.1406 2.041
3|4 1.4410 0.1519 9.489
>
> Sex = relevel(Sex,ref="M")
> Pos = relevel(Pos,ref="Closed")
> mclosed<- clm(Int ~ Pos*Sex, data = x)
> summary(mclosed)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.30e-12 7.2e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosOpen -1.1286 0.2050 -5.507 3.66e-08 ***
SexF -0.9824 0.2021 -4.861 1.17e-06 ***
PosOpen:SexF 0.9911 0.2812 3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.9642 0.1656 -11.859
2|3 -0.8331 0.1536 -5.422
3|4 0.3211 0.1506 2.132
>
> Sex = relevel(Sex,ref="F")
> Pos = relevel(Pos,ref="Closed")
> fclosed<- clm(Int ~ Pos*Sex, data = x)
> summary(fclosed)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.32e-12 6.5e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosOpen -0.1376 0.1931 -0.712 0.476238
SexM 0.9824 0.2021 4.861 1.17e-06 ***
PosOpen:SexM -0.9911 0.2812 -3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.9819 0.1477 -6.649
2|3 0.1493 0.1413 1.057
3|4 1.3035 0.1512 8.623
My best answer is, that it was a mistake to use a dummy code for f/m and closed/open
I tried a contrast code and got better results
Used following code to create a contrast
contrasts(Sex) <- contr.sum(2)
In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.
Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"
Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?