I understand the contrasts from previous posts and I think I am doing the right thing but it is not giving me what I would expect.
x <- c(11.80856, 11.89269, 11.42944, 12.03155, 10.40744,
12.48229, 12.1188, 11.76914, 0, 0,
13.65773, 13.83269, 13.2401, 14.54421, 13.40312)
type <- factor(c(rep("g",5),rep("i",5),rep("t",5)))
type
[1] g g g g g i i i i i t t t t t
Levels: g i t
When I run this:
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.514 1.729 6.659 2.33e-05 ***
typei -4.240 2.445 -1.734 0.109
typet 2.222 2.445 0.909 0.381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
Here my reference is my type "g", so my typei is the difference between type "g" and type "i", and my typet is the difference between type "g" and type "t".
I wanted to see two more contrasts here, the difference between typei+typeg and type "t" and difference between type "i" and type "t"
so the contrasts
> contrasts(type) <- cbind( c(-1,-1,2),c(0,-1,1))
> summary.lm(aov(x~type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.8412 0.9983 10.860 1.46e-07 ***
type1 -0.6728 1.4118 -0.477 0.642
type2 4.2399 2.4453 1.734 0.109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
When I try to do the second contrast by changing my reference I get different results. I am not understanding what is wrong with my contrast.
Refence: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
mat <- cbind(rep(1/3, 3), "g+i vs t"=c(-1/2, -1/2, 1),"i vs t"=c(0, -1, 1))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(type) <- my.contrast
summary.lm(aov(x ~ type))
my.contrast
> g+i vs t i vs t
[1,] -1.3333 1
[2,] 0.6667 -1
[3,] 0.6667 0
> contrasts(type) <- my.contrast
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.274 -0.414 0.097 0.663 5.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.841 0.998 10.86 1.5e-07 ***
typeg+i vs t 4.342 2.118 2.05 0.063 .
typei vs t 6.462 2.445 2.64 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.87 on 12 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.271
F-statistic: 3.6 on 2 and 12 DF, p-value: 0.0594
Related
I want to look at the coefficient estimates for all of my different sales teams. I have 20 teams listed in a "Teams" column and about 24 observations for each. However, when I run my regression, I am only seeing 15 of the 20 teams in my model summary. I want to see all of them, any thoughts?
Here is my code and output:
(log_teams <- lm(Worked ~ Team+Activity+Presented+Confirmed+Jobs_Filled+Converted, data = df))%>%
summary
Output:
Call:
lm(formula = Worked ~ Team + Activity + Presented + Confirmed +
Jobs_Filled + Converted, data = WBY)
Residuals:
Min 1Q Median 3Q Max
-4.4035 -1.0048 0.0000 0.8774 5.1677
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.609486 1.869903 1.396 0.1699
TeamCRW 0.110828 1.908735 0.058 0.9540
TeamEMW -1.068797 2.767863 -0.386 0.7013
TeamGSW -0.424508 2.795353 -0.152 0.8800
TeamNS2 -1.234508 2.388392 -0.517 0.6078
TeamNUW -1.458735 2.083549 -0.700 0.4875
TeamOBW 3.224057 2.103054 1.533 0.1324
TeamORT -0.432185 1.884824 -0.229 0.8197
TeamPC1 4.338479 2.115219 2.051 0.0462 *
TeamPC2 -1.002268 2.227166 -0.450 0.6549
TeamPDW 2.560784 2.791501 0.917 0.3640
TeamPLW 1.381216 2.151150 0.642 0.5242
TeamPYW -1.074374 2.799772 -0.384 0.7030
TeamSB2 -0.646769 2.288132 -0.283 0.7788
TeamSYW 2.252061 1.833820 1.228 0.2259
TeamWMO 0.857452 2.302522 0.372 0.7114
Activity -0.000627 0.002906 -0.216 0.8302
Presented 0.162181 0.331876 0.489 0.6275
Confirmed -0.242462 0.317139 -0.765 0.4486
Jobs_Filled -0.025657 0.016099 -1.594 0.1182
Converted 0.006213 0.002610 2.381 0.0217 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.247 on 44 degrees of freedom
(427 observations deleted due to missingness)
Multiple R-squared: 0.5217, Adjusted R-squared: 0.3043
F-statistic: 2.4 on 20 and 44 DF, p-value: 0.007786*
I have 2 dataframes
#dummy df for examples:
set.seed(1)
df1 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
df2 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
I want to do this for every column in both dataframes (except the t column):
model <- lm(df2$A ~ df1$A, data = NULL)
I have tried something like this:
model <- function(yvar, xvar){
lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
}
lapply(names(data), model)
but it obviously doesn't work. What am i doing wrong?
In the end, what i really want is to get the coefficients and other stuff from the models. But what is stopping me is how to run a linear model with variables from different dataframes multiple times.
the output i want i'll guess it should look something like this:
# [[1]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[2]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[3]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
Since df1 and df2 have same names you can do this as :
model <- function(var){
lm(df1[[var]] ~ df2[[var]])
}
result <- lapply(names(df1)[-1], model)
result
#[[1]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.1504 -0.4763
#[[2]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 3.0227 0.6374
#[[3]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.4240 0.2411
To get summary statistics from the model you can use broom::tidy :
purrr::map_df(result, broom::tidy, .id = 'model_num')
# model_num term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 (Intercept) 15.2 3.03 5.00 0.000194
#2 1 df2[[var]] -0.476 0.248 -1.92 0.0754
#3 2 (Intercept) 3.02 4.09 0.739 0.472
#4 2 df2[[var]] 0.637 0.227 2.81 0.0139
#5 3 (Intercept) 15.4 4.40 3.50 0.00351
#6 3 df2[[var]] 0.241 0.272 0.888 0.390
Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?
For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000
I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00