Changing significance notation in R - r

R has certain significance codes to determine statistical significance. In the example below, for example, a dot . indicates significance at the 10% level (see sample output below).
Dots can be very hard to see, especially when I copy-paste to Excel and display it in Times New Roman.
I'd like to change it such that:
* = significant at 10%
** = significant at 5%
*** = significant at 1%
Is there a way I can do this?
> y = c(1,2,3,4,5,6,7,8)
> x = c(1,3,2,4,5,6,8,7)
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.0714 -0.3333 0.0000 0.2738 1.1191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2143 0.6286 0.341 0.74480
x 0.9524 0.1245 7.651 0.00026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8067 on 6 degrees of freedom
Multiple R-squared: 0.907, Adjusted R-squared: 0.8915
F-statistic: 58.54 on 1 and 6 DF, p-value: 0.0002604

You can create your own formatting function with
mystarformat <- function(x) symnum(x, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", " "))
And you can write your own coefficient formatter
show_coef <- function(mm) {
mycoef<-data.frame(coef(summary(mm)), check.names=F)
mycoef$signif = mystarformat(mycoef$`Pr(>|t|)`)
mycoef$`Pr(>|t|)` = format.pval(mycoef$`Pr(>|t|)`)
mycoef
}
And then with your model, you can run it with
mm <- lm(y~x)
show_coef(mm)
# Estimate Std. Error t value Pr(>|t|) signif
# (Intercept) 0.2142857 0.6285895 0.3408993 0.7447995
# x 0.9523810 0.1244793 7.6509206 0.0002604 ***

One should be aware that stargazer package reports significance levels with a different scale than other statistical softwares like STATA.
In R (stargazer) you get # (* p<0.1; ** p<0.05; *** p<0.01). Whereas, in STATA you get # (* p<0.05, ** p<0.01, *** p< 0.001).
This means that what is significant with one * in R results may appear not to be significant for a STATA user.

Sorry for the late response, but I found a great solution to this.
Just do the following:
install.packages("stargazer")
library(stargazer)
stargazer(your_regression, type = "text")
This displays everything in a beautiful way with your desired format.
Note: If you leave type = "text" out, then you'll get the LaTeX code.

Related

R equivalent of Stata's for-loop over macros

I have a variable x that is between 0 and 1, or (0,1].
I want to generate 10 dummy variables for 10 deciles of variable x. For example x_0_10 takes value 1 if x is between 0 and 0.1, x_10_20 takes value 1 if x is between 0.1 and 0.2, ...
The Stata code to do above is something like this:
forval p=0(10)90 {
local Next=`p'+10
gen x_`p'_`Next'=0
replace x_`p'_`Next'=1 if x<=`Next'/100 & x>`p'/100
}
Now, I am new at R and I wonder how I can do above in R?
cut is your friend here; its output is a factor, which, when used in models, R will auto-expand into the 10 dummy variables.
set.seed(2932)
x = runif(1e4)
y = 3 + 4 * x + rnorm(1e4)
x_cut = cut(x, 0:10/10, include.lowest = TRUE)
summary(lm(y ~ x_cut))
# Call:
# lm(formula = y ~ x_cut)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.7394 -0.6888 0.0028 0.6864 3.6742
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.16385 0.03243 97.564 <2e-16 ***
# x_cut(0.1,0.2] 0.43932 0.04551 9.654 <2e-16 ***
# x_cut(0.2,0.3] 0.85555 0.04519 18.933 <2e-16 ***
# x_cut(0.3,0.4] 1.26441 0.04588 27.556 <2e-16 ***
# x_cut(0.4,0.5] 1.66181 0.04495 36.970 <2e-16 ***
# x_cut(0.5,0.6] 2.04538 0.04574 44.714 <2e-16 ***
# x_cut(0.6,0.7] 2.44771 0.04533 53.999 <2e-16 ***
# x_cut(0.7,0.8] 2.80875 0.04591 61.182 <2e-16 ***
# x_cut(0.8,0.9] 3.22323 0.04545 70.919 <2e-16 ***
# x_cut(0.9,1] 3.60092 0.04564 78.897 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.011 on 9990 degrees of freedom
# Multiple R-squared: 0.5589, Adjusted R-squared: 0.5585
# F-statistic: 1407 on 9 and 9990 DF, p-value: < 2.2e-16
See ?cut for more customizations
You can also pass cut directly in the RHS of the formula, which would make using predict a bit easier:
reg = lm(y ~ cut(x, 0:10/10, include.lowest = TRUE))
idx = sample(length(x), 500)
plot(x[idx], y[idx])
x_grid = seq(0, 1, length.out = 500L)
lines(x_grid, predict(reg, data.frame(x = x_grid)),
col = 'red', lwd = 3L, type = 's')
This won't fit well into a comment, but for the record, the Stata code can be simplified down to
forval p = 0/9 {
gen x_`p' = x > `p'/10 & `x' <= (`p' + 1)/10
}
Note that -- contrary to the OP's claim -- values of x exactly zero will be mapped to zero for all these variables, both on their code and on mine (which is intended to be a simplification of their code, not a correct way to do it, modulo a difference of taste on variable names). That follows from the fact that 0 is not greater than 0. Again, values that are exactly 0.1, 0.2, 0.3, will in principle go in the lower bin, not the higher bin, but that is complicated by the fact that most multiples of 0.1 don't have exact binary representations (0.5 is clearly an exception).
Indeed, depending on details about their set-up that the OP doesn't tell us, indicator variables (dummy variables, in their terminology) may well be available in Stata without a loop or made quite unnecessary by factor variable notation. In that respect Stata is closer to R than may at first appear.
While not answering the question directly, the signal here to Stata and R users alike is that Stata need not be so awkward as might be inferred from the code in the question.

Why I get Estimate Std. in negative when the data I am using never can be negative?

I am running a script to find out differences between songs of birds (comparing different lengths, frequencies and others). I am using linear mixed effects with lme4 package. I get as an outcome of negative Estimate Std. and since (for instance) the length of the song can not be negative, I wonder if anybody could tell me what I am doing wrong. Find details underneath.
I have been looking for errors in my data and different ways to dispose of the data, getting the same results.
This is how I have the data organized:
Bird site length freq
1 FH 2.69 4354 -58.9
1 FH 2.546 4298 -57.3
1 FH 2.043 5303 -53.7
2 FH 4.437 6084 -63.1
11 ML 3.371 4689 -37.1
12 ML 3.706 5470 -39.7
13 ML 4.331 5358 -48.7
13 ML 4.124 4744 -39.8
14 ML 3.802 5805 -42.5
This is the full code
#1 song lenght####
library("lmerTest")
model1<-lmer(length~site
+(1|Bird),
data=dframe1)
summary(model1)
anova(model1, test="F")
pdat <- expand.grid (site=c("ML", "SI","FH", "SH"))
detach(package:lmerTest) #
model1<-lmer(length~site
+(1|Bird),
data=dframe1)
pred <- predictSE(model1, newdata = pdat, re.form = NA,
se.fit = T, na.action = na.exclude,
type= "response")
pred
predframe <- data.frame (pdat, pred) ; predframe
predframe
plot(
NULL
, xlim = c(0.75,4.25) #
, ylim = c(3,6)
, axes = F #
, ylab = ""
, xlab = ""
)
at.x <- c(1,2,3,4)
at.lab <- c(1,2,3,4)
for (i in 1:nrow(predframe))
{arrows(
x0 = at.x[i]
, y0 = (predframe$fit[i] + predframe$se.fit[i])
, x1 = at.x[i]
, y1 = (predframe$fit[i] - predframe$se.fit[i])
, code = 3
, angle = 90
, length = 0.12
, col = "gray25")
points(
x = at.x[i]
, y = predframe$fit[i]
, pch = 21
,bg="black"
, col = "black"
, cex = 1.25) # point size
}
axis(1, labels = c("Mainland","Sully", "Flat Holm","Skokholm"), at = at.lab)
axis(2, at = c(3,4,5,6), labels = c(3,4,5,6), las = 1, cex.axis = 1)
box()
title(xlab = "Location", line = 2.5, cex = 0.8)
title(ylab = expression(paste("song length (secs)")), line = 2.75)
Ahead is the first part of the results, not sure why the site FH (siteFH -0.9480) comes up as negative. This happens with other variables as well, so I guess must be something wrong with the model. I am a beginner, please be considered with me, I've looked already and I haven't found a similar question.
Thank you in advance.
Results
`Scaled residuals:
Min 1Q Median 3Q Max
-3.1852 -0.4119 -0.0071 0.5304 2.2659
Random effects:
Groups Name Variance Std.Dev.
Bird (Intercept) 0.51798 0.7197
Residual 0.07313 0.2704
Number of obs: 112, groups: Bird, 42
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 4.2429 0.1787 37.6710 23.745 < 2e-16 ***
siteFH -0.9480 0.2965 36.3879 -3.197 0.002871 **
siteSH 1.2641 0.3173 35.4150 3.983 0.000323 ***
siteSI -0.4258 0.3515 35.2203 -1.212 0.233769
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) siteFH siteSH
siteFH -0.603
siteSH -0.563 0.339
siteSI -0.508 0.306 0.286
> anova(model1, test="F")
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
site 3.0075 1.0025 3 35.336 13.709 4.337e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`
The columns in the output are right aligned, so the column is named Estimate, the next column is named Std. Error.
The estimate describes the association between your dependent and independent variables. It does not describe any values in your dataset.
A negative estimate just means "the larger your dependent variable (length), the lower your independent variable (site)" (or vice versa). But within this relationship, both variables still can be positive.
In detail, an estimate of -0.948 in your case means that the length for siteFH is about 0.948 lower than the length for siteML (the reference category, not shown in the output). However, it does not mean that siteFH is negative.

variable lengths differ in R using lm

I get a nasty surprise when running lm in R:
variable lengths differ (found for 'returnsandp')
I run the following model:
# regress apple price return on s&p price return
attach(NewSetSlide.ex)
resultr = lm(returnapple ~ returnsandp)
summary(resultr)
It cannot get any more simple than that, but for some reason, I get the error above.
I checked that the length of returnapple & returnsandp is exactly the same. So what on earth is going on, please?
The data.frame in question:
NewSetSlide.ex <- structure(list(returnapple = c(0.1412251, 0.07665801, 0.02560235,
0.09638143, 0.06384145, 0.05163189, -0.1076969, 0.05121892, 0.06428114,
0.09939652, 0.07271771, 0.06923432, 0.02873109, 0.0721757, -0.0121841,
0.07196034, 0.1012038, -0.06786657, 0.06142434, 0.09644931, -0.02754909,
0.005786519, 0.04099078, -0.03320592, -0.03292676, -0.06908485,
-0.01878077, 0.08340874, -0.01004186, -0.1064195, -0.07524236,
-0.006677446, 0.133327, -0.139921, 0.06528701, -0.036831, 0.09006266,
0.01813659, 0.07127628, 0.004334296, -0.02659846, 0.05333548,
0.04774654, 0.1288835, 0.05323629, -0.00006978558, 0.0634182,
-0.0533224, 0.03270362, 0.1026693, -0.05655361, 0.09680779, 0.01662336,
-0.01170586, -0.01063646, 0.0638476, -0.0542103, -0.01501973,
0.1307637, -0.005598485, 0.02798327, 0.1962269, 0.006725292,
0), returnsandp = c(0.1159772758, 0.007614392, 0.1104467964,
0.0359706698, 0.0152313579, 0.0331342721, 0.0189951476, 0.0330947526,
0.0749868297, -0.0124064592, 0.0323295771, -0.0303030364, 0.0113188732,
0.0101582303, -0.0151743475, 0.0174258083, -0.0088341409, -0.0092159647,
-0.0388593467, 0.0134979946, 0.0054655738, -0.05935645, 0.0174692125,
-0.0164511628, 0.1063320628, -0.0034796438, -0.0000602649, -0.0151122528,
0.0223743915, 0.0740851449, 0.0086287811, -0.0028700134, -0.0045942764,
0.0540510532, 0.0121340172, -0.0048475787, -0.0119945162, -0.034724078,
0.0425088143, 0.0650615875, 0.0450610926, 0.0023665278, 0.0714892769,
0.052793919, -0.0141481377, 0.0502292875, 0.0141095206, -0.0586828306,
0.071192607, -0.0854386059, 0.05472933, 0.0214771911, -0.0282882713,
0.1317668962, 0.0369236189, 0.0263898652, -0.0114502121, 0.0060341972,
0.0479144906, 0.0482236974, 0.0349588397, -0.0241661652, -0.2176304161,
-0.0853488645)), class = "data.frame", row.names = c(NA, -64L))
Based on #Dave2e comment.
It is better to use data=NewSetSlide.ex argument inside lm function call to avoid naming conflicts moreover you should avoid using attach function. Please see as below (NewSetSlide.ex data frame was taken from the question above):
resultr <- lm(returnapple ~ returnsandp, data = NewSetSlide.ex)
summary(resultr)
Output:
Call:
lm(formula = returnapple ~ returnsandp, data = NewSetSlide.ex)
Residuals:
Min 1Q Median 3Q Max
-0.166599181 -0.041838291 0.003778841 0.044034591 0.166774667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.028595156 0.008677931 3.29516 0.0016294 **
returnsandp -0.035466006 0.160976847 -0.22032 0.8263478
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0672294 on 62 degrees of freedom
Multiple R-squared: 0.0007822871, Adjusted R-squared: -0.01533413
F-statistic: 0.04853977 on 1 and 62 DF, p-value: 0.8263478

General Linear Model interpretation of parameter estimates in R

I have a data set that looks like
"","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB"
"1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK"
"2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA"
"3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA"
"4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK"
"5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA"
"6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA"
"7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK"
"8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA"
"9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA"
"10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK"
"11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK"
"12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK"
"13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA"
"14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK"
"15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK"
"16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA"
"17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK"
"18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA"
"19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA"
"20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK"
"21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK"
"22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA"
"23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK"
"24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA"
"25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK"
"26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK"
"27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK"
"28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA"
"29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK"
"30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA"
"31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK"
"32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA"
"33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA"
"34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK"
"35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA"
"36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA"
"37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK"
"38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA"
"39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK"
"40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK"
"41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA"
"42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA"
"43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK"
"44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK"
"45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA"
"46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK"
"47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK"
"48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA"
"49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA"
"50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA"
"51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA"
"52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA"
"53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA"
"54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK"
"55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK"
"56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK"
"57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK"
I define a linear model in R using lm as
lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data)
and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level?
> summary(lm1)
Coefficients:
Call:
lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.90821 -0.32102 -0.08993 0.27311 0.97758
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2983 0.2110 34.596 < 2e-16 ***
OXYGENL -0.4086 0.1669 -2.449 0.017953 *
OXYGENN -0.7567 0.1802 -4.199 0.000113 ***
LOADL -1.0645 0.1675 -6.357 6.58e-08 ***
LOADN NA NA NA NA
PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 **
PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 ***
TIME2 -0.9160 0.2065 -4.436 5.18e-05 ***
LABUSA 0.3829 0.1344 2.849 0.006392 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5058 on 49 degrees of freedom
Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161
F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here:
[linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest.
For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression:
lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df)
> summary(lm1)
Call:
lm(formula = logDIOX ~ 1 + OXYGEN, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.7803 -0.7833 -0.2027 0.6597 3.1229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5359 0.2726 20.305 <2e-16 ***
OXYGENL -0.4188 0.3909 -1.071 0.289
OXYGENN -0.1896 0.3807 -0.498 0.621
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.188 on 54 degrees of freedom
Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542
F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662
what this result is saying is that for
OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463.
Hope this helps
UPDATE:
Following your comment I generalize to your model.
when
OXYGEN = "H"
LOAD = "H"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983
when
OXYGEN = "L"
LOAD = "H"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983-0.4086 =6.8897
when
OXYGEN = "L"
LOAD = "L"
PLANT= "KARRA"
TIME=1
LAB="KK"
then:
logDIOX =7.2983-0.4086-1.0645 =5.8252
etc.

Extract data from Partial least square regression on R

I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")

Resources