For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020:
bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T)
bc$x[which.max(bc$y)] # gives lambda
Now I want to fit a simple linear regression using the transformed data, so I've entered this code
transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020))
transform
But all I get is the error message
'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package.
## make sure you do install.packages("car") if you haven't already
library(car)
data(Prestige)
p <- powerTransform(prestige ~ income + education + type ,
data=Prestige,
family="bcPower")
summary(p)
# bcPower Transformation to Normality
# Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
# Y1 1.3052 1 0.9408 1.6696
#
# Likelihood ratio test that transformation parameter is equal to 0
# (log transformation)
# LRT df pval
# LR test, lambda = (0) 41.67724 1 1.0765e-10
#
# Likelihood ratio test that no transformation is needed
# LRT df pval
# LR test, lambda = (1) 2.623915 1 0.10526
mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige)
summary(mod)
#
# Call:
# lm(formula = bcPower(prestige, 1.3052) ~ income + education +
# type, data = Prestige)
#
# Residuals:
# Min 1Q Median 3Q Max
# -44.843 -13.102 0.287 15.073 62.889
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 *
# income 3.363e-03 6.928e-04 4.854 4.87e-06 ***
# education 1.205e+01 2.009e+00 5.999 3.78e-08 ***
# typeprof 2.027e+01 1.213e+01 1.672 0.0979 .
# typewc -1.078e+01 7.884e+00 -1.368 0.1746
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.25 on 93 degrees of freedom
# (4 observations deleted due to missingness)
# Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427
# F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator:
I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020)
However, I think you should do what #DaveArmstrong suggested anyway:
it's only the predictor variable that gets transformed
the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)
Related
I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.
Does anyone know if it is possible to use lmFit or lm in R to calculate a linear model with categorical variables while including all possible comparisons between the categories? For example in the test data created here:
set.seed(25)
f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
mat <- model.matrix(~f, data = data.frame(f = f))
beta <- c(12, 3, 6) #these are the simulated regression coefficient
y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
m <- lm(y ~ f)
I get the summary:
summary(m)
Call:
lm(formula = y ~ f)
Residuals:
Min 1Q Median 3Q Max
-4.3505 -1.6114 0.1608 1.1615 5.2010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.4976 0.4629 24.840 < 2e-16 ***
flow 3.0370 0.6546 4.639 2.09e-05 ***
fhigh 6.1630 0.6546 9.415 3.27e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 57 degrees of freedom
Multiple R-squared: 0.6086, Adjusted R-squared: 0.5949
F-statistic: 44.32 on 2 and 57 DF, p-value: 2.446e-12
which is because the contrasts term ("contr.treatment") compares "high" to "control" and "low" to "control".
Is it possible to get also the comparison between "high" and "low"?
If you use aov instead of lm, you can use the TukeyHSD function from the stats package:
fit <- aov(y ~ f)
TukeyHSD(fit)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = y ~ f)
# $f
# diff lwr upr p adj
# low-control 3.036957 1.461707 4.612207 6.15e-05
# high-control 6.163009 4.587759 7.738259 0.00e+00
# high-low 3.126052 1.550802 4.701302 3.81e-05
If you want to use an lm object, you can use the TukeyHSD function from the mosaic package:
library(mosaic)
TukeyHSD(m)
Or, as #ben-bolker suggests,
library(emmeans)
e1 <- emmeans(m, specs = "f")
pairs(e1)
# contrast estimate SE df t.ratio p.value
# control - low -3.036957 0.6546036 57 -4.639 0.0001
# control - high -6.163009 0.6546036 57 -9.415 <.0001
# low - high -3.126052 0.6546036 57 -4.775 <.0001
# P value adjustment: tukey method for comparing a family of 3 estimates
With lmFit:
library(limma)
design <- model.matrix(~0 + f)
colnames(design) <- levels(f)
fit <- lmFit(y, design)
contrast.matrix <- makeContrasts(control-low, control-high, low-high,
levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
round(t(rbind(fit2$coefficients, fit2$t, fit2$p.value)), 5)
# [,1] [,2] [,3]
# control - low -3.03696 -4.63938 2e-05
# control - high -6.16301 -9.41487 0e+00
# low - high -3.12605 -4.77549 1e-05
Also see Multiple t-test comparisons for more information.
I have been trying to fit a glm (Poisson with log link, to be specific) on a dataset in sparkR. It is pretty large, and as such collecting it and using R's own glm() isn't likely to work. This includes an exposure term which needs to be included as an offset (regressor with known coefficient - 1 in my case). Unfortunately, neither adding an offset term in the formula, nor passing the column name (or the column itself, or a numeric vector formed by collecting the coumn after selecting it) works - in the first case the formula isn't parsed, and in the other cases the offset term is ignored - with no error messages at all. Here's an example of what I've been trying to do (outputs in comments):
library(datasets)
#set up Spark session
#Sys.setenv(SPARK_HOME = "/usr/share/spark_2.1.0")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
options(scipen = 15, digits = 5)
sparkR.session(spark.executor.instances = "20", spark.executor.memory = "6g")
# # Setting default log level to "WARN".
# # To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# # 17/06/19 06:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# # 17/06/19 06:33:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
# # 17/06/19 06:34:22 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
message(sparkR.conf()$spark.app.id)
# # application_*************_****
#Test glm() in sparkR
data("iris")
iris_df = createDataFrame(iris)
# # Warning messages:
# # 1: In FUN(X[[i]], ...) :
# # Use Sepal_Length instead of Sepal.Length as column name
# # 2: In FUN(X[[i]], ...) :
# # Use Sepal_Width instead of Sepal.Width as column name
# # 3: In FUN(X[[i]], ...) :
# # Use Petal_Length instead of Petal.Length as column name
# # 4: In FUN(X[[i]], ...) :
# # Use Petal_Width instead of Petal.Width as column name
model = glm(Sepal_Length ~ offset(Sepal_Width) + Petal_Length, data = iris_df)
# # 17/06/19 08:46:47 ERROR RBackendHandler: fit on org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
# # java.lang.reflect.InvocationTargetException
# # ......
# # Caused by: java.lang.IllegalArgumentException: Could not parse formula: Sepal_Length ~ offset(Sepal_Width) + Petal_Length
# # at org.apache.spark.ml.feature.RFormulaParser$.parse(RFormulaParser.scala:200)
# # ......
model = glm(Sepal_Length ~ Petal_Length + offset(Sepal_Width), data = iris_df)
# # (Same error as above)
# The one below runs.
model = glm(Sepal_Length ~ Petal_Length, offset = Sepal_Width, data = iris_df, family = gaussian())
# # 17/06/19 08:51:21 WARN WeightedLeastSquares: regParam is zero, which might cause numerical instability and overfitting.
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
summary(model)
# # Deviance Residuals:
# # (Note: These are approximate quantiles with relative error <= 0.01)
# # Min 1Q Median 3Q Max
# # -1.24675 -0.30140 -0.01999 0.26700 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.3066 0.078389 54.939 0
# # Petal_Length 0.40892 0.018891 21.646 0
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160
# #
# # Number of Fisher Scoring iterations: 1
# (RESULTS ARE SAME AS GLM WITHOUT OFFSET)
# Results in R:
model = glm(Sepal.Length ~ Petal.Length, offset = Sepal.Width, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris, offset = Sepal.Width)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -0.93997 -0.27232 -0.02085 0.28576 0.88944
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 0.85173 0.07098 12.00 <2e-16 ***
# # Petal.Length 0.51471 0.01711 30.09 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1358764)
# #
# # Null deviance: 143.12 on 149 degrees of freedom
# # Residual deviance: 20.11 on 148 degrees of freedom
# # AIC: 130.27
# #
# # Number of Fisher Scoring iterations: 2
#Results in R without offset. Matches SparkR output with and w/o offset.
model = glm(Sepal.Length ~ Petal.Length, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -1.24675 -0.29657 -0.01515 0.27676 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
# # Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160.04
# #
# # Number of Fisher Scoring iterations: 2
Note: The Spark version is 2.1.0 (as in the code). From what I checked the implementation is supposed to be there. Also, the warning messages after gl don't always appear, but that does not appear to have an effect on what's going on.
Am I doing something wrong, or is the offset term not used in the glm implementation of spark? If it is the second, is there any workaround to get the same results as having an offset term?
A poisson GLM with response Y and offset log(K) is the same as a GLM with response Y/K and weights K.
Example using the Insurance dataset from MASS:
> glm(Claims ~ District + Group + Age, data=Insurance, family=poisson, offset=log(Holders))
Call: glm(formula = Claims ~ District + Group + Age, family = poisson,
data = Insurance, offset = log(Holders))
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: 388.7
> glm(Claims/Holders ~ District + Group + Age, data=Insurance, family=quasipoisson, weights=Holders)
Call: glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson,
data = Insurance, weights = Holders)
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: NA
(The quasipoisson family is to shut R up about the non-integral values detected for the response.)
This technique should be usable with Spark's GLM implementation as well.
See also a similar question on stats.SE.
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")
By default lm summary test slope coefficient equal to zero. My question is very basic. I want to know how to test slope coefficient equal to non-zero value. One approach could be to use confint but this does not provide p-value. I also wonder how to do one-sided test with lm.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
Call:
lm(formula = weight ~ group)
Residuals:
Min 1Q Median 3Q Max
-1.0710 -0.4938 0.0685 0.2462 1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.2202 22.850 9.55e-15 ***
groupTrt -0.3710 0.3114 -1.191 0.249
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308, Adjusted R-squared: 0.02158
F-statistic: 1.419 on 1 and 18 DF, p-value: 0.249
confint(lm.D9)
2.5 % 97.5 %
(Intercept) 4.56934 5.4946602
groupTrt -1.02530 0.2833003
Thanks for your time and effort.
as #power says, you can do by your hand.
here is an example:
> est <- summary.lm(lm.D9)$coef[2, 1]
> se <- summary.lm(lm.D9)$coef[2, 2]
> df <- summary.lm(lm.D9)$df[2]
>
> m <- 0
> 2 * abs(pt((est-m)/se, df))
[1] 0.2490232
>
> m <- 0.2
> 2 * abs(pt((est-m)/se, df))
[1] 0.08332659
and you can do one-side test by omitting 2*.
UPDATES
here is an example of two-side and one-side probability:
> m <- 0.2
>
> # two-side probability
> 2 * abs(pt((est-m)/se, df))
[1] 0.08332659
>
> # one-side, upper (i.e., greater than 0.2)
> pt((est-m)/se, df, lower.tail = FALSE)
[1] 0.9583367
>
> # one-side, lower (i.e., less than 0.2)
> pt((est-m)/se, df, lower.tail = TRUE)
[1] 0.0416633
note that sum of upper and lower probabilities is exactly 1.
Use the linearHypothesis function from car package. For instance, you can check if the coefficient of groupTrt equals -1 using.
linearHypothesis(lm.D9, "groupTrt = -1")
Linear hypothesis test
Hypothesis:
groupTrt = - 1
Model 1: restricted model
Model 2: weight ~ group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.7075
2 18 8.7292 1 1.9782 4.0791 0.05856 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The smatr package has a slope.test() function with which you can use OLS.
In addition to all the other good answers, you could use an offset. It's a little trickier with categorical predictors, because you need to know the coding.
lm(weight~group+offset(1*(group=="Trt")))
The 1* here is unnecessary but is put in to emphasize that you are testing against the hypothesis that the difference is 1 (if you want to test against a hypothesis of a difference of d, then use d*(group=="Trt")
You can use t.test to do this for your data. The mu parameter sets the hypothesis for the difference of group means. The alternative parameter lets you choose between one and two-sided tests.
t.test(weight~group,var.equal=TRUE)
Two Sample t-test
data: weight by group
t = 1.1913, df = 18, p-value = 0.249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean in group Ctl mean in group Trt
5.032 4.661
t.test(weight~group,var.equal=TRUE,mu=-1)
Two Sample t-test
data: weight by group
t = 4.4022, df = 18, p-value = 0.0003438
alternative hypothesis: true difference in means is not equal to -1
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean in group Ctl mean in group Trt
5.032 4.661
Code up your own test. You know the estimated coeffiecient and you know the standard error. You could construct your own test stat.