Coding Custom Contrasts - r

I'm looking to create a user-defined contrast on my data. In brief, the data is organized in a dataframe, with each row having 1 of 4 possible conditions, a proportion of correct answers on a test, and 2 variables called "Schedule" and "Cluster." The head of my data looks like this:
Subjects Condition PC Schedule Cluster
1 1 1 0.5555556 Interleaved Similar
2 2 1 0.3425926 Interleaved Similar
3 3 1 0.7129630 Interleaved Similar
4 4 1 0.5000000 Interleaved Similar
5 5 1 0.6296296 Interleaved Similar
6 6 1 0.6851852 Interleaved Similar
There are two main contrasts I want to run. The first compares condition 1 to the mean of conditions 2, 3, and 4. The second compares condition 4 to the mean of conditions 2 and 3. I coded my two contrtasts like this:
contrast1 = c(1, -1/3, -1/3, -1/3)
contrast2 = c(0, -1/2, -1/2, 1)
I then put them into a matrix:
cond.contrasts = matrix(c(contrast1, contrast2), ncol = 2)
Per advice I saw elsewhere, I got the general inverse of this matrix with a function from the MASS package, ginv():
cond.contrasts = t(ginv(cond.contrasts))
show(cond.contrasts)
[,1] [,2]
[1,] 0.75 0.0000000
[2,] -0.25 -0.3333333
[3,] -0.25 -0.3333333
[4,] -0.25 0.6666667
Note there are only two contrasts here. However, my output looks like this:
lm.experiment = lm(PC ~ Condition, PC)
summary(lm.experiment)
Call:
lm(formula = PC ~ Condition, data = PC)
Residuals:
Min 1Q Median 3Q Max
-0.22099 -0.12069 -0.00926 0.11443 0.35117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5438470 0.0136786 39.759 <2e-16 ***
Condition1 0.0263110 0.0312175 0.843 0.401
Condition2 0.0279084 0.0335882 0.831 0.408
Condition3 -0.0007032 0.0276090 -0.025 0.980
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1472 on 112 degrees of freedom
Multiple R-squared: 0.01234, Adjusted R-squared: -0.01412
F-statistic: 0.4663 on 3 and 112 DF, p-value: 0.7064
If I'm understanding this right, my contrasts should be represented by the "Condition1" and "Condition2" coefficients. However, I have no idea what "Condition3" refers to. If I ask R to show me the contrasts directly, it gives me this:
> show(contrasts(PC$Condition))
[,1] [,2] [,3]
1 0.75 0.0000000 8.326673e-17
2 -0.25 -0.3333333 -7.071068e-01
3 -0.25 -0.3333333 7.071068e-01
4 -0.25 0.6666667 -2.498002e-16
Where does the third column come from? Have I done something wrong?

If you specify the contrasts outside the lm function, R will automatically use the maximum number of contrasts. In your example, one contrast is added since 4 factor levels allow for 3 orthogonal contrasts.
However, you can use the parameter contrasts in lm to override the default behavior. In this case, the specified contrast matrix is used. No additional contrasts are added.
The command:
lm(PC ~ Condition, PC, contrasts = list(Condition = cond.contrasts))
This means that you want to use the contrast matrix cond.contrasts for the factor Condition.

Related

Plotting mean and standard error of mean from linear regression

I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like
Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.

linear regression with a multi stage process

I have a multistage process where I start with a certain number of widgets, run a process and am left with a certain number of widgets, which are used as inputs to the next stage. I do this 4 times until I am left with my final output, the result of the whole multistage process.
START,STAGE1,STAGE2,STAGE3,STAGE4,END
6.026519962,5.006499328,11.34166661,19.76708718,33.15886266,224969
39.10407297,39.33554868,16.72339655,20.06091416,48.27435337,211219
59.01058053,-0.132117703,65.86320651,28.83414845,35.80126588,171002
4.223160769,8.922875348,7.284576901,12.23231723,22.69628442,167601
3.346709925,11.02595913,5.939584679,10.24429047,21.25120225,141647
5.805562629,-0.132117703,9.573058934,14.79379707,22.94771267,141525
6.790051562,-0.132117703,10.75312925,16.11541117,21.75703831,137048
37.32127895,-0.132117703,32.3638353,29.51485539,28.38585138,134767
20.29966555,13.37384397,15.12515734,11.32934817,21.38963677,126394
3.146289383,-0.132117703,5.829709365,11.36942823,18.68626736,122419
4.934995656,-0.132117703,9.390127066,10.30669951,22.11733477,122357
27.00639885,44.34669272,16.24179336,23.87692773,26.40697122,120518
16.43867312,20.86299235,6.724579532,9.023950915,21.5152363,94467
7.141229746,-0.132117703,10.64018571,9.727173688,15.29874722,92407
3.730343996,11.5274705,4.422081678,7.277245326,13.49520217,86933
7.721150514,-0.132117703,13.43075,8.817664761,15.1243289,84975
6.295702334,-0.132117703,11.01875809,14.25575271,17.55220446,82344
10.54578702,-0.132117703,12.21433296,18.3202813,17.61523342,81626
4.339816554,-0.132117703,5.75616262,19.05937993,16.39865988,79357
9.797258349,-0.132117703,14.05058693,16.41091983,17.48161202,78624
Call:
lm(formula = END ~ START + STAGE1 + STAGE2 + STAGE3 + STAGE4,
data = demo)
Residuals:
Min 1Q Median 3Q Max
-23434.9 -8973.9 136.3 7581.5 26091.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.7 15778.3 -0.764 0.45762
START -3683.4 1075.7 -3.424 0.00411 **
STAGE1 522.1 561.2 0.930 0.36798
STAGE2 2695.1 1069.0 2.521 0.02445 *
STAGE3 -601.1 935.6 -0.642 0.53097
STAGE4 6834.6 737.9 9.262 2.39e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16530 on 14 degrees of freedom
Multiple R-squared: 0.8913, Adjusted R-squared: 0.8524
F-statistic: 22.95 on 5 and 14 DF, p-value: 2.755e-06
I see a high r squared and a low p value, so the model looks good.
However,
library(corpcor)
cor2pcor(cov(X))
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.7597090 0.91450928 0.13842775 0.2761325
[2,] 0.7597090 1.0000000 -0.78568174 -0.11211720 0.1549637
[3,] 0.9145093 -0.7856817 1.00000000 0.09986542 -0.1385342
[4,] 0.1384277 -0.1121172 0.09986542 1.00000000 0.2810463
[5,] 0.2761325 0.1549637 -0.13853420 0.28104629 1.0000000
A number of the variables are highly correlated.
How do I remove the correlation?
In the end, I would like a model to be able to determine if a particular run of a stage is efficient, however it is hard to do that because, to some degree, it depends on the process before. Is there a better approach?
Thanks

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Polynomial regression line through origin with equation in calibration curve

I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.
Here's my data:
ecoli_ug_ml A420 rpt
1 0 0.000 1
2 10 0.129 1
3 20 0.257 1
4 30 0.379 1
5 40 0.479 1
6 50 0.579 1
7 60 0.673 1
8 70 0.758 1
9 80 0.838 1
10 90 0.912 1
11 100 0.976 1
12 0 0.000 2
13 10 0.126 2
14 20 0.257 2
15 30 0.382 2
16 40 0.490 2
17 50 0.592 2
18 60 0.684 2
19 70 0.772 2
20 80 0.847 2
21 90 0.917 2
22 100 0.977 2
23 0 0.000 3
24 10 0.125 3
25 20 0.258 3
26 30 0.376 3
27 40 0.488 3
28 50 0.582 3
29 60 0.681 3
30 70 0.768 3
31 80 0.846 3
32 90 0.915 3
33 100 0.977 3
My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2
When I view this, the fit of this line to the points is spectacularly good.
When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:
coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))
(Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1
-1.979021e-03 1.374789e-02
poly(ecoli_ug_ml, 2, raw = TRUE)2
-3.964258e-05
Therefore, I assume the plot is actually not quite right either.
So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.
I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!
To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.979e-03 1.926e-03 -1.028 0.312
# poly(ecoli_ug_ml, 2, raw = T)1 1.375e-02 8.961e-05 153.419 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05 8.631e-07 -45.932 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
# F-statistic: 8.343e+04 on 2 and 30 DF, p-value: < 2.2e-16
So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.
You can force a fit without the intercept this way (note the -1 in the formula):
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# poly(ecoli_ug_ml, 2, raw = T)1 1.367e-02 5.188e-05 263.54 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05 6.396e-07 -61.05 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 3.4e+05 on 2 and 31 DF, p-value: < 2.2e-16
Note that the coefficients do not change appreciably.
EDIT (Response to OP's comment)
The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))
The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."
The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.
Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.
I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:
....+
stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...
You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.
coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
(Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2
0.5466667 1.7772858 -0.2011251
So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.
Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:
mod <- lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
geom_point(shape=ecoli$rpt) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )

constrained multiple linear regression in R

Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?
You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")

Resources