Linear model in R - Multiplication Expression

Linear model in R - Multiplication Expression - r

I have 3 numerical variables A, B and C. I am trying to create a linear model capable of predicting A. The expression that I am using is the product of B*C in order to predict A; however, when looking at the output I am not able to get my equation because I get and extra variable that I don't know what is it.
Here is my code
MyData<-read.csv("...", header = T)
head(MyData,6)
str(MyData)
#Linear Model
#Expersion A= B*C
Model1<-lm(MyData$A~MyData$B*MyData$C)
summary(Model1)
Output of str(MyData)
> str(MyData)
'data.frame': 6 obs. of 3 variables:
$ A: num 2.5 3.4 2.7 3.6 2.5 2.1
$ B: num 0.01 0.02 0.015 0.017 0.018 0.01
$ C: num 0.1 0.2 0.27 0.19 0.17 0.16
Output of summary(Model1)
Call:
lm(formula = MyData$A ~ MyData$B * MyData$C)
Residuals:
1 2 3 4 5 6
-0.03945 -0.08386 -0.13925 0.67703 -0.40055 -0.01393
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.473 5.774 0.948 0.443
MyData$B -222.431 454.508 -0.489 0.673
MyData$C -26.482 36.222 -0.731 0.541
MyData$B:MyData$C 1938.961 2679.207 0.724 0.544
Residual standard error: 0.5688 on 2 degrees of freedom
Multiple R-squared: 0.6149, Adjusted R-squared: 0.03723
F-statistic: 1.064 on 3 and 2 DF, p-value: 0.5178
lm uses the Wilkinson-Rogers notation so "*" is an iteration, based on the output, right? is this true, how do I create my model using the product of my two variables?

If you just want a single term that is the literal product of the two variables, not an interaction, you can use I():
Model1 <- lm(MyData$A ~ I(MyData$B * MyData$C))
I think in practice, with 2 numeric variables, this ends up the same as Dan's suggestion to use x1:x2 to get just the interaction without the terms for each individual predictor, but it might differ in other cases.

Related

Linear mixed model confidence intervals question

Hoping that you can clear some confusion in my head.
Linear mixed model is constructed with lmerTest:
MODEL <- lmer(Ca content ~ SYSTEM +(1 | YEAR/replicate) +
(1 | YEAR:SYSTEM), data = IOSDV1)
Fun starts happening when I'm trying to get the confidence intervals for the specific levels of the main effect.
Commands emmeans and lsmeans produce the same intervals (example; SYSTEM A3: 23.9-128.9, mean 76.4, SE:8.96).
However, the command as.data.frame(effect("SYSTEM", MODEL)) produces different, narrower confidence intervals (example; SYSTEM A3: 58.0-94.9, mean 76.4, SE:8.96).
What am I missing and what number should I report?
To summarize, for the content of Ca, i have 6 total measurements per treatment (three per year, each from different replication). I will leave the names in the code in my language, as used. Idea is to test if certain production practices affect the content of specific minerals in the grains. Random effects without residual variance were left in the model for this example.
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: CA ~ SISTEM + (1 | LETO/ponovitev) + (1 | LETO:SISTEM)
Data: IOSDV1
REML criterion at convergence: 202.1
Scaled residuals:
Min 1Q Median 3Q Max
-1.60767 -0.74339 0.04665 0.73152 1.50519
Random effects:
Groups Name Variance Std.Dev.
LETO:SISTEM (Intercept) 0.0 0.0
ponovitev:LETO (Intercept) 0.0 0.0
LETO (Intercept) 120.9 11.0
Residual 118.7 10.9
Number of obs: 30, groups: LETO:SISTEM, 10; ponovitev:LETO, 8; LETO, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 76.417 8.959 1.548 8.530 0.0276 *
SISTEM[T.C0] -5.183 6.291 24.000 -0.824 0.4181
SISTEM[T.C110] -13.433 6.291 24.000 -2.135 0.0431 *
SISTEM[T.C165] -7.617 6.291 24.000 -1.211 0.2378
SISTEM[T.C55] -10.883 6.291 24.000 -1.730 0.0965 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) SISTEM[T.C0 SISTEM[T.C11 SISTEM[T.C16
SISTEM[T.C0 -0.351
SISTEM[T.C11 -0.351 0.500
SISTEM[T.C16 -0.351 0.500 0.500
SISTEM[T.C5 -0.351 0.500 0.500 0.500
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
> ls_means(MODEL, ddf="Kenward-Roger")
Least Squares Means table:
Estimate Std. Error df t value lower upper Pr(>|t|)
SISTEMA3 76.4167 8.9586 1.5 8.5299 23.9091 128.9243 0.02853 *
SISTEMC0 71.2333 8.9586 1.5 7.9514 18.7257 123.7409 0.03171 *
SISTEMC110 62.9833 8.9586 1.5 7.0305 10.4757 115.4909 0.03813 *
SISTEMC165 68.8000 8.9586 1.5 7.6797 16.2924 121.3076 0.03341 *
SISTEMC55 65.5333 8.9586 1.5 7.3151 13.0257 118.0409 0.03594 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confidence level: 95%
Degrees of freedom method: Kenward-Roger
> emmeans(MODEL, spec = c("SISTEM"))
SISTEM emmean SE df lower.CL upper.CL
A3 76.4 8.96 1.53 23.9 129
C0 71.2 8.96 1.53 18.7 124
C110 63.0 8.96 1.53 10.5 115
C165 68.8 8.96 1.53 16.3 121
C55 65.5 8.96 1.53 13.0 118
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
> as.data.frame(effect("SISTEM", MODEL))
SISTEM fit se lower upper
1 A3 76.41667 8.958643 57.96600 94.86734
2 C0 71.23333 8.958643 52.78266 89.68400
3 C110 62.98333 8.958643 44.53266 81.43400
4 C165 68.80000 8.958643 50.34933 87.25067
5 C55 65.53333 8.958643 47.08266 83.98400
Many thanks.

I'm pretty sure this has to do with the dreaded "denominator degrees of freedom" question, i.e. what kind (if any) of finite-sample correction is being employed. tl;dr emmeans is using a Kenward-Roger correction, which is more or less the most accurate available option — the only reason not to use K-R is if you have a large data set for which it becomes unbearably slow.
load packages, simulate data, fit model
library(lmerTest)
library(emmeans)
library(effects)
dd <- expand.grid(f=factor(letters[1:3]),g=factor(1:20),rep=1:10)
set.seed(101)
dd$y <- simulate(~f+(1|g), newdata=dd, newparams=list(beta=rep(1,3),theta=1,sigma=1))[[1]]
m <- lmer(y~f+(1|g), data=dd)
compare default emmeans with effects
emmeans(m, ~f)
## f emmean SE df lower.CL upper.CL
## a 0.848 0.212 21.9 0.409 1.29
## b 1.853 0.212 21.9 1.414 2.29
## c 1.863 0.212 21.9 1.424 2.30
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
as.data.frame(effect("f",m))
## f fit se lower upper
## 1 a 0.8480161 0.2117093 0.4322306 1.263802
## 2 b 1.8531805 0.2117093 1.4373950 2.268966
## 3 c 1.8632228 0.2117093 1.4474373 2.279008
effects doesn't explicitly tell us what/whether it's using a finite-sample correction: we could dig around in the documentation or the code to try to find out. Alternatively, we can tell emmeans not to use finite-sample correction:
emmeans(m, ~f, lmer.df="asymptotic")
## f emmean SE df asymp.LCL asymp.UCL
## a 0.848 0.212 Inf 0.433 1.26
## b 1.853 0.212 Inf 1.438 2.27
## c 1.863 0.212 Inf 1.448 2.28
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
Testing shows that these are equivalent to about a tolerance of 0.001 (probably close enough). In principle we should be able to specify KR=TRUE to get effects to use Kenward-Roger correction, but I haven't been able to get that to work yet.
However, I will also say that there's something a little bit funky about your example. If we compute the distance between the mean and the lower CI in units of standard error, for emmeans we get (76.4-23.9)/8.96 = 5.86, which implies a very small effect degrees of freedom (e.g. about 1.55). That seems questionable to me unless your data set is extremely small ...
From your updated post, it appears that Kenward-Roger is indeed estimating only 1.5 denominator df.
In general it is dicey/not recommended to try fitting random effects where the grouping variable has a small number of levels (although see here for a counterargument). I would try treating LETO (which has only two levels) as a fixed effect, i.e.
CA ~ SISTEM + LETO + (1 | LETO:ponovitev) + (1 | LETO:SISTEM)
and see if that helps. (I would expect you would then get on the order of 7 df, which would make your CIs ± 2.4 SE instead of ± 6 SE ...)

How do I include p-value and R-square for the estimates in semPaths?

I am using semPaths (semPlot package) to draw my structural equation models. After some trial and error, I have a pretty good script to show what I want. Except, I haven’t been able to figure out how to include the p-value/significance levels of the estimates/regression coefficients in the figure.
Can/how can I include significance levels either as e.g. p-value in the edge labels below the estimate or as a broken line for insignificance or …?
I am also interested in including the R-square, but not as critically as the significance level.
This is the script I am using so far:
semPaths(fitmod.bac.class2,
what = "std",
whatLabels = "std",
style="ram",
edge.label.cex = 1.3,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7 )
Example of one of the SemPath outputs
In this example the following are not significant:
Ignavibacteria -> First_C_CO2_ugC_gC_day, p = 0.096
pH -> Ignavibacteria, p = 0.151
cand_class_MB_A2_108 <-> Bacilli correlation, p = 0.054
I am a R-user and not really a coder, so I might just be missing a crucial point in the arguments.
I am testing a lot of different models at the moment, and would really like not to have to draw them all up by hand.
update:
Using semPlotModel: Am I right in understanding that semPlotModel doesn’t include the significance levels from the sem function (see my script and output below)? I am specifically looking to include the P(>|z|) for regressions and covariance.
Is it just me that is missing that, or is it not included? If it is not included, my solution is simply just to custom the edge labels.
{model.NA.UP.bac.class2 <- '
#LATANT VARIABLES
#REGRESSIONS
#soil organic carbon quality
c_Negativicutes ~ CN
#microorganisms
First_C_CO2_ugC_gC_day ~ c_Bacilli
First_C_CO2_ugC_gC_day ~ c_Ignavibacteria
First_C_CO2_ugC_gC_day ~ c_cand_class_MB_A2_108
First_C_CO2_ugC_gC_day ~ c_Negativicutes
#pH
c_Bacilli ~pH
c_Ignavibacteria ~pH
c_cand_class_MB_A2_108~pH
c_Negativicutes ~pH
#COVARIANCE
initial_water ~~ CN
c_cand_class_MB_A2_108 ~~ c_Bacilli
'
fitmod.bac.class2 <- sem(model.NA.UP.bac.class2, data=datapNA.UP.log, missing="ml", meanstructure=TRUE, fixed.x=FALSE, std.lv=FALSE, std.ov=FALSE)
summary(fitmod.bac.class2, standardized=TRUE, fit.measures=TRUE, rsq=TRUE)
out <- capture.output(summary(fitmod.bac.class2, standardized=TRUE, fit.measures=TRUE, rsq=TRUE))
}
Output:
lavaan 0.6-5 ended normally after 188 iterations
Estimator ML
Optimization method NLMINB
Number of free parameters 28
Number of observations 30
Number of missing patterns 1
Model Test User Model:
Test statistic 17.816
Degrees of freedom 16
P-value (Chi-square) 0.335
Model Test Baseline Model:
Test statistic 101.570
Degrees of freedom 28
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.975
Tucker-Lewis Index (TLI) 0.957
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) 472.465
Loglikelihood unrestricted model (H1) 481.373
Akaike (AIC) -888.930
Bayesian (BIC) -849.697
Sample-size adjusted Bayesian (BIC) -936.875
Root Mean Square Error of Approximation:
RMSEA 0.062
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.185
P-value RMSEA <= 0.05 0.414
Standardized Root Mean Square Residual:
SRMR 0.107
Parameter Estimates:
Information Observed
Observed information based on Hessian
Standard errors Standard
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
c_Negativicutes ~
CN 0.419 0.143 2.939 0.003 0.419 0.416
c_cand_class_MB_A2_108 ~
CN -0.433 0.160 -2.707 0.007 -0.433 -0.394
First_C_CO2_ugC_gC_day ~
c_Bacilli 0.525 0.128 4.092 0.000 0.525 0.496
c_Ignavibacter 0.207 0.124 1.667 0.096 0.207 0.195
c_c__MB_A2_108 0.310 0.125 2.475 0.013 0.310 0.301
c_Negativicuts 0.304 0.137 2.220 0.026 0.304 0.271
c_Bacilli ~
pH 0.624 0.135 4.604 0.000 0.624 0.643
c_Ignavibacteria ~
pH 0.245 0.171 1.436 0.151 0.245 0.254
c_cand_class_MB_A2_108 ~
pH 0.393 0.151 2.597 0.009 0.393 0.394
c_Negativicutes ~
pH 0.435 0.129 3.361 0.001 0.435 0.476
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
CN ~~
initial_water 0.001 0.000 2.679 0.007 0.001 0.561
.c_cand_class_MB_A2_108 ~~
.c_Bacilli -0.000 0.000 -1.923 0.054 -0.000 -0.388
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.c_Negativicuts 0.145 0.198 0.734 0.463 0.145 3.826
.c_c__MB_A2_108 1.038 0.226 4.594 0.000 1.038 25.076
.Frs_C_CO2_C_C_ -0.346 0.233 -1.485 0.137 -0.346 -8.115
.c_Bacilli 0.376 0.135 2.778 0.005 0.376 9.340
.c_Ignavibacter 0.754 0.170 4.424 0.000 0.754 18.796
CN 0.998 0.007 145.158 0.000 0.998 26.502
pH 0.998 0.008 131.642 0.000 0.998 24.034
initial_water 0.998 0.008 125.994 0.000 0.998 23.003
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.c_Negativicuts 0.001 0.000 3.873 0.000 0.001 0.600
.c_c__MB_A2_108 0.001 0.000 3.833 0.000 0.001 0.689
.Frs_C_CO2_C_C_ 0.001 0.000 3.873 0.000 0.001 0.408
.c_Bacilli 0.001 0.000 3.873 0.000 0.001 0.586
.c_Ignavibacter 0.002 0.000 3.873 0.000 0.002 0.936
CN 0.001 0.000 3.873 0.000 0.001 1.000
initial_water 0.002 0.000 3.873 0.000 0.002 1.000
pH 0.002 0.000 3.873 0.000 0.002 1.000
R-Square:
Estimate
c_Negativicuts 0.400
c_c__MB_A2_108 0.311
Frs_C_CO2_C_C_ 0.592
c_Bacilli 0.414
c_Ignavibacter 0.064
Warning message:
In lav_model_hessian(lavmodel = lavmodel, lavsamplestats = lavsamplestats, :
lavaan WARNING: Hessian is not fully symmetric. Max diff = 5.15131396241486e-05

This example is taken from ?semPaths since we don't have your object.
library('semPlot')
modFile <- tempfile(fileext = '.OUT')
download.file('http://sachaepskamp.com/files/mi1.OUT', modFile)
Use semPlotModel to get the object without plotting. There you can inspect what is to be plotted. I just dug around without reading the docs until I found what it seems to be using.
After you run semPlotModel, the object has an element x#Pars which contains the edges, nodes, and the std which is being used for the edge labels in your case. semPaths also has an argument that allows you to make custom edge labels, so you can take the data you need from x#Pars and add your p-values:
x <- semPlotModel(modFile)
x#Pars
# label lhs edge rhs est std group fixed par
# 1 lambda[11]^{(y)} perfIQ -> pc 1.000 0.6219648 Group 1 TRUE 0
# 2 lambda[21]^{(y)} perfIQ -> pa 0.923 0.5664888 Group 1 FALSE 1
# 3 lambda[31]^{(y)} perfIQ -> oa 1.098 0.6550159 Group 1 FALSE 2
# 4 lambda[41]^{(y)} perfIQ -> ma 0.784 0.4609990 Group 1 FALSE 3
# 5 theta[11]^{(epsilon)} pc <-> pc 5.088 0.6131598 Group 1 FALSE 5
# 10 theta[22]^{(epsilon)} pa <-> pa 5.787 0.6790905 Group 1 FALSE 6
# 15 theta[33]^{(epsilon)} oa <-> oa 5.150 0.5709541 Group 1 FALSE 7
# 20 theta[44]^{(epsilon)} ma <-> ma 7.311 0.7874800 Group 1 FALSE 8
# 21 psi[11] perfIQ <-> perfIQ 3.210 1.0000000 Group 1 FALSE 4
# 22 tau[1]^{(y)} int pc 10.500 NA Group 1 FALSE 9
# 23 tau[2]^{(y)} int pa 10.374 NA Group 1 FALSE 10
# 24 tau[3]^{(y)} int oa 10.663 NA Group 1 FALSE 11
# 25 tau[4]^{(y)} int ma 10.371 NA Group 1 FALSE 12
# 11 lambda[11]^{(y)} perfIQ -> pc 1.000 0.6515609 Group 2 TRUE 0
# 27 lambda[21]^{(y)} perfIQ -> pa 0.923 0.5876948 Group 2 FALSE 1
# 31 lambda[31]^{(y)} perfIQ -> oa 1.098 0.6981974 Group 2 FALSE 2
# 41 lambda[41]^{(y)} perfIQ -> ma 0.784 0.4621919 Group 2 FALSE 3
# 51 theta[11]^{(epsilon)} pc <-> pc 5.006 0.5754684 Group 2 FALSE 14
# 101 theta[22]^{(epsilon)} pa <-> pa 5.963 0.6546148 Group 2 FALSE 15
# 151 theta[33]^{(epsilon)} oa <-> oa 4.681 0.5125204 Group 2 FALSE 16
# 201 theta[44]^{(epsilon)} ma <-> ma 8.356 0.7863786 Group 2 FALSE 17
# 211 psi[11] perfIQ <-> perfIQ 3.693 1.0000000 Group 2 FALSE 13
# 221 tau[1]^{(y)} int pc 10.500 NA Group 2 FALSE 9
# 231 tau[2]^{(y)} int pa 10.374 NA Group 2 FALSE 10
# 241 tau[3]^{(y)} int oa 10.663 NA Group 2 FALSE 11
# 251 tau[4]^{(y)} int ma 10.371 NA Group 2 FALSE 12
# 26 alpha[1] int perfIQ -2.469 NA Group 2 FALSE 18
As you can see there are more edge labels than ones that are plotted, and I have no idea how it chooses which to use, so I am just taking the first four from each group (since there are four edges shown and the stds match those. Maybe there is an option to plot all of them or select which ones you need--I haven't read the docs.
## take first four stds from each group, generate some p-values
l <- sapply(split(x#Pars$std, x#Pars$group), function(x) head(x, 4))
set.seed(1)
l <- sprintf('%.3f, p=%s', l, format.pval(runif(length(l)), digits = 2))
l
# [1] "0.622, p=0.27" "0.566, p=0.37" "0.655, p=0.57" "0.461, p=0.91" "0.652, p=0.20" "0.588, p=0.90" "0.698, p=0.94" "0.462, p=0.66"
Then you can plot the object with your new labels, edgeLabels = l
layout(1:2)
semPaths(
x,
edgeLabels = l,
ask = FALSE, title = FALSE,
what = 'std',
whatLabels = 'std',
style = 'ram',
edge.label.cex = 1.3,
layout = 'tree',
intercepts = FALSE,
residuals = FALSE,
sizeMan = 7
)

With the help from #rawr, I have worked it out. If anybody else needs to include estimates and p-value from Lavaan in their semPaths, here is how it can be done.
#extracting the parameters from the sem model and selecting the interactions relevant for the semPaths (here, I need 12 estimates and p-values)
table2<-parameterEstimates(fitmod.bac.class2,standardized=TRUE) %>% head(12)
#turning the chosen parameters into text
b<-gettextf('%.3f \n p=%.3f', table2$std.all, digits=table2$pvalue)
I can honestly say that I do not understand how the last bit of script works. This is copied from rawr's answer before a lot of trial and error until it worked. There might (quite possibly) be a nicer way to write it, but it works :)
#putting that list into edgeLabels in sempaths
semPaths(fitmod.bac.class2,
what = "std",
edgeLabels = b,
style="ram",
edge.label.cex = 1,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7
)

Just a small, but relevant detail for an improvement for the above answer.
The above code requires an inspection of the parameter table to count how many lines to maintain to specify as in %>%head(4).
We can exclude from the extracted parameter table those lines which lhs and rhs are not equal.
#extracting the parameters from the sem model and selecting the interactions relevant for the semPaths
table2<-parameterEstimates(fitmod.bac.class2,standardized=TRUE)%>%as.dataframe()
table2<-table2[!table2$lhs==table2$rhs,]
If the formula comprised also extra lines as those with ':=' those also will comprise the parameter table, and should be removed.
The remaining keeps the same...
#turning the chosen parameters into text
b<-gettextf('%.3f \n p=%.3f', table2$std.all, digits=table2$pvalue)
#putting that list into edgeLabels in sempaths
semPaths(fitmod.bac.class2,
what = "std",
edgeLabels = b,
style="ram",
edge.label.cex = 1,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7
)

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6

While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992

The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.

The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R lavaan sem categorical variable no standard error

I want to compute a structural equation model with the sem() function in R with the package lavaan.
There are two categorial variables, one latent exogenous and one latent endogenous, I want to include in the final version of the model.
When I include one of the categorial variables in the model, however, R produces the following warning:
1: In estimateVCOV(lavaanModel, samplestats = lavaanSampleStats,
options = lavaanOptions, : lavaan WARNING: could not compute
standard errors!
2: In computeTestStatistic(lavaanModel, partable = lavaanParTable, :
lavaan WARNING: could not compute scaled test statistic
Code used:
model1 <- '
Wertschaetzung_Essen =~ abwechslungsreiche_M + schnell_zubereitbar + koche_sehr_gerne + koche_sehr_haeufig
Fleischverzicht =~ Ern_Index1
Fleischverzicht ~ Wertschaetzung_Essen
'
fit_model1 <- sem(model1, data=survey2_subset, ordered = c("Ern_Index1"))
Note: This is only a small version of the final model and in which I only introduce one categorial variable. The warning, however, is the same for more complex versions of the model.
Output
str(survey2_subset):
'data.frame': 3676 obs. of 116 variables:
$ abwechslungsreiche_M : num 4 2 3 4 3 3 4 3 3 3 ...
$ schnell_zubereitbar : num 0 3 2 0 0 1 3 2 1 1 ...
$ koche_sehr_gerne : num 1 3 3 1 3 1 4 4 4 3 ...
$ koche_sehr_haeufig : num 2 2 3 NA 3 2 2 4 3 3 ...
$ Ern_Index1 : num 1 1 1 1 0 0 1 0 1 0 ...
summary(fit_model1, fit.measures = TRUE, standardized=TRUE)
lavaan (0.5-15) converged normally after 31 iterations
Used Total
Number of observations 3469 3676
Estimator DWLS Robust
Minimum Function Test Statistic 13.716 NA
Degrees of freedom 4 4
P-value (Chi-square) 0.008 NA
Scaling correction factor NA
Shift parameter
for simple second-order correction (Mplus variant)
Model test baseline model:
Minimum Function Test Statistic 2176.159 1582.139
Degrees of freedom 10 10
P-value 0.000 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 0.996 NA
Tucker-Lewis Index (TLI) 0.989 NA
Root Mean Square Error of Approximation:
RMSEA 0.026 NA
90 Percent Confidence Interval 0.012 0.042 NA NA
P-value RMSEA <= 0.05 0.994 NA
Parameter estimates:
Information Expected
Standard Errors Robust.sem
Estimate Std.err Z-value P(>|z|) Std.lv Std.all
Latent variables:
Wertschaetzung_Essen =~
abwchslngsr_M 1.000 0.363 0.436
schnll_zbrtbr 1.179 0.428 0.438
koche_shr_grn 2.549 0.925 0.846
koche_shr_hfg 2.530 0.918 0.775
Fleischverzicht =~
Ern_Index1 1.000 0.249 0.249
Regressions:
Fleischverzicht ~
Wrtschtzng_Es 0.302 0.440 0.440
Intercepts:
abwchslngsr_M 3.133 3.133 3.760
schnll_zbrtbr 1.701 1.701 1.741
koche_shr_grn 2.978 2.978 2.725
koche_shr_hfg 2.543 2.543 2.148
Wrtschtzng_Es 0.000 0.000 0.000
Fleischvrzcht 0.000 0.000 0.000
Thresholds:
Ern_Index1|t1 0.197 0.197 0.197
Variances:
abwchslngsr_M 0.562 0.562 0.810
schnll_zbrtbr 0.771 0.771 0.808
koche_shr_grn 0.339 0.339 0.284
koche_shr_hfg 0.559 0.559 0.399
Ern_Index1 0.938 0.938 0.938
Wrtschtzng_Es 0.132 1.000 1.000
Fleischvrzcht 0.050 0.806 0.806
Is the model not identified? There should be enough degrees of freedom and the loadings of the first manifest items are set to one.
How can I resolve this issue?

My first thought was:
You can´t have missing values in the dataframe, because with categorial variables WLSMV is used and FIML (missing="ML") is only usable with ML estimates. Perhaps that´s a problem.
Also: Does lavaan automatically fix the residual-variance of "Fleischverzicht" to 0 (or some other value)? A single-item latent variable would not be identified without that, I think.

constrained multiple linear regression in R

Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?

You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Linear model in R - Multiplication Expression - r

Related

Linear mixed model confidence intervals question

How do I include p-value and R-square for the estimates in semPaths?

How do I use the glm() function?

R lavaan sem categorical variable no standard error

constrained multiple linear regression in R

Categories

Resources