multivariate regression - r

I have two dependents that both depent on two variables AND on each other, can this be modelled in R (must be!) but I can't figure out how, anyone a hint?
In clear terms:
I want to model my data with the following model:
Y1=X1*coef1+X2*coef2
Y2=X1*coef2+X2*coef3
Note: coef2 appears in both lines
Xi, Yi is input and output data respectively
I got this far:
lm(Y1~X1+X2,mydata)
now how do I add the second line of the model including the cross dependency?
Your help is greatly appreciated!
Cheers, Bastiaan

Try this:
# sample data - true coefs are 2, 3, 4
set.seed(123)
n <- 35
DF <- data.frame(X1 = 1, X2 = 1:n, X3 = (1:n)^2)
DF <- transform(DF, Y1 = X1 * 2 + X2 * 3 + rnorm(n),
Y2 = X1 * 3 + X2 * 4 + rnorm(n))
# construct data frame for required model
DF2 <- with(DF, data.frame(y = c(Y1, Y2),
x1 = c(X1, 0*X1),
x2 = c(X2, X1),
x3 = c(0*X2, X2)))
lm(y ~. - 1, DF2)
We see it does, indeed, recover the true coefs of 2, 3, 4:
> lm(y ~. - 1, DF2)
Call:
lm(formula = y ~ . - 1, data = DF2)
Coefficients:
x1 x2 x3
2.084 2.997 4.007

Related

How to run the same regression but replacing the dataframe used in R?

I have 3 dataframes (df1, df2, df3) with the same variable names, and I would like to perform essentially the same regressions on all 3 dataframes. My regressions currently look like this:
m1 <- lm(y ~ x1 + x2, df1)
m2 <- lm(y~ x1 + x2, df2)
m3<- lm(y~ x1 + x2, df3)
Is there a way I can use for-loops in order to perform these regressions by just swapping out dataframe used?
Thank you
or add the dataframes to a list and map the lm function over the list.
library(tidyverse)
df1 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df2 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df3 <- tibble(x = 1:20, y = 3*x + rnorm(20, sd = 5))
df_list <- list(df1, df2, df3)
m <- map(df_list, ~lm(y ~ x, data = .))
Using update.
(fit <- lm(Y ~ X1 + X2 + X3, df1))
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df1)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
update(fit, data=df2)
# Call:
# lm(formula = Y ~ X1 + X2 + X3, data = df2)
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
Or lapply
lapply(mget(ls(pattern='^df\\d$')), lm, formula=Y ~ X1 + X2 + X3)
# $df1
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.9416 -0.2400 0.6481 0.9357
#
#
# $df2
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.6948 0.3199 0.6255 0.9588
#
#
# $df3
#
# Call:
# FUN(formula = ..1, data = X[[i]])
#
# Coefficients:
# (Intercept) X1 X2 X3
# 0.5720 0.6106 -0.1576 1.1391
Data:
set.seed(42)
f <- \() transform(data.frame(X1=rnorm(10), X2=rnorm(10), X3=rnorm(10)),
Y=1 + .2*X1 + .4*X2 + .8*X3 + rnorm(10))
set.seed(42); df1 <- f(); df2 <- f()

Possible to add second predictor that positively correlates with y but kills the effect of another predictor?

I simulated data that includes two x-variables, both positively correlated with y. How should I simply edit 'x2' so that it positively correlates with 'y' but removes the association between 'x1' and 'y'?
library(tidyverse)
library(jtools)
set.seed(123)
n = 50
d = tibble(y = rnorm(n, 50, 10))
d = d %>%
arrange(y) %>%
mutate(x1 = y + rnorm(nrow(d), 0, 3),
x2 = y + rnorm(nrow(d), 0, 10))
m1 = lm(y ~ x1, data = d)
effect_plot(m1, "x1", interval = T)
m2 = lm(y ~ x1 + x2, data = d)
effect_plot(m2, "x1", interval = T)
You could make x2 so that it correlates closely to x1:
library(tidyverse)
library(jtools)
set.seed(123)
n = 50
d = tibble(y = rnorm(n, 50, 10))
d = d %>%
arrange(y) %>%
mutate(x1 = y + rnorm(nrow(d), 0, 3),
x2 = x1 + rnorm(nrow(d), 0, 0.5))
This way x1 alone has a positive effect:
m1 = lm(y ~ x1, data = d)
effect_plot(m1, "x1", interval = T)
and x2 alone has a positive effect
m2 = lm(y ~ x2, data = d)
effect_plot(m2, "x2", interval = T)
But the effect of x1 when combined with x2 is effectively "killed"
m3 = lm(y ~ x1 + x2, data = d)
effect_plot(m3, "x1", interval = T)
Created on 2022-03-14 by the reprex package (v2.0.1)

Find confidence interval of a regression line at its center per group

I have the following simulated data to fit a regression model, where y, x1 are continuous variables and x2 is a categorical variable.
y <- rnorm(100, 2, 3)
x1 <- rnorm(100, 2.5, 2.8)
x2 <- factor(c(rep(1,45), rep(0,55)))
I need to find the 95% confidence intervals for y when x2 = 0 and x1 equals to the mean within x2 = 0.
I did
mod <- lm(y ~ x1 * x2)
tapply(x1, x2, mean)
# 0 1
#3.107850 2.294103
pred.dat <- data.frame(x1 = 3.107850, x2 = "0")
predict(mod, pred.dat, interval = "confidence", level = 0.95)
# fit lwr upr
#1 2.413393 1.626784 3.200003
predict(mod, pred.dat, interval = "prediction", level = 0.95)
# fit lwr upr
#1 2.413393 -3.473052 8.299839
I want to know whether I did this correctly or not. Also I want to know whether there is any easier way than this.
setup
set.seed(0)
y <- rnorm(100, 2, 3)
x1 <- rnorm(100, 2.5, 2.8)
x2 <- factor(c(rep(1,45), rep(0,55)))
mod <- lm(y ~ x1 * x2)
95% confidence intervals for y when x2 = 0 and x1 equals to the mean within x2 = 0.
I want to know whether I did this correctly or not.
Your use of predict is correct.
I want to know whether there is any easier way than this.
The tapply can be skipped if you do
pred.data <- data.frame(x1 = mean(x1[x2 == "0"]), x2 = "0")
# x1 x2
#1 2.649924 0
Or you can do
pred.data <- setNames(stack(tapply(x1, x2, mean)), c("x1", "x2"))
# x1 x2
#1 2.649924 0
#2 2.033328 1
so that you can get the result for both factor levels in one go.

Using svyglm within plyr call

This is clearly something idiosyncratic to R's survey package. I'm trying to use llply from the plyr package to make a list of svyglm models. Here's an example:
library(survey)
library(plyr)
foo <- data.frame(y1 = rbinom(50, size = 1, prob=.25),
y2 = rbinom(50, size = 1, prob=.5),
y3 = rbinom(50, size = 1, prob=.75),
x1 = rnorm(50, 0, 2),
x2 = rnorm(50, 0, 2),
x3 = rnorm(50, 0, 2),
weights = runif(50, .5, 1.5))
My list of dependent variables' column numbers
dvnum <- 1:3
Indicating no clusters or strata in this sample
wd <- svydesign(ids= ~0, strata= NULL, weights= ~weights, data = foo)
A single svyglm call works
svyglm(y1 ~ x1 + x2 + x3, design= wd)
And llply will make a list of base R glm models
llply(dvnum, function(i) glm(foo[,i] ~ x1 + x2 + x3, data = foo))
But llply throws the following error when I try to adapt this method to svyglm
llply(dvnum, function(i) svyglm(foo[,i] ~ x1 + x2 + x3, design= wd))
Error in svyglm.survey.design(foo[, i] ~ x1 + x2 + x3, design = wd) :
all variables must be in design= argument
So my question is: how do I use llply and svyglm?
DWin was on to something with his comment about correct formula.
reformulate will do this.
dvnum <- names(foo)[1:3]
llply(dvnum, function(i) {
svyglm(reformulate(c('x1', 'x2', 'x3'),response = i), design = wd)})

R script - NLS not working

I have 5 (x,y) data points and I'm trying to find a best fit solution consisting of two lines which intersect at a point (x0,y0), and which follow these equations:
y1 = (m1)(x1 - x0) + y0
y2 = (m2)(x2 - x0) + y0
Specifically, I require that the intersection must occur between x=2 and x=3. Have a look at the code:
#Initialize x1, y1, x2, y2
x1 <- c(1,2)
y1 <- c(10,10)
x2 <- c(3,4,5)
y2 <- c(20,30,40)
g <- c(TRUE, TRUE, FALSE, FALSE, FALSE)
q <- nls(c(y1, y2) ~ ifelse(g == TRUE, m1 * (x1 - x0) + y0, m2 * (x2 - x0) + y0), start = c(m1 = -1, m2 = 1, y0 = 0, x0 = 2), algorithm = "port", lower = c(m1 = -Inf, m2 = -Inf, y0 = -Inf, x0 = 2), upper = c(m1 = Inf, m2 = Inf, y0 = Inf, x0 = 3))
coef <- coef(q)
m1 <- coef[1]
m2 <- coef[2]
y0 <- coef[3]
x0 <- coef[4]
#Plot the original x1, y1, and x2, y2
plot(x1,y1,xlim=c(1,5),ylim=c(0,50))
points(x2,y2)
#Plot the fits
x1 <- c(1,2,3,4,5)
fit1 <- m1 * (x1 - x0) + y0
lines(x1, fit1, col="red")
x2 <- c(1,2,3,4,5)
fit2 <- m2 * (x2 - x0) + y0
lines(x2, fit2, col="blue")
So, you can see the data points listed there. Then, I run it through my nls, get my parameters m1, m2, x0, y0 (the slopes, and the intersection point).
But, take a look at the solution:
Clearly, the red line (which is supposed to only be based on the first 2 points) is not the best line of fit for the first 2 points. This is the same case with the blue line (the 2nd fit), which supposed to be is dependent on the last 3 points). What is wrong here?
This is segmented regression:
# input data
x1 <- c(1,2); y1 <- c(10,10); x2 <- c(3,4,5); y2 <- c(20,30,40)
x <- c(x1, x2); y <- c(y1, y2)
# segmented regression
library(segmented)
fm <- segmented.lm(lm(y ~ x), ~ x, NA, seg.control(stop.if.error = FALSE, K = 2))
summary(fm)
# plot
plot(fm)
points(y ~ x)
See ?lm, ?segmented.lm and ?seg.control for more info.
I'm not exactly sure what's wrong but I can get it to work by rearranging things a bit. Please note the comment in ?nls about "Do not use ‘nls’ on artificial "zero-residual" data."; I added a bit of noise.
## Initialize x1, y1, x2, y2
x1 <- c(1,2)
y1 <- c(10,10)
x2 <- c(3,4,5)
y2 <- c(20,30,40)
## make single x, y vector
x <- c(x1,x2)
set.seed(1001)
## (add a bit of noise to avoid zero-residual artificiality)
y <- c(y1,y2)+rnorm(5,sd=0.01)
g <- c(TRUE,TRUE,FALSE,FALSE,FALSE) ## specify identities of points
## particular changes:
## * you have lower=upper=2 for x0. Did you want 2<x0<3?
## * specified data argument explicitly (allows use of predict() etc.)
## * changed name from 'q' to 'fit1' (avoid R built-in function)
fit1 <- nls(y ~ ifelse(g,m1,m1+delta_m)*(x - x0) + y0,
start = c(m1 = -1, delta_m = 2, y0 = 0, x0 = 2),
algorithm = "port",
lower = c(m1 = -Inf, delta_m = 0, y0 = -Inf, x0 = 2),
upper = c(m1 = Inf, delta_m = Inf, y0 = Inf, x0 = 3),
data=data.frame(x,y))
#Plot the original 'data'
plot(x,y,col=rep(c("red","blue"),c(2,3)),
xlim=c(1,5),ylim=c(0,50))
## add predicted values
xvec <- seq(1,5,length.out=101)
lines(xvec,predict(fit1,newdata=data.frame(x=xvec)))
edit: based ifelse clause on point identity, not x position
edit: changed to require second slope to be > first slope
On a second look, I think the issue above is probably due to the use of separate vectors for x1 and x2 above, rather than a single x vector: I suspect these got replicated by R to match up with the g vector, which would have messed things up pretty badly. For example, this stripped-down example:
g <- c(TRUE, TRUE, FALSE, FALSE, FALSE)
ifelse(g,x1,x2)
## [1] 1 2 5 3 4
shows that x2 gets extended to (3 4 5 3 4) before being used in the ifelse clause. The scariest part is that normally one gets a warning such as this:
> x2 + 1:5
[1] 4 6 8 7 9
Warning message:
In x2 + 1:5 :
longer object length is not a multiple of shorter object length
but in this case there is no warning ...

Resources