I am trying to estimate a regression model on a data set with one continuous dependent variable (y) and three categorical independent variables (x1,x2,x3). For example imagine y is the price you pay for a smartphone and x are three features (say color, size and storage space).
My assumption is that each feature represents a multiplicative factor relative to an (unknown) baseline price. So if the baseline price for your phone is 100 a red color would increase this by 25%, a large size decrease it by 50% and high storage space increase by 75%. This means the final price of the phone would be 100 x (1+0.25) x (1-0.50) x (1+0.75) = 109.375.
The problem is that I only know the final price (not the baseline price) and the individual features. How can I estimate the multiplicative factors that go along with these features? I have written a brief simulation in R below to illustrate this problem.
Thanks for your help with this,
Michael
x_fun <- function() {
tmp1 <- runif(N)
tmp2 <- cut(tmp1, quantile(tmp1, probs=c(0, 1/3, 2/3, 3/3)))
levels(tmp2) <- seq(1:length(levels(tmp2)))
tmp2[is.na(tmp2)] <- 1
as.factor(tmp2)}
N <- 1000
x1 <- x_fun()
x2 <- x_fun()
x3 <- x_fun()
f1 <- 1+0.25*(as.numeric(x1)-2)
f2 <- 1+0.50*(as.numeric(x2)-2)
f3 <- 1+0.75*(as.numeric(x3)-2)
y_Base <- runif(min=0, max=1000, N)
y <- y_Base*f1*f2*f3
output <- data.frame(y, x1, x2, x3)
rm(y_Base, f1, f2, f3, N, y, x_fun, x1, x2, x3)
I think you can do it like this if you know the base levels of your factors:
N <- 1000
set.seed(42)
x1 <- x_fun()
x2 <- x_fun()
x3 <- x_fun()
f1 <- 1+0.25*(as.numeric(x1)-2)
f2 <- 1+0.50*(as.numeric(x2)-2)
f3 <- 1+0.75*(as.numeric(x3)-2)
y_Base <- runif(min=0, max=1000, N)
y <- y_Base*f1*f2*f3
str(x1)
output <- data.frame(y, x1, x2, x3)
#rm(y_Base, f1, f2, f3, N, y, x_fun, x1, x2, x3)
output[, c("x1", "x2", "x3")] <- lapply(output[, c("x1", "x2", "x3")], relevel, ref = "2")
fit <- glm(y ~ x1 + x2 + x3, data = output, family = gaussian(link = "log"))
summary(fit)
predbase <- exp(log(output$y) - predict(fit, type = "link") + coef(fit)["(Intercept)"])
library(ggplot2)
ggplot(data.frame(x = y_Base, y = predbase, output[, c("x1", "x2", "x3")]),
aes(x = x, y = y)) +
geom_point() +
facet_wrap( ~ x1 + x2 + x3) +
geom_abline(slope = 1, color = "dark red")
Related
Data consists of 4 variable, id, x1 and x2, continuous variables which are correlated with y, a binary variable. 0 and 1 in the binary variable represent different states. Is it possible to use Markov chain models to calculate and plot state transition probability along the gradient of covariate values for each id and subsequently for the pooled data?
set.seed(1)
id =rep(1, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
a<-data.frame(id,x1,x2, y)
set.seed(2)
id =rep(2, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
b<-data.frame(id,x1,x2, y)
set.seed(3)
id =rep(3, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
c<-data.frame(id,x1,x2, y)
d<-rbind(a,b,c)
I try to calculate incremental variance explained by variables in multivariate multiple linear regression model, but I don't have Sum of squares parameters like multiple linear regression. I'd like something like:
library(car)
#Create variables and adjusted the model
set.seed(123)
N <- 100
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
X3 <- abs(rnorm(N, 60, 30))
Y1 <- 0.2*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 10)
Y2 <- -0.3*X2 + 0.2*X3 + rnorm(N, 10)
Y <- cbind(Y1, Y2)
dfRegr <- data.frame(X1, X2, X3, Y1, Y2)
(fit <- lm(cbind(Y1, Y2) ~ X1 + X2 + X3, data=dfRegr))
#How do we get the proportion now?
af <- Anova(fit)
afss <- af$"test stat"
print(cbind(af,PctExp=afss/sum(afss)*100))
#
Obviously doesn't work. There are some kind of approach for this?
This is my dataframe:
x1 <- c(1,2,3,4)
x2 <- c(3,4,5,6)
x3 <- c(5,6,7,8)
x4 <- c(7,8,9,10)
x5 <- c(8,7,6,5)
df <- c(x1,x2,x3,x4,x5)
I choose 3 variables from my dataframe to plot 3 separate scatterplots each against x1 and store these in a character vector:
varlist <- c("x2","x4","x5")
So I want to create a function to make 3 independent scatterplots of x1 with x2, x1 with x4 and x1 with x5, using ggplot, where xx and yy will be the different pairs of variables to plot:
ggplot(data = df) +
geom_point(mapping = aes(x = xx, y = yy)) +
geom_smooth(mapping = aes(x = xx, y = yy))
You could do:
mapply(function(y) print(ggplot(data = df) +
geom_point(aes_string(x = "x1", y = y)) +
geom_smooth(aes_string(x = "x1", y = y))), y=c("x2","x4","x5"))
Note : I used df <- data.frame(x1,x2,x3,x4,x5) instead of df <- c(x1,x2,x3,x4,x5)
x is set to x1, mapply will loop over y which contains the different variables we want to have plotted against x1.
I'm running a set of models with the same independent variables but different dependent variables and would like to create a set of coefficient plots in one figures in which each model gets its own panel. The following code provides intuition but in this all of the models are integrated into one figure rather than have 3 unique panels side-by-side in one figure:
require("coefplot")
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100), y1 = rnorm(100), y2 = rnorm(100), y3 = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
mod2 <- lm(y2 ~ x + z, data = dat)
mod3 <- lm(y3 ~ x + z, data = dat)
multiplot(mod1,mod2, mod3)
Which generates this plot:
Any thoughts on how to get them to panel next to each other in one figure? Thanks!
I haven't used the coefplot package before, but you can create a coefficient plot directly in ggplot2.
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100), y1 = rnorm(100), y2 = rnorm(100), y3 = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
mod2 <- lm(y2 ~ x + z, data = dat)
mod3 <- lm(y3 ~ x + z, data = dat)
## Create data frame of model coefficients and standard errors
# Function to extract what we need
ce = function(model.obj) {
extract = summary(get(model.obj))$coefficients[ ,1:2]
return(data.frame(extract, vars=row.names(extract), model=model.obj))
}
# Run function on the three models and bind into single data frame
coefs = do.call(rbind, sapply(paste0("mod",1:3), ce, simplify=FALSE))
names(coefs)[2] = "se"
# Faceted coefficient plot
ggplot(coefs, aes(vars, Estimate)) +
geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se, colour=vars),
lwd=1, width=0) +
geom_point(size=3, aes(colour=vars)) +
facet_grid(. ~ model) +
coord_flip() +
guides(colour=FALSE) +
labs(x="Coefficient", y="Value") +
theme_grey(base_size=15)
Using ggplot(), I am trying to plot the results of an ANCOVA in which slopes of the two linear components are equal: i.e., lm(y ~ x + A). The default behavior for geom_smooth(method = "lm") is to plot separate slopes and intercepts for each level of each factor. For example, with two levels of A
library(ggplot2)
set.seed(1234)
n <- 20
x1 <- rnorm(n); x2 <- rnorm(n)
y1 <- 2 * x1 + rnorm(n)
y2 <- 3 * x2 + (2 + rnorm(n))
A <- as.factor(rep(c(1, 2), each = n))
df <- data.frame(x = c(x1, x2), y = c(y1, y2), A = A)
p <- ggplot(df, aes(x = x, y = y, color = A))
p + geom_point() + geom_smooth(method = "lm")
I can fit the ANCOVA separately with lm() and then use geom_abline() to manually add the lines. This approach has a couple of drawbacks like having the lines extend beyond the range of the data and manually specify the colors.
fm <- lm(y ~ x + A, data = df)
summary(fm)
a1 <- coef(fm)[1]
b <- coef(fm)[2]
a2 <- a1 + coef(fm)[3]
p + geom_point() +
geom_abline(intercept = a1, slope = b) +
geom_abline(intercept = a2, slope = b)
I know ancova() in the HH package automates the plotting, but I don't really care for lattice graphics. So I am looking for a ggplot()-centric solution.
library(HH)
ancova(y ~ x + A, data = df)
Is there a method to accomplish this using ggplot()? For this example, A has two levels, but I have situations with 3, 4, or more levels. The formula argument to geom_smooth() doesn't seem to have the answer (as far as I can tell).
For completeness, this works:
library(ggplot2)
set.seed(1234)
n <- 20
x1 <- rnorm(n); x2 <- rnorm(n)
y1 <- 2 * x1 + rnorm(n)
y2 <- 3 * x2 + (2 + rnorm(n))
A <- as.factor(rep(c(1, 2), each = n))
df <- data.frame(x = c(x1, x2), y = c(y1, y2), A = A)
fm <- lm(y ~ x + A, data = df)
p <- ggplot(data = cbind(df, pred = predict(fm)),
aes(x = x, y = y, color = A))
p + geom_point() + geom_line(aes(y = pred))