Covariate dependent Markov models? Plot state transition probability along gradient of covariate values - r

Data consists of 4 variable, id, x1 and x2, continuous variables which are correlated with y, a binary variable. 0 and 1 in the binary variable represent different states. Is it possible to use Markov chain models to calculate and plot state transition probability along the gradient of covariate values for each id and subsequently for the pooled data?
set.seed(1)
id =rep(1, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
a<-data.frame(id,x1,x2, y)
set.seed(2)
id =rep(2, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
b<-data.frame(id,x1,x2, y)
set.seed(3)
id =rep(3, 100)
x1 = rnorm(100)
x2 = rnorm(100)
z = 1 + 2*x1 + 3*x2
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
c<-data.frame(id,x1,x2, y)
d<-rbind(a,b,c)

Related

How do I create a regression line with various variables in R

I have already created the actual regression code but I am trying to get the regression line and a predicted line onto a plot but I can't seem to figure it out.
m1 <- lm(variable1 ~ 2 + 3 + 4 + 5 + 6 + 7 + 8, data = prog)
summary(m1)
and then I want to create the plot on the basis of hyp.data but I am still a bit lost.
Consider two (not 7!) predictor variables; one is numeric, the other categorical (i.e. a factor).
# Simulate data
set.seed(2017);
x1 <- 1:10;
x2 <- as.factor(sample(c("treated", "not_treated"), 10, replace = TRUE));
df <- cbind.data.frame(
y = 2 * x1 + as.numeric(x2) - 1 + rnorm(10),
x1 = x1,
x2 = x2);
In that case you can do the following:
# Fit the linear model
m1 <- lm(y ~ x1 + x2, data = df);
# Get predictions
df$pred <- predict(m1);
# Plot data
library(ggplot2);
ggplot(df, aes(x = x1, y = y)) +
geom_point() +
facet_wrap(~ x2, scales = "free") +
geom_line(aes(x = x1, y = pred), col = "red");

Multiplicative regression

I am trying to estimate a regression model on a data set with one continuous dependent variable (y) and three categorical independent variables (x1,x2,x3). For example imagine y is the price you pay for a smartphone and x are three features (say color, size and storage space).
My assumption is that each feature represents a multiplicative factor relative to an (unknown) baseline price. So if the baseline price for your phone is 100 a red color would increase this by 25%, a large size decrease it by 50% and high storage space increase by 75%. This means the final price of the phone would be 100 x (1+0.25) x (1-0.50) x (1+0.75) = 109.375.
The problem is that I only know the final price (not the baseline price) and the individual features. How can I estimate the multiplicative factors that go along with these features? I have written a brief simulation in R below to illustrate this problem.
Thanks for your help with this,
Michael
x_fun <- function() {
tmp1 <- runif(N)
tmp2 <- cut(tmp1, quantile(tmp1, probs=c(0, 1/3, 2/3, 3/3)))
levels(tmp2) <- seq(1:length(levels(tmp2)))
tmp2[is.na(tmp2)] <- 1
as.factor(tmp2)}
N <- 1000
x1 <- x_fun()
x2 <- x_fun()
x3 <- x_fun()
f1 <- 1+0.25*(as.numeric(x1)-2)
f2 <- 1+0.50*(as.numeric(x2)-2)
f3 <- 1+0.75*(as.numeric(x3)-2)
y_Base <- runif(min=0, max=1000, N)
y <- y_Base*f1*f2*f3
output <- data.frame(y, x1, x2, x3)
rm(y_Base, f1, f2, f3, N, y, x_fun, x1, x2, x3)
I think you can do it like this if you know the base levels of your factors:
N <- 1000
set.seed(42)
x1 <- x_fun()
x2 <- x_fun()
x3 <- x_fun()
f1 <- 1+0.25*(as.numeric(x1)-2)
f2 <- 1+0.50*(as.numeric(x2)-2)
f3 <- 1+0.75*(as.numeric(x3)-2)
y_Base <- runif(min=0, max=1000, N)
y <- y_Base*f1*f2*f3
str(x1)
output <- data.frame(y, x1, x2, x3)
#rm(y_Base, f1, f2, f3, N, y, x_fun, x1, x2, x3)
output[, c("x1", "x2", "x3")] <- lapply(output[, c("x1", "x2", "x3")], relevel, ref = "2")
fit <- glm(y ~ x1 + x2 + x3, data = output, family = gaussian(link = "log"))
summary(fit)
predbase <- exp(log(output$y) - predict(fit, type = "link") + coef(fit)["(Intercept)"])
library(ggplot2)
ggplot(data.frame(x = y_Base, y = predbase, output[, c("x1", "x2", "x3")]),
aes(x = x, y = y)) +
geom_point() +
facet_wrap( ~ x1 + x2 + x3) +
geom_abline(slope = 1, color = "dark red")

R: GAM with fit on subset of data

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.
Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)
If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

R neuralnet does not converge within stepmax for time series

I'm writing a neural network for prediction of elements in a time series x + sin(x^2) in R, using the neuralnet package. This is how training data is being generated, assuming a window of 4 elements, and that the last one is the one that has to be predicted:
nntr0 <- ((1:25) + sin((1:25)^2))
nntr1 <- ((2:26) + sin((2:26)^2))
nntr2 <- ((3:27) + sin((3:27)^2))
nntr3 <- ((4:28) + sin((4:28)^2))
nntr4 <- ((5:29) + sin((5:29)^2))
Then, I turn these into a data.frame:
nntr <- data.frame(nntr0, nntr1, nntr2, nntr3, nntr4)
Then, I proceed to train the NN:
net.sinp <- neuralnet(nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data=nntr, hidden=10, threshold=0.04, act.fct="tanh", linear.output=TRUE, stepmax=100000)
Which, after a while, gives me the message
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
Call: neuralnet(formula = nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data = nntr, hidden = 10, threshold = 0.04, stepmax = 100000, act.fct = "tanh", linear.output = TRUE)
Can anyone help me figure out why it is not converging? Many thanks
With tanh as an activation function (it is bounded),
it is very difficult to reproduce the linear trend in your signal.
You can use linear activation functions instead,
or try to detrend the signal.
# Data
dx <- 1
n <- 25
x <- seq(0,by=dx,length=n+4)
y <- x + sin(x^2)
y0 <- y[1:n]
y1 <- y[1 + 1:n]
y2 <- y[2 + 1:n]
y3 <- y[3 + 1:n]
y4 <- y[4 + 1:n]
d <- data.frame(y0, y1, y2, y3, y4)
library(neuralnet)
# Linear activation functions
r <- neuralnet(y4 ~ y0 + y1 + y2 + y3, data=d, hidden=10)
plot(y4, compute(r, d[,-5])$net.result)
# No trend
d2 <- data.frame(
y0 = y0 - x[1:n],
y1 = y1 - x[1 + 1:n],
y2 = y2 - x[2 + 1:n],
y3 = y3 - x[3 + 1:n],
y4 = y4 - x[4 + 1:n]
)
r <- neuralnet(y4 ~ y0 + y1 + y2 + y3, data=d2, hidden=10, act.fct="tanh" )
plot(d2$y4, compute(r, d2[,-5])$net.result)
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmaxmeans your algorithm reached the limited steps before it is converged. If you type ?neuralnet and see the definition for stepmax it says,
the maximum steps for the training of the neural network. Reaching this maximum leads to a stop of the neural network's training process.
For your problem, I recommend you to increase your stepmax value to 1e7 and see what happens.
The code will be,
net.sinp <- neuralnet(nntr4 ~ nntr0 + nntr1 + nntr2 + nntr3, data=nntr, hidden=10, threshold=0.04, act.fct="tanh", linear.output=TRUE, stepmax=1e7)

Constraining slope in stat_smooth with ggplot (plotting ANCOVA)

Using ggplot(), I am trying to plot the results of an ANCOVA in which slopes of the two linear components are equal: i.e., lm(y ~ x + A). The default behavior for geom_smooth(method = "lm") is to plot separate slopes and intercepts for each level of each factor. For example, with two levels of A
library(ggplot2)
set.seed(1234)
n <- 20
x1 <- rnorm(n); x2 <- rnorm(n)
y1 <- 2 * x1 + rnorm(n)
y2 <- 3 * x2 + (2 + rnorm(n))
A <- as.factor(rep(c(1, 2), each = n))
df <- data.frame(x = c(x1, x2), y = c(y1, y2), A = A)
p <- ggplot(df, aes(x = x, y = y, color = A))
p + geom_point() + geom_smooth(method = "lm")
I can fit the ANCOVA separately with lm() and then use geom_abline() to manually add the lines. This approach has a couple of drawbacks like having the lines extend beyond the range of the data and manually specify the colors.
fm <- lm(y ~ x + A, data = df)
summary(fm)
a1 <- coef(fm)[1]
b <- coef(fm)[2]
a2 <- a1 + coef(fm)[3]
p + geom_point() +
geom_abline(intercept = a1, slope = b) +
geom_abline(intercept = a2, slope = b)
I know ancova() in the HH package automates the plotting, but I don't really care for lattice graphics. So I am looking for a ggplot()-centric solution.
library(HH)
ancova(y ~ x + A, data = df)
Is there a method to accomplish this using ggplot()? For this example, A has two levels, but I have situations with 3, 4, or more levels. The formula argument to geom_smooth() doesn't seem to have the answer (as far as I can tell).
For completeness, this works:
library(ggplot2)
set.seed(1234)
n <- 20
x1 <- rnorm(n); x2 <- rnorm(n)
y1 <- 2 * x1 + rnorm(n)
y2 <- 3 * x2 + (2 + rnorm(n))
A <- as.factor(rep(c(1, 2), each = n))
df <- data.frame(x = c(x1, x2), y = c(y1, y2), A = A)
fm <- lm(y ~ x + A, data = df)
p <- ggplot(data = cbind(df, pred = predict(fm)),
aes(x = x, y = y, color = A))
p + geom_point() + geom_line(aes(y = pred))

Resources