Linear regression in R (normal and logarithmic data) - r

I want to carry out a linear regression in R for data in a normal and in a double logarithmic plot.
For normal data the dataset might be the follwing:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
plot (lin$x, lin$y)
There I want to calculate draw a line for the linear regression only of the datapoints 2, 3 and 4.
For double logarithmic data the dataset might be the following:
data = data.frame(
x=c(1:15),
y=c(
1.000, 0.742, 0.623, 0.550, 0.500, 0.462, 0.433,
0.051, 0.043, 0.037, 0.032, 0.028, 0.025, 0.022, 0.020
)
)
plot (data$x, data$y, log="xy")
Here I want to draw the regression line for the datasets 1:7 and for 8:15.
Ho can I calculate the slope and the y-offset als well as parameters for the fit (R^2, p-value)?
How is it done for normal and for logarithmic data?
Thanks for you help,
Sven

In R, linear least squares models are fitted via the lm() function. Using the formula interface we can use the subset argument to select the data points used to fit the actual model, for example:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
linm <- lm(y ~ x, data = lin, subset = 2:4)
giving:
R> linm
Call:
lm(formula = y ~ x, data = lin, subset = 2:4)
Coefficients:
(Intercept) x
-1.633 1.500
R> fitted(linm)
2 3 4
-0.1333333 1.3666667 2.8666667
As for the double log, you have two choices I guess; i) estimate two separate models as we did above, or ii) estimate via ANCOVA. The log transformation is done in the formula using log().
Via two separate models:
logm1 <- lm(log(y) ~ log(x), data = dat, subset = 1:7)
logm2 <- lm(log(y) ~ log(x), data = dat, subset = 8:15)
Or via ANCOVA, where we need an indicator variable
dat <- transform(dat, ind = factor(1:15 <= 7))
logm3 <- lm(log(y) ~ log(x) * ind, data = dat)
You might ask if these two approaches are equivalent? Well they are and we can show this via the model coefficients.
R> coef(logm1)
(Intercept) log(x)
-0.0001487042 -0.4305802355
R> coef(logm2)
(Intercept) log(x)
0.1428293 -1.4966954
So the two slopes are -0.4306 and -1.4967 for the separate models. The coefficients for the ANCOVA model are:
R> coef(logm3)
(Intercept) log(x) indTRUE log(x):indTRUE
0.1428293 -1.4966954 -0.1429780 1.0661152
How do we reconcile the two? Well the way I set up ind, logm3 is parametrised to give more directly values estimated from logm2; the intercepts of logm2 and logm3 are the same, as are the coefficients for log(x). To get the values equivalent to the coefficients
of logm1, we need to do a manipulation, first for the intercept:
R> coefs[1] + coefs[3]
(Intercept)
-0.0001487042
where the coefficient for indTRUE is the difference in the mean of group 1 over the mean of group 2. And for the slope:
R> coefs[2] + coefs[4]
log(x)
-0.4305802
which is the same as we got for logm1 and is based on the slope for group 2 (coefs[2]) modified by the difference in slope for group 1 (coefs[4]).
As for plotting, an easy way is via abline() for simple models. E.g. for the normal data example:
plot(y ~ x, data = lin)
abline(linm)
For the log data we might need to be a bit more creative, and the general solution here is to predict over the range of data and plot the predictions:
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1))
pdat <- transform(pdat, yhat = c(predict(logm1, pdat[1:70,, drop = FALSE]),
predict(logm2, pdat[71:141,, drop = FALSE])))
Which can plot on the original scale, by exponentiating yhat
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
or on the log scale:
plot(log(y) ~ log(x), data = dat)
lines(yhat ~ log(x), dat = pdat, subset = 1:70, col = "red")
lines(yhat ~ log(x), dat = pdat, subset = 71:141, col = "blue")
For example...
This general solution works well for the more complex ANCOVA model too. Here I create a new pdat as before and add in an indicator
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1)[1:140],
ind = factor(rep(c(TRUE, FALSE), each = 70))))
pdat <- transform(pdat, yhat = predict(logm3, pdat))
Notice how we get all the predictions we want from the single call to predict() because of the use of ANCOVA to fit logm3. We can now plot as before:
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")

#Split the data into two groups
data1 <- data[1:7, ]
data2 <- data[8:15, ]
#Perform the regression
model1 <- lm(log(y) ~ log(x), data1)
model2 <- lm(log(y) ~ log(x), data2)
summary(model1)
summary(model2)
#Plot it
with(data, plot(x, y, log="xy"))
lines(1:7, exp(predict(model1, data.frame(x = 1:7))))
lines(8:15, exp(predict(model2, data.frame(x = 8:15))))
In general, splitting the data into different groups and running different models on different subsets is unusual, and probably bad form. You may want to consider adding a grouping variable
data$group <- factor(rep(letters[1:2], times = 7:8))
and running some sort of model on the whole dataset, e.g.,
model_all <- lm(log(y) ~ log(x) * group, data)
summary(model_all)

Related

How to perform a nonlinear regression of a complex function that has a summation using R?

I have the following function:
Of this function, the parameter R is a constant with a value of 22.5. I want to estimate parameters A and B using nonlinear regression (nls() function). I made a few attempts, but all were unsuccessful. I'm not very familiar with this type of operations in R, so I would like your help.
Additionally, if possible, I would also like to plot this function using ggplot2.
# Initial data
x <- c(0, 60, 90, 120, 180, 240)
y <- c(0, 0.967676, 1.290101, 1.327099, 1.272404, 1.354246)
R <- 22.5
df <- data.frame(x, y)
f <- function(x) (1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2))
# First try
nls(formula = y ~ A*(1-(6/(pi^2))*sum(f, seq(1, Inf, 1))),
data = df,
start = list(A = 1,
B = 0.7))
Error in seq.default(1, Inf, 1) : 'to' must be a finite number
# Second try
nls(formula = y ~ A*(1-(6/(pi^2))*integrate(f, 1, Inf)),
data = df,
start = list(A = 1,
B = 0.7))
Error in f(x, ...) : object 'n' not found
You can use a finite sum approximation. Using 25 terms:
f <- function(x, B, n = 1:25) sum((1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2)))
fm <- nls(formula = y ~ cbind(A = (1-(6/pi^2))* Vectorize(f)(x, B)),
data = df,
start = list(B = 0.7),
alg = "plinear")
fm
giving:
Nonlinear regression model
model: y ~ cbind(A = (1 - (6/pi^2)) * Vectorize(f)(x, B))
data: df
B .lin.A
-0.00169 1.39214
residual sum-of-squares: 1.054
Number of iterations to convergence: 12
Achieved convergence tolerance: 9.314e-06
The model does not seem to fit the data very well (solid line in graph below); however, a logistic model seems to work well (dashed line).
fm2 <- nls(y ~ SSlogis(x, Asym, xmid, scal), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
legend("bottomright", c("fm", "fm2"), lty = 1:2)

R - Predicted variables not included in linear regression graph

Here's the relevant code snippet. How do I get the predicted variables to display in the plot?
df <- data.frame(X = 2010:2022, Y = c(11539282, 11543332, 11546969, 11567845, 11593741, 11606027, 11622554, 11658609, rep(NA, 5)))
model.1 <- lm(formula = Y ~ X, data = df)
predict(object = model.1, newdata = df)
plot(X, Y, ylim=c(11500000,11750000))
lines(sort(X), fitted(model.1)[order(X)])
Make these changes:
when creating the model use na.action = na.exclude
use the formula methods for plot and lines
use fitted(model.2) as the predicted values
no sorting is needed as X is already sorted
giving this code:
model.2 <- lm(Y ~ X, df, na.action = na.exclude)
plot(Y ~ X, df)
lines(fitted(model.2) ~ X, df)
or use abline in which case this shorter code can be used:
model.3 <- lm(Y ~ X, df)
plot(Y ~ X, df)
abline(model.3)
In either case we get this output:
Added
Based on clarification in the comments we could do this (or if you want an even wider range try ylim = extendrange(pred, f = .10) to extend the range by 10%, say, on either side).
pred <- predict(model.3, df)
plot(Y ~ X, df, ylim = range(pred))
lines(pred ~ X, df)
giving:

R: GAM with fit on subset of data

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.
Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)
If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

Plot the observed and fitted values from a linear regression using xyplot() from the lattice package

I can create simple graphs. I would like to have observed and predicted values (from a linear regression) on the same graph. I am plotting say Yvariable vs Xvariable. There is only 1 predictor and only 1 response. How could I also add linear regression curve to the same graph?
So to conclude need help with:
plotting actuals and predicted both
plotting regression line
Here is one option for the observed and predicted values in a single plot as points. It is easier to get the regression line on the observed points, which I illustrate second
First some dummy data
set.seed(1)
x <- runif(50)
y <- 2.5 + (3 * x) + rnorm(50, mean = 2.5, sd = 2)
dat <- data.frame(x = x, y = y)
Fit our model
mod <- lm(y ~ x, data = dat)
Combine the model output and observed x into a single object for plott
res <- stack(data.frame(Observed = dat$y, Predicted = fitted(mod)))
res <- cbind(res, x = rep(dat$x, 2))
head(res)
Load lattice and plot
require("lattice")
xyplot(values ~ x, data = res, group = ind, auto.key = TRUE)
The resulting plot should look similar to this
To get just the regression line on the observed data, and the regression model is a simple straight line model as per the one I show then you can circumvent most of this and just plot using
xyplot(y ~ x, data = dat, type = c("p","r"), col.line = "red")
(i.e. you don't even need to fit the model or make new data for plotting)
The resulting plot should look like this
An alternative to the first example which can be used with anything that will give coefficients for the regression line is to write your own panel functions - not as scary as it seems
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod), ...) ## using mod from earlier
}
)
That gives a plot from Figure 2 above, but by hand.
Assuming you've done this with caret then
mod <- train(y ~ x, data = dat, method = "lm",
trControl = trainControl(method = "cv"))
xyplot(y ~ x, data = dat, col.line = "red",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(coef = coef(mod$finalModel), ...) ## using mod from caret
}
)
Will produce a plot the same as Figure 2 above.
Another option is to use panel.lmlineq from latticeExtra.
library(latticeExtra)
set.seed(0)
xsim <- rnorm(50, mean = 3)
ysim <- (0 + 2 * xsim) * (1 + rnorm(50, sd = 0.3))
## basic use as a panel function
xyplot(ysim ~ xsim, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmlineq(x, y, adj = c(1,0), lty = 1,xol.text='red',
col.line = "blue", digits = 1,r.squared =TRUE)
})

R: predict.glm equivalent for MCMCpack::MCMClogit

I am running a Bayesian logit with MCMCpack::MCMClogit. The syntax is easy and follows lm() or glm(), but I can't find any equivalent of the predict.glm function. Is there any way of predicting the probabilities of the outcomes in MCMClogit for each unit of observation in the dataframe? predict() is especially useful for validating training data from new data, which is what I ultimately have to do.
df = read.csv("http://dl.dropbox.com/u/1791181/MCMC.csv")#Read in data
model.glm = glm(SECONDARY.LEVEL ~ AGE + SEX, data=df, family=binomial(link=logit))
glm.predict = predict(model.glm, type="response")
For MCMClogit():
model.mcmc = MCMClogit(SECONDARY.LEVEL ~ AGE + SEX, data=df, mcmc=1000)
You could use the posterior distribution of model parameters produced by MCMC to get a distribution of predictions, using the logistic function.
For instance, if your model formula is y ~ x1 + x2 + x3, and your MCMC output is stored in the variable posterior.mcmc, then you could use
function(x1, x2, x3) 1 / (1 + exp(-posterior.mcmc %*% rbind(1, x1, x2, x3)))
to give the distribution analogous to predict.glm(., 'response')
More detailed example for the case of a single input variable:
library(extraDistr)
library(MCMCpack)
# Take x uniformly distributed between -100 and 100
x <- runif(2000, min=-100, max=100)
# Generate a response which is logistic with some noise
beta <- 1/8
eps <- rnorm(length(x), 0, 1)
p <- function(x, eps) 1 / (1 + exp(-beta*x + eps))
p.x <- p(x, eps)
y <- sapply(p.x, function(p) rbern(1, p))
df1 <- data.frame(x, y)
# Fit by logistic regression
glm.logistic <- glm(y ~ x, df1, family=binomial)
# MCMC gives a distribution of values for the model parameters
posterior.mcmc <- MCMClogit(y ~ x, df1, verbose=2000)
densplot(posterior.mcmc)
# Thus, we have a distribution of model predictions for each x
predict.p.mcmc <- function(x) 1 / (1 + exp(-posterior.mcmc %*% rbind(1,x)))
interval.p.mcmc <- function(x, low, high) apply(predict.p.mcmc(x), 2,
function(x) quantile(x, c(low, high)))
predict.y.mcmc <- function(x) posterior.mcmc %*% rbind(1,x)
interval.y.mcmc <- function(x, low, high) apply(predict.y.mcmc(x), 2,
function(x) quantile(x, c(low, high)))
## Plot the data and fits ##
plot(x, p.x, ylab = 'Pr(y=1)', pch = 20, cex = 0.5, main = 'Probability vs x')
# x-values for prediction
x_test <- seq(-100, 100, 0.01)
# Blue line is the logistic function we used to generate the data, with noise removed
p_of_x_test <- p(x_test, 0)
lines(x_test, p_of_x_test, col = 'blue')
# Green line is the prediction from logistic regression
lines(x_test, predict(glm.logistic, data.frame(x = x_test), 'response'), col = 'green')
# Red lines indicates the range of model predictions from MCMC
# (for each x, 95% of the distribution of model predictions lies between these bounds)
interval.p.mcmc_95 <- interval.p.mcmc(x_test, 0.025, 0.975)
lines(x_test, interval.p.mcmc_95[1,], col = 'red')
lines(x_test, interval.p.mcmc_95[2,], col = 'red')
# Similarly for the log-odds
plot(x, log(p.x/(1 - p.x)), ylab = 'log[Pr(y=1) / (1 - Pr(y=1))]',
pch = 20, cex = 0.5, main = 'Log-Odds vs x')
lines(x_test, log(p_of_x_test/(1 - p_of_x_test)), col = 'blue')
lines(x_test, predict(glm.logistic, data.frame(x = x_test)), col = 'green')
interval.y.mcmc_95 <- interval.y.mcmc(x_test, 0.025, 0.975)
lines(x_test, interval.y.mcmc_95[1,], col = 'red')
lines(x_test, interval.y.mcmc_95[2,], col = 'red')
The description of the function says :
This function generates a sample from the posterior distribution of a logistic regression model using a random walk Metropolis algorithm.
I think therefore that your model.mcmc already contains the points that MCMClogit() has simulated.
You can use str to see what it contains and summary and plot functions on it like in the example there : http://cran.r-project.org/web/packages/MCMCpack/MCMCpack.pdf

Resources