I have a few datapoints (x and y) that seem to have a logarithmic relationship.
> mydata
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77
> qplot(x, y, data=mydata, geom="line")
Now I would like to find an underlying function that fits the graph and allows me to infer other datapoints (i.e. 3 or 82). I read about lm and nls but I'm not getting anywhere really.
At first, I created a function of which I thought it resembled the plot the most:
f <- function(x, a, b) {
a * exp(b *-x)
}
x <- seq(0:100)
y <- f(seq(0:100), 1,1)
qplot(x,y, geom="line")
Afterwards, I tried to generate a fitting model using nls:
> fit <- nls(y ~ f(x, a, b), data=mydata, start=list(a=1, b=1))
Error in numericDeriv(form[[3]], names(ind), env) :
Missing value or an Infinity produced when evaluating the model
Can someone point me in the right direction on what to do from here?
Follow up
After reading your comments and googling around a bit further I adjusted the starting parameters for a, b and c and then suddenly the model converged.
fit <- nls(y~f(x,a,b,c), data=data.frame(mydata), start=list(a=1, b=30, c=-0.3))
x <- seq(0,120)
fitted.data <- data.frame(x=x, y=predict(fit, list(x=x))
ggplot(mydata, aes(x, y)) + geom_point(color="red", alpha=.5) + geom_line(alpha=.5) + geom_line(data=fitted.data)
Maybe using a cubic specification for your model and estimating via lm would give you a good fit.
# Importing your data
dataset <- read.table(text='
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77', header=T)
# I think one possible specification would be a cubic linear model
y.hat <- predict(lm(y~x+I(x^2)+I(x^3), data=dataset)) # estimating the model and obtaining the fitted values from the model
qplot(x, y, data=dataset, geom="line") # your plot black lines
last_plot() + geom_line(aes(x=x, y=y.hat), col=2) # the fitted values red lines
# It fits good.
Try taking the log of your response variable and then using lm to fit a linear model:
fit <- lm(log(y) ~ x, data=mydata)
The adjusted R-squared is 0.8486, which at face value isn't bad. You can look at the fit using plot, for example:
plot(fit, which=2)
But perhaps, it's not such a good fit after all:
last_plot() + geom_line(aes(x=x, y=exp(fit$fitted.values)))
Check this document out: http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
In brief, first you need to decide on the model to fit onto your data (e.g., exponential) and then estimate its parameters.
Here are some widely used distributions:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm
Related
I have a tibble with numerical and logical variables, e.g. like this:
x f y
<dbl> <int> <dbl>
1 -2 1 -0.801
2 -1.96 0 -2.27
3 -1.92 0 -1.75
4 -1.88 0 -2.44
5 -1.84 1 -0.123
...
For reproducibility, it can be generated using:
library(tidyverse)
set.seed(0)
tb1 = tibble(
x=(-50:50)/25,
p=plogis(x),
f=rbinom(p, 1, p),
y = x+f+rnorm(x, 0, .5)
) %>% select(-p)
I'd like to plot the points and draw regression lines, once taking x as the predictor and f as the outcome (logistic regression), and once taking x and f as predictors and y as the outcome (linear regression). This works well for the logistic regression.
ggplot(tb1, aes(x, f)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="binomial"))
produces:
but:
ggplot(tb1, aes(x, y, colour=f)) +
geom_point() +
geom_smooth(method="lm")
produces:
which is wrong. I want f treated as a factor, producing two regression lines, and a discrete instead of the continuous-coloured legend. I can force f manually to a logical value:
tb2 = tb1 %>% mutate(f = f>0)
and obtain the correct linear regression graph:
but now I cannot plot the logistic regression. I get the
Warning message:
Computation failed in stat_smooth():
y values must be 0 <= y <= 1
For some reason, both lm() and glm() have no problems:
summary(glm(f ~ x, binomial, tb1))
summary(lm(y ~ x + f, tb1))
summary(glm(f ~ x, binomial, tb2))
summary(lm(y ~ x + f, tb2))
all produce reasonable results, and the results are identical for tb1 and tb2, as they should be. So is there a way of convincing geom_smooth to accept logical variables, or must I use two redundant variables, with identical values but of a different type, e.g. f.int and f.lgl?
I am trying to fit data to a non linear model, but I am getting "singular gradient" message when I build the model.
here is the data:
> astrodata
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
The function:
y= a * exp(-b * temperature) + c
What I did so far:
> temperature <- astrodata$temperature
temperature
[1] 277.15 282.15 287.15 292.15 297.15
> intensity <- astrodata$intensity
> c.0 <- min(temperature)*0.5
> c.0 <- min(intensity)*0.5
> model.0 <- lm(log(intensity - c.0) ~ temperature, data=astrodata)
> start <- list(a=exp(coef(model.0)[1]), b=coef(model.0)[2], c=c.0)
>
> model <- nls(intensity ~ a * exp(-b * temperature) + c, data = astrodata, start = start)
Error in nls(intensity ~ a * exp(b * temperature) + c, data = astrodata, :
singular gradient
Does anybody has an idea how to solve this ?
The model is linear in a and c and only nonlinear in b. That suggests we try the "plinear" algorithm. It has the advantage that only the non-linear parameters require starting values.
Note that the formula specification for that algorithm is different and has a RHS which is a matrix with one column per linear parameter.
model <- nls(intensity ~ cbind(exp(-b * temperature), 1), data = astrodata,
start = start["b"], algorithm = "plinear")
giving:
> model
Nonlinear regression model
model: intensity ~ cbind(exp(-b * temperature), 1)
data: astrodata
b .lin1 .lin2
-1.598e-01 4.728e-19 1.129e+02
residual sum-of-squares: 0.003853
Number of iterations to convergence: 5
Achieved convergence tolerance: 2.594e-07
Also:
plot(intensity ~ temperature, astrodata)
lines(fitted(model) ~ temperature, astrodata)
Note: Based on the comment below you don't really need an nls model and it may be good enough to just use geom_line
p <- ggplot(astrodata, aes(temperature, intensity)) + geom_point()
p + geom_line()
or splines:
p + geom_line(data = data.frame(spline(temperature, intensity)), aes(x, y))
Your data isn't varied enough.
nls uses least squares to work. This is a measurement of the distance between the model and the data points. If there is no distance, nls doesn't work. Your model fits the data exactly, this is called "zero-residual" data. Hence
singular gradient matrix at initial parameter estimates.
It's an overly complicated error message that simply means "There is no error to measure."
You only have 5 (x,y) combos, so this error is almost guaranteed using non-linear analysis with so little data. Use different data or more data.
One possibility is to double each data point, adding very tiny variations to the doubled data like so:
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
11 277.15000001 121.000001
12 282.15000001 131.000001
13 287.15000001 153.000001
14 292.15000001 202.000001
15 297.15000001 311.000001
In the original data set, each point effectively has the same weight of 1.0, and in the "doubled" data set again each point effectively has the same weight of 2.0 so you get the same fitted parameter values but no error.
I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:
set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1
OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)
I then used glmnet to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).
library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)
test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")
Here are the predictions:
> preds
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
> cv.preds
1
1 0.1568839
2 0.3630938
3 0.7435941
4 0.4808428
5 0.5261076
6 -0.1431655
7 0.4123054
8 0.2207381
9 -0.1446941
10 0.2962391
I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.
Why are the predictions from predict.glmnet (the preds vector) not in the form of probabilities? I put the preds values through the inverse logit function and got reasonable probabilities. Was that correct?
The predictions from predict.cv.glmnet (the cv.preds vector) mostly look like probabilities, but some of them are negative. Why is this?
When I use the glmnet function to create the glmmod object, I include the family="binomial" argument to indicate that I'm using logistic regression. However, when I use the cv.glmnet function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?
Similarly, when I use the predict.cv.glmnet function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?
I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.
Question 1
Yes, you're right. Note that,
> predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
which is the same output as type="response". Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.
Question 2...4
For the cv.preds, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the family parameter. Namely:
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")
> cv.preds
1
1 -10.873290
2 1.299113
3 15.812671
4 3.622259
5 5.621857
6 -24.826551
7 1.734000
8 -5.420878
9 -26.160403
10 -4.496020
In this case, cv.preds will output along the real line and you can put those values through the inverse logit to get the probabilities.
I have the following set of data and when plotted has a curvilinear relationship
Fish.species.richness Habitat.Complexity log.habitat
17 0.6376 -0.1954858
13 0.2335 -0.6317131
30 0.2866 -0.5427238
20 0.3231 -0.4906630
22 0.1073 -0.9694003
25 0.2818 -0.5500590
2 0.2182 -0.6612448
4 0.0189 -1.7246886
19 0.2960 -0.5287083
25 0.5507 -0.2590849
29 0.2689 -0.5704900
21 0.6286 -0.2016602
18 0.1557 -0.8078509
24 0.6851 -0.1642460
30 0.5059 -0.2959353
32 0.4434 -0.3532043
29 0.3585 -0.4455108
32 0.5920 -0.2276783
When I log the x axis and do a linear regression to find the intercept and slope I am able to add a line that fits the data:
summary(lm(Fish.species.richness~log.habitat,data=three))
plot(three$log.habitat,
three$Fish.species.richness,
xlab='Log Habitat Complexity',
ylab='Fish Species Richness')
abline(29.178,13.843)
However when I then do a curvilinear regression and try to plot the curve it doesn't fit the data, where am I going wrong?
mod.log<-lm(Fish.species.richness~log(Habitat.Complexity),data=three)
plot(three$Habitat.Complexity,
three$Fish.species.richness)
abline(mod.log)
Using ggplot2:
ggplot(three, aes(Habitat.Complexity, Fish.species.richness))+
geom_point(shape = 1) + stat_smooth(method = "lm", formula = y ~ log(x))
abline can only draw straight lines, on the form y = a + bx. Other curves can be added using the curve function.
plot(Fish.species.richness ~ Habitat.Complexity, three)
curve(coef(mod.log)[1] + coef(mod.log)[2]*log(x), add=TRUE)
For clarity and flexibility to other model types, you may want to use the predict function to calculate the predicted values along the range of your predictor variable:
mod.log<-lm(Fish.species.richness~log(Habitat.Complexity), data=three)
# predict along predictor variable range
newdat <- data.frame(Habitat.Complexity=seq(min(three$Habitat.Complexity), max(three$Habitat.Complexity),,100))
newdat$Fish.species.richness <- predict(mod.log, newdat, type="response")
# plot
plot(Fish.species.richness ~ Habitat.Complexity, data=three)
lines(Fish.species.richness ~ Habitat.Complexity, data=newdat)
I have a data.frame containing a continuous predictor and a dichotomous response variable.
> head(df)
position response
1 0 1
2 3 1
3 -4 0
4 -1 0
5 -2 1
6 0 0
I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.
Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.
I went ahead and computed the probabilities with cast() and saved them in another data.frame
> probs
position prob
1 -5 0.0500
2 -4 0.0000
3 -3 0.0000
4 -2 0.2000
5 -1 0.1500
6 0 0.3684
7 1 0.4500
8 2 0.6500
9 3 0.7500
10 4 0.8500
11 5 1.0000
I plotted the probabilities:
p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()
But when I try to add the fitted regression line
p <- p + stat_smooth(method="glm", family="binomial", se=F)
it returns a warning: non-integer #successes in a binomial glm!.
I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.
How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?
There are basically three solutions:
Merging the data.frames
The easiest, after you have your data in two separate data.frames would be to merge them by position:
mydf <- merge( mydf, probs, by="position")
Then you can call ggplot on this data.frame without warnings:
ggplot( mydf, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
Avoiding the creation of two data.frames
In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:
librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response) )
Edit: Use different data for each layer
I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:
ggplot( probs, aes(x=position, y=prob)) +
geom_point() +
geom_smooth(data = mydf, aes(x = position, y = response),
method = "glm", method.args = list(family = "binomial"),
se = FALSE)
As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.