How to reduce cubic equation's maximum peak (fitting) - r

The ventilation volume data were collected according to the efficiency. Several samples were taken and fitted into cubic equations.
It was written in Excel, and a third regression equation was obtained.
However, as you can see from the picture, the ventilation volume at 90-95% is higher than 100%. The data should never be higher than 100%, but the maximum vertex of the auto regression is convex so that it exceeds 100% in the form of a curve.
Is there a way to reduce the maximum vertex and fit it? Use the measured data as it is, but do not exceed 100%.
The use of R or other statistical programs is also welcome.
R values ​​can be a little lower.
Thank you.

I extracted data from the scatterplot and found a good fit to a Gompertz type of sigmoidal equation "a * exp(-1.0 * exp((x - b)/c)) + Offset", with the extracted data giving parameters a = -4.7537951574153149E+03, b = 5.4531406419707224E+01, c = 2.1494180901343391E+01, and Offset = 4.4056239791186508E+03 yielding RMSE = 57.17 and R-squared = 0.9988, see below. If this seems like it might be useful to you, I suggest re-fitting the actual data using these values as the initial parameter estimates.

Here are a few ideas in R:
First, I'm making some example data that are similar to yours and fitting a linear model with x^3, x^2, and x as predictors:
# make example data
xx = rep(c(30, 50, 70, 100), each = 10)
yy = 1/(1+exp(-(xx-50)/15)) * 4798.20 + rnorm(length(xx), sd = 20)
xx = c(0, xx)
yy = c(0, yy)
# fit third-order linear model
m0 = lm(yy ~ I(xx^3) + I(xx^2) + xx)
x_to_predict = data.frame(xx = seq(0, 100, length.out = length(xx)))
lm_preds = predict(m0, newdata = x_to_predict)
Idea 1: You could fit a model that uses a sigmoid (or other monotonic) curve.
# fit quasibinomial model for proportion
# first scale response variable between 0 and 1
m1 = glm(I(yy/max(yy)) ~ xx , family = quasibinomial())
# predict
preds_glm = predict(m1,
newdata = x_to_predict,
type = "response")
Idea 2: Fit a generalized additive model that will make a smooth curve.
# fit Generalized Additive Model
library(mgcv)
# you have to tune "k" somewhat -- larger means more "wiggliness"
m2 = gam(yy ~ s(xx, k = 4))
gam_preds = predict(m2,
newdata = x_to_predict,
type = "response")
Here's what the plots for each model look like:
# plot data and predictions
plot(xx, yy, ylab = "result", xlab = "efficiency")
lines(x_to_predict$xx,
preds_glm*max(yy), "l", col = 'red', lwd = 2)
lines(x_to_predict$xx,
gam_preds, "l", col = 'blue', lwd = 2)
lines(x_to_predict$xx, lm_preds,
"l", col = 'black', lwd = 2, lty = 2)
legend("bottomright",
lty = c(0, 1, 1, 2),
legend = c("data", "GLM prediction", "GAM prediction", "third-order lm"),
pch = c(1, NA_integer_, NA_integer_, NA_integer_),
col = c("black", "red", "blue", "black"))

Related

How do you draw a partition plane from a classification algorithm in a 3D plot in R

I'm trying to draw a partition border from a classification algorithm in a 3D plot in R (using plot3D). It's a relatively simple task if we only have two predictors, requiring only two axes to draw (e.g. using the partimat function). I haven't yet found a satisfactory way to draw a three predictor-based classification partition in 3D space.
To visualise the problem, let's start by building a partition for just two axes using a Linear Discriminant Analysis (LDA) classification algorithm on the iris dataset:
# Load packages and subset the iris dataset:
library(klaR)
data = droplevels(iris[iris$Species != 'virginica', ])
partimat(Species ~ Sepal.Length + Sepal.Width, data,
method = 'lda')
We get a 2D plot with a clearly defined partition between the two species:
However, partimat can only handle two predictors at a time (see ?partimat). Let's now look at the 3D problem:
library(plot3D)
# Plot the raw data:
points3D(data$Sepal.Length, data$Sepal.Width, data$Petal.Length,
colkey = F,
pch = 16, cex = 2,
theta = 30, phi = 30,
ticktype = 'detailed',
col = data$Species)
I want to draw a plane separating the two data classes based on a classification algorithm like LDA. Drawing inspiration from Roman Luštrik's example, here's my poor attempt at defining the partition between three predictors. Essentially, I've built a LDA model with three predictors, then predicted the species (setosa or versicolor) onto multiple points between the max. and min. values of all three predictors. When plotted on a 3D plot, this generates a point cloud, coloured differently to represent the 3D space where either iris species should appear based on the three predictors:
# Build a classification model with three predictors:
m = lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length, data)
# Predict 'Species' for the full range of each plant metric:
np = 50
nx = seq(from = min(data[, 1]), to = max(data[, 1]), length.out = np)
ny = seq(from = min(data[, 2]), to = max(data[, 2]), length.out = np)
nz = seq(from = min(data[, 3]), to = max(data[, 3]), length.out = np)
nd = expand.grid(Sepal.Length = nx, Sepal.Width = ny, Petal.Length = nz)
p = as.numeric(predict(m, newdata = nd)$class)
part = cbind(nd, Partition = p)
# Plot the partition and add the data points:
scatter3D(part$Sepal.Length, part$Sepal.Width, part$Petal.Length,
colvar = part$Partition,
colkey = F,
alpha = 0.5,
pch = 16, cex = 0.3,
theta = 30, phi = 30,
ticktype = 'detailed',
plot = F)
points3D(data$Sepal.Length, data$Sepal.Width, data$Petal.Length,
colkey = F,
pch = 16, cex = 2,
theta = 30, phi = 30,
ticktype = 'detailed',
col = data$Species,
add = T)
I've also added the data points. You can make out the partition as the fuzzy intersection between blue and red in the pointcloud:
This isn't an ideal solution, as it's difficult to see the data points hidden amongst the point cloud. The point cloud is also a little bit distracting. Maybe some clever plotting of the points with transparency would improve things, but I suspect a much nicer solution would be to draw a plane (similar to a regression plane) at the intersect between species classes (i.e. where the blue and red dots meet). Note, I ultimately wish to use different classifiers (e.g. Random Forest) just in case there's a solution out there limited only to LDA or similar.
Many thanks in advance for any solutions or advice.
You can use the coefficients from the lda model to generate a plane separating the discriminant volumes. Effectively, the plane is the set of points in the 3D space where the sum of the (x, y, z) co-ordinates multiplied by their respective coefficients from the model is equal to the model's threshold (i.e. the plane where the model can't discriminate one group from the other).
We can do this by creating a 10 x 10 grid of equally spaced values along the x and y axes and calculating the z value that gives us the threshold value based on the model:
threshold <- sum(coef(m) * data[1, 1:3]) - predict(m)$x[1]
Sepal_Lengths <- seq(min(data$Sepal.Length), max(data$Sepal.Length), length.out = 10)
Sepal_Widths <- seq(min(data$Sepal.Width), max(data$Sepal.Width), length.out = 10)
Petal_Lengths <- outer(Sepal_Lengths, Sepal_Widths, function(x, y) {
(threshold - x * coef(m)[1] - y * coef(m)[2]) / coef(m)[3]})
So now when we draw our points:
points3D(data$Sepal.Length, data$Sepal.Width, data$Petal.Length,
colkey = F,
pch = 16, cex = 2,
theta = 30, phi = 30,
ticktype = 'detailed',
col = data$Species)
Adding the plane is as easy as:
persp3D(x = Sepal_Lengths,
y = Sepal_Widths,
z = Petal_Lengths,
col = "gold", add = TRUE, alpha = 0.5)

Fitting a sigmoid curve using a logistic function in R

I have data that follows a sigmoid curve and I would like fit a logistic function to extract the three (or two) parameters for each participant. I have found some methods online, but I'm not sure which is the correct option.
This tutorial explains that you should use the nls() function like this:
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
## get the coefficients using the coef function
params=coef(fitmodel)
... where you clearly need the starting values to find the best-fitting values (?).
And then this post explains that to get the starting values, you can use a "selfstarting model can estimate good starting values for you, so you don't have to specify them":
fit <- nls(y ~ SSlogis(x, Asym, xmid, scal), data = data.frame(x, y))
However somewhere else I also read that you should use the SSlogis function for fitting a logistic function. Please could someone confirm whether these two steps are the best way to go about it? Or should I use values that I have extracted from previous similar data for the starting values?
Additionally, what should I do if I don't want the logistic function to be defined by the asymptote at all?
Thank you!
There isn't a best way but SSlogis does eliminate having to set starting values whereas if you specify the formula you have more control over the parameterization.
If the question is really how to fix a at a predetermined level, here the value 1, without rewriting the formula then set a before running nls and omit it from the starting values.
a <- 1
fo <- y ~ a / (1 + exp(-b * (x-c)))
nls(fo, start = list(b = 0.5, c = 25))
Alternately this substitutes a=1 into formula fo giving fo2 without having to rewrite the formula yourself.
fo2 <- do.call("substitute", list(fo, list(a = 1)))
nls(fo2, start = list(b = 0.5, c = 25))
As #G. Grothendieck writes, there is no general "best way", it always depends on you particular aims. Use of SSLogis is a good idea, as you don't need to specify start values, but a definition of an own function is more flexible. See the following example, where we use heuristics to derive start values ourselves instead of specifying them manually. Then we fit a logistic model and as a small bonus, the Baranyi growth model with an explicit lag phase.
# time (t)
x <- c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
# Algae cell counts (Mio cells per ml)
y <- c(0.88, 1.02, 1.43, 2.79, 4.61, 7.12,
6.47, 8.16, 7.28, 5.67, 6.91)
## we now plot the data linearly and logarithmically
## the layout function is another way to subdivide the plotting area
nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE), respect = TRUE)
layout.show(nf) # this shows how the plotting area is subdivided
plot(x, y)
plot(x, log(y))
## we see that the first points show the steepest increase,
## so we can estimate a start value of the growth rate
r <- (log(y[5]) - log(y[1])) / (x[5] - x[1])
abline(a=log(y[1]), b=r)
## this way, we have a heuristics for all start parameters:
## r: steepest increase of y in log scale
## K: maximum value
## N0: first value
## we can check this by plotting the function with the start values
f <- function(x, r, K, N0) {K /(1 + (K/N0 - 1) * exp(-r *x))}
plot(x, y, pch=16, xlab="time (days)", ylab="algae (Mio cells)")
lines(x, f(x, r=r, K=max(y), N0=y[1]), col="blue")
pstart <- c(r=r, K=max(y), N0=y[1])
aFit <- nls(y ~ f(x, r, K,N0), start = pstart, trace=TRUE)
x1 <- seq(0, 25, length = 100)
lines(x1, predict(aFit, data.frame(x = x1)), col = "red")
legend("topleft",
legend = c("data", "start parameters", "fitted parameters"),
col = c("black", "blue", "red"),
lty = c(0, 1, 1),
pch = c(16, NA, NA))
summary(aFit)
(Rsquared <- 1 - var(residuals(aFit))/var(y))
## =============================================================================
## Approach with Baranyi-Roberts model
## =============================================================================
## sometimes, a logistic is not good enough. In this case, use another growth
## model
baranyi <- function(x, r, K, N0, h0) {
A <- x + 1/r * log(exp(-r * x) + exp(-h0) - exp(-r * x - h0))
y <- exp(log(N0) + r * A - log(1 + (exp(r * A) - 1)/exp(log(K) - log(N0))))
y
}
pstart <- c(r=0.5, K=7, N0=1, h0=2)
fit2 <- nls(y ~ baranyi(x, r, K, N0, h0), start = pstart, trace=TRUE)
lines(x1, predict(fit2, data.frame(x = x1)), col = "forestgreen", lwd=2)
legend("topleft",
legend = c("data", "logistic model", "Baranyi-Roberts model"),
col = c("black", "red", "forestgreen"),
lty = c(0, 1, 1),
pch = c(16, NA, NA))

Simulation Poisson Process using R and ggplot2

Using a simulation of a Poisson process with rate lambda = 0.7. Show a sample run of a Poisson process with N(t) on the vertical axis and time t on the horizontal axis. The simulation is from the range t[0:100]. Generate first graph with 10 trajectories and second graph with 100 trajectories.
I have tried the following code but I cannot generate both graphs.
library(plyr)
library(ggplot2)
Process_poisson<- function(t, lambda){
distr_poisson<- rpois(1, t*lambda)
s_poisson<- sort(runif(distr_poisson, 0, t))
data.frame(x = c(0, 0, s_poisson),y = c(0, 0:distr_poisson))
}
N_simulations<- function(n,t,lambda){
s_poisson<- lapply (1:n, function(n) data.frame(Process_poisson(t, lambda), simulation = n))
s_poisson<- ldply (s_poisson, data.frame)
s_poisson$simulation<- factor(s_poisson$simulation)
}
t<- 0:100
lambda<- 0.7
N_simulations(10, t, lambda)
N_simulations(100, t, lambda)
par(mfrow = c(1,2))
matplot(x, y, type = "l", lty = 0:5, lwd = 1, lend = par("lend"),
pch = NULL, col = simulation, cex = 0.5, bg = NA, main =sprintf("Nº simulations of trajectories of Poisson Process",10,lambda), xlab = "Time", ylab = "N(t)",
xlim = c(0,100), ylim = c(-10,0))
matplot(Proceso_poisson(t, lambda), n, y, type = "l", lty = 0:5, lwd = 1, lend = par("lend"),
pch = NULL, col = simulacion, cex = 0.5, bg = NA, main =sprintf("Nº simulations of trajectories of Poisson Process",10,lambda), xlab = "Time", ylab = "N(t)",
xlim = c(0,100), ylim = c(-10,0))
How could I do it?
Thanks so much!
I think you could make this simpler. Here's a ggplot solution.
First, create a function that will simulate a Poisson process by taking samples drawn from an exponential distribution with the appropriate lambda. In this example I have used a while loop that starts with a vector x whose first element is 0. The function grows this vector by adding random samples until its sum reaches the target duration tmax. This is not the most efficient way to do it, but should make the example clearer.
When the target is reached, the function returns the cumulative sum of the vector, which represents the arrival times of a Poisson process of the appropriate lambda. Note that to make plotting easier, it actually returns a data frame with the cumulative times, the cumulative count, and a grouping variable run that will allow us to plot several runs easily on a single plot.
make_sample_df <- function(run, tmax, lambda)
{
x <- 0
while(sum(x) < tmax) x <- c(x, rexp(1, lambda))
data.frame(t = cumsum(x), N = seq_along(x), run = rep(run, length(x)))
}
We can now use this function inside our actual plotting function:
plot_poisson <- function(runs, tmax, lambda)
{
# Creates one data frame for each run, this sticks them all together:
df <- do.call("rbind", lapply(seq(runs), make_sample_df, tmax, lambda))
ggplot2::ggplot(df, aes(t, N, group = run)) +
geom_step(alpha = 0.25) +
labs( title = paste(runs, "runs of Poisson process with lambda", lambda)) +
theme(legend.position = "none") +
coord_cartesian(xlim = c(0, tmax))
}
So you can do:
plot_poisson(runs = 10, tmax = 100, lambda = 0.7)
plot_poisson(runs = 100, tmax = 100, lambda = 0.7)

why my GAM fit doesn't seem to have a correct intecept? [R]

My GAM curves are being shifted downwards. Is there something wrong with the intercept? I'm using the same code as Introduction to statistical learning... Any help's appreciated..
Here's the code. I simulated some data (a straight line with noise), and fit GAM multiple times using bootstrap.
(It took me a while to figure out how to plot multiple GAM fits in one graph. Thanks to this post Sam's answer, and this post)
library(gam)
N = 1e2
set.seed(123)
dat = data.frame(x = 1:N,
y = seq(0, 5, length = N) + rnorm(N, mean = 0, sd = 2))
plot(dat$x, dat$y, xlim = c(1,100), ylim = c(-5,10))
gamFit = vector('list', 5)
for (ii in 1:5){
ind = sample(1:N, N, replace = T) #bootstrap
gamFit[[ii]] = gam(y ~ s(x, 10), data = dat, subset = ind)
par(new=T)
plot(gamFit[[ii]], col = 'blue',
xlim = c(1,100), ylim = c(-5,10),
axes = F, xlab='', ylab='')
}
The issue is with plot.gam. If you take a look at the help page (?plot.gam), there is a parameter called scale, which states:
a lower limit for the number of units covered by the limits on the ‘y’ for each plot. The default is scale=0, in which case each plot uses the range of the functions being plotted to create their ylim. By setting scale to be the maximum value of diff(ylim) for all the plots, then all subsequent plots will produced in the same vertical units. This is essential for comparing the importance of fitted terms in additive models.
This is an issue, since you are not using range of the function being plotted (i.e. the range of y is not -5 to 10). So what you need to do is change
plot(gamFit[[ii]], col = 'blue',
xlim = c(1,100), ylim = c(-5,10),
axes = F, xlab='', ylab='')
to
plot(gamFit[[ii]], col = 'blue',
scale = 15,
axes = F, xlab='', ylab='')
And you get:
Or you can just remove the xlim and ylim parameters from both calls to plot, and the automatic setting of plot to use the full range of the data will make everything work.

R calculated gamma distribution density wrong?

I am trying to numerically calculate marginal likelihood (marginalize over a positive parameter). I am using Gamma distribution as prior for that parameter. Here I looked at the behavior of Gamma distribution for two specific parameter settings:
s = 28.4; r = 17000
plot(x, dgamma(x, shape=s, rate = r), type = 'l', ylab = 'density')
abline(v = s/r, col = 'red')
I got the following results:
Then I tried the following to get a tighter Gamma distribution:
lines(x, dgamma(x, shape=s*1000, rate = r*1000), col = 'blue')
and the result:
I am confused. As distribution gets tighter, the height should grow taller, otherwise the area won't integrate to 1. Did I miss anything? Or is there any numerally problems? Thanks!
Your x variable needs to have more samples to capture the narrow peak in the second density function:
x = seq(0, .01, .000001)
s = 28.4; r = 17000
plot(x, dgamma(x, shape=s, rate = r), type = 'l', ylab = 'density')
abline(v = s/r, col = 'red')
lines(x, dgamma(x, shape=s*1000, rate = r*1000), col = 'blue')

Resources