Fitting a sigmoidal curve to points with ggplot - r

I have a simple dataframe for the response measurements from a drug treatment at various doses:
drug <- c("drug_1", "drug_1", "drug_1", "drug_1", "drug_1",
"drug_1", "drug_1", "drug_1", "drug_2", "drug_2", "drug_2",
"drug_2", "drug_2", "drug_2", "drug_2", "drug_2")
conc <- c(100.00, 33.33, 11.11, 3.70, 1.23, 0.41, 0.14,
0.05, 100.00, 33.33, 11.11, 3.70, 1.23, 0.41, 0.14, 0.05)
mean_response <- c(1156, 1833, 1744, 1256, 1244, 1088, 678, 489,
2322, 1867, 1333, 944, 567, 356, 200, 177)
std_dev <- c(117, 317, 440, 200, 134, 38, 183, 153, 719,
218, 185, 117, 166, 167, 88, 50)
df <- data.frame(drug, conc, mean_response, std_dev)
I can plot these point using the following code and get the basic foundation of the visualization that I would like:
p <- ggplot(data=df, aes(y=mean_response, x= conc, color = drug)) +
geom_pointrange(aes(ymax = (mean_response + std_dev), ymin = (mean_response - std_dev))) +
scale_x_log10()
p
The next thing I would like to do with these data is add a sigmoidal curve to the plot, that fits the plotted points for each drug. Following that, I would like to calculate the EC50 for this curve.
I realize I may not have the entire range of the sigmoidal curve in my data, but I am hoping to get the best estimate I can with what I have. Also, the final point for drug_1 does not follow the expected trend of a sigmoidal curve, but this is actually not unexpected as the solutions that the drug is in can inhibit responses at high concentrations (each drug is in a different solution). I would like to exclude this point from the data.
I am getting stuck at the step of fitting a sigmoidal curve to my data. I have looked over some other solutions to fitting sigmoidal curves to data but none seem to work.
One post that is very close to my problem is this:
(sigmoid) curve fitting glm in r
Based on it, I tried:
p + geom_smooth(method = "glm", family = binomial, se = FALSE)
This gives the following error, and seems to default to plotting straight lines:
`geom_smooth()` using formula 'y ~ x'
Warning message:
Ignoring unknown parameters: family
I have also tried the solution from this link:
Fitting a sigmoidal curve to this oxy-Hb data
In this case, I get the following error:
Computation failed in `stat_smooth()`:
Convergence failure: singular convergence (7)
and no lines are added to the plot.
I have tried looking up both of these errors but cannot seem to find a reason that makes sense with my data.
Any help would be much appreciated!

As I said in a comment, I would only use geom_smooth() for a very easy problem; as soon as I run into trouble I use nls instead.
My answer is very similar to #Duck's, with the following differences:
I show both unweighted and (inverse-variance) weighted fits.
In order to get the weighted fits to work, I had to use the nls2 package, which provides a slightly more robust algorithm
I use SSlogis() to get automatic (self-starting) initial parameter selection
I do all of the prediction outside of ggplot2, then feed it into geom_line()
p1 <- nls(mean_response~SSlogis(conc,Asym,xmid,scal),data=df,
subset=(drug=="drug_1" & conc<100)
## , weights=1/std_dev^2 ## error in qr.default: NA/NaN/Inf ...
)
library(nls2)
p1B <- nls2(mean_response~SSlogis(conc,Asym,xmid,scal),data=df,
subset=(drug=="drug_1" & conc<100),
weights=1/std_dev^2)
p2 <- update(p1,subset=(drug=="drug_2"))
p2B <- update(p1B,subset=(drug=="drug_2"))
pframe0 <- data.frame(conc=10^seq(log10(min(df$conc)),log10(max(df$conc)), length.out=100))
pp <- rbind(
data.frame(pframe0,mean_response=predict(p1,pframe0),
drug="drug_1",wts=FALSE),
data.frame(pframe0,mean_response=predict(p2,pframe0),
drug="drug_2",wts=FALSE),
data.frame(pframe0,mean_response=predict(p1B,pframe0),
drug="drug_1",wts=TRUE),
data.frame(pframe0,mean_response=predict(p2B,pframe0),
drug="drug_2",wts=TRUE)
)
library(ggplot2); theme_set(theme_bw())
(ggplot(df,aes(conc,mean_response,colour=drug)) +
geom_pointrange(aes(ymin=mean_response-std_dev,
ymax=mean_response+std_dev)) +
scale_x_log10() +
geom_line(data=pp,aes(linetype=wts),size=2)
)
I believe the EC50 is equivalent to the xmid parameter ... note the large differences between weighted and unweighted estimates ...

I would suggest next approach which is close to what you want. I also tried with a setting for your data using binomial family but there are some issues about values between 0 and 1. In that case you would need an additional variable in order to determine the respective proportions. The code in the following lines use a non linear approximation in order to sketch your output.
Initially, the data:
library(ggplot2)
#Data
df <- structure(list(drug = c("drug_1", "drug_1", "drug_1", "drug_1",
"drug_1", "drug_1", "drug_1", "drug_1", "drug_2", "drug_2", "drug_2",
"drug_2", "drug_2", "drug_2", "drug_2", "drug_2"), conc = c(100,
33.33, 11.11, 3.7, 1.23, 0.41, 0.14, 0.05, 100, 33.33, 11.11,
3.7, 1.23, 0.41, 0.14, 0.05), mean_response = c(1156, 1833, 1744,
1256, 1244, 1088, 678, 489, 2322, 1867, 1333, 944, 567, 356,
200, 177), std_dev = c(117, 317, 440, 200, 134, 38, 183, 153,
719, 218, 185, 117, 166, 167, 88, 50)), class = "data.frame", row.names = c(NA,
-16L))
In a non linear least squares, you need to define initial values for the search of ideal parameters. We use next code with base function nls() to obtain those initial values:
#Drug 1
fm1 <- nls(log(mean_response) ~ log(a/(1+exp(-b*(conc-c)))), df[df$drug=='drug_1',], start = c(a = 1, b = 1, c = 1))
#Drug 2
fm2 <- nls(log(mean_response) ~ log(a/(1+exp(-b*(conc-c)))), df[df$drug=='drug_2',], start = c(a = 1, b = 1, c = 1))
With this initial approach of parameters, we sketch the plot using geom_smooth(). We again use nls() to find the right parameters:
#Plot
ggplot(data=df, aes(y=mean_response, x= conc, color = drug)) +
geom_pointrange(aes(ymax = (mean_response + std_dev), ymin = (mean_response - std_dev))) +
geom_smooth(data = df[df$drug=='drug_1',],method = "nls", se = FALSE,
formula = y ~ a/(1+exp(-b*(x-c))),
method.args = list(start = coef(fm1),
algorithm='port'),
color = "tomato")+
geom_smooth(data = df[df$drug=='drug_2',],method = "nls", se = FALSE,
formula = y ~ a/(1+exp(-b*(x-c))),
method.args = list(start = coef(fm0),
algorithm='port'),
color = "cyan3")
The output:

Related

simple curve fitting in R

I am trying to find a fit for my data. But so far had no luck.
Tried the logarithmic, different ones from the drc package .. but I am sure there must be a better one I just don't know the type.
On a different note - I would be grateful for advice on how to go about curve hunting in general.
library(drc)
df<-structure(list(x = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55), y = c(0.1066, -0.6204, -0.2028, 0.2621, 0.4083,
0.4497, 0.6343, 0.7762, 0.8809, 1.0029, 0.8089, 0.7845, 0.8009,
0.9319, 0.9414, 0.9505, 0.9323, 1.0321, 0.9381, 0.8975, 1.0929,
1.0236, 0.9589, 1.0644, 1.0411, 1.0763, 0.9679, 1.003, 1.142,
1.1049, 1.2868, 1.1569, 1.1952, 1.0802, 1.2125, 1.3765, 1.263,
1.2507, 1.2125, 1.2207, 1.2836, 1.3352, 1.1311, 1.2321, 1.4277,
1.1645), w = c(898, 20566, 3011, 1364, 1520, 2376, 1923, 1934,
1366, 1010, 380, 421, 283, 262, 227, 173, 118, 113, 95, 69, 123,
70, 80, 82, 68, 83, 76, 94, 101, 97, 115, 79, 98, 84, 92, 121,
97, 102, 93, 92, 101, 74, 124, 64, 52, 63)), row.names = c(NA,
-46L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`47` = 47L), class = "omit"))
fit <- drm(data = df,y ~ x,fct=LL.4(), weights = w)
plot(fit)
1) If we ignore the weights then y = a + b * x + c/x^2 seems to fit and is linear in the coefficients so is easy to fit. This seems upward sloping so we started with a line but then we needed to dampen that so we added a reciprocal term. A reciprocal quadratic worked slightly better than a plain reciprocal based on the residual sum of squares so we switched to that.
fm <- lm(y ~ x + I(1 / x^2), df)
coef(summary(fm))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.053856e+00 0.116960752 9.010341 1.849238e-11
## x 4.863077e-03 0.002718613 1.788808 8.069195e-02
## I(1/x^2) -1.460443e+02 16.518887452 -8.841049 3.160306e-11
The coefficient of the x term is not significant at the 5% level -- the p value is 8% in the table above -- so we can remove it and it will fit nearly as well giving a model with only two parameters. In the plot below the fm fit with 3 parameters is solid and the fm2 fit with 2 parameters is dashed.
fm2 <- lm(y ~ I(1 / x^2), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
2) Another approach is to use two straight lines. This is still continuous but has one non-differentiable point at the point of transition. The model has 4 parameters, the intercepts and slopes of each line. Below we do use the weights. It has the advantage of an obvious motivation based on the appearance of the data. The break point at the intersection of the two lines may have significance as the transition point between the higher sloping initial growth and the lower sloping subsequent growth.
# starting values use lines fitted to 1st ten and last 10 points
fm_1 <- lm(y ~ x, df, subset = 1:10)
fm_2 <- lm(y ~ x, df, subset = seq(to = nrow(df), length = 10))
st <- list(a = coef(fm_1)[[1]], b = coef(fm_1)[[2]],
c = coef(fm_2)[[1]], d = coef(fm_2)[[2]])
fm3 <- nls(y ~ pmin(a + b * x, c + d * x), df, start = st, weights = w)
# point of transition
X <- with(as.list(coef(fm3)), (a - c) / (d - b)); X
## [1] 16.38465
Y <- with(as.list(coef(fm3)), a + b * X); Y
## [1] 0.8262229
plot(y ~ x, df)
lines(fitted(fm3) ~ x, df)
The basic idea is to understand how the selected function performs. Take a function you know (e.g. the logistic) and modify it. Or (even better) go to the literature and see which functions people use in your specific domain. Then create a user-defined model, play with it to understand the parameters, define good start values and then fit it.
Her a quick & dirty example of a user-defined function (with package growthrates). It can surely be made similarly with drc.
library("growthrates")
grow_userdefined <- function (time, parms) {
with(as.list(parms), {
y <- (K * y0)/(y0 + (K - y0) * exp(-mumax * time)) + shift
return(as.matrix(data.frame(time = time, y = y)))
})
}
fit <- fit_growthmodel(FUN=grow_userdefined,
p = c(y0 = -1, K = 1, mumax = 0.1, shift = 1),
time = df$x, y = df$y)
plot(fit)
summary(fit)
It can of course be made better. As we have no exponential start at the onset, one can for example start with a simple saturation function instead of a logistic, e.g. something Monod-like. As said, the preferred way is to use a function related to the application domain.

ggeffects giving different prediction results from lm extracted as parsnip model, despite same coefficients

I have a question about predictions using ggeffects, which is giving me completely different results if I use a traditional lm fit or an extracted parsnip model fit (despite having the same coefficients).
Here is an example...
library(tidyverse)
library(tidymodels)
library(ggeffects)
test_df <- structure(list(weight = c(-1.7, 0, 0.6, 0.6, -0.7, -0.3, -0.6,
-1, -1, 2, 0.1, -0.6, -1.5, 2, -0.7, -0.2, -0.9, -0.6, 1.1, -2,
1.4, -1, -1.1, 0.5, 1.3, 0, -0.5, -3, 1.1, -0.6), steps = c(19217,
15758, 14124, 14407, 5565, 20860, 17536, 17156, 17219, 652, 1361,
8524, 1169, 3117, 3135, 1917, 4267, 7067, 8927, 2436, 3014, 5281,
8104, 6836, 8939, 4923, 6885, 10581, 10370, 11024), calories = c(1943,
1581, 1963, 1551, 1699, 1789, 1550, 2036, 1707, 1522, 1672, 1994,
1588, 1506, 1678, 1673, 1662, 1906, 1814, 1609, 1799, 1825, 1654,
2291, 1788, 2019, 1911, 1589, 2177, 2137)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L)) %>%
as_tibble(.)
#lm fit
lmmod_simp <- lm(weight ~ steps * calories, data = test_df)
#tidymodels
linear_reg_lm_spec <-
linear_reg() %>%
set_engine('lm')
basic_rec <- recipe(weight ~ steps + calories, test_df) %>%
step_interact(terms = ~ steps:calories)
lm_wflw <- workflow() %>%
add_recipe(basic_rec) %>%
add_model(linear_reg_lm_spec)
lm_fit <- fit(lm_wflw, data = test_df)
lm_fit_extracted <- lm_fit %>% extract_fit_parsnip()
When I look at the output, both have the same coefficients
lmmod_simp
lm_fit_extracted
But when I go to predict, the predictions are completely different
ggemmeans(lmmod_simp, terms = c("steps", "calories[1500,2000,2500]")) %>%
as.data.frame() %>%
ggplot(aes(x,predicted, color=group, linetype = group))+
geom_line()
modlm 1
ggemmeans(lm_fit_extracted, terms = c("steps", "calories[1500,2000,2500]")) %>%
as.data.frame() %>%
ggplot(aes(x,predicted, color=group, linetype = group))+
geom_line()
mod lm2
Perhaps I can't/shouldn't use the parsnip fit object in this way, but it seems odd since they are showing the same coefficients.
I appreciate any help!
You are getting different results because lmmod_simp and lm_fit_extracted are different models. While lm_fit has an interaction effect on steps, lm_fit_extracted has no idea about this interaction as it gets the data after the interaction calculation has been performed.
It is generally not recommended to pull out models from a workflow object if you plan on using it for other things than diagnostics.

How to fix 'non-conformable arrays' error in SVM?

Currently practicing SVM with a sample dataset and I'm trying to add the abline to the plot but says that w[1,2] is not the correct dimensions, leading me back to the created variable 'w' in which 'coefs' and 'SV' are not wanting to work in the code
I've tried lowercasing SV, which works but results in w | numeric (empty) instead of num [1, 1:2] -0.035 -0.0188.
structure(list(Height = c(44, 52.1, 57.1, 33, 27.8, 27.2, 32,
45.1, 56.7, 56.9, 122.1, 123.9, 122.9, 101.1, 128.9, 137.1, 127,
103, 141.6, 102.4), Weight = c(126.3, 136.9, 109.2, 148.3, 110.4,
107.8, 128.4, 120.2, 140.2, 139.2, 154.1, 170.8, 183.1, 164,
193.6, 181.7, 164.8, 174.6, 185.8, 176.9)), class = "data.frame", row.names = c(NA,
-20L))
#create an is horse indicator variable
ishorse <- c(rep(-1,10),rep(+1,10))
library(e1071)
#create data frame for performing svm
data <- data.frame(Height= animal_data['Height'],
Weight= animal_data['Weight'],
animal=as.factor(ishorse))
#plot data
plot(data[,-3],col=(3)/2, pch=19); abline(h=0,v=0,lty=3)
#perform svm
svm.model <- svm(animal ~ .,
data=data,
type='C-classification',
kernel='linear',
scale=FALSE)
#show support vectors
points(data[svm.model$index, c(1,2)], col="orange", cex = 2)
#get parameters of hyperplane
w <- t(svm.model$coefs) %% svm.model$SV
b <- -svm.model$rho
#in this 2D case the hyperplane is the line w[1,1]*x1 + w[1,2]*2 +b = 0
abline(a=-b/w[2,2], b=-w[,]/w[,], col = "blue", lty = 3)
The correct syntax for matrix multiplication is %*%. Try this:
svm.model <- svm(animal ~ .,
data=data,
type='C-classification',
kernel='linear',
scale=FALSE)
#show support vectors
points(data[svm.model$index, c(1,2)], col="orange", cex = 2)
#get parameters of hyperplane
w <- t(svm.model$coefs) %*% svm.model$SV
b <- -1*svm.model$rho
#in this 2D case the hyperplane is the line w[1,1]*x1 + w[1,2]*2 +b = 0
abline(a=-b/w[1,2], b=-w[1]/w[2], col = "blue", lty = 3)

Error when running a gam with poisson family and offset

I'm trying to run a gam in R and I'm getting a strange error message.
Generally, I have some number of counts, per volume of water sampled, and I want to correct by that number of counts. I'm trying to generate a smooth function that fits the counts as a function of depth, accounting for differences in volume sampled.
test <- structure(list(depth = c(2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5,
37.5, 42.5, 47.5, 52.5, 57.5, 62.5, 67.5, 72.5, 77.5, 82.5, 87.5,
92.5, 97.5), count = c(53323, 665, 1090, 491, 540, 514, 612,
775, 601, 497, 295, 348, 357, 294, 292, 968, 455, 148, 155, 101
), vol = c(2119.92, 111.76, 156.64, 71.28, 77.44, 73.92, 62.48,
78.32, 74.8, 81.84, 53.68, 80.96, 80.08, 79.2, 79.2, 77.44, 77.44,
84.48, 73.04, 59.84)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .Names = c("depth", "count", "vol"
))
gam(count ~ s(depth) + offset(vol), data = test, family = "poisson")
Error in if (pdev - old.pdev > div.thresh) { : missing value where TRUE/FALSE needed
Any idea why this is not working? If I get rid of the offset, or if I set family = "gaussian" the function runs as one would expect.
Edit: I find that
gam(count ~ s(depth) + offset(log(vol)), data = test, family = "poisson")
does run, and I think I saw something that said that one wants to log transform the offset variable for these, so maybe this is actually working ok.
You definitely need to put vol on the log scale (for this model).
More generally, an offset enters the model on the scale of the link function. Hence if your model use family = poisson(link = 'sqrt'), then you'd want to include
the offset as offset(sqrt(vol)).
I suspect the error is coming from some overflow or bad value in the likelihood/deviance arising from assuming that the vol values were on the log scale whilst the initial model was fitting.

glm fitted values mirrored/won't match

I've got a strange problem with plotting the fitted values of a glm.
My code is:
Data <- data.frame("Sp" = c(111.4, 185, 231, 272.5, 309, 342, 371, 399,
424, 447, 469, 489, 508, 527, 543, 560, 575, 589, 603, 616, 630, 642, 653,
664, 675, 685, 695, 705, 714, 725, 731, 740), "nrC" = 1:32)
modell <- glm(Sp ~ nrC, data = Data, family = Gamma)
pred <- predict(modell, newdata = data.frame("nrC" = 1:32), type = "response")
plot(Data$nrC, Data$Sp, xlim = c(0, 40), ylim = c(50, 1000))
lines(Data$nrC, pred, col = "blue")
The blue line representing the fitted values seems to be ok, apart from being horizontally mirrored.
I'm relatively new to this, so maybe I'm missing something obvious here, but I can't figure out what's wrong.
Doing the same with the data presented here works perfectly fine.
I'be grateful for any hints!
The gamma distribution isn't quite right for this data set. The data shown in the plot as you have it formulated shows a square root-ish looking function. Try specifying the model like this:
modell <- glm(Sp ~ sqrt(nrC), data = Data, family = gaussian)
pred <- predict(modell, newdata = data.frame("nrC" = 1:32), type = "response")
plot(Data$nrC, Data$Sp, xlim = c(0, 40), ylim = c(50, 1000))
lines(Data$nrC, pred, col = "blue")

Resources