Plotting an Exponential Best Fit Curve to ggplot2 using Stat_smooth [duplicate] - r

I am trying to fit data on an exponential decay function (RC like system) with equation:
My data are on the following dataframe:
dataset <- data.frame(Exp = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6), t = c(0, 0.33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10, 0, 33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10, 0, 0.33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10), fold = c(1, 0.957066345654286, 1.24139015724819, 1.62889151698633, 1.72008539595879, 1.82725412314402, 1.93164365299958, 1.9722929538061, 2.15842019312484, 1.9200507796933, 1.95804730344453, 1, 0.836176542548747, 1.07077717914707, 1.45471712491441, 1.61069357875771, 1.75576377806756, 1.89280913889538, 2.00219054189937, 1.87795513639311, 1.85242493827193, 1.7409346372629, 1, 0.840498729335292, 0.904130905000499, 1.23116185602517, 1.41897551928886, 1.60167656534099, 1.72389226836308, 1.80635095956481, 1.76640786872057, 1.74327897001172, 1.63581509884482))
I have 3 experiment (Exp: 4, 5 and 6) data I want to fit each experiment on the given equation.
I have managed to do it for of the experiment by subsetting my data and using the parameter calculated by nls
test <- subset(dataset,Exp==4)
fit1 = nls(fold ~ 1+(Vmax*(1-exp(-t/tau))),
data=test,
start=c(tau=0.2,Vmax=2))
ggplot(test,aes(t,fold))+
stat_function(fun=function(t){1+coef(fit1)[[2]]*(1-exp(-t/coef(fit1)[[1]]))})+
geom_point()
But if I try to use the geom_smooth function directly on the full dataset with this code
d <- ggplot(test,aes(t,fold))+
geom_point()+
geom_smooth(method="nls",
formula='fold~1+Vmax*(1-exp(-t/tau))',
start=c(tau=0.2,Fmax=2))
print(d)
I get the following error:
Error in model.frame.default(formula = ~fold, data = data, weights = weight) :
variable lengths differ (found for '(weights)')
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Is there anything wrong with my syntax? I would have this one working in order to use the same function on the dataset and using group to have one fit per Exp level.

There are several problems:
formula is a parameter of nls and you need to pass a formula object to it and not a character.
ggplot2 passes y and x to nls and not fold and t.
By default, stat_smooth tries to get the confidence interval. That isn't implemented in predict.nls.
In summary:
d <- ggplot(test,aes(x=t, y=fold))+
#to make it obvious I use argument names instead of positional matching
geom_point()+
geom_smooth(method="nls",
formula=y~1+Vmax*(1-exp(-x/tau)), # this is an nls argument,
#but stat_smooth passes the parameter along
start=c(tau=0.2,Vmax=2), # this too
se=FALSE) # this is an argument to stat_smooth and
# switches off drawing confidence intervals
Edit:
After the major ggplot2 update to version 2, you need:
geom_smooth(method="nls",
formula=y~1+Vmax*(1-exp(-x/tau)), # this is an nls argument
method.args = list(start=c(tau=0.2,Vmax=2)), # this too
se=FALSE)

Related

How might you determine how well distributed a set of data is?

I have two datasets which contains a distrbution of 90 data points into 2 and 4 groups/rows and I would like to determine which one out of the two has better distributed the data and plot the result to visually see which one has done this. Better distribution means which one has made it so each group has a similar/same number of data. For example, we can see that the result of Grouped 2 the second group contains larger values for each column compared to the first column so 1 of the 2 groups contains larger values which means its not well distributed among the 2 groups.
I quite new to R so I am unsure how I could go about doing this. Would appreciate any insight into what approach could be used.
R
Grouped into 4
Values <- matrix(c(1, 6, 3, 6, 6, 8,
3, 3, 5, 3, 3, 3,
6, 7, 6, 7, 5, 4,
9, 4, 4, 5, 5, 3), nrow = 4, ncol = 6, byrow = TRUE)
Grouped into 2
Values <- matrix(c(3, 6, 4, 3, 4, 6,
12, 9, 12, 12, 11, 9), nrow = 2, ncol = 6, byrow = TRUE)
You can do this with some basic statistics, using hypothesis testing i.e. testing whether the two groups are statistically different or not. The stats package in R has a lot of tests that you can try and use, each with its own assumptions. Here is one:
Making the matrix
values <- matrix(c(3, 6, 4, 3, 4, 6,
12, 9, 12, 12, 11, 9), nrow = 2, ncol = 6, byrow = TRUE)
Conducting t-test
t.test(values[1, ], values[2, ], paired = FALSE)
Will give you this:
Welch Two Sample t-test
data: values[1, ] and values[2, ]
t = -7.9279, df = 9.945, p-value = 1.318e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.328203 -4.671797
sample estimates:
mean of x mean of y
4.333333 10.833333
The means of values[1, ] is smaller than values[2, ], with a p-value of 1.3e-05.

Error in fitdist with gamma distribution

Below are my codes:
library(fitdistrplus)
s <- c(11, 4, 2, 9, 3, 1, 2, 2, 3, 2, 2, 5, 8,3, 15, 3, 9, 22, 0, 4, 10, 1, 9, 10, 11,
2, 8, 2, 6, 0, 15, 0 , 2, 11, 0, 6, 3, 5, 0, 7, 6, 0, 7, 1, 0, 6, 4, 1, 3, 5,
2, 6, 0, 10, 6, 4, 1, 17, 0, 1, 0, 6, 6, 1, 5, 4, 8, 0, 1, 1, 5, 15, 14, 8, 1,
3, 2, 9, 4, 4, 1, 2, 18, 0, 0, 10, 5, 0, 5, 0, 1, 2, 0, 5, 1, 1, 2, 3, 7)
o <- fitdist(s, "gamma", method = "mle")
summary(o)
plot(o)
and the error says:
Error in fitdist(s, "gamma", method = "mle") : the function mle
failed to estimate the parameters,
with the error code 100
The Gamma distribution doesn't allow zero values (the likelihood will evaluate to zero, and the log-likelihood will be infinite, for a response of 0) unless the shape parameter is exactly 1.0 (i.e., an exponential distribution - see below) ... that's a statistical/mathematical problem, not a programming problem. You're going to have to find something sensible to do about the zero values. Depending on what makes sense for your application, you could (for example)
choose a different distribution to test (e.g. pick a censoring point and fit a censored Gamma, or fit a zero-inflated Gamma distribution, or ...)
exclude the zero values (fitdist(s[s>0], ...))
set the zero values to some sensible non-zero value (fitdist(replace(s,which(s==0),0.1),...)
which (if any) of these is best depends on your application.
#Sandipan Dey's first answer (leaving the zeros in the data set) appears to make sense, but in fact it gets stuck at the shape parameter equal to 1.
o <- fitdist(s, "exp", method = "mle")
gives the same answer as #Sandipan's code (except that it estimates rate=0.2161572, the inverse of the scale parameter=4.626262 that's estimated for the Gamma distribution - this is just a change in parameterization). If you choose to fit an exponential instead of a Gamma, that's fine - but you should do it on purpose, not by accident ...
To illustrate that the zeros-included fit may not be working as expected, I'll construct my own negative log-likelihood function and display the likelihood surface for each case.
mfun <- function(sh,sc,dd=s) {
-sum(dgamma(dd,shape=sh,scale=sc,log=TRUE))
}
library(emdbook) ## for curve3d() helper function
Zeros-included surface:
cc1 <- curve3d(mfun(x,y),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero1.png")
with(cc1,image(x,y,z))
dev.off()
In this case the surface is only defined at shape=1 (i.e. an exponential distribution); the white regions represent infinite log-likelihoods. It's not that shape=1 is the best fit, it's that it's the only fit ...
Zeros-excluded surface:
cc2 <- curve3d(mfun(x,y,dd=s[s>0]),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero2.png")
with(cc2,image(x,y,z))
with(cc2,contour(x,y,z,add=TRUE))
abline(v=1.0,lwd=2,lty=2)
dev.off()
Just provide the initial values for the gamma distribution parameters (scale, shape) to be computed with mle using optim and also the lower bounds for the parameters, it should work.
o <- fitdist(s, "gamma", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 4.626262 NA
#shape 1.000000 NA
#Loglikelihood: -250.6432 AIC: 505.2864 BIC: 510.4766
As per the comments by #Ben Bolker, we may want to exclude the zero points first:
o <- fitdist(s[s!=0], "gamma", method = "mle", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 3.401208 NA
#shape 1.622378 NA
#Loglikelihood: -219.6761 AIC: 443.3523 BIC: 448.19

how to use method="nlsLM" (in packages minpack.lm) in geom_smooth

test <- data.frame(Exp = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6), t = c(0, 0.33, 0.67,
1, 1.33, 1.67, 2, 4, 6, 8, 10, 0, 33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10,
0, 0.33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10), fold = c(1,
0.957066345654286, 1.24139015724819, 1.62889151698633, 1.72008539595879,
1.82725412314402, 1.93164365299958, 1.9722929538061, 2.15842019312484,
1.9200507796933, 1.95804730344453, 1, 0.836176542548747, 1.07077717914707,
1.45471712491441, 1.61069357875771, 1.75576377806756, 1.89280913889538,
2.00219054189937, 1.87795513639311, 1.85242493827193, 1.7409346372629, 1,
0.840498729335292, 0.904130905000499, 1.23116185602517, 1.41897551928886,
1.60167656534099, 1.72389226836308, 1.80635095956481, 1.76640786872057,
1.74327897001172, 1.63581509884482))
d <- ggplot(test,aes(x=t, y=fold))+
#to make it obvious I use argument names instead of positional matching
geom_point()+
geom_smooth(method="nls",
formula=y~1+Vmax*(1-exp(-x/tau)), # this is an nls argument
method.args = list(start=c(tau=0.2,Vmax=2)), # this too
se=FALSE)
I find the code here in this site, but I wonder how to change method="nls" to method = "nlsLM" in geom_smooth, as the original "nls" is really a big problem to me when setting the start values.
Is there any ways to use packages from cran in the method of geom_smooth in ggplot2?
Thanks
You don't seem to have tried anything. You can simply do the obvious:
library(ggplot2)
library(minpack.lm)
d <- ggplot(test,aes(x=t, y=fold))+
geom_point()+
geom_smooth(method="nlsLM",
formula=y~1+Vmax*(1-exp(-x/tau)),
method.args = list(start=c(tau=0.2,Vmax=2)),
se=FALSE)
print(d)
#works
Note that convergence problems do not have an easy one-size-fits-all solution. Sometimes minpack can help, but often it will simply give you a bad fit where nls helpfully throws an error.
It's probably best to keep your nls results in a separate data frame, and plot the two items separately:
ggplot() +
geom_point(aes(x=t, y=fold), data = test) +
geom_line(aes(...), data = my.nls.results)
Use geom_line() instead.
For example, let's say you're working with mtcars and your formula is mpg ~ k / wt + b
nls_model <- nls(mpg ~ k / wt + b, data, etc.)
ggplot(...) +
geom_line(stat = "smooth",
method = "nls",
formula = y ~ k / x + b,
method.args = list(start = as.list(coef(nls_model))),
se = FALSE)
This worked for me even with nlsLM, too. The idea, too, behind coef(nls_model) is to use the coefficients of your successful model as the starting values in the geom_line so you get the same model. Just make sure you use y and x in the formula inside geom_line.

R error using DBSCAN on Data frame

Error in data - x : non-numeric argument to binary operator
My code is as follows:
x <- as.factor(c(2, 2, 8, 5, 7, 6, 1, 4))
y <- as.factor(c(10, 5, 4, 8, 5, 4, 2, 9))
coordinates <- data.frame(x, y)
colnames(coordinates) <- c("x_coordinate", "y_coordinate")
print(coordinates)
point_clusters <- dbscan(coordinates, 2, MinPts = 2, scale = FALSE,
method = c("hybrid", "raw", "dist"), seeds = TRUE,
showplot = 1, countmode = NULL)
point_clusters
But I'm getting following error while executing the above code:
> point_clusters <- dbscan(coordinates, 2, MinPts = 2, scale = FALSE, method = c("hybrid", "r ..." ... [TRUNCATED]
Error in data - x : non-numeric argument to binary operator
I don't know what is the problem with above code.
I solved the problem as per my need. I saw somewhere that the data needs to be numeric matrix, although I'm not sure about that. So, here is what I did:
x <- c(2, 2, 8, 5, 7, 6, 1, 4)
y <- c(10, 5, 4, 8, 5, 4, 2, 9)
coordinates <- matrix(c(x, y), nrow = 8, byrow = FALSE)
Remaining code is same as above. Now it works fine for me.

R: displaying scientific notation

chocolate <- data.frame(
Sabor =
c(5, 7, 3,
4, 2, 6,
5, 3, 6,
5, 6, 0,
7, 4, 0,
7, 7, 0,
6, 6, 0,
4, 6, 1,
6, 4, 0,
7, 7, 0,
2, 4, 0,
5, 7, 4,
7, 5, 0,
4, 5, 0,
6, 6, 3
),
Tipo = factor(rep(c("A", "B", "C"), 15)),
Provador = factor(rep(1:15, rep(3, 15))))
tapply(chocolate$Sabor, chocolate$Tipo, mean)
ajuste <- lm(chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
summary(ajuste)
anova(ajuste)
a1 <- aov(chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
posthoc <- TukeyHSD(x=a1, 'chocolate$Tipo', conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
$`chocolate$Tipo`
diff lwr upr p adj
B-A -0.06666667 -1.803101 1.669768 0.9950379
C-A -3.80000000 -5.536435 -2.063565 0.0000260
C-B -3.73333333 -5.469768 -1.996899 0.0000337
Here is some sample code using TukeyHSD. The output is a matrix, and I want the values to be displayed in scientific notation. I've tried using scipen and setting options(digits = 20) but some of my values from my actual data are still way too small so that the p adj values are 0.00000000000000000000
How can I get the values to be displayed in scientific notation?
You could do this:
format(posthoc, scientific = TRUE)
If you want to change the number of digits, for instance using 3, you could do this:
format(posthoc, scientific = TRUE, digits = 3)

Resources