add exponential function given mean and intercept to cdf plot - r

Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)

I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:

I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.

Related

Plotting dr4pl dose response curves, and how to integrate them with ggplot2?

I am trying to set up a high-throughput way of plotting out dose response curves from a large screening experiment. Prism obviously has the easiest way of doing dose-response curves well, but I can't copy and paste this much data.
Since CRAN removed drc, the package dr4pl seems the way to go, but there is very little instruction available yet.
#make data frame
dose <- c("0.078125","0.156250","0.312500","0.625000","1.250000","2.500000","5.000000","10.000000","20.000000")
POC<-c("1.05637425", "0.87380081", "0.79171200", "0.83166848", "0.77361290", "0.35199288", "0.19404609", "0.09079221", "0.09850658")
data<-data.frame(dose, POC)
#use the dr4pl pakcage to calculate curve and IC50 etc
model<-dr4pl(POC~dose, data)
summary.model <- summary(model)
summary.model$coefficients
#plot this
plot(dr4pl(POC~dose, data=data))
The above will generate the type of curve I need using dr4pl, and get me the IC50. but how would I plot several datasets/curves on one plot?
Ideally I'd rather plot the data using ggplot2: plot+geom_point() and add in the dose response line by using the dr4pl summary as a +stat_smooth() model, if that makes sense? But I don't know how to do this.
Any help would be appreciated
I can get most of the way but not all the way. The main step is to write a predict() method for dr4pl objects:
predict.dr4pl <- function (object, newdata=NULL, se.fit=FALSE, level, interval) {
xseq <- if (is.null(newdata)) object$data$Dose else newdata$x
pred <- MeanResponse(xseq, object$parameters)
if (!se.fit) {
return(pred)
}
qq <- qnorm((1+level)/2)
se <- sapply(xseq,
function(x) car::deltaMethod(object,
"UpperLimit + (LowerLimit - UpperLimit)/(1 + (x/IC50)^Slope)")[["Estimate"]])
return(list(fit=data.frame(fit=pred,lwr=pred-qq*se,
upr=pred+qq*se), se.fit=se))
}
I included a slightly hacky way to compute the confidence intervals via the delta method - this might not be too reliable (bootstrapping would be better ...)
It works OK (sort of) for your data (changed the name to dd because it's sometimes dicey to name the data data (fortunes::fortune("dog"))).
dd <- data.frame(dose = c(0.078125,0.156250,0.312500,0.625000,1.25,
2.50,5.0,10.0,20.0),
POC = c(1.05637425, 0.87380081, 0.79171200,
0.83166848, 0.77361290, 0.35199288,
0.19404609, 0.09079221, 0.09850658))
library(dr4pl)
ggplot(dd, aes(dose,POC)) + geom_point() +
geom_smooth(method="dr4pl",se=TRUE) + coord_trans(x="log10")
the confidence intervals are terrible, turn them off with se=FALSE
dr4pl puts the x-axis on a log10-scale by default, but the standard scale_x_log10() screws this up because it is applied before the fitting and prediction, so I use coord_trans(x="log10") instead.
However, coord_trans() doesn't play so nicely if the axes are on a very broad logarithmic scale - I tried the example above with the sample_data_1 data from the package and it didn't work.
But I'm afraid I've spent enough time on this for now.
It would more robust to use the predict method above to generate the values you want, over the range you want, separately, and then use geom_line() + geom_ribbon() to add the information to the plot ....
If you're willing to fit the model first (outside geom_smooth) you can do this (this is using sample_data_1 from dr4pl package - it's from the first example in ?dr4pl)
model2 <- dr4pl(dose = sample_data_1$Dose,
response = sample_data_1$Response)
ggplot(sample_data_1, aes(Dose,Response)) + geom_point() +
stat_function(fun=function(x) predict(model2,newdata=data.frame(x=x))) +
scale_x_log10()
which is less sensitive to the order of scaling/unscaling the x axis.
Improved but slow bootstrap CIs:
predictdf.dr4pl <- function (model, xseq, se, level, nboot=200) {
pred <- MeanResponse(xseq, model$parameters)
if (!se) {
return(base::data.frame(x=xseq, y=pred))
}
## bootstrap residuals
pred0 <- MeanResponse(model$data$Dose, model$parameters)
res <- pred0-model$data$Response
bootres <- matrix(nrow=length(xseq), ncol=nboot)
pb <- txtProgressBar(max=nboot,style=3)
for (i in seq(nboot)) {
setTxtProgressBar(pb,i)
mboot <- dr4pl(model$data$Dose,
pred0 + sample(res, size=length(pred0),
replace=TRUE))
bootres[,i] <- MeanResponse(xseq, mboot$parameters)
}
fit <- data.frame(x = xseq,
y=pred,
ymin=apply(bootres,1,quantile,(1-level)/2),
ymax=apply(bootres,1,quantile,(1+level)/2))
return(fit)
}
print(ggplot(dd, aes(dose,POC))
+ geom_point()
+ geom_smooth(method="dr4pl",se=TRUE) + coord_trans(x="log10")
)

Generate random numbers of a chi squared distribution in R

I want to generate a chi squared distribution with 100,000 random numbers with degrees of freedom 3.
This is what I have tried.
df3=data.frame(X=dchisq(1:100000, df=3))
But output is not I have expected. I used below code to visualize it.
ggplot(df3,aes(x=X,y=..density..)) + geom_density(fill='blue')
Then the pdf looks abnormal. Please help
Use rchisq to sample from the distribution:
df3=data.frame(X=rchisq(1:100000, df=3))
ggplot(df3,aes(x=X,y=..density..)) + geom_density(fill='blue')
If your goal is to plot a density function, do this:
ggplot(data.frame(x = seq(0, 25, by = 0.01)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 3), fill = "blue", geom = "density")
The latter has the advantage of the plot being fully deterministic.
Use rchisq() to create a distribution of 100,000 observations randomly drawn from a chi square distribution with 3 degrees of freedom.
df3=data.frame(X=rchisq(1:100000, df=3))
hist(df3$X)
...and the output:
The ggplot version looks like this:
library(ggplot2)
ggplot(data = df3, aes(X)) + geom_histogram()
You may use rchisq to make random draws from a random X2 distribution as shown in the other answers.
dchisq is the density distribution function, which you might find useful though, since you want to plot:
curve(dchisq(x, 3), xlim=0:1*15)

R ggplot: Weighted CDF

I'd like to plot a weighted CDF using ggplot. Some old non-SO discussions (e.g. this from 2012) suggest this is not possible, but thought I'd reraise.
For example, consider this data:
df <- data.frame(x=sort(runif(100)), w=1:100)
I can show an unweighted CDF with
ggplot(df, aes(x)) + stat_ecdf()
How would I weight this by w? For this example, I'd expect an x^2-looking function, since the larger numbers have higher weight.
There is a mistake in your answer.
This is the right code to compute the weighted ECDF:
df <- df[order(df$x), ] # Won't change anything since it was created sorted
df$cum.pct <- with(df, cumsum(w) / sum(w))
ggplot(df, aes(x, cum.pct)) + geom_line()
The ECDF is a function F(a) equal to the sum of weights (probabilities) of observations where x<a divided by the total sum of weights.
But here is a more satisfying option that simply modifies the original code of the ggplot2 stat_ecdf:
https://github.com/NicolasWoloszko/stat_ecdf_weighted

Log Log Probability Chart in R

I'm sure this is easy, but I've been tearing my hair out trying to find out how to do this in R.
I have some data that I am trying to fit to a power law distribution. To do this, you need to plot the data on a log-log cumulative probability chart. The y-axis is the LOG of the frequency of the data (or log-probability, if you like), and the x-axis is the log of the values. If it's a straight line, then it fits a power law distribution, and the gradient determines the power law parameter.
If I want the frequency of the data, I can just use the ecdf() function:
My data set is called Profits.negative, and it's just a long list of trading profits that were less than zero (and I've notionally converted them all to positive numbers to avoid logging problems later on).
So I can type
plot(ecdf(Profits.negative))
And I get a handy empirical CDF function plotted. All I need to do is to convert both axes to log scales. I can do the x-axis:
Profits.negative.logs <- log(Profits.negative)
plot(ecdf(Profits.negative.logs))
Almost there! I just need to work out how to log the y-axis! But I can't seem to do it, and I can't work out how to extract the figures from the ecdf object. Can anyone help?
I know there is a power.law.fit function, but that just estimates the parameters - I want to plot the data and see if it lines up.
You can fit and plot power-laws using the poweRlaw package. Here's an example. First we generate some data from a heavy tailed distribution:
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
Next we load the package and create a data object and a displ object:
library(poweRlaw)
m = displ$new(x)
We can estimate xmin and the scaling parameter:
est = estimate_xmin(m))
and set the parameters
m$setXmin(est[[2]])
m$setPars(est[[3]])
Then plot the data and add the fitted line:
plot(m)
lines(m, col=2)
To get:
Data generation first (you part, actually ;)):
set.seed(1)
Profits.negative <- runif(1e3, 50, 100) + rnorm(1e2, 5, 5)
Logging and ecdf:
Profits.negative.logs <- log(Profits.negative)
fn <- ecdf(Profits.negative.logs)
ecdf returns function, and if you want to extract something from it - it's good idea to look into function's closure:
ls(environment(fn))
# [1] "f" "method" "n" "nobs" "x" "y" "yleft" "yright"
Well, now we can access x and y:
x <- environment(fn)$x
y <- environment(fn)$y
Probably it's what you need. Indeed, plot(fn) and plot(x,y,type="l") show virtually the same results. To log y-axis you need just:
plot(x,log(y),type="l")
Here is an approach using ggplot2:
library(ggplot2)
# data
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
# organize data into a df
df <- data.frame(x = sort(x, decreasing = T),
pk <- ecdf(x)(x),
k <- seq_along(x))
# plot
ggplot(df, aes(x=k, y= pk)) + geom_point(alpha=0.5) +
coord_trans(x = 'log10', y = 'log10') +
scale_x_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))

Plot probability with ggplot2 (not density)

I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.
So, I'd like to ask:
a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.
b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?
Thanks in advance,
Timo
EDIT:
when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.
Just making up a quick merge of #JD Long's and #yesterday's answers:
ggplot(df, aes(x=x)) +
geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
geom_density(fill="red", alpha = 0.2) +
theme_bw() +
xlab('') +
ylab('')
This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as #yesterday suggested for further customization.
This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)
start with some synthetic data:
set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")
The density function has a reasonably smart bandwidth calculator which you can borrow from:
bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)
Then use the bandwidth calc as the SD to make some random noise
noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")
Hey look! A kernel smoothed histogram!
I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.
You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().
See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html
You can specify any of the kernel estimation functions that are supported by by stats::density()
library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")

Resources