How to add an equation to my plot using text function in R? [duplicate] - r

This question already has answers here:
Add regression line equation and R^2 on graph
(10 answers)
Closed 2 years ago.
So I am trying to add a simple equation to my graph using the text function in R but for some reason its not working.
This is how I generated my randon simulated data:
a <- 5
b <- 7
sigma <- 3
x <- runif(100, min = 0, max = 50)
n <- length(x)
y <- a + b * x + rnorm(n, 0, sigma)
fakeData <- data.frame(x, y)
Then I plugged in the data frame and added a regression line like so.
ggplot(data = fakeData, aes(y, x)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(x = "random uniform distribution from range 0 - 50.",
y = "Linear model points based on the Uniform dist",
title = "Fake data simulation")
Which results in a graph like this
Now all I want to do is add the equation of the line to the graph using the text function but for some reason, R just seems to be throwing errors at me. I tried googling the issue but I wasnt able to resolve it. I tried this method also but the lack of documentation is appalling I have no idea what the guy was doing in the function.

Try this with annotate() but you previously have to obtain the coefs model using lm():
library(ggplot2)
#Code
a <- 5
b <- 7
sigma <- 3
x <- runif(100, min = 0, max = 50)
n <- length(x)
y <- a + b * x + rnorm(n, 0, sigma)
fakeData <- data.frame(x, y)
#lm
mod <- lm(y~x,data=fakeData)
lab <- paste0(round(coef(mod)[1],3),'+',paste0(round(coef(mod)[2],3),'x'))
#plot
ggplot(data = fakeData, aes(y, x)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(x = "random uniform distribution from range 0 - 50.",
y = "Linear model points based on the Uniform dist",
title = "Fake data simulation")+
annotate(geom='text',x=50,y=30,label=lab,fontface='bold')
Output:

Related

Scaling stat_function in R

I am taking a random sample of 30 data points from the standard normal distribution and plotting the resulting histogram in R. I would like to show an overlapping normal distribution that illustrates how the sample distribution is close to the population distribution. However, I can't figure out how to scale the normal curve. Here is what I have so far in R:
library(ggplot2)
n <- 30
set.seed(42)
X <- rnorm(n, mean = 0, sd = 1)
X <- as.data.frame(X)
ggplot(X, aes(x = X)) +
geom_histogram(bins = 6) +
stat_function(fun = dnorm, args = list(
mean = 0, sd = 1
))
How do I vertically stretch the PDF of the normal distribution to account for n = 30?
A) Using frequency as the y-axis in the histogram
I have one solution in the function rcompanion::plotNormalHistogram
n <- 30
set.seed(42)
X <- rnorm(n, mean = 0, sd = 1)
library(rcompanion)
plotNormalHistogram(X)
I think you are looking for the scenario with the default prob=FALSE. There, I extract some information about the counts and density from the hist() function, and use this Factor to stretch the normal curve vertically.
I don't know how to do the equivalent in ggplot2, but I would suspect that there is a way.
You can just use library(rcompanion); plotNormalHistogram to see the code.
B) Using density as the y-axis in the histogram
library(ggplot2)
n <- 30
set.seed(42)
X <- rnorm(n, mean = 0, sd = 1)
X <- as.data.frame(X)
ggplot(X, aes(x=X)) +
geom_histogram(aes(y = ..density..), bins=6) +
stat_function(fun = dnorm, args = list(
mean = 0, sd = 1))

How do I get a smooth curve from a few data points, in R?

I am trying to plot the rate 1/t as it changes with mue. The code is given below and I have highlighted the relevant lines with input and output.
library("deSolve")
library("reshape")
library("tidyverse")
Fd <- data.frame()
MUES <- c(100, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 100010, 100020, 100050, 100060, 100080, 100090, 100100, 100500) # <------ THIS IS THE INPUT
for (i in 1:length(MUES)){
parameters <- c(tau = 0.005, tau_r = 0.0025, mui=0, Ve=0.06, Vi=-0.01, s=0.015, mue=MUES[i])
state <- c(X = 0.015, Y = 0)
Derivatives <-function(t, state, parameters) {
#cc <- signal(t)
with(as.list(c(state, parameters)),{
# rate of change
dX <- -(1/tau + mue - mui)*X + (Y-X)/tau_r + mue*Ve - mui*Vi
dY <- -Y/tau + (X-Y)/tau_r
# return the rate of change
list(c(dX, dY))
}) # end with(as.list ...
}
times <- seq(0, 0.1, by = 0.0001)
out <- ode(y = state, times = times, func = Derivatives, parms = parameters)
out.1 <- out %>%
as.data.frame() %>% summarise(d = min(times[Y >=0.015]))
Time <- out.1$d
localdf <- data.frame(t=Time, rate= 1/Time, input=MUES[i])
Fd <- rbind.data.frame(Fd, localdf)}. # <----- THIS IS THE DATAFRAME WITH OUTPUT AND INPUT
spline_int <- as.data.frame(spline(Fd$input, Fd$rate))
ggplot(Fd) +
geom_point(aes(x = input, y = rate), size = 3) +
geom_line(data = spline_int, aes(x = x, y = y))
The rate 1/t has a limiting value at 1276 and thats why I have taken quite a few values of mue in the end, to highlight this. I get a graph like this:
What I want is something like below, so I can highlight the fact that the rate 1/t doesn't grow to infinity and infact has a limiting value. The below figure is from the Python question.
How do I accomplish this in R? I have tried loess and splines and geom_smooth (but just with changing span), perhaps I am missing something obvious.
Splines are polynomials with multiple inflection points. It sounds like you instead want to fit a logarithmic curve:
# fit a logarithmic curve with your data
logEstimate <- lm(rate~log(input),data=Fd)
# create a series of x values for which to predict y
xvec <- seq(0,max(Fd$input),length=1000)
# predict y based on the log curve fitted to your data
logpred <- predict(logEstimate,newdata=data.frame(input=xvec))
# save the result in a data frame
# these values will be used to plot the log curve
pred <- data.frame(x = xvec, y = logpred)
ggplot() +
geom_point(data = Fd, size = 3, aes(x=input, y=rate)) +
geom_line(data = pred, aes(x=x, y=y))
Result:
I borrowed some of the code from this answer.

How to automatically fit data with several normal cumulative distribution functions in R

I have several data sets (hundreds of them actually), that I know can be fitted with the sum of several normal cumulative distributions (see here).
Here is one example of such data set, here with two cumulative distribution functions:
library(pracma)
library(minpack.lm)
x <- seq(1, 1000, length.out = 50)
k1 <- 0.5
mu1 <- 500
sigma1 <- 100
y1 <- k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
k2 <- 0.5
mu2 <- 300
sigma2 <- 50
y2 <- k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2)))
my.df <- data.frame(x, y = y1 + y2, type = "data")
ggplot(my.df, aes(x, y)) + geom_line()
Now I want to fit those curves, so I use nls to do so:
model <- nlsLM(y ~ k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
+ k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2))),
start= c(mu1 = 500 , sigma1 = 50, k1 = 0.5,
mu2 = 300 , sigma2 = 50, k2 = 0.5),
data = my.df,
control = nls.lm.control(maxiter = 500))
tmp <- data.frame(x, y = predict(model), type = "fit")
combined <- rbind(my.df, tmp)
ggplot(combined, aes(x, y, colour = type, shape = type)) + geom_line() + geom_point()
Here is what I get:
The fit is great. However, I helped nls a lot:
I gave it a perfect fitting curve as input, not raw data
I told it my curve was the sum of two functions (not one or three)
And I almost gave the solution by providing very close parameter values
To fix the first point, I compute 3 models for one, two and three functions and choose the one with the minimum deviance.
For the second point, with my hundreds of data sets unfortunately, the parameters change quite a bit and I have disappointing results when I give the same starting parameters for all sets.
Is there a better way to select those starting values?
I heard of the mixtools library, but I'm not sure it works for CDF (cumulative distribution functions).

Changing Color in ggplot2 Scatterplots

I'm attempting to modify some existing code that was originally from the question found here (https://stats.stackexchange.com/questions/76999/simulating-longitudinal-lognormal-data-in-r), and used to demonstrate Scatterplots in R at the following website: https://hopstat.wordpress.com/2014/10/30/my-commonly-done-ggplot2-graphs/
It's a simple and stupid question, but I've been struggling with it all morning. The following code gives a nice black and white scatterplot. I want to modify the code to make the lines a very light gray.
library(MASS)
library(nlme)
library(plyr)
library(ggplot2)
### set number of individuals
n <- 200
### average intercept and slope
beta0 <- 1.0
beta1 <- 6.0
### true autocorrelation
ar.val <- .4
### true error SD, intercept SD, slope SD, and intercept-slope cor
sigma <- 1.5
tau0 <- 2.5
tau1 <- 2.0
tau01 <- 0.3
### maximum number of possible observations
m <- 10
### simulate number of observations for each individual
p <- round(runif(n,4,m))
### simulate observation moments (assume everybody has 1st obs)
obs <- unlist(sapply(p, function(x) c(1, sort(sample(2:m, x-1,
replace=FALSE)))))
### set up data frame
dat <- data.frame(id=rep(1:n, times=p), obs=obs)
### simulate (correlated) random effects for intercepts and slopes
mu <- c(0,0)
S <- matrix(c(1, tau01, tau01, 1), nrow=2)
tau <- c(tau0, tau1)
S <- diag(tau) %*% S %*% diag(tau)
U <- mvrnorm(n, mu=mu, Sigma=S)
### simulate AR(1) errors and then the actual outcomes
dat$eij <- unlist(sapply(p, function(x) arima.sim(model=list(ar=ar.val),
n=x) * sqrt(1-ar.val^2) * sigma))
dat$yij <- (beta0 + rep(U[,1], times=p)) + (beta1 + rep(U[,2], times=p)) *
log(dat$obs) + dat$eij
dat = ddply(dat, .(id), function(x){
x$alpha = ifelse(runif(n = 1) > 0.9, 1, 0.1)
x$grouper = factor(rbinom(n=1, size =3 ,prob=0.5), levels=0:3)
x
})
tspag = ggplot(dat, aes(x=obs, y=yij)) +
geom_line() + guides(colour=FALSE) + xlab("Observation Time Point") +
ylab("Y")
spag = tspag + aes(colour = factor(id))
spag
bwspag = tspag + aes(group=factor(id))
bwspag
I've tried scale_colour_manual, I've tried defining the color within the aes statement on the bwspag line...no luck. I'm relatively inexperienced with R. I appreciate any assistance!
Do you want to make the line in grayscale? If yes, then adding colour in geom_line() function should be enough. For example:
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(colour = "gray40")
You can choose other values with gray: from 0 to 100. More info here.

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources