Calculate area under a curve below a certain threshold in R - r

I'm trying to calculate the area below a certain point, and unsure how to do that. I've seen this question, but it's not exactly answering what I'm looking for.
Here is some example data...
test_df <- structure(list(time = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23), balance = c(27,
-45, -118, -190, -263, -343, -424, -1024, -434, -533, -613, -694,
-775, -355, -436, -516, -597, -77, -158, -239, -319, -400, -472,
-545)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
)) %>% as_tibble()
ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
I'd like to calculate the AUC for the trend line, but only for when it is below a certain threshold (-400, for example).
So I can extract the values for the smoothed line...
test_plot <- ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
ggp_data <- ggplot_build(test_plot)$data[[1]]
and use something like this to get an AUC value
MESS::auc(ggp_data$x, ggp_data$y)
My questions are:
How to only calculate below -400?
How to interpret the value?
What units would it be in?
If my x axis is in hours, is there a way to turn the value into an hour value?
Thanks!

To calculate the area only below a certain threshold you can add the threshold to your y-values if your threshold is below 0 and subtract if your threshold is larger than 0. For your case that would be like this:
MESS::auc(ggp_data$x, ggp_data$y+400)
However, this calculates the AUC from 0 to 23 and therefore, also parts that are above -400. To get the AUC for the part that is below your threshold you have to find the x-values of the intersection between your smoothed line and the h-line at -400. Inspecting your values by eye you could find the following approximation of these x-values that fulfill this criteria:
x1 <- 4.45
x2 <- 15.45
x3 <- 21.35
Now we have to calculate the AUC between x1 and x2, and x3 and max(x). Then we have to add these values together:
AUC1 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x1, to = x2)
AUC2 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x3, to = max(ggp_data$x))
AUC.total <- AUC1 + AUC2
> AUC.total
[1] -1747.352
Note that the value is negative because it is below 0. There are now "negative areas" therefore, you can take the absolute value AUC.total = 1747.352 to proceede. However, without information on your y-axis one cannot clearly interpret this value.

Related

Multiply probability distributions in R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)

R-hat against iterations RStan

I am trying to generate a similar plot as below to show the change in R-hat over iterations:
I have tried the following options :
summary(fit1)$summary : gives R-hat all chains are merged
summary(fit1)$c_summary : gives R-hat for each chain individually
Can you please help me to get R-hat for each iteration for a given parameter?
rstan provides the Rhat() function, which takes a matrix of iterations x chains and returns R-hat. We can extract this matrix from the fitted model and apply Rhat() cumulatively over it. The code below uses the 8 schools model as an example (copied from the getting started guide).
library(tidyverse)
library(purrr)
library(rstan)
theme_set(theme_bw())
# Fit the 8 schools model.
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(file = 'schools.stan', data = schools_dat)
# Extract draws for mu as a matrix; columns are chains and rows are iterations.
mu_draws = as.array(fit)[,,"mu"]
# Get the cumulative R-hat as of each iteration.
mu_rhat = map_dfr(
1:nrow(mu_draws),
function(i) {
return(data.frame(iteration = i,
rhat = Rhat(mu_draws[1:i,])))
}
)
# Plot iteration against R-hat.
mu_rhat %>%
ggplot(aes(x = iteration, y = rhat)) +
geom_line() +
labs(x = "Iteration", y = expression(hat(R)))

Set Acceptable Region for My Skewness Test in R

I am writing the below function to let me conduct a test of skewness for a vector of samples (10, 20, 50, 100) with a 1000 replicate.
library(moments)
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
My conditions
My condition of rejecting the Null hypothesis is that the statistic must fulfil two (2) conditions:
less than -2
or greater than +2.
What I have
But in my R function I can only describe the second condition.
What I want
How do I include both the first and the second condition in my function?
Perhaps adding the abs would be the easiest approach to meet both conditions
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(abs(unlist(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out

R: apply the pclm function

I have trouble to apply the Penalized Composite Link Model (PCLM) function which only works with vectors. I use the pclm function to generate single years of age (syoa) population data from 5-year age group population data.
pclm() can be installed by following the instructions given by the author on https://github.com/mpascariu/ungroup.
Usage of the function:
pclm(x, y, nlast,control = list())
-x: vector of the cumulative sum points of the sequence in y.
-y: vector of values to be ungrouped.
-nlast: Length of the last interval.
-control: List with additional parameters.
Here's my training dataset:
data<-data.frame(
GEOID= c(1,2),
name= c("A","B"),
"Under 5 years"= c(17,20),
"5-9 years"= c(82,90),
"10-14 years"= c(18, 22),
"15-19 years"= c(90,88),
"20-24 years"= c(98, 100),
check.names=FALSE)
#generating a data.frame storing the fitted values from the pclm for the first row: GEOID=1.
#using the values directly
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(17,82,18,90,98), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
#or referring to the vector by its rows and columns
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(data[1,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
As my data have many observations, I'd like to apply the pclm() function across all the rows for columns 3-7: data[,3:7].
apply(data[3:7], 1, pclm(x=c(0, 5, 10, 15, 20), y=c(data[,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1)))
but it's not working and gives the following error message:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
(list) object cannot be coerced to type 'double'
I don't know the issue's related to apply() or the pclm ()function. Can anyone help? Thanks.
It's easier than I thought.
pclm <- data.frame(apply(data[3:7], 1, function(x){
pclm <- pclm(x=c(0, 5, 10, 15, 20), y=c(x), nlast=5, control = list(lambda = NA, deg = 3, kr = 1))
round(fitted(pclm))
}))

Extracting slopes from earth model

I have some data for which I'ved used the earth model. I'm interested in the slopes of the different lines but looking at the model summary I don't get my expected values.
library(earth)
library(dplyr)
library(ggplot2)
d = structure(list(x = c(9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30), y = c(0.151534750704409,
0.0348452707597105, -0.0913494247372798, -0.214465577974757,
-0.365251164825619, -0.528214103496014, -0.614970081844732,
-0.922572314358796,
-1.15911158401926, -1.36432638285029, -1.51587576144429, -1.63708705686248,
-1.7530889072188, -1.86142968143915, -1.98159646754281, -2.0994478459505,
-2.23037530743309, -2.3421669680425, -2.40621060828366, -2.55432043723978,
-2.73246980567199, -2.92496136528975)), .Names = c("x", "y"), row.names =
c(NA, -22L), class = c("tbl_df", "tbl", "data.frame"))
mod = earth(y ~ x, data = d)
d$pred = predict(mod, newdata = d)
summary(mod, style = 'pmax')
this gives me this summary:
Call: earth(formula=y~x, data=d)
y =
-1.314958
- 0.06811314 * pmax(0, x - 16)
+ 0.1518165 * pmax(0, 19 - x)
- 0.05124021 * pmax(0, x - 19)
Selected 4 of 4 terms, and 1 of 1 predictors
Termination condition: RSq changed by less than 0.001 at 4 terms
Importance: x
Number of terms at each degree of interaction: 1 3 (additive model)
GCV 0.004496406 RSS 0.04598597 GRSq 0.9953947 RSq 0.9976504
However when I look at my model the three different slopes all look negative:
ggplot(d, aes(x, y)) +
geom_point() +
geom_line(aes(x, pred)) +
theme(aspect.ratio = 1)
How do I get the values for those 3 negative slopes?
mod$coefficients gives the coefficients. If the coefficients are on -x te slopes will be the negative of the coefficients. You can do mod$coefficients %>% {ifelse(grepl('-x', rownames(.)) , -., .)} to get the slopes (or just mentally reverse the signs for the portions with -x).

Resources