This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.
Related
I try to get the square root of negative number. I got the absolute value of data and, for the positive number, I use the squart root of absolute number directly, otherwive add an negaitve sign to the result. However all numbers I got are negaitve...
My code
Results shown
I try to get negaitve and positive results, but I only got negative numbers.your text``your text
Library and Data
Not sure exactly what you are doing because your original data frame isn't included in the question. However, I have simulated a dataset that should emulate what you want depending on what you are doing. First, I loaded the tidyverse package for data wrangling like creating/manipulating variables, then set a random seed so you can reproduce the simulated data.
#### Load Library ####
library(tidyverse)
#### Set Random Seed ####
set.seed(123)
Now I create a randomly distributed x value that is both positive and negative.
#### Create Randomly Distributed X w/Neg Values ####
tib <- tibble(
x = rnorm(n=100)
)
Creating Variables
Now we can make absolute values, followed by square roots, which are made negative if the original raw value was negative.
#### Create Absolute and Sqrt Values ####
new.tib <- tib %>%
mutate(
abs.x = abs(x),
sq.x = sqrt(abs.x),
final.x = ifelse(x < 0,
sq.x * -1,
sq.x)
)
new.tib
If you print new.tib, the end result will look like this:
# A tibble: 100 × 4
x abs.x sq.x final.x
<dbl> <dbl> <dbl> <dbl>
1 2.20 2.20 1.48 1.48
2 1.31 1.31 1.15 1.15
3 -0.265 0.265 0.515 -0.515
4 0.543 0.543 0.737 0.737
5 -0.414 0.414 0.644 -0.644
6 -0.476 0.476 0.690 -0.690
7 -0.789 0.789 0.888 -0.888
8 -0.595 0.595 0.771 -0.771
9 1.65 1.65 1.28 1.28
10 -0.0540 0.0540 0.232 -0.232
If you just want to select the final x values, you can simply select them, like so:
new.tib %>%
select(final.x)
Giving you just this vector:
# A tibble: 100 × 1
final.x
<dbl>
1 1.48
2 1.15
3 -0.515
4 0.737
5 -0.644
6 -0.690
7 -0.888
8 -0.771
9 1.28
10 -0.232
# … with 90 more rows
Using the first example in ?ifelse:
x <- c(6:-4)
[1] 6 5 4 3 2 1 0 -1 -2 -3 -4
sqrt(ifelse(x >= 0, x, -x))
[1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000
[7] 0.000000 1.000000 1.414214 1.732051 2.000000
I have three columns, one per group, with numeric values. I want to analyze them using an Anova test, but I found applications when you have the different groups in a column and the respective values in the second column. I wonder if it is necessary to reorder the data like that, or if there is a method that I can use for the columns that I currently have. Here I attached a capture:
Thanks!
You can convert a wide table having many columns into another table having only two columns for key (group) and value (response) by pivoting the data:
library(tidyverse)
# create example data
set.seed(1337)
data <- tibble(
VIH = runif(100),
VIH2 = runif(100),
VIH3 = runif(100)
)
data
#> # A tibble: 100 × 3
#> VIH VIH2 VIH3
#> <dbl> <dbl> <dbl>
#> 1 0.576 0.485 0.583
#> 2 0.565 0.495 0.108
#> 3 0.0740 0.868 0.350
#> 4 0.454 0.833 0.324
#> 5 0.373 0.242 0.915
#> 6 0.331 0.0694 0.0790
#> 7 0.948 0.130 0.563
#> 8 0.281 0.122 0.287
#> 9 0.245 0.270 0.419
#> 10 0.146 0.488 0.838
#> # … with 90 more rows
data %>%
pivot_longer(everything()) %>%
aov(value ~ name, data = .)
#> Call:
#> aov(formula = value ~ name, data = .)
#>
#> Terms:
#> name Residuals
#> Sum of Squares 0.124558 25.171730
#> Deg. of Freedom 2 297
#>
#> Residual standard error: 0.2911242
#> Estimated effects may be unbalanced
Created on 2022-05-10 by the reprex package (v2.0.0)
I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?
BLUF: I am struggling to understand out how to use batching in the R targets package to improve performance in a static and dynamic branching pipeline processed in parallel using tar_make_future(). I presume that I need to batch within each dynamic branch but I am unsure how to go about doing that.
Here's a reprex that uses dynamic branching nested inside static branching, similar to what my actual pipeline is doing. It first branches statically for each value in all_types, and then dynamically branches within each category. This code produces 1,000 branches and 1,010 targets total. In the actual workflow I obviously don't use replicate, and the dynamic branches vary in number depending on the type value.
# _targets.r
library(targets)
library(tarchetypes)
library(future)
library(future.callr)
plan(callr)
all_types = data.frame(type = LETTERS[1:10])
tar_map(values = all_types, names = "type",
tar_target(
make_data,
replicate(100,
data.frame(x = seq(1000) + rnorm(1000, 0, 5),
y = seq(1000) + rnorm(1000, 20, 20)),
simplify = FALSE
),
iteration = "list"
),
tar_target(
fit_model,
lm(make_data),
pattern = map(make_data),
iteration = "list"
)
)
And here's a timing comparison of tar_make() vs tar_make_future() with eight workers:
# tar_destroy()
t1 <- system.time(tar_make())
# tar_destroy()
t2 <- system.time(tar_make_future(workers = 8))
rbind(serial = t1, parallel = t2)
## user.self sys.self elapsed user.child sys.child
## serial 2.12 0.11 25.59 NA NA
## parallel 2.07 0.24 184.68 NA NA
I don't think the user or system fields are useful here since the job gets dispatched to separate R processes, but the elapsed time for the parallel job takes about 7 times longer than the serial job.
I presume this slowdown is caused by the large number of targets. Will batching improve performance in this case, and if so how can I implement batching within the dynamic branch?
You are on the right track with batching. In your case, that is a matter of breaking up your list of 100 datasets into groups of, say, 10 or so. You could do this with a nested list of datasets, but that's a lot of work. Luckily, there is an easier way.
Your question is actually really well-timed. I just wrote some new target factories in tarchetypes that could help. To access them, you will need the development version of tarchetypes from GitHub:
remotes::install_github("ropensci/tarchetypes")
Then, with tar_map2_count(), it will be much easier to batch your list of 100 datasets for each scenario.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_data <- function(n) {
datasets_per_batch <- replicate(
100,
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
),
simplify = FALSE
)
tibble(dataset = datasets_per_batch, rep = seq_along(datasets_per_batch))
}
tar_map2_count(
name = model,
command1 = make_data(n = rows),
command2 = tidy(lm(y ~ x, data = dataset)), # Need dataset[[1]] in tarchetypes 0.4.0
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
columns2 = NULL,
batches = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 8
#> term estimate std.error statistic p.value scenario rows tar_group
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int>
#> 1 (Intercept) 17.1 12.8 1.34 0.218 A 10 10
#> 2 x 1.39 1.35 1.03 0.333 A 10 10
#> 3 (Intercept) 6.42 14.0 0.459 0.658 A 10 10
#> 4 x 1.75 1.28 1.37 0.209 A 10 10
#> 5 (Intercept) 32.8 7.14 4.60 0.00176 A 10 10
#> 6 x -0.300 1.14 -0.263 0.799 A 10 10
#> 7 (Intercept) 29.7 3.24 9.18 0.0000160 A 10 10
#> 8 x 0.314 0.414 0.758 0.470 A 10 10
#> 9 (Intercept) 20.0 13.6 1.47 0.179 A 10 10
#> 10 x 1.23 1.77 0.698 0.505 A 10 10
#> # … with 1,990 more rows
Created on 2021-12-10 by the reprex package (v2.0.1)
There is also tar_map_rep(), which may be easier if all your datasets are randomly generated, but I am not sure if I am overfitting your use case.
library(targets)
tar_script({
library(broom)
library(targets)
library(tarchetypes)
library(tibble)
make_one_dataset <- function(n) {
tibble(
x = seq(n) + rnorm(n, 0, 5),
y = seq(n) + rnorm(n, 20, 20)
)
}
tar_map_rep(
name = model,
command = tidy(lm(y ~ x, data = make_one_dataset(n = rows))),
values = data_frame(
scenario = LETTERS[seq_len(10)],
rows = seq(10, 100, length.out = 10)
),
batches = 10,
reps = 10
)
})
tar_make(reporter = "silent")
#> Warning message:
#> `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
tar_read(model)
#> # A tibble: 2,000 × 10
#> term estimate std.error statistic p.value scenario rows tar_batch tar_rep
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int> <int>
#> 1 (Inter… 37.5 7.50 5.00 0.00105 A 10 1 1
#> 2 x -0.701 1.17 -0.601 0.564 A 10 1 1
#> 3 (Inter… 21.5 9.64 2.23 0.0567 A 10 1 2
#> 4 x -0.213 1.55 -0.138 0.894 A 10 1 2
#> 5 (Inter… 20.6 9.51 2.17 0.0620 A 10 1 3
#> 6 x 1.40 1.79 0.783 0.456 A 10 1 3
#> 7 (Inter… 11.6 11.2 1.04 0.329 A 10 1 4
#> 8 x 2.34 1.39 1.68 0.131 A 10 1 4
#> 9 (Inter… 26.8 9.16 2.93 0.0191 A 10 1 5
#> 10 x 0.288 1.10 0.262 0.800 A 10 1 5
#> # … with 1,990 more rows, and 1 more variable: tar_group <int>
Created on 2021-12-10 by the reprex package (v2.0.1)
Unfortunately, futures do come with overhead. Maybe it will be faster in your case if you try tar_make_clustermq()?
I am using pROC to provide the ROC analysis of blood tests. I have calculated the ROC curve, AUC and am using the ci.coords function to provide the spec, sens, PPV and NPV at a provided specificity (with 95% CI).
I would like to be able to say at what value of blod test this is, for instance at 1.2 the sens is x, spec is y, NPV is c, PPV is d. Ideally I ould have the data for a table like:
Lab value | Sens | Spec | NPV | PPV
I don't seem to be able to get this from the methodology I am currently using?
Does anyone have any suggestions?
Many thanks
Currently
spred1 = predict(smodel1)
sroc1 = roc(EditedDF1$any_abnormality, spred1)
ci.coords(sroc1, x=0.95, input="sensitivity", transpose = FALSE, ret=c("sensitivity","specificity","ppv","npv"))```
As you gave no reproducible example let's use the one that comes with the package
library(pROC)
data(aSAH)
roc1 <- roc(aSAH$outcome, aSAH$s100b)
The package comes with the function coords which lists specificity and sensititivity at different thresholds:
> coords(roc1)
threshold specificity sensitivity
1 -Inf 0.00000000 1.00000000
2 0.035 0.00000000 0.97560976
3 0.045 0.06944444 0.97560976
4 0.055 0.11111111 0.97560976
5 0.065 0.13888889 0.97560976
6 0.075 0.22222222 0.90243902
7 0.085 0.30555556 0.87804878
8 0.095 0.38888889 0.82926829
9 0.105 0.48611111 0.78048780
10 0.115 0.54166667 0.75609756
...
From there you can use the function ci.coords that you already have used to complete the table by whatever data you desire.
library(tidyverse)
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
data(aSAH)
roc <- roc(aSAH$outcome, aSAH$s100b,
levels = c("Good", "Poor")
)
#> Setting direction: controls < cases
tibble(threshold = seq(0, 1, by = 0.1)) %>%
mutate(
data = threshold %>% map(~ {
res <- roc %>% ci.coords(x = .x, ret = c("sensitivity", "specificity", "ppv", "npv"))
# 97.5%
list(
sens = res$sensitivity[[3]],
spec = res$specificity[[3]],
ppv = res$ppv[[3]],
npv = res$npv[[3]]
)
})
) %>%
unnest_wider(data)
#> # A tibble: 11 x 5
#> threshold sens spec ppv npv
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 0 0.363 NA
#> 2 0.1 0.927 0.5 0.5 0.917
#> 3 0.2 0.780 0.903 0.784 0.867
#> 4 0.3 0.634 0.917 0.769 0.8
#> 5 0.4 0.561 0.958 0.85 0.782
#> 6 0.5 0.439 1 1 0.755
#> 7 0.6 0.366 1 1 0.735
#> 8 0.7 0.317 1 1 0.72
#> 9 0.8 0.195 1 1 0.686
#> 10 0.9 0.122 1 1 0.667
#> 11 1 0.0732 1 1 0.655
Created on 2021-09-10 by the reprex package (v2.0.1)