I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395
Related
I have 150 columns of scores against 1 column of label (1/0).
My goal is to create 150 AUC scores.
Here is a manual example:
auc(roc(df$label, df$col1)),
auc(roc(df$label, df$col2)),
...
I can use here Map/sapply/lapply but is there any other method, or function?
This is a bit of an XY question. What you actually want to achieve is speed up your calculation. gfgm's answer answers it with parallelization, but that's only one way to go.
If, as I assume, you are using library(pROC)'s roc/auc functions, you can gain even more speed by selecting the appropriate algorithm for your dataset.
pROC comes with essentially two algorithms that scale very differently depending on the characteristics of your data set. You can benchmark which one is the fastest by passing algorithm=0 to roc:
# generate some toy data
label <- rbinom(600000, 1, 0.5)
score <- rpois(600000, 10)
library(pROC)
roc(label, score, algorithm=0)
Starting benchmark of algorithms 2 and 3, 10 iterations...
expr min lq mean median uq max neval
2 2 4805.58762 5827.75410 5910.40251 6036.52975 6085.8416 6620.733 10
3 3 98.46237 99.05378 99.52434 99.12077 100.0773 101.363 10
Selecting algorithm 3.
Here we select algorithm 3, which shines when the number of thresholds remains low. But if 600000 data points take 5 minutes to compute I strongly suspect that your data is very continuous (no measurements with identical values) and that you have about as many thresholds as data points (600000). In this case you can skip directly to algorithm 2 which scales much better as the number of thresholds in the ROC curve increases.
You can then run:
auc(roc(df$label, df$col1, algorithm=2)),
auc(roc(df$label, df$col2, algorithm=2)),
On my machine each call to roc now takes about 5 seconds, pretty independently of the number of thresholds. This way you should be done in under 15 minutes total. Unless you have 50 cores or more this is going to be faster than just parallelizing. But of course you can do both...
If you want to parallelize the computations you could do it like this:
# generate some toy data
label <- rbinom(1000, 1, .5)
scores <- matrix(runif(1000*150), ncol = 150)
df <- data.frame(label, scores)
library(pROC)
library(parallel)
auc(roc(df$label, df$X1))
#> Area under the curve: 0.5103
auc_res <- mclapply(df[,2:ncol(df)], function(row){auc(roc(df$label, row))})
head(auc_res)
#> $X1
#> Area under the curve: 0.5103
#>
#> $X2
#> Area under the curve: 0.5235
#>
#> $X3
#> Area under the curve: 0.5181
#>
#> $X4
#> Area under the curve: 0.5119
#>
#> $X5
#> Area under the curve: 0.5083
#>
#> $X6
#> Area under the curve: 0.5159
Since most of the computational time seems to be the call to auc(roc(...)) this should speed things up if you have a multi-core machine.
There's a function for doing that in the cutpointr package. It also calculates cutpoints and other metrics, but you can discard those. By default it will try all columns except for the response column as predictors. Additionally, you can select whether the direction of the ROC curve (whether larger values imply the positive class or the other way around) is determined automatically by leaving out direction or set it manually.
dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor",
silent = FALSE)
mc %>% select(variable, direction, AUC)
# A tibble: 4 x 3
variable direction AUC
<chr> <chr> <dbl>
1 Sepal.Length >= 0.933
2 Sepal.Width <= 0.925
3 Petal.Length >= 1.00
4 Petal.Width >= 1.00
By the way, the runtime shouldn't be a problem here because calculating the ROC-curve (even including a cutpoint) takes less than a second for one variable and one million observations using cutpointr or ROCR, so your task runs in about one or two minutes.
If memory is the limiting factor, parallelization will probably make that problem worse. If the above solution takes up too much memory, because it returns ROC-curves for all variables before dropping those columns, you can try selecting the columns of interest right away in a call to map:
# 600.000 observations for 150 variables and a binary outcome
predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))
library(cutpointr)
library(tidyverse)
vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
select(direction, AUC) %>%
mutate(variable = coln)
})
result
# A tibble: 150 x 3
direction AUC variable
<chr> <dbl> <chr>
1 >= 0.500 V2
2 <= 0.501 V3
3 >= 0.501 V4
4 >= 0.501 V5
5 <= 0.501 V6
6 <= 0.500 V7
7 <= 0.500 V8
8 >= 0.502 V9
9 >= 0.501 V10
10 <= 0.500 V11
# ... with 140 more rows
I am trying to get a new column say duration_probablity which gets the probablity of a value falling between 6 and 12 hours . P(6 < Origin_Duration ≤ 12)
dput(df)
structure(list(CRD_NUM = c(1000120005478330, 1000130009109199,
1000140001635234, 1000140002374747, 1000140003618308, 1000140007236959,
1000140015078086, 1000140026268650, 1000140027281272, 1000148000012215
), Origin_Duration = c("10:48:38", "07:41:34", "11:16:41", "09:19:35",
"17:09:19", "08:59:05", "11:27:28", "12:17:41", "10:45:42", "12:19:05"
)), .Names = c("CRD_NUM", "Origin_Duration"), class = c("data.table",
"data.frame"), row.names = c(NA, -10L))
CRD_NUM Origin_Duration
1: 1000120005478330 10:48:38
2: 1000130009109199 07:41:34
3: 1000140001635234 11:16:41
4: 1000140002374747 09:19:35
5: 1000140003618308 17:09:19
6: 1000140007236959 08:59:05
7: 1000140015078086 11:27:28
8: 1000140026268650 12:17:41
9: 1000140027281272 10:45:42
10: 1000148000012215 12:19:05
I am not sure how to do that in R. I am trying to get cumulative distribution function of the standard normal distribution. The probability that a commuter’s stay-duration at certain station falling between 6 and 12 hours.
The Output would be say for example 0.96 for duration 11:16:41
My CDF would be something like - P(6 <X≤ 12) = Φ((12−μ)/σ)−Φ((6−μ)/σ)
From your question it is unclear whether you already know the mean and variance or not. I will discuss both cases. Also, I will assume you have reason to believe that the durations are in fact normally distributed.
Known parameters: If you have a pre-specified mean and variance given. Say, mu = 11 and sigma = 3. Then you can use that P(6 < X ≤ 12) = P(X ≤ 12) - P(X ≤ 6). The base R function pnorm() is able to calculate this:
mu <- 11
sigma <- 3
pnorm(12, mu, sigma) - pnorm(6, mu, sigma)
# 0.5827683
Unknown parameters, P(6 < X < 12): If you do not yet know what the mean and variance are, you can use estimations from your data and use the student t-distribution instead of the normal distribution (the story why this is called 'student' distribution, is nice too. You can find it in the wikipedia link). In order to find the mean and variance, it makes sense to first transform df$Origin_Duration from character to some time-type:
df$Origin_Duration <- as.POSIXct(df$Origin_Duration, format = "%H:%M:%S")
mu <- mean(df$Origin_Duration) # "2017-09-04 11:12:28 CEST"
df$demeaned <- df$Origin_Duration - mu
sigma <- var(df$demeaned)^0.5 # 153.68
Note that I subtracted the mean first, before calculating the variation. I did this in order to have the duration in minutes. The standard deviation is therefore to be read as 153.68 minutes.
We will use the pt function to calculate the probability P(X ≤ 12) - P(X ≤ 6). In order to so, we'd need a standardised / scaled / normalised version of 12 and 6. That is, we have to subtract the mean and divide by the standard deviation:
x6 <- as.numeric(difftime("2017-09-04 06:00:00", mu), unit = "mins")/sigma
x12 <- as.numeric(difftime("2017-09-04 12:00:00", mu), unit = "mins")/sigma
deg_fr <- length(df$demeaned)-1
p_x_smaller_than12 <- pt( x12, df = deg_fr ) # 0.6178973
p_x_smaller_than6 <- pt( x6, df = deg_fr ) # 0.03627651
p_x_smaller_than12 - p_x_smaller_than6
# [1] 0.5816208
Added in response to comment: Unknown parameters, all entries:
# scale gives the distance from the mean in terms of standard deviations:
df$scaled <- scale(df$Origin_Duration)
pt(df$scaled, df = deg_fr)
# [1,] 0.4400575
# [2,] 0.1015886
# [3,] 0.5106114
# [4,] 0.2406431
# [5,] 0.9773264
# [6,] 0.2039751
# [7,] 0.5377728
# [8,] 0.6593331
# [9,] 0.4327620
# [10,] 0.6625280
I have this grid-matrix:
cutoff <- c(pi/48, 2*pi/48, 3*pi/48, pi/12)
lambda <- c(5:10)
eta <- seq(1, 1.5, by=0.1)
grid <- expand.grid(cutoff, lambda, eta)
And this is the output I get after the function (which calculates the sharpe vector):
best_grid <- grid[max(sharpe),]
cutoff lambda eta
[17] 0.1963495 5 1.5
But I would like to get this:
cutoff lambda eta
[17] 3*pi/48 5 1.5
Do you have any ideas?
You could create separate string vector of cutoff values that correspond to the values in cutoff and then match to that. I just selected four random rows for illustration.
cutoff.string = c("pi/48", "2*pi/48", "3*pi/48", "pi/12")
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = cutoff.string[match(best_grid$cutoff, cutoff)]
best_grid
cutoff lambda eta
1 pi/48 5 1.0
2 pi/12 9 1.0
3 2*pi/48 5 1.2
4 pi/12 10 1.4
Or, with #HaddE.Nuff's suggestion:
cutoff = quote(c(pi/48, 2*pi/48, 3*pi/48, pi/12))
grid <- expand.grid(cutoff=eval(cutoff), lambda=lambda, eta=eta)
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = gsub(" ","", as.character(cutoff[-1]))[match(best_grid$cutoff, eval(cutoff))]
I created decision tree with Party package in R.
I'm trying to get the route/branch with the maximum value.
It can be mean value that came from box-plot
and it can be probability value that came from binary tree
(source: rdatamining.com)
This can be done pretty easily actually, though while your definition of maximum value is clear for a regression tree, it is not very clear for a classification tree, as in each node different level can have it's own maximum
Either way, here's a pretty simple helper function that will return you the predictions for each type of tree
GetPredicts <- function(ct){
f <- function(ct, i) nodes(ct, i)[[1]]$prediction
Terminals <- unique(where(ct))
Predictions <- sapply(Terminals, f, ct = ct)
if(is.matrix(Predictions)){
colnames(Predictions) <- Terminals
return(Predictions)
} else {
return(setNames(Predictions, Terminals))
}
}
Now luckily you've took your trees from the examples of ?ctree, so we can test them (next time, please provide the code you used yourself)
Regression Tree (your frist tree)
## load the package and create the tree
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
plot(airct)
Now, test the function
res <- GetPredicts(airct)
res
# 5 3 6 9 8
# 18.47917 55.60000 31.14286 48.71429 81.63333
So we've got the predictions per each terminal node. You can easily proceed with which.max(res) from here (I'll leave it for you to decide)
Classification tree (your second tree)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, type = "simple")
Run the function
res <- GetPredicts(irisct)
res
# 2 5 6 7
# [1,] 1 0.00000000 0.0 0.00000000
# [2,] 0 0.97826087 0.5 0.02173913
# [3,] 0 0.02173913 0.5 0.97826087
Now, the output is a bit harder to read because each class has it's own probabilities. You could make this a bit more readable using
row.names(res) <- levels(iris$Species)
res
# 2 5 6 7
# setosa 1 0.00000000 0.0 0.00000000
# versicolor 0 0.97826087 0.5 0.02173913
# virginica 0 0.02173913 0.5 0.97826087
The, you could do something like the following in order to get the overall maximum value
which(res == max(res), arr.ind = TRUE)
# row col
# setosa 1 1
For column/row maxes, you could do
matrixStats::colMaxs(res)
# [1] 1.0000000 0.9782609 0.5000000 0.9782609
matrixStats::rowMaxs(res)
# [1] 1.0000000 0.9782609 0.9782609
But, again, I'll leave to you to decide on how to proceed from here.
The question hast 2 parts.
Which is the data structure in R that allows to store the paired data:
0:0
0.5:10
1:20
(Python dictionary {[0]:0, [0.5]:10, [1]:20})
and how to initiate it with one liner? i.e. to couple seq(0,1,by=0.5)
with seq(0,10,by=5) in this data structure
Assume I added 0.25 to the list, then I want the weighted average of the neighbor nodes to appear (automatically) in the data set, i.e. the element 0.25:5 and the paired set would be
0:0
0.25:5
0.5:10
1:20
If I add the element 0.3, then it must be paired with 5+(10-5)*(0.3-0.25)/(0.5-0.25)=6 and element 0.3:6 to be added.
How I can create the class with S4 or Reference Class class model where I could put this functionality?
Not really sure what you are getting at but maybe the package hash may have what you want
library(hash)
h<-hash(keys=seq(0,1,by=0.5),values=seq(0,10,by=5))
h[['0.25']]<-2.5
Probably deals with the first part of your question. http://cran.r-project.org/web/packages/hash/hash.pdf may allude to help on the second.
a similar construct with lists
lst<-list()
lst<-seq(0,10,5)
names(lst)<-seq(0,1,0.5)
> lst['0.5']
0.5
5
lst['0.25']<-2.5
for your second part you could construct a simple function to update you hash/list with a new value.
A two-column data.frame seems appropriate:
xy <- data.frame(x = seq(0, 1, by = 0.5), y = seq(0, 20, by = 10))
xy
# x y
# 1 0.0 0
# 2 0.5 10
# 3 1.0 20
Then, what you are trying to do is a linear-interpolation, which you can achieve using the approx function. For example:
approx(xy$x, xy$y, xout = 0.3)
# $x
# [1] 0.3
#
# $y
# [1] 6
If you want to add that result to the data.frame, you can do something like:
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, 0.3))))
xy
# x y
# 1 0.0 0
# 2 0.3 6
# 3 0.5 10
# 4 1.0 20
which is a bit expensive, especially if you plan to add points one at a time. You could instead add all your points at once since the result is independent of the order in which you add them:
add.points <- c(0.25, 0.3)
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, add.points))))
xy
# x y
# 1 0.00 0
# 2 0.25 5
# 3 0.30 6
# 4 0.50 10
# 5 1.00 20