I am trying to create a ROC curve in ggplot
I wrote function myself, however when I compare my results to results from roc_curve function from community (that I believe more) I get different results.
I would like to ask where is mistake in the function below?
library(ggplot2)
library(dplyr)
library(yardstick)
n <- 300 # sample size
data <-
data.frame(
real = sample(c(0,1), replace=TRUE, size=n),
pred = sample(runif(n), replace=TRUE, size=n)
)
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
}
simple_roc(data$real, data$pred) %>%
ggplot(aes(TPR, FPR)) +
geom_line()
yardstick::roc_curve(data, factor(real), pred) %>%
ggplot(aes(1 - specificity, sensitivity)) +
geom_line()
First you need to anchor your ROC curve in the points (0, 0) and (1, 1).
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(
TPR = c(0, cumsum(labels)/sum(labels), 1),
FPR = c(0, cumsum(!labels)/sum(!labels), 1)
)
}
Then the order in which your data is presented matters in ggplot2. Reversing the line direction should get you a bit closer:
yardstick::roc_curve(data, factor(real), pred) %>%
ggplot(aes(rev(1 - specificity), rev(sensitivity))) +
geom_line()
I would recommend against using your own function for any serious work. There are many other things that can go wrong and that well-maintained packages will handle properly such as missing values, infinite values, absence of some labels, and others that I can't even think about right now.
Related
I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}
I need to plot a couple of curves on one graph. I've got trajectories of Brownian simulation which I got from the function:
brownian <- function(T,N){
alpha=0
sigma=1
delta_t=T/N
t=seq(0,T,by=delta_t)
#x=c(0,alpha*delta_t+sigma*sqrt(delta_t)*rnorm(N,mean=0,sd=1))
x=c(0,alpha*delta_t+sqrt(delta_t)*rnorm(N,mean=0,sd=1))
Xt=cumsum(x)
plot(t,Xt,type='l',col = rep(1:3, each = 10),xlab="t=[0,T]",ylab = "B(t,ω)")
}
For example for brownian(1,1000) I get:
And for brownian(10,1000) I get:
As you can see I get black graphs. I have to plot these trajectories on one graph (every trajectory should have different color). When it takes several trajectories, it should look like:
Do you have any advices how can I plot these curves on one graph and each curve has different color?
Thanks in advance
You could do this pretty easily by modifying the function and using ggplot() to make the graph. The function below takes ntimes as an argument which specifies the number of times you want to do the simulation. It then uses ggplot() to make the graph. You could adjust the internals of the function to have it produce a different looking plot if you like.
brownian <- function(T,N, ntimes){
if((length(N) != length(T)) & length(N) != 1){
stop("N has to be either length of T or 1\n")
}
alpha=0
sigma=1
if(length(N) == 1 & length(T) > 1)N <- rep(N, length(T))
dat <- NULL
for(i in 1:ntimes){
delta_t=T/N
t=seq(0,T,by=delta_t)
#x=c(0,alpha*delta_t+sigma*sqrt(delta_t)*rnorm(N,mean=0,sd=1))
x=c(0,alpha*delta_t+sqrt(delta_t)*rnorm(N,mean=0,sd=1))
Xt=cumsum(x)
dat <- rbind(dat, data.frame(xt=Xt, t=t, n=i))
}
require(ggplot2)
ggplot(dat, aes(x=t, y=xt, colour=as.factor(n))) +
geom_line(show.legend=FALSE) +
labs(x="t=[0,T]",y = "B(t,ω)", colour="T") +
theme_classic()
}
brownian(10,1000, 5)
Here is a base R solution with matplot. It is ideal for this type of plot, since it computes the x and y axis ranges and plots all lines in one call only. It uses DaveArmstrong's idea of adding an extra argument ntimes. This argument is also used for the color scheme.
brownian <- function(T, N, ntimes){
alpha <- 0
sigma <- 1
delta_t <- T/N
t <- seq(0, T, by = delta_t)
#x=c(0,alpha*delta_t+sigma*sqrt(delta_t)*rnorm(N,mean=0,sd=1))
Xt <- replicate(ntimes,
cumsum(c(0, alpha*delta_t+sqrt(delta_t)*rnorm(N, mean = 0, sd = 1)))
)
matplot(t, Xt,
type = "l", lty = 1,
col = seq_len(ntimes),
xlab = "t=[0,T]", ylab = "B(t,ω)")
}
set.seed(2020)
brownian(1, 1000, 5)
I have made code that computes the two lines I am asking for in the question, as shown in the image below (desired lines are in red).
EDIT : This is the expected graph using my snippet to generate the ROC curves (atleast I'm pretty sure this is right) :
The problem is that said code is very very ugly (too long to even post here) and the process I came up with seems extremely tedious to me. Yet I can't seem to come up with anything better.
Here is a quick snippet to produce an input list of ROC curves
library(MASS)
library(dplyr)
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
return(rbind(c(0,0,0),data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)))
}
diab_data=rbind(data.frame(Pima.tr),data.frame(Pima.te))
roc_curves_list_logisitic=list()
for (k in 1:100) {
#Set a fixed seed for reproducibility
set.seed(k)
# sampled_rows <- createDataPartition(diab_data$type, p = .7, list = FALSE)
sampled_rows <- sample(1:nrow(diab_data), size=floor(0.7*nrow(diab_data)))
diab_data_train=diab_data[sampled_rows,]
diab_data_test=diab_data[-sampled_rows,]
diab_data_train[,1:7]=scale(diab_data_train[,1:7])
diab_data_test[,1:7]=scale(diab_data_test[,1:7])
diab_data_train[,"type"]=as.numeric(as.character(recode_factor(diab_data_train[,"type"],`Yes` = "1", `No` = "0")))
diab_data_test[,"type"]=as.numeric(as.character(recode_factor(diab_data_test[,"type"],`Yes` = "1", `No` = "0")))
logistic_model_simple=glm(data=diab_data_train,as.formula(paste(colnames(diab_data_train)[8], "~",
paste(colnames(diab_data_train)[-8], collapse = "+"),
sep = "")),family=binomial(link = "logit"))
roc_curves_list_logisitic[[k]]=simple_roc(diab_data_test[,"type"],
ifelse(predict(logistic_model_simple,diab_data_test,type='response')>0.5,1,0))
}
I am now asking for help, in case anyone has a "beautiful" solution to produce the two red lines in this graph (in ggplot2) using the list of ROC curves I provided as input.
Preferably I would like to end up with two dataframes lower_bound_roc_curves and upper_bound_roc_curves containing the necessary values to plot the two lines seperately if I need them.
Thanks in advance,
EDIT 2 :#denis Here are some parts I think your code gets wrong :
I have a solution with data.table and zoo. The first step is to have a common FPR between all your curves. It is to be able to plot the maximum and the minimum of all curve. To do so:
library(data.table)
library(zoo)
FPRlist <- unique(rbindlist(lapply(roc_curves_list_logisitic,function(ROC){
rccurve <- as.data.table(ROC)
rccurve[,.(FPR = FPR)]
})))
I create a table FPRlist containing all the FPR existing in all your curves. I will after merge each curve with this table containing all FPR, and use na.locf to complete the missing values.
I use rbindlist to make one table, with an ID for each ROC curve
results <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve <- merge(FPRlist,rccurve,all = T)
rccurve[,TPR := na.locf(TPR,na.rm = F)] # I complete the values
rccurve[,ID := idx] # I create an ID
rccurve
}))
I then calculate the max and min across all ID (all ROC curve) for each FPR step
resultmax <- results[,.(TPR = max(TPR)),by = FPR]
resultmin <- results[,.(TPR = min(TPR)),by = FPR]
And plot it the same way you plot it
ggplot()+
geom_line(data = results,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")+
geom_line(data = resultmax,aes(FPR,TPR),color = "red",size = 1)+
geom_line(data = resultmin,aes(FPR,TPR),color = "red",size = 1)
I let the dplyr translation to dplyr users, because I am not used to.
Edit
I modified my plot to make a comparison with the plot of just all raw ROC curves without any merge nor na.locf. One can see that the red lines I propose do follow the max and the min of all curves. The second plot is obtained as follow:
results2 <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve[,ID := idx] # I create an ID
rccurve
}))
p2 <- ggplot()+
geom_line(data = results2,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")
It just plots all the ROC curves contained in the list provided in the OS question. The two column plot is obtained with multiplot function (see here)
Here is my code and plot results, dues to some outliers, the x-axis is very long. Is there a simple method which I can filter df$foo by only 0-90% or 0-95% percentile in R, so that I can plot only normal values? Thanks.
df <- read.csv('~/Downloads/foo.tsv', sep='\t', header=F, stringsAsFactors=FALSE)
names(df) <- c('a', 'foo', 'goo')
df$foo <- as.numeric(df$foo)
goodValue <- df$foo
summary(goodValue)
hist(goodValue,main="Distribution",xlab="foo",breaks=20)
Maybe this is what you're looking for?
a = c(rnorm(99), 50) #create some data
quant <- as.numeric(quantile(a, c(0, 0.9))) #get 0 and 0.9 quantile
hist(a[a > quant[1] & a < quant[2]]) #histogram only data within these bounds
Suppose you wanted to examine the diamonds. (I don't have your data)
library(ggplot2)
library(dplyr)
diamonds %>% ggplot() + geom_histogram(aes(x = price))
You might decide to examine the deciles of your data, and since the tail probability is not of interest to you, you might throw away the top uppermost decile. You could do that as follows, with a free scale so that you can see what is happening within each decile.
diamonds %>% mutate(ntile = ntile(price, 10)) %>%
filter(ntile < 10) %>%
ggplot() + geom_histogram(aes(x = price)) +
facet_wrap(~ntile, scales = "free_x")
But be cautious although seeing your data in a much finer granularity has its benefits, notice how you could almost barely tell that your data is roughly exponentially distributed (with a heavy tail, as commodities price data often are).
I'm sure this is easy, but I've been tearing my hair out trying to find out how to do this in R.
I have some data that I am trying to fit to a power law distribution. To do this, you need to plot the data on a log-log cumulative probability chart. The y-axis is the LOG of the frequency of the data (or log-probability, if you like), and the x-axis is the log of the values. If it's a straight line, then it fits a power law distribution, and the gradient determines the power law parameter.
If I want the frequency of the data, I can just use the ecdf() function:
My data set is called Profits.negative, and it's just a long list of trading profits that were less than zero (and I've notionally converted them all to positive numbers to avoid logging problems later on).
So I can type
plot(ecdf(Profits.negative))
And I get a handy empirical CDF function plotted. All I need to do is to convert both axes to log scales. I can do the x-axis:
Profits.negative.logs <- log(Profits.negative)
plot(ecdf(Profits.negative.logs))
Almost there! I just need to work out how to log the y-axis! But I can't seem to do it, and I can't work out how to extract the figures from the ecdf object. Can anyone help?
I know there is a power.law.fit function, but that just estimates the parameters - I want to plot the data and see if it lines up.
You can fit and plot power-laws using the poweRlaw package. Here's an example. First we generate some data from a heavy tailed distribution:
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
Next we load the package and create a data object and a displ object:
library(poweRlaw)
m = displ$new(x)
We can estimate xmin and the scaling parameter:
est = estimate_xmin(m))
and set the parameters
m$setXmin(est[[2]])
m$setPars(est[[3]])
Then plot the data and add the fitted line:
plot(m)
lines(m, col=2)
To get:
Data generation first (you part, actually ;)):
set.seed(1)
Profits.negative <- runif(1e3, 50, 100) + rnorm(1e2, 5, 5)
Logging and ecdf:
Profits.negative.logs <- log(Profits.negative)
fn <- ecdf(Profits.negative.logs)
ecdf returns function, and if you want to extract something from it - it's good idea to look into function's closure:
ls(environment(fn))
# [1] "f" "method" "n" "nobs" "x" "y" "yleft" "yright"
Well, now we can access x and y:
x <- environment(fn)$x
y <- environment(fn)$y
Probably it's what you need. Indeed, plot(fn) and plot(x,y,type="l") show virtually the same results. To log y-axis you need just:
plot(x,log(y),type="l")
Here is an approach using ggplot2:
library(ggplot2)
# data
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
# organize data into a df
df <- data.frame(x = sort(x, decreasing = T),
pk <- ecdf(x)(x),
k <- seq_along(x))
# plot
ggplot(df, aes(x=k, y= pk)) + geom_point(alpha=0.5) +
coord_trans(x = 'log10', y = 'log10') +
scale_x_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))