stat_ecdf without reaching 1 (right censored data) - r

I'm using ggplot2's stat_ecdf to plot the cumulative distribution of times to event. However, some events are right-censored (not yet occurred). So, I want the cumulative distribution to not reach 1. As described in the documentation, na.rm option is basically useless. Do I have to code my own ecdf...?
MWE
The problem is that stat_ecdf includes no option to treat NA as censored, so the distribution always reaches 1, including if the x-limits are adjusted (which seems fully incorrect).
library('ggplot2')
set.seed(0)
x = rbinom(8,20,.5) # true all data
# x[x>10] = NA # "censoring" as NA - same problem
g = ggplot(data.frame(x=x),aes(x=x)) +
stat_ecdf(na.rm=TRUE) + # na.rm does nothing
lims(x=c(0,10)) # "censoring" as xlim - cdf is re-scaled to the visible data!?
# lims(x=c(0,20)) # no "censoring" - correct
print(g)

Here is one solution using after_stat, but it has several limitations. It works by re-scaling y values by the number of data points within the layer vs the number expected. It solves both the NA and lims problems.
MWE
library('ggplot2')
set.seed(0)
N = 8
x = rbinom(N,20,.5) + runif(N,-1e-9,+1e-9) # add noise - need a better solution...
x[x>10] = NA # "censoring" as NA
g = ggplot(data.frame(x=x),aes(x=x)) +
stat_ecdf(aes(y=after_stat(
unlist(lapply(split(y,list(group,PANEL)),function(y){ # for groups, facets
y * (length(y)-2) / N # core solution
})) # for groups, facets
))) +
lims(x=c(0,10)) # "censoring" as xlim
# lims(x=c(0,20)) # no "censoring"
# print(layer_data(last_plot())) # DEBUG helper
print(g)
Limitations
Internally, y already has NAs removed, and only includes data for unique values of x. As a result...
We need to know N from outside the scope of where after_stat is evaluated. This becomes a pain if N is different per group / facet.
Duplicate values reduce the length of y but not due to NA. My solution for now is to add noise to the x data (runif) before the plot, but this is obviously ugly.
Solution assumes that pad = TRUE (adds -Inf,+Inf to the data), which is why we use length(y)-2 not length(y), but you can adjust for your case.
Thanks
To this answer for mentioning layer_data(last_plot()), which made this solution much easier to develop.

The empirical cumulative distribution function will apply to the data that you provide. I don't think it can create a distribution for unknown data or that do not exist yet.
Right now you have 8 data points. How many more data points would there be until the event?
If you knew that there would be 20 data points in total, and assuming that the new values are all lower or higher than what you already have, then you could estimate what your ecdf would look like. If that is the case, you can re-scale your problem and instead of reaching 1, it would reach 1/20*8 = 0.4. There are different workarounds to either re-scale your y-axis or fitted data.
But if you do not know how many more data points there are until the event, you can't just decide where you currently are in the cumulative distribution function. You know it shouldn't be 1, but where would it be then?

Related

R- how to plot cumulative lines

my problem is the following:
I have to plot a curve which shows the number of breakdowns (y) by the service life (x) but in a cumulative way - and that's the point where I struggle!!
The solution is given in the second Picture, my code in the first (I think only the type of the plot should be different)
my code
solution
Thanks so much for every help!!
I can't replicate your data, so this is more of a comment, then a complete solution.
n <- sum(h$counts) # This should sum up to the number of observations
y <- cumsum(h$counts) / n # Your y values
x <- h$mids # I assume these to be your x-axis value, but this might need an edit.
plot(x = x, y = y, type = "l")
Finally, you can add the vertical and horizontal lines via the abline() function at the respective points.

Histogram and density in R

Who can explain this to me?
If I run the following
repet <- 10000
size <- 100
p <- .5
data <- (rbinom(repet, size, p) - size * p) / sqrt(size * p * (1-p))
hist(data, freq = FALSE)
x = seq(min(data) - 1, max(data) + 1, .01)
lines(x, dnorm(x), col='green', lwd = 4)
then I get reasonable agreement of the histogram and the theoretical density (due to the Central Limit Theorem).
If I use
hist(data, breaks = 100, freq = FALSE)
the histogram is significantly different from the theoretical density.
This change in behavior happens when I increase the number of breaks from 51 to 52. Why does it happen?
Is has to do with the fact that the data you are generating from rbinom isn't continuous. It's discrete. There are only ~35 discrete values in there (with set.seed(15) and length(unique(data))). When you force the histogram to have 100 breaks, you find that many of those bin end up being empty
sum(hist(data, breaks = 100, freq = FALSE)$counts==0)
# [1] 36
So if you'll notice the second histogram has a bar, then a space (for a bar with height 0), repeating. The total area under the curve has to be the same for both histograms but because the bars in the second plot are half as wide, they need to be twice as all.
The point of all of this is to be careful when using histograms with discrete data. They are intended for continuous data. Also, the number of bins you choose can make a big difference on interpretation. If you change defaults, you should have a very good reason to do so.
Look at the values in data -- the precision is limited to tenths of a unit. Therefore, if you have too many bins, some of the bins will fall between the data points and will have a zero hit count. The others will have a correspondingly higher density.
In your experiments, there is a discontinuous effect because breaks...
is a suggestion only; the breakpoints will be set to pretty values
You can override the arbitrary behavior of breaks by precisely specifying the breaks with a vector. I demonstrate that below, along with a more direct (integer-based) histogram of the binomial results:
probability=0.5 ## probability of success per trial
trials=14 ## number of trials per result
reps=1e6 ## number of results to generate (data size)
## generate histogram of random binomial results
data <- rbinom(reps,trials,probability)
offset = 0.5 ## center histogram bins around integer data values
window <- seq(0-offset,trials+offset) ## create the vector of 'breaks'
hist(data,breaks=window)
## demonstrate the central limit theorem with a predictive curve over the histogram
population_variance = probability*(1-probability) ## from model of Bernoulli trials
prediction_variance <- population_variance / trials
y <- dnorm(seq(0,1,0.01),probability,sqrt(prediction_variance))
lines(seq(0,1,0.01)*trials,y*reps/trials, col='green', lwd=4)
Regarding the first chart shown in the question: Using repet <- 10000, the histogram should be very close to normal (the "Law of large numbers" results in convergence), and running the same experiment repeatedly (or further increasing repet) doesn't change the shape much -- despite the explicit randomness. The apparent randomness in the first chart is also an artifact of the plotting bug in question. To put it more plainly: both charts shown in the question are very wrong (because of breaks).

Set ylim() automatically

Here is some data to work with.
df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))
If I plot these data, it will produce a simple scatter plot with an obvious outlier:
plot(df$x2,df$x1)
Then I can always write the code below to remove the y-axis outlier(s).
plot(df$x2,df$x1,ylim=c(0,800))
So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.
This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.
You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.
Here all observations with "a" exceeding 5 * median of "a" are excluded.
df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))
f <- function(x){
which(x$a > 5*median(x$a))
}
with(df[-f(df),], plot(b, a))
There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).
That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:
ybounds <- function(y){ # y is the response variable in the dataframe
bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}
Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))

Find local minimum in bimodal distribution with r

My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.
My testdata is: http://www.file-upload.net/download-9365389/data.txt.html
I tried to follow this thread:
I plotted the histogram and calculated the kernel density function:
datafile <- read.table("....txt")
data <- data$V1
hist(data)
d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results
But how to continue?
I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test) seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)?
There are a few ways to do this.
First, using d for the density as in your question, d$x and d$y contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y), and seek d$x where abs(diff(d$y)) is minimized:
d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785
The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.
So a second way uses optimize(...) which seeks a local minimum in a given interval. optimize(...) needs a function as argument, so we use approxfun(d$x,d$y) to create an interpolation function.
optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791
Finally, we show that this is indeed the minimum:
hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")
Another approach, which is preferred actually, uses k-means clustering.
df <- read.csv(header=F,"data.txt")
colnames(df) = "X"
# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
The data actually looks more trimodal than bimodal.
# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

Resources