Here is some data to work with.
df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))
If I plot these data, it will produce a simple scatter plot with an obvious outlier:
plot(df$x2,df$x1)
Then I can always write the code below to remove the y-axis outlier(s).
plot(df$x2,df$x1,ylim=c(0,800))
So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.
This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.
You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.
Here all observations with "a" exceeding 5 * median of "a" are excluded.
df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))
f <- function(x){
which(x$a > 5*median(x$a))
}
with(df[-f(df),], plot(b, a))
There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).
That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:
ybounds <- function(y){ # y is the response variable in the dataframe
bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}
Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))
Related
I'm using ggplot2's stat_ecdf to plot the cumulative distribution of times to event. However, some events are right-censored (not yet occurred). So, I want the cumulative distribution to not reach 1. As described in the documentation, na.rm option is basically useless. Do I have to code my own ecdf...?
MWE
The problem is that stat_ecdf includes no option to treat NA as censored, so the distribution always reaches 1, including if the x-limits are adjusted (which seems fully incorrect).
library('ggplot2')
set.seed(0)
x = rbinom(8,20,.5) # true all data
# x[x>10] = NA # "censoring" as NA - same problem
g = ggplot(data.frame(x=x),aes(x=x)) +
stat_ecdf(na.rm=TRUE) + # na.rm does nothing
lims(x=c(0,10)) # "censoring" as xlim - cdf is re-scaled to the visible data!?
# lims(x=c(0,20)) # no "censoring" - correct
print(g)
Here is one solution using after_stat, but it has several limitations. It works by re-scaling y values by the number of data points within the layer vs the number expected. It solves both the NA and lims problems.
MWE
library('ggplot2')
set.seed(0)
N = 8
x = rbinom(N,20,.5) + runif(N,-1e-9,+1e-9) # add noise - need a better solution...
x[x>10] = NA # "censoring" as NA
g = ggplot(data.frame(x=x),aes(x=x)) +
stat_ecdf(aes(y=after_stat(
unlist(lapply(split(y,list(group,PANEL)),function(y){ # for groups, facets
y * (length(y)-2) / N # core solution
})) # for groups, facets
))) +
lims(x=c(0,10)) # "censoring" as xlim
# lims(x=c(0,20)) # no "censoring"
# print(layer_data(last_plot())) # DEBUG helper
print(g)
Limitations
Internally, y already has NAs removed, and only includes data for unique values of x. As a result...
We need to know N from outside the scope of where after_stat is evaluated. This becomes a pain if N is different per group / facet.
Duplicate values reduce the length of y but not due to NA. My solution for now is to add noise to the x data (runif) before the plot, but this is obviously ugly.
Solution assumes that pad = TRUE (adds -Inf,+Inf to the data), which is why we use length(y)-2 not length(y), but you can adjust for your case.
Thanks
To this answer for mentioning layer_data(last_plot()), which made this solution much easier to develop.
The empirical cumulative distribution function will apply to the data that you provide. I don't think it can create a distribution for unknown data or that do not exist yet.
Right now you have 8 data points. How many more data points would there be until the event?
If you knew that there would be 20 data points in total, and assuming that the new values are all lower or higher than what you already have, then you could estimate what your ecdf would look like. If that is the case, you can re-scale your problem and instead of reaching 1, it would reach 1/20*8 = 0.4. There are different workarounds to either re-scale your y-axis or fitted data.
But if you do not know how many more data points there are until the event, you can't just decide where you currently are in the cumulative distribution function. You know it shouldn't be 1, but where would it be then?
So, I've spent the last four hours trying to find an efficient way of plotting the curve(s) of a function with two variables - to no avail. The only answer that I could actually put to practice wasn't producing a multiple-line graph as I expected.
I created a function with two variables, x and y, and it returns a continuous numeric value. I wanted to plot in a single screen the result of this function with certain values of x and all possible values of y within a given range (y is also a continuous variable).
Something like that:
These two questions did help a little, but I still can't get there:
Plotting a function curve in R with 2 or more variables
How to plot function of multiple variables in R by initializing all variables but one
I also used the mosaic package and plotFun function, but the results were rather unappealing and not very readable: https://www.youtube.com/watch?v=Y-s7EEsOg1E.
Maybe the problem is my lack of proficiency with R - though I've been using it for months so I'm not such a noob. Please enlighten me.
Say we have a simple function with two arguments:
fun <- function(x, y) 0.5*x - 0.01*x^2 + sqrt(abs(y)/2)
And we want to evaluate it on the following x and y values:
xs <- seq(-100, 100, by=1)
ys <- c(0, 100, 300)
This line below might be a bit hard to understand but it does all of the work:
res <- mapply(fun, list(xs), ys)
mapply allows us to run function with multiple variables across a range of values. Here we provide it with only one value for "x" argument (note that xs is a long vector, but since it is in a list - it's only one instance). We also provide multiple values of "y" argument. So the function will run 3 times each with the same value of x and different values of y.
Results are arranged column-wise so in the end we have 3 columns. Now we only have to plot:
cols <- c("black", "cornflowerblue", "orange")
matplot(xs, res, col=cols, type="l", lty=1, lwd=2, xlab="x", ylab="result")
legend("bottomright", legend=ys, title="value of y", lwd=2, col=cols)
Here the matplot function does all the work - it plots a line for every column in the provided matrix. Everything else is decoration.
Here is the result:
I need to make a histogram for my variable which is 'travel time'. And inside that, I need to plot the regression(correlation) data i.e. my observed data vs predicted. And I need to repeat it for different time of day and week(in simple words, make a matrix of such figure using par function). for now, I can draw histograms and arrange that in matrix form but I am facing a problem in inside plot (plotting x and y data together with y=x line, and arranging them within their consecutive histograms plot, in a matrix ). How can I do that, as in the figure below. Any help would be appreciated. Thanks!
One way to do this is to loop over your data and on every iteration create a desired plot. Here is one not very polished example, but it shows the logic how plotting a small plot over larger plot can be done. You will have to tweak the code to get it work in the way you need, but it shouldn't be that difficult.
# create some sample dataset (your x values)
a <- c(rnorm(100,0,1))
b <- c(rnorm(100,2,1))
# create their "y" values counterparts
x <- a + 3
y <- b + 4
# bind the data into two dataframes (explanatory variables in one, explained in the other)
data1 <- cbind(a,b)
data2 <- cbind(x,y)
# set dimensions of the plot matrix
par(mfrow = c(2,1))
# for each of the explanatory - explained pair
for (i in 1:ncol(data2))
{
# set positioning of the histogram
par("plt" = c(0.1,0.95,0.15,0.9))
# plot the histogram
hist(data1[, i])
# set positioning of the small plot
par("plt" = c(0.7, 0.95, 0.7, 0.95))
# plot the small plot over the histogram
par(new = TRUE)
plot(data1[, i], data2[, i])
# add some line into the small plot
lines(data1[, i], data1[, i])
}
I don't know how can I plot in better way.
I have
df1 <- data.frame(x=c(1,3,5), y=c(2,4,6))
df2 <- data.frame(x=c(2,6,10,12), y=c(1,4,7,15)
Those data frames have x as time, y as its own value.
I have data-frames with different amount of elements
I want to combine this data by x (time), but I need one method of two to show them on one plot: a) to show df1.y on x axis of a plot to see distribution df2 by df1, so these two data frames should be connected by the time (x) but shown each on one of two axis, or b) to show three axis, and for df1.y the y axis should be at the right side of a plot.
For a better terminology, I will rename your example variables according to your sample plots.
df1 <- data.frame(time=c(1,3,5), memory=c(2,4,6))
df2 <- data.frame(time=c(2,6,10,12), threads=c(1,4,7,15))
Your first plot:
From your description, I assume that you want to do the following: For each available time value get the value of df1$memory and df2$threads. However, that value may not always be available. One suitable approach is to fill up missing values by linear interpolation. This may be done using the approx-function:
merged.time <- sort(unique(c(df1$time, df2$time))
merged.data <- data.frame(time = merged.time,
memory = approx(df1$time, df1$memory, xout=merged.time)$y
threads = approx(df2$time, df2$threads, xout=merged.time)$y
)
Note that appprox(...)$y just extracts the interpolated data.
Plotting may now be done using standard plotting commands (or, as your tags suggest, using ggplot2:
ggplot(data=merged.data, aes(x=memory, y=threads)) + geom_line()
Your second plot
... is not possible with ggplot2. That is for numerous reasons, for example see here.
I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)