R to scale the y-axis to fit the loess curves

R to scale the y-axis to fit the loess curves - r

I want to reduce the scale of the Y-xis according to the loess function from ggplot2, even if it means that some raw points will not be displayed.
the problem is that I do not know beforehand what will be the maximum value of the loess function. All I have is the various raw data sets, and sometimes some of these data set can have high peaks once in while and therefore those are squishing all my loess curves to the bottom of the graphs, whereas I care more about the display of the loess curves than the display of the raw data (but these raw data must be displayed, at least those which are near the loess curves)
With this example, the blue loess line never goes higher than 2, while there are many point between 2 and 3; so typically, I want the top of the Y-axis to be 2 and ditch those extreme raw points.
library(ggplot2)
set.seed(34)
n <- 200
X <- runif(n)*8
Y <- sin(3*X) + cos(X^2) + rnorm(n, 0, 0.5)
myData <- data.frame(X,Y)
fit <- loess(Y~X,data=myData)
myData$pred <- predict(fit)
ggplot(myData, aes(X,Y))+
geom_point()+
stat_smooth(method="loess", se=F, size=3)+
geom_line(aes(X,pred),colour="yellow")

You were almost there, just add a coord_cartesian call
ggplot(myData, aes(X,Y))+
geom_point()+
stat_smooth(method="loess", se=F, size=3)+
geom_line(aes(X,pred),colour="yellow") +
coord_cartesian(ylim=c(min(myData$pred)-.1, max(myData$pred)+.1))

Related

Plot bin averaged values with error bars in R

I have a dataframe with three columns "DateTime", "T_ET", and "LAI". I want to plot T_ET (on y-axis) against LAI (on x-axis) along with 0.1-bin LAI averaged values of T_ET on the same plot something like below (Wei et al., 2017):
In above figure, y-axis is T_ET or T/(E+T), x-axis is LAI, red open diamonds with error bars are 0.1-bin LAI averaged of black points and the standard deviation, solid line is
a regression of the individual data points (estimated from the bin averages), n is available data points. Dash lines are 95% confidence bounds.
How can I obtain the plot similar to above plot? Please find the sample data using the following link: file
or use following sample data:
df <- structure(list(DateTime = structure(c(1478088000, 1478347200, 1478692800, 1478779200, 1478865600, 1478952000, 1479124800, 1479211200, 1479297600, 1479470400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
T_ET = c(0.996408350852751, 0.904748351479432, 0.28771236118773, 0.364402232484906, 0.452348409759872, 0.415408041501318, 0.629291202120187, 0.812083112145703, 0.992414777441755, 0.818032913071265),
LAI = c(1.3434, 1.4669, 1.6316, 1.6727, 1.8476, 2.0225, 2.3723, 2.5472, 2.7221, 3.0719)),
row.names = c(NA, 10L),
class = "data.frame")

You can do this directly while plotting via stat_summary_bin(). By default, the geom associated with this would be the pointrange geom and uses mean_se(). bins= controls the number of bins, but you can also supply binwidth=. Note that with the pointrange geom, fatten controls the size of the central point:
ggplot(df, aes(LAI, T_ET)) + geom_point() + theme_classic() +
stat_summary_bin(bins=3, color='red', shape=5, fatten=5)
Your sample data is a little light, so here's another example via the diamonds dataset. Here, I'm constructing the same look as the example plot you show by combining the errorbar and poing geom. Please note that apparently setting the width of the errorbar doesn't work correctly with stat_summary_bin().
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
theme_classic()
EDIT: Showing Regression for Binned Data
As indicated in the comments, drawing a regression line based on the binned data and not the original data is possible, but not through the stat_summary_bin() function unless you are okay to use loess. If you're looking for linear regression, you'll need to bin the data outside of ggplot, then plot the regression on the binned data.
The reason for this is probably by design. It's inherently not a good idea to draw a regression line (a way of summarizing data) that is based on summarized data. Regardless, here's one way to do this via the diamonds dataset. We can use the cut() function to cut into separate bins, then summarize the data on those binned values. Due to the way the cut() function labels the output, we have to create our own labels. Since we're cutting into 12 equal pieces in this example, I'm creating 12 evenly-spaced positions on the x axis for our data values to sit into - this may be different in your case, just take care you label according to what the data represents and what makes the most statistical sense.
df <- diamonds
# setting interval labeling
bin_width <- diff(range(df$carat)/12)
bin_labels <- c((range(df$carat)[1] + (bin_width/2))+(0:11*bin_width))
# cutting the data
df$bins <- cut(df$carat, breaks=12, labels=bin_labels)
df$bins <- as.numeric(levels(df$bins)[df$bins]) # convert factor to numeric
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(data=df, aes(x=bins), method='lm', color='blue') +
theme_classic()
Note that the regression line above is weighting all binned values equally. This is generally not a good idea unless your data is spaced evenly among the dataset. I'd still recommend if you're going to draw a regression line, have it linked to the original data, which is much more representative of the reality within your data. That would look like this:
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(method='lm', color='green') +
theme_classic()
When it comes down to it, drawing a regression line for binned data is summarizing the summarized data rather than summarizing your original data. It's statistical heresay, so use at your own risk. But if you simply must for whatever strange reason... I can't stop you. ;)

Displaying smoothed (convolved) densities with ggplot2

I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)

Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.

How to reverse axis order and use a predefined scale in ggplot?

I've read a past post asking about using scale_reverse and scale_log10 at the same time. I have a similar issue, except my scale I'm seeking to "reverse" is a pre-defined scale in the "scales" package. Here is my code:
##Defining y-breaks for probability scale
ybreaks <- c(1,2,5,10,20,30,40,50,60,70,80,90,95,98,99)/100
#Random numbers, and their corresponding weibull probability valeus (which I'm trying to plot)
x <- c(.3637, .1145, .8387, .9521, .330, .375, .139, .662, .824, .899)
p <- c(.647, .941, .255, .059, .745, .549, .853, .451, .352, .157)
df <- data.frame(x, p)
require(scales)
require(ggplot2)
ggplot(df)+
geom_point(aes(x=x, y=p, size=2))+
stat_smooth(method="lm", se=FALSE, linetype="dashed", aes(x=x, y=p))+
scale_x_continuous(trans='probit',
breaks=ybreaks,
minor_breaks=qnorm(ybreaks))+
scale_y_log10()
Resulting plot:
For more information, the scale I'm trying to achieve is the probability plotting scale, which has finer resolution on either end of the scale (at 0 and 1) to show extreme events, with ever-decreasing resolution toward the median value (0.5).
I want to be able to use scale_x_reverse concurrently with my scale_x_continuous probability scale, but I don't know how to build that in any sort of custom scale. Any guidance on this?

Arguments in scale_(x|y)_reverse() are passed to scale_(x|y)_continuous() so you should simply do:
scale_x_reverse(trans='probit', breaks = ybreaks, minor_breaks=qnorm(ybreaks))

Rather than try to combine two transformations, why not transform your existing data and then plot it?
The following looks like it should be right.
#http://r.789695.n4.nabble.com/Inverse-Error-Function-td802691.html
erf.inv <- function(x) qnorm((x + 1)/2)/sqrt(2)
#http://en.wikipedia.org/wiki/Probit#Computation
probit <- function(x) sqrt(2)*erf.inv((2*x)-1)
# probit(0.3637)
df$z <- probit(df$x)
ggplot(df)+
geom_point(aes(x=z, y=p), size=2)+
stat_smooth(method="lm", se=FALSE, linetype="dashed", aes(x=z, y=p))+
scale_x_reverse(breaks = ybreaks,
minor_breaks=qnorm(ybreaks))+
scale_y_log10()

Scatterplot with ugly margins when using log scale

I have a somewhat "weird" two-dimensional distribution (not normal with some uniform values, but it kinda looks like this.. this is just a minimal reproducible example), and want to log-transform the values and plot them.
library("ggplot2")
library("scales")
df <- data.frame(x = c(rep(0,200),rnorm(800, 4.8)), y = c(rnorm(800, 3.2),rep(0,200)))
Without the log transformation, the scatterplot (incl. rug plot which I need) works (quite) well, apart from a marginally narrower rug plot on the x axis:
p <- ggplot(df, aes(x, y)) + geom_point() + geom_rug(alpha = I(0.5)) + theme_minimal()
p
When plotting the same with a log10-transform though, the points at the margin (at x = 0 and y = 0, respectively) are plotted outside the rug plot or just on the axis (with other data, and only one half side of a point is visible).
p + scale_x_log10() + scale_y_log10()
How can I "rescale" the axes so that all the points are contained fully within the grid and the rug plots are unaffected, as in the first example?

Maybe you want
p + scale_x_log10(oob=squish_infinite) + scale_y_log10(oob=squish_infinite)
I don't really know what you expect to happen for those values that can be negative or infinite, but one general advice when transformations don't do what you want is to perform them outside of ggplot2. Something like this might be useful,
library(plyr)
df2 <- colwise(log10)(df) # log transform columns
df2 <- colwise(squish_infinite)(df2) # do something with infinites
p %+% df2 # plot the transformed data

How to control ylim for a faceted plot with different scales in ggplot2?

In the following example, how do I set separate ylims for each of my facets?
qplot(x, value, data=df, geom=c("smooth")) + facet_grid(variable ~ ., scale="free_y")
In each of the facets, the y-axis takes a different range of values and I would like to different ylims for each of the facets.
The defaults ylims are too long for the trend that I want to see.

This was brought up on the ggplot2 mailing list a short while ago. What you are asking for is currently not possible but I think it is in progress.

As far as I know this has not been implemented in ggplot2, yet. However a workaround - that will give you ylims that exceed what ggplot provides automatically - is to add "artificial data". To reduce the ylims simply remove the data you don't want plot (see at the and for an example).
Here is an example:
Let's just set up some dummy data that you want to plot
df <- data.frame(x=rep(seq(1,2,.1),4),f1=factor(rep(c("a","b"),each=22)),f2=factor(rep(c("x","y"),22)))
df <- within(df,y <- x^2)
Which we could plot using line graphs
p <- ggplot(df,aes(x,y))+geom_line()+facet_grid(f1~f2,scales="free_y")
print(p)
Assume we want to let y start at -10 in first row and 0 in the second row, so we add a point at (0,-10) to the upper left plot and at (0,0) ot the lower left plot:
ylim <- data.frame(x=rep(0,2),y=c(-10,0),f1=factor(c("a","b")),f2=factor(c("x","y")))
dfy <- rbind(df,ylim)
Now by limiting the x-scale between 1 and 2 those added points are not plotted (a warning is given):
p <- ggplot(dfy,aes(x,y))+geom_line()+facet_grid(f1~f2,scales="free_y")+xlim(c(1,2))
print(p)
Same would work for extending the margin above by adding points with higher y values at x values that lie outside the range of xlim.
This will not work if you want to reduce the ylim, in which case subsetting your data would be a solution, for example to limit the upper row between -10 and 1.5 you could use:
p <- ggplot(dfy,aes(x,y))+geom_line(subset=.(y < 1.5 | f1 != "a"))+facet_grid(f1~f2,scales="free_y")+xlim(c(1,2))
print(p)

There are actually two packages that solve that problem now:
https://github.com/zeehio/facetscales, and https://cran.r-project.org/package=ggh4x.
I would recommend using ggh4x because it has very useful tools, such as facet grid multiple layers (having 2 variables defining the rows or columns), scaling the x and y-axis as you wish in each facet, and also having multiple fill and colour scales.
For your problems the solution would be like this:
library(ggh4x)
scales <- list(
# Here you have to specify all the scales, one for each facet row in your case
scale_y_continuous(limits = c(2,10),
scale_y_continuous(breaks = c(3, 4))
)
qplot(x, value, data=df, geom=c("smooth")) +
facet_grid(variable ~ ., scale="free_y") +
facetted_pos_scales(y = scales)

I have one example of function facet_wrap
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(vars(class), scales = "free",
nrow=2,ncol=4)
Above code generates plot as:
my level too low to upload an image, click here to see plot