Displaying smoothed (convolved) densities with ggplot2 - r

I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)

Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.

Related

Plot bin averaged values with error bars in R

I have a dataframe with three columns "DateTime", "T_ET", and "LAI". I want to plot T_ET (on y-axis) against LAI (on x-axis) along with 0.1-bin LAI averaged values of T_ET on the same plot something like below (Wei et al., 2017):
In above figure, y-axis is T_ET or T/(E+T), x-axis is LAI, red open diamonds with error bars are 0.1-bin LAI averaged of black points and the standard deviation, solid line is
a regression of the individual data points (estimated from the bin averages), n is available data points. Dash lines are 95% confidence bounds.
How can I obtain the plot similar to above plot? Please find the sample data using the following link: file
or use following sample data:
df <- structure(list(DateTime = structure(c(1478088000, 1478347200, 1478692800, 1478779200, 1478865600, 1478952000, 1479124800, 1479211200, 1479297600, 1479470400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
T_ET = c(0.996408350852751, 0.904748351479432, 0.28771236118773, 0.364402232484906, 0.452348409759872, 0.415408041501318, 0.629291202120187, 0.812083112145703, 0.992414777441755, 0.818032913071265),
LAI = c(1.3434, 1.4669, 1.6316, 1.6727, 1.8476, 2.0225, 2.3723, 2.5472, 2.7221, 3.0719)),
row.names = c(NA, 10L),
class = "data.frame")
You can do this directly while plotting via stat_summary_bin(). By default, the geom associated with this would be the pointrange geom and uses mean_se(). bins= controls the number of bins, but you can also supply binwidth=. Note that with the pointrange geom, fatten controls the size of the central point:
ggplot(df, aes(LAI, T_ET)) + geom_point() + theme_classic() +
stat_summary_bin(bins=3, color='red', shape=5, fatten=5)
Your sample data is a little light, so here's another example via the diamonds dataset. Here, I'm constructing the same look as the example plot you show by combining the errorbar and poing geom. Please note that apparently setting the width of the errorbar doesn't work correctly with stat_summary_bin().
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
theme_classic()
EDIT: Showing Regression for Binned Data
As indicated in the comments, drawing a regression line based on the binned data and not the original data is possible, but not through the stat_summary_bin() function unless you are okay to use loess. If you're looking for linear regression, you'll need to bin the data outside of ggplot, then plot the regression on the binned data.
The reason for this is probably by design. It's inherently not a good idea to draw a regression line (a way of summarizing data) that is based on summarized data. Regardless, here's one way to do this via the diamonds dataset. We can use the cut() function to cut into separate bins, then summarize the data on those binned values. Due to the way the cut() function labels the output, we have to create our own labels. Since we're cutting into 12 equal pieces in this example, I'm creating 12 evenly-spaced positions on the x axis for our data values to sit into - this may be different in your case, just take care you label according to what the data represents and what makes the most statistical sense.
df <- diamonds
# setting interval labeling
bin_width <- diff(range(df$carat)/12)
bin_labels <- c((range(df$carat)[1] + (bin_width/2))+(0:11*bin_width))
# cutting the data
df$bins <- cut(df$carat, breaks=12, labels=bin_labels)
df$bins <- as.numeric(levels(df$bins)[df$bins]) # convert factor to numeric
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(data=df, aes(x=bins), method='lm', color='blue') +
theme_classic()
Note that the regression line above is weighting all binned values equally. This is generally not a good idea unless your data is spaced evenly among the dataset. I'd still recommend if you're going to draw a regression line, have it linked to the original data, which is much more representative of the reality within your data. That would look like this:
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(method='lm', color='green') +
theme_classic()
When it comes down to it, drawing a regression line for binned data is summarizing the summarized data rather than summarizing your original data. It's statistical heresay, so use at your own risk. But if you simply must for whatever strange reason... I can't stop you. ;)

Differentiating each Line with different type in `ggsurv` plots (or in `plot`)

I am using Rstudio. I am using ggsurv function from GGally package for drawing Kaplan-Meier curves for my data (for survival analysis), from tutorial here. I am using it instead of plot because ggsurv takes care of legends by itself.
As shown on the link, multiple curves are differentiated by color. I want to differentiate based on linetype. The tutorial does not seem to have any option for that. Following is my command:
surv1 <- survfit(Surv(DaysOfTreatment,Survived)~AgeOnFirstContactGroup)
print(ggsurv(surv1, lty.est = 3)+ ylim(0, 1))
lty.est=3(or 2) gives same dashed lines for all the lines. I want differently dashed line for each line. Using lty=type gives error:object 'type' not found. And lty=type would work in ggplot but ggplot does not directly deal with survfit plots.
Please show me how to differentiate curves by linetype in either ggsurv or simple plot (although I would prefer ggsurv because it takes care of legends)
From the documentation for ggsurv
lty.est: linetype of the survival curve(s). Vector length should be
either 1 or equal to the number of strata.
So, to get a different line type for each stratum, set lty.est equal to a vector of the same length as the number of lines you are plotting, with each value corresponding to a different line type.
For example, using the lung data from the survival package
library(GGally)
library(survival)
data(lung)
surv1 <- survfit(Surv(time,status) ~ sex, data = lung)
ggsurv(surv1, lty.est=c(1,2), surv.col = 1)
Gives the following plot
You can add ggplot themes or other ggplot elements to the plot too. For example, we can improve the appearance using the cowplot theme as follows
library(ggplot2)
library(cowplot)
ggsurv(surv1, lty.est=c(1,2), surv.col = 1) + theme_cowplot()
If you need to change the legend labels after differentiating by linetype, then you can do it this way
ggsurv(surv1, lty.est=c(1,2), surv.col = 1) +
guides(colour = FALSE) +
scale_linetype_discrete(name = 'Sex', breaks = c(1,2), labels = c('Male', 'Female'))

Change colors of select lines in ggplot2 coefficient plot in R

I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.

R to scale the y-axis to fit the loess curves

I want to reduce the scale of the Y-xis according to the loess function from ggplot2, even if it means that some raw points will not be displayed.
the problem is that I do not know beforehand what will be the maximum value of the loess function. All I have is the various raw data sets, and sometimes some of these data set can have high peaks once in while and therefore those are squishing all my loess curves to the bottom of the graphs, whereas I care more about the display of the loess curves than the display of the raw data (but these raw data must be displayed, at least those which are near the loess curves)
With this example, the blue loess line never goes higher than 2, while there are many point between 2 and 3; so typically, I want the top of the Y-axis to be 2 and ditch those extreme raw points.
library(ggplot2)
set.seed(34)
n <- 200
X <- runif(n)*8
Y <- sin(3*X) + cos(X^2) + rnorm(n, 0, 0.5)
myData <- data.frame(X,Y)
fit <- loess(Y~X,data=myData)
myData$pred <- predict(fit)
ggplot(myData, aes(X,Y))+
geom_point()+
stat_smooth(method="loess", se=F, size=3)+
geom_line(aes(X,pred),colour="yellow")
You were almost there, just add a coord_cartesian call
ggplot(myData, aes(X,Y))+
geom_point()+
stat_smooth(method="loess", se=F, size=3)+
geom_line(aes(X,pred),colour="yellow") +
coord_cartesian(ylim=c(min(myData$pred)-.1, max(myData$pred)+.1))

How to get the ggplot2 sat_smooth blue line as a function?

The follwing command:
ggplot(s, aes(x = I5, y = Success))+geom_point(size=3, alpha=0.4)+
stat_smooth(method="loess", colour="blue", size=1.5)+
xlab("I5")+
ylab("Probability of Success")+
theme_bw()
gives me the following plot:
I would like to get what corresponds to the blue line as a function so that I can apply it to any value.
Is there a way to do that?
If you need the actual loess fit, it's probably better to run it yourself. Let's create some sample data (it would have been nice if you had include some in your original question)
dd <- data.frame(
x=1:50,
y = cumsum(rnorm(50))
)
And now we can run the loess function ourself
sm <- loess(y~x, dd)
Now we can compare the line that ggplot draws to our loess curve
ggplot(dd, aes(x,y)) +
stat_smooth(method="loess") +
geom_point(data=data.frame(x=sm$x, y=predict(sm)), col="red")
We can see these line up perfectly. This we can just use the predict() function with our loess object to get a value for any point. For example
predict(sm, 5)
# [1] -2.922876

Resources