I am trying to use the lowess method from R to compute the weighted average of a data set which is not uniformly distributed along x axis. For example, the first 5 data points are like this, where the first column is the x and the second is the y.
375.0 2040.0
472.0 5538.0
510.0 4488.0
573.0 2668.0
586.0 7664.0
I used the following command in R:
x<-read.table(add,header=FALSE,sep="\t")
y<-lowess(x[,1],x[,2],f=0.01)
write.table(y, file = results , sep = "\t", col.names =FALSE, row.names =FALSE)
The output looks like this:
The green line shows the average computed by the smooth function in matlab (tri-cubic kernel), and the red line is the average line computed by lowess method in R. The blue dots are the data points.
I can't find why the method in R does not work. Do you have any idea?
Here is a link to part of the data.
Thanks a lot for your help.
Th smooth function in matlab is like a filter ,
yy = smooth(y)
yy(1) = y(1)
yy(2) = (y(1) + y(2) + y(3))/3
yy(3) = (y(1) + y(2) + y(3) + y(4) + y(5))/5 ## convolution of size 5
yy(4) = (y(2) + y(3) + y(4) + y(5) + y(6))/5
I think it is better to do a simple smooth here.
Here some attempts using loess, lowesss with f = 0.2(1/5) and using smooth.spline
I am using ggplot2 to plot ( to use geom_jitter with some alpha )
library(ggplot2)
dat <- subset(data, V2 < 5000)
#dat <- data
xy <- lowess(dat$V1,dat$V2,f = 0.8)
xy <- as.data.frame(do.call(cbind,xy))
p1<- ggplot(data = dat, aes(x= V1, y = V2))+
geom_jitter(position = position_jitter(width = .2), alpha= 0.1)+
geom_smooth()
xy <- lowess(dat$V1,dat$V2,f = 0.2)
xy <- as.data.frame(do.call(cbind,xy))
xy.smooth <- smooth.spline(dat$V1,dat$V2)
xy.smooth <- data.frame(x= xy.smooth$x,y = xy.smooth$y)
p2 <- ggplot(data = dat, aes(x= V1, y = V2))+
geom_jitter(position = position_jitter(width = .2), alpha= 0.1)+
geom_line(data = xy, aes(x=x, y = y, group = 1 ), color = 'red')+
geom_line(data = xy.smooth, aes(x=x, y = y, group = 1 ), color = 'blue')
library(gridExtra)
grid.arrange(p1,p2)
Related
I combined two plots of predicted mixed effect model and trying to change the legend title "sR" so that it is friendly to read but I couldn't get it to work when using p + scale_fill_discrete(name = "New Legend Title"). I believe it should be a simple fix but still scratching my head why I can't get it to work even after reading other posts in stackoverflow. Can someone help me please? Thanks.
Below is my R code for you to reproduce the problem:
library(nlme)
library(ggplot2)
library(ggeffects)
#For lowW dataset:
SurfaceCoverage <- c(0.04,0.08,0.1,0.12,0.15,0.2,0.04,0.08,0.1,0.12,0.15,0.2)
TotalSurfaceEnergy <- c(139.31449,105.17776,105.38411,99.27608,92.29064,91.55114,84.44251,78.40453,74.66656,73.33242,72.42429,77.08666)
sample <- c(1,1,1,1,1,1,2,2,2,2,2,2)
lowW <- data.frame(sample,SurfaceCoverage,TotalSurfaceEnergy)
lowW$sample <- sub("^", "Wettable", lowW$sample)
lowW$RelativeHumidity <- "Low relative humidity"; lowW$group <- "Wettable"
lowW$sR <- paste(lowW$sample,lowW$RelativeHumidity)
dflowW <- data.frame(
"y"=c(lowW$TotalSurfaceEnergy),
"x"=c(lowW$SurfaceCoverage),
"b"=c(lowW$sample),
"sR"=c(lowW$sR)
)
mixed.lme <- lme(y~log(x),random=~1|b,data=dflowW)
pred.mmlowW <- ggpredict(mixed.lme, terms = c("x"))
#For highW dataset:
SurfaceCoverage <- c(0.02,0.04,0.06,0.08,0.1,0.12,0.02,0.04,0.06,0.08,0.1,0.12)
TotalSurfaceEnergy <- c(66.79554,61.46907,57.56855,54.00953,54.28361,55.15855,50.72314,48.55892,47.41811,43.70885,42.13757,40.55924)
sample <- c(1,1,1,1,1,1,2,2,2,2,2,2)
highW <- data.frame(sample,SurfaceCoverage,TotalSurfaceEnergy)
highW$sample <- sub("^", "Wettable", highW$sample)
highW$RelativeHumidity <- "High relative humidity"; highW$group <- "Wettable"
highW$sR <- paste(highW$sample,highW$RelativeHumidity)
dfhighW <- data.frame(
"y"=c(highW$TotalSurfaceEnergy),
"x"=c(highW$SurfaceCoverage),
"b"=c(highW$sample),
"sR"=c(highW$sR)
)
mixed.lme <- lme(y~log(x),random=~1|b,data=dfhighW)
pred.mmhighW <- ggpredict(mixed.lme, terms = c("x"))
# Combine two predicted mixed effect model into a single ggplot:
p <- ggplot() +
#lowa plot
geom_line(data=pred.mmlowW, aes(x = x, y = predicted)) + # slope
geom_ribbon(data=pred.mmlowW, aes(x = x, ymin = predicted - std.error, ymax = predicted + std.error),
fill = "lightgrey", alpha = 0.5) + # error band
geom_point(data = dflowW, # adding the raw data (scaled values)
aes(x = x, y = y, shape = sR)) +
#higha plot
geom_line(data=pred.mmhighW, aes(x = x, y = predicted)) + # slope
geom_ribbon(data=pred.mmhighW, aes(x = x, ymin = predicted - std.error, ymax = predicted + std.error),
fill = "lightgrey", alpha = 0.5) + # error band
geom_point(data = dfhighW, # adding the raw data (scaled values)
aes(x = x, y = y, shape = sR)) +
xlim(0.01,0.2) +
ylim(30,150) +
labs(title = "") +
ylab(bquote('Total Surface Energy ' (mJ/m^2))) +
xlab(bquote('Surface Coverage ' (n/n[m]) )) +
theme_minimal()
print(p)
p1 <- p + scale_fill_discrete(name = "New Legend Title")
print(p1)
I want to creat density plot with multiple groups and add slope line for the means. The plot looks like following:
library(tidyverse)
library(ggridges)
data1 <- data.frame(x1 = c(rep(1,50), rep(2,50), rep(3,50), rep(4,50), rep(5,50)),
y1 = c(rnorm(50,10,1), rnorm(50,15,2), rnorm(50,20,3), rnorm(50,25,3), rnorm(50,30,4)))
data1$x1 <- as.factor(data1$x1)
ggplot(data1, aes(x = y1, y = x1, fill = 0.5 - abs(0.5 - stat(ecdf)))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
scale_fill_viridis_c(name = "Tail probability", direction = -1)
There are two ways to construct the red line. You can either (1) use geom_line through points representing the group means, or (2) fit a regression through the data.
(1) will be truncated to fit the data, (2) can be extended beyond the data, but will only look right if there is an overall linear relationship between your x and y.
Code for (1)
means <- aggregate(y1 ~ x1, data=data1, FUN=mean)
ggplot(data1, aes(x = y1, y = x1, fill = 0.5 - abs(0.5 - stat(ecdf)))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
scale_fill_viridis_c(name = "Tail probability", direction = -1) +
geom_line(aes(x=y1, y=as.numeric(x1), fill=1), data=means, colour="red")
// NB: need to override the fill aesthetic or you get an error
Code for (2)
regressionLine <- coef(lm(as.numeric(x1) ~ y1 , data=data1))
ggplot(data1, aes(x = y1, y = x1, fill = 0.5 - abs(0.5 - stat(ecdf)))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
scale_fill_viridis_c(name = "Tail probability", direction = -1) +
geom_abline(intercept=regressionLine[1], slope=regressionLine[2], colour="red")
When executing the following piece of code, the output plot shows a blue line of f(x) = 0, instead of the Gamma pdf (see the blue line in this picture).
analyzeGamma <- function(csvPath, alpha, beta) {
dfSamples <- read.csv(file = csvPath,
header = TRUE,
sep = ",")
base <- ggplot(dfSamples, aes(x = value, y = quantity))
base +
geom_col(color = "red") +
geom_vline(xintercept = qgamma(seq(0.1, 0.9, by = 0.1), alpha, beta)) +
stat_function(
fun = dgamma,
args = list(shape = alpha, rate = beta),
colour = "blue"
)
}
path = "/tmp/data.csv"
alpha = 1.2
beta = 0.01
analyzeGamma(path, alpha, beta)
When I comment out the line:
geom_col(color = "red") +
The Gamma pdf is drawn correctly, as can be seen here.
Any idea why it happens and how to resolve?
Thanks.
It's because your geom_col() goes up to 25 and probability density functions have an integral of 1. If I'm correct in assuming your columns resemble a histogram with count data as quantities, you would have to scale your density to match the columns as follows:
density * number of samples * width of columns
If you've precomputed the columns, 'number of samples' would be the sum of all your y-values.
An example with some toy data, notice the function in the stat:
alpha = 1.2
beta = 0.01
df <- data.frame(x = rgamma(1000, shape = alpha, rate = beta))
binwidth <- 5
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * length(df$x) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)
The following example with geom_col() gives the same picture:
x <- table(cut_width(df$x, binwidth, boundary = 0))
newdf <- data.frame(x = seq(0.5*binwidth, max(df$x), by = binwidth),
y = as.numeric(x))
ggplot(newdf, aes(x, y)) +
geom_col(width = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * sum(newdf$y) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)
ggplot scales the y-axis to show all data. The blue curve appears as a straight line due do scale - if you compare the scale of the y-axis in both charts you'll see: when you draw the geom_col the y axis maximum is somewhere at 25 (and stat_functions seems to be a straigh line). Without the geom_col, y-axis max is somewhere at 0.006.
I would like to plot a logistic regression directly from the parameter estimates using ggplot2, but not quite sure how to do it.
For example, if I had 1500 draws of alpha and beta parameter estimates, I could plot each of the lines thus:
alpha_post = rnorm(n=1500,mean=1.1,sd = .15)
beta_post = rnorm(n=1500,mean=1.8,sd = .19)
X_lim = seq(from = -3,to = 2,by=.01)
for (i in 1:length(alpha_post)){
print(i)
y = exp(alpha_post[i] + beta_post[i]*X_lim)/(1+ exp(alpha_post[i] + beta_post[i]*X_lim) )
if (i==1){plot(X_lim,y,type="l")}
else {lines(X_lim,y,add=T)}
}
How would I do this in ggplot2? I know how to use geom_smooth(), but this is a little different.
As always in ggplot, you want to make a data.frame with all data that needs to be plotted:
d <- data.frame(
alpha_post = alpha_post,
beta_post = beta_post,
X_lim = rep(seq(from = -3,to = 2,by=.01), each = length(alpha_post))
)
d$y <- with(d, exp(alpha_post + beta_post * X_lim) / (1 + exp(alpha_post + beta_post * X_lim)))
Then the plotting itself becomes quite easy:
ggplot(d, aes(X_lim, y, group = alpha_post)) + geom_line()
If you want to be more fancy, add a summary line with e.g. the mean:
ggplot(d, aes(X_lim, y)) +
geom_line(aes(group = alpha_post), alpha = 0.3) +
geom_line(size = 1, color = 'firebrick', stat = 'summary', fun.y = 'mean')
I'm trying to plot some nonparametric regression curves with ggplot2. I achieved It with the base plot()function:
library(KernSmooth)
set.seed(1995)
X <- runif(100, -1, 1)
G <- X[which (X > 0)]
L <- X[which (X < 0)]
u <- rnorm(100, 0 , 0.02)
Y <- -exp(-20*L^2)-exp(-20*G^2)/(X+1)+u
m <- lm(Y~X)
plot(Y~X)
abline(m, col="red")
m2 <- locpoly(X, Y, bandwidth = 0.05, degree = 0)
lines(m2$x, m2$y, col = "red")
m3 <- locpoly(X, Y, bandwidth = 0.15, degree = 0)
lines(m3$x, m3$y, col = "black")
m4 <- locpoly(X, Y, bandwidth = 0.3, degree = 0)
lines(m4$x, m4$y, col = "green")
legend("bottomright", legend = c("NW(bw=0.05)", "NW(bw=0.15)", "NW(bw=0.3)"),
lty = 1, col = c("red", "black", "green"), cex = 0.5)
With ggplot2 have achieved plotting the linear regression:
With this code:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25))
But I dont't know how to add the other lines to this plot, as in the base R plot. Any suggestions? Thanks in advance.
Solution
The shortest solution (though not the most beautiful one) is to add the lines using the data= argument of the geom_line function:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = as.data.frame(m2), mapping = aes(x=x,y=y))
Beautiful solution
To get beautiful colors and legend, use
# Need to convert lists to data.frames, ggplot2 needs data.frames
m2 <- as.data.frame(m2)
m3 <- as.data.frame(m3)
m4 <- as.data.frame(m4)
# Colnames are used as names in ggplot legend. Theres nothing wrong in using
# column names which contain symbols or whitespace, you just have to use
# backticks, e.g. m2$`NW(bw=0.05)` if you want to work with them
colnames(m2) <- c("x","NW(bw=0.05)")
colnames(m3) <- c("x","NW(bw=0.15)")
colnames(m4) <- c("x","NW(bw=0.3)")
# To give the different kernel density estimates different colors, they must all be in one data frame.
# For merging to work, all x columns of m2-m4 must be the same!
# the merge function will automatically detec columns of same name
# (that is, x) in m2-m4 and use it to identify y values which belong
# together (to the same x value)
mm <- Reduce(x=list(m2,m3,m4), f=function(a,b) merge(a,b))
# The above line is the same as:
# mm <- merge(m2,m3)
# mm <- merge(mm,m4)
# ggplot needs data in long (tidy) format
mm <- tidyr::gather(mm, kernel, y, -x)
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = mm, mapping = aes(x=x,y=y,color=kernel))
Solution which will settle this for everyone and for eternity
The most beautiful and reproducable way though will be to create a custom stat in ggplot2 (see the included stats in ggplot).
There is this vignette of the ggplot2 team to this topic: Extending ggplot2. I have never undertaken such a heroic endeavour though.