Underscore plot in R - r

Introduction and Current Work Done
[Note: For those interested, I have provided code at the end for reproducing my example.]
I have some data and I have conducted an ANOVA analysis and obtained Tukey's pairwise comparisons:
model1 = aov(trt ~ grp, data = df)
anova(model1)
> TukeyHSD(model1)
diff lwr upr p adj
B-A 0.03481504 -0.40533118 0.4749613 0.9968007
C-A 0.36140489 -0.07874134 0.8015511 0.1448379
D-A 1.53825179 1.09810556 1.9783980 0.0000000
C-B 0.32658985 -0.11355638 0.7667361 0.2166301
D-B 1.50343674 1.06329052 1.9435830 0.0000000
D-C 1.17684690 0.73670067 1.6169931 0.0000000
I can also plot Tukey's pairwise comparisons
> plot(TukeyHSD(model1))
We can see from Tukey's confidence intervals and the plot that A-B, B-C and A-C are not significantly different.
Problem
I have been asked to create something called an "underscore plot" which is described as follows:
We plot the group means on the real line and we draw a line segment between group means to indicate that there is no significant difference between those two particular groups.
Obtaining the means is not difficult:
> aggregate(df$trt ~ df$grp, FUN = mean)
df$grp df$trt
1 A 2.032086
2 B 2.066901
3 C 2.393491
4 D 3.570338
Desired Output
Using the data in this example, the desired plot should appear like the one below:
There is a line segment between the groups that are not significantly different (i.e. a line segment between A-B, B-C and A-C as indicated by Tukey's).
Note: Please note that the plot above is not to scale and it was created in keynote for illustrative purposes only.
Is there a way to get the "underscore plot" described above using R (using either base R or a library such as ggplot2)?
Edit
Here is the code that I used to create the example above:
library(data.table)
set.seed(3)
A = runif(20, 1,3)
A = data.frame(A, rep("A", length(A)))
B = runif(20, 1.25,3.25)
B = data.frame(B, rep("B", length(B)))
C = runif(20, 1.5,3.5)
C = data.frame(C, rep("C", length(C)))
D = runif(20, 2.75,4.25)
D = data.frame(D, rep("D", length(D)))
df = list(A, B, C, D)
df = rbindlist(df)
colnames(df) = c("trt", "grp")

Here's a ggplot version of the underscore plot. We'll load the tidyverse package, which loads ggplot2, dplyr and a few other packages from the tidyverse. We create a data frame of coefficients to plot the group names, coefficient values, and vertical segments and a data frame of non-significant pairs for generating the horizontal underscores.
library(tidyverse)
model1 = aov(trt ~ grp, data=df)
# Get coefficients and label coefficients with names of levels
coefs = coef(model1)
coefs[2:4] = coefs[2:4] + coefs[1]
names(coefs) = levels(model1$model$grp)
# Get non-significant pairs
pairs = TukeyHSD(model1)$grp %>%
as.data.frame() %>%
rownames_to_column(var="pair") %>%
# Keep only non-significant pairs
filter(`p adj` > 0.05) %>%
# Add coefficients to TukeyHSD results
separate(pair, c("pair1","pair2"), sep="-", remove=FALSE) %>%
mutate(start = coefs[match(pair1, names(coefs))],
end = coefs[match(pair2, names(coefs))]) %>%
# Stagger vertical positions of segments
mutate(ypos = seq(-0.03, -0.04, length=3))
# Turn coefs into a data frame
coefs = enframe(coefs, name="grp", value="coef")
ggplot(coefs, aes(x=coef)) +
geom_hline(yintercept=0) +
geom_segment(aes(x=coef, xend=coef), y=0.008, yend=-0.008, colour="blue") +
geom_text(aes(label=grp, y=0.011), size=4, vjust=0) +
geom_text(aes(label=sprintf("%1.2f", coef)), y=-0.01, size=3, angle=-90, hjust=0) +
geom_segment(data=pairs, aes(group=pair, x=start, xend=end, y=ypos, yend=ypos),
colour="red", size=1) +
scale_y_continuous(limits=c(-0.05,0.04)) +
theme_void()

Base R
d1 = data.frame(TukeyHSD(model1)[[1]])
inds = which(sign(d1$lwr) * (d1$upr) <= 0)
non_sig = lapply(strsplit(row.names(d1)[inds], "-"), sort)
d2 = aggregate(df$trt ~ df$grp, FUN=mean)
graphics.off()
windows(width = 400, height = 200)
par("mai" = c(0.2, 0.2, 0.2, 0.2))
plot(d2$`df$trt`, rep(1, NROW(d2)),
xlim = c(min(d2$`df$trt`) - 0.1, max(d2$`df$trt`) + 0.1), lwd = 2,
type = "l",
ann = FALSE, axes = FALSE)
segments(x0 = d2$`df$trt`,
y0 = rep(0.9, NROW(d2)),
x1 = d2$`df$trt`,
y1 = rep(1.1, NROW(d2)),
lwd = 2)
text(x = d2$`df$trt`, y = rep(0.8, NROW(d2)), labels = round(d2$`df$trt`, 2), srt = 90)
text(x = d2$`df$trt`, y = rep(0.75, NROW(d2)), labels = d2$`df$grp`)
lapply(seq_along(non_sig), function(i){
lines(cbind(d2$`df$trt`[match(non_sig[[i]], d2$`df$grp`)], rep(0.9 - 0.01 * i, 2)))
})

Related

How to produce overlapping QQ plots in R?

I would like to make an overlapping QQ-plot from the GWAS results similar to the attached figure. I have run two GWAS analyses and want to generate a figure where the QQ-plot from both GWAS are overlaid on one another. I am using the R-package "qqman" for that.
Can someone please tell me how to do that in R?
Thank you.
Sample figure
You can use the R package CMplot for producing overlapping QQ plots in R.
library(CMplot)
data(pig60K)
pig60K$trait1[sample(1:nrow(pig60K), round(nrow(pig60K)*0.80))] <- NA
pig60K$trait2[sample(1:nrow(pig60K), round(nrow(pig60K)*0.25))] <- NA
CMplot(pig60K,plot.type="q",col=c("dodgerblue1", "olivedrab3", "darkgoldenrod1"),threshold=1e-6,
ylab.pos=2,signal.pch=c(19,6,4),signal.cex=1.2,signal.col="red",conf.int=TRUE,box=FALSE,multracks=
TRUE,cex.axis=2,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,ylim=c(0,8),width=5,height=5)
This works
(abd = data.frame(a = runif(100, 0, 3),
b = runif(100, 346, 455),
d = runif(100, 3952, 4903)) %>%
ggplot() +
geom_qq(aes(sample = a, color = "a")) +
geom_qq(aes(sample = b, color = "b")) +
geom_qq(aes(sample = d, color = "d")) +
theme_minimal())
I'm not sure what your objective is exactly, but more than likely the data will overlap and hide important information. Perhaps what you really need is a bi-variate or multi-variate version of a qq plot? A chi-square qq plot using the Mahalanobis distance might be a better bet, if that's the case.
This plot will show with the arguments set for it with MVN::mvn()
abd = data.frame(a = runif(100, 0, 3),
b = runif(100, 346, 455),
d = runif(100, 3952, 4903))
MVN::mvn(abd, multivariatePlot = "qq", multivariateOutlierMethod = "quan")
You can manually create the chi-square qq plot this way:
(cvAbd = cov(and)) # covariance matrix
(dif = scale(abd, scale = F)) # scaling
# solve() used in next call calculates the inverse matrix
(d = diag(dif %*%
solve(cvAbd) %*% # matrix multiplication
t(dif)))
(r = rank(d)) # ranking
n = dim(abd)[1] # number of observations
p = dim(abd)[2] # number of variables
(ch <- qchisq((r - 0.5)/n, p)) # determine chi-square quantiles
ggplot(data = data.frame(d = d, ch = ch),
aes(d, ch)) +
geom_point() +
ggtitle("Chi-Square Q-Q Plot") +
xlab("Squared Mahalanobis Distance") +
ylab("Chi-Square Quantile") +
theme_minimal()
You were looking for something different -- try this
you can replace runif(10000,.5,1) with your data,
-o and e for the same vector (same data in both)
-o2 and e2 is a second vector or second model (same data in both)
This uses plotly...
# first model
o = -log10(sort(runif(10000,.5,1), decreasing = FALSE))
e = -log10(ppoints(length(runif(10000,.5,1))))
# second model
o2 = -log10(sort(runif(10000,0,1), decreasing = FALSE))
e2 = -log10(ppoints(length(runif(10000,0,1))))
plotly::plot_ly(x = ~e, y = ~o, name = "p1",
type = "scatter", mode = "markers") %>%
add_trace(x = ~e2, y = ~o2, name = "p2") %>%
add_trace(x = c(0, max(e,e2)), y = c(0, max(e,e2)),
mode="lines", name = "Log normal") %>%
layout(xaxis = list(title = 'Expected -log[10](<i>p</i>)'),
yaxis = list(title = 'Observed -log[10](<i>p</i>)'))
You requested a ggplot2 version -
ggplot(data.frame(o = o, e = e, o2 = o2, e2 = e2),
aes(e, o, color = "First Model")) +
geom_point() +
geom_point(aes(o2, e2, color = "Second Model")) +
geom_abline(intercept = 0, slope = 1, color = "darkred") +
scale_color_discrete("") +
xlab(expression(paste('Expected -log[10](', italic('p'),')'))) +
ylab(expression(paste('Observed -log[10](', italic('p'),')'))) +
theme_minimal()

ggplot plot 2d probability density function on top of points on ggplot

I have the following example:
require(mvtnorm)
require(ggplot2)
set.seed(1234)
xx <- data.frame(rmvt(100, df = c(13, 13)))
ggplot(data = xx, aes(x = X1, y= X2)) + geom_point() + geom_density2d()
Here is what I get:
However, I would like to get the density contour from the mutlivariate t density given by the dmvt function. How do I tweak geom_density2d to do that?
This is not an easy question to answer: because the contours need to be calculated and the ellipse drawn using the ellipse package.
Done with elliptical t-densities to illustrate the plotting better.
nu <- 5 ## this is the degrees of freedom of the multivariate t.
library(mvtnorm)
library(ggplot2)
sig <- matrix(c(1, 0.5, 0.5, 1), ncol = 2) ## this is the sigma parameter for the multivariate t
xx <- data.frame( rmvt(n = 100, df = c(nu, nu), sigma = sig)) ## generating the original sample
rtsq <- rowSums(x = matrix(rt(n = 2e6, df = nu)^2, ncol = 2)) ## generating the sample for the ellipse-quantiles. Note that this is a cumbersome calculation because it is the sum of two independent t-squared random variables with the same degrees of freedom so I am using simulation to get the quantiles. This is the sample from which I will create the quantiles.
g <- ggplot( data = xx
, aes( x = X1
, y = X2
)
) + geom_point(colour = "red", size = 2) ## initial setup
library(ellipse)
for (i in seq(from = 0.01, to = 0.99, length.out = 20)) {
el.df <- data.frame(ellipse(x = sig, t = sqrt(quantile(rtsq, probs = i)))) ## create the data for the given quantile of the ellipse.
names(el.df) <- c("x", "y")
g <- g + geom_polygon(data=el.df, aes(x=x, y=y), fill = NA, linetype=1, colour = "blue") ## plot the ellipse
}
g + theme_bw()
This yields:
I still have a question: how does one reduce the size of the plotting ellispe lines?

Plot regression coefficient with confidence intervals

Suppose I have 2 data frames, one for 2015 and one for 2016. I want to run a regression for each data frame and plot one of the coefficient for each regression with their respective confidence interval. For example:
set.seed(1020022316)
library(dplyr)
library(stargazer)
df16 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.5 * x1 + 2 * t + e) %>%
select(-e)
df15 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.75 * x1 + 2.5 * t + e) %>%
select(-e)
lm16 <- lm(y ~ x1 + t, data = df16)
lm15 <- lm(y ~ x1 + t, data = df15)
stargazer(lm15, lm16, type="text", style = "aer", ci = TRUE, ci.level = 0.95)
I want to plot t=1.558, x=2015, and t=2.797, x=2016 with their respective .95 CI. What is the best way of doing this?
I could do it 'by hand', but I hope there is a better way.
library(ggplot2)
df.plot <-
data.frame(
y = c(lm15$coefficients[['t']], lm16$coefficients[['t']]),
x = c(2015, 2016),
lb = c(
confint(lm15, 't', level = 0.95)[1],
confint(lm16, 't', level = 0.95)[1]
),
ub = c(
confint(lm15, 't', level = 0.95)[2],
confint(lm16, 't', level = 0.95)[2]
)
)
df.plot %>% ggplot(aes(x, y)) + geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(aes(yintercept=0), linetype="dashed")
Best: The figure quality (looks nice), code elegance, easy to expand (more than 2 regressions)
This is a bit too long for a comment, so I post it as a partial answer.
It is unclear from your post if your main problem is to get the data into the right shape, or if it is the plotting itself. But just to follow up on one of the comments, let me show you how to do run several models using dplyr and broom that makes plotting easy. Consider the mtcars-dataset:
library(dplyr)
library(broom)
models <- mtcars %>% group_by(cyl) %>%
do(data.frame(tidy(lm(mpg ~ disp, data = .),conf.int=T )))
head(models) # I have abbreviated the following output a bit
cyl term estimate std.error statistic p.value conf.low conf.high
(dbl) (chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
4 (Intercept) 40.8720 3.5896 11.39 0.0000012 32.752 48.99221
4 disp -0.1351 0.0332 -4.07 0.0027828 -0.210 -0.06010
6 (Intercept) 19.0820 2.9140 6.55 0.0012440 11.591 26.57264
6 disp 0.0036 0.0156 0.23 0.8259297 -0.036 0.04360
You see that this gives you all coefficients and confidence intervals in one nice dataframe, which makes plotting with ggplot easier. For instance, if your datasets have identical content, you could add a year identifier to them (e.g. df1$year <- 2000; df2$year <- 2001 etc), and bind them together afterwards (e.g. using bind_rows, of you can use bind_rows's .id option). Then you can use the year identifer instead of cyl in the above example.
The plotting then is simple. To use the mtcars data again, let's plot the coefficients for disp only (though you could also use faceting, grouping, etc):
ggplot(filter(models, term=="disp"), aes(x=cyl, y=estimate)) +
geom_point() + geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
To use your data:
df <- bind_rows(df16, df15, .id = "years")
models <- df %>% group_by(years) %>%
do(data.frame(tidy(lm(y ~ x1+t, data = .),conf.int=T ))) %>%
filter(term == "t") %>%
ggplot(aes(x=years, y=estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
Note that you can easily add more and more models just by binding more and more data to the main dataframe. You can also easily use faceting, grouping or position-dodgeing to adjust the look of the corresponding plot if you want to plot more than one coefficient.
This is the solution I have right now:
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
df.plot <- lapply(list(lm15,lm16), gen_df_plot, coef_name = 't')
df.plot <- data.table::rbindlist(df.plot)
df.plot$x <- as.factor(c(2015, 2016))
df.plot %>% ggplot(aes(x, y)) + geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") + theme_bw()
I don't love it, but it works.
Here is what might be generalized code. I have made a change to how "x" is defined so that you don't have to worry about alphabetic reordering of the factor.
#
# Paul Gronke and Paul Manson
# Early Voting Information Center at Reed College
#
# August 27, 2019
#
#
# Code to plot a single coefficient from multiple models, provided
# as an easier alternative to "coefplot" and "dotwhisker". Some users
# may find those packages more capable
#
# Code adapted from https://stackoverflow.com/questions/35582052/plot-regression-coefficient-with-confidence-intervals
# gen_df_plot function will create a tidy data frame for your plot
# Currently set up to display 95% confidence intervals
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
# Populate the data frame with a list of your model results.
df.plot <- lapply(list(model1, # List your models here
model2),
gen_df_plot,
coef_name = 'x1') # Coefficient name
# Convert the list to a tidy data frame
df.plot <- data.table::rbindlist(df.plot)
# Provide the coefficient or regression labels below, in the
# order that you want them to appear. The "levels=unique(.)" parameter
# overrides R's desire to order the factor alphabetically
df.plot$x <- c("Group 1",
"Group 2") %>%
factor(., levels = unique(.),
ordered = TRUE)
# Create your plot
df.plot %>% ggplot(aes(x, y)) +
geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") +
theme_bw() +
ggtitle("Comparing Coefficients") +
ylab("Coefficient Value")```

Partially fill density plot for area of interest [duplicate]

I frequently use kernel density plots to illustrate distributions. These are easy and fast to create in R like so:
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
plot(dens)
#or in one line like this: plot(density(rnorm(100)^2))
Which gives me this nice little PDF:
I'd like to shade the area under the PDF from the 75th to 95th percentiles. It's easy to calculate the points using the quantile function:
q75 <- quantile(draws, .75)
q95 <- quantile(draws, .95)
But how do I shade the the area between q75 and q95?
With the polygon() function, see its help page and I believe we had similar questions here too.
You need to find the index of the quantile values to get the actual (x,y) pairs.
Edit: Here you go:
x1 <- min(which(dens$x >= q75))
x2 <- max(which(dens$x < q95))
with(dens, polygon(x=c(x[c(x1,x1:x2,x2)]), y= c(0, y[x1:x2], 0), col="gray"))
Output (added by JDL)
Another solution:
dd <- with(dens,data.frame(x,y))
library(ggplot2)
qplot(x,y,data=dd,geom="line")+
geom_ribbon(data=subset(dd,x>q75 & x<q95),aes(ymax=y),ymin=0,
fill="red",colour=NA,alpha=0.5)
Result:
An expanded solution:
If you wanted to shade both tails (copy & paste of Dirk's code) and use known x values:
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
plot(dens)
q2 <- 2
q65 <- 6.5
qn08 <- -0.8
qn02 <- -0.2
x1 <- min(which(dens$x >= q2))
x2 <- max(which(dens$x < q65))
x3 <- min(which(dens$x >= qn08))
x4 <- max(which(dens$x < qn02))
with(dens, polygon(x=c(x[c(x1,x1:x2,x2)]), y= c(0, y[x1:x2], 0), col="gray"))
with(dens, polygon(x=c(x[c(x3,x3:x4,x4)]), y= c(0, y[x3:x4], 0), col="gray"))
Result:
This question needs a lattice answer. Here's a very basic one, simply adapting the method employed by Dirk and others:
#Set up the data
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
#Put in a simple data frame
d <- data.frame(x = dens$x, y = dens$y)
#Define a custom panel function;
# Options like color don't need to be hard coded
shadePanel <- function(x,y,shadeLims){
panel.lines(x,y)
m1 <- min(which(x >= shadeLims[1]))
m2 <- max(which(x <= shadeLims[2]))
tmp <- data.frame(x1 = x[c(m1,m1:m2,m2)], y1 = c(0,y[m1:m2],0))
panel.polygon(tmp$x1,tmp$y1,col = "blue")
}
#Plot
xyplot(y~x,data = d, panel = shadePanel, shadeLims = c(1,3))
Here's another ggplot2 variant based on a function that approximates the kernel density at the original data values:
approxdens <- function(x) {
dens <- density(x)
f <- with(dens, approxfun(x, y))
f(x)
}
Using the original data (rather than producing a new data frame with the density estimate's x and y values) has the benefit of also working in faceted plots where the quantile values depend on the variable by which the data is being grouped:
Code used
library(tidyverse)
library(RColorBrewer)
# dummy data
set.seed(1)
n <- 1e2
dt <- tibble(value = rnorm(n)^2)
# function that approximates the density at the provided values
approxdens <- function(x) {
dens <- density(x)
f <- with(dens, approxfun(x, y))
f(x)
}
probs <- c(0.75, 0.95)
dt <- dt %>%
mutate(dy = approxdens(value), # calculate density
p = percent_rank(value), # percentile rank
pcat = as.factor(cut(p, breaks = probs, # percentile category based on probs
include.lowest = TRUE)))
ggplot(dt, aes(value, dy)) +
geom_ribbon(aes(ymin = 0, ymax = dy, fill = pcat)) +
geom_line() +
scale_fill_brewer(guide = "none") +
theme_bw()
# dummy data with 2 groups
dt2 <- tibble(category = c(rep("A", n), rep("B", n)),
value = c(rnorm(n)^2, rnorm(n, mean = 2)))
dt2 <- dt2 %>%
group_by(category) %>%
mutate(dy = approxdens(value),
p = percent_rank(value),
pcat = as.factor(cut(p, breaks = probs,
include.lowest = TRUE)))
# faceted plot
ggplot(dt2, aes(value, dy)) +
geom_ribbon(aes(ymin = 0, ymax = dy, fill = pcat)) +
geom_line() +
facet_wrap(~ category, nrow = 2, scales = "fixed") +
scale_fill_brewer(guide = "none") +
theme_bw()
Created on 2018-07-13 by the reprex package (v0.2.0).

R Language - Sorting data into ranges; averaging; ignore outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")

Resources