Get data associated to ggplot + stat_ecdf() - r

I like the stat_ecdf() feature part of ggplot2 package, which I find quite useful to explore a data series. However this is only visual, and I wonder if it is feasible - and if yes how - to get the associated table?
Please have a look to the following reproducible example
p <- ggplot(iris, aes_string(x = "Sepal.Length")) + stat_ecdf() # building of the cumulated chart
p
attributes(p) # chart attributes
p$data # data is iris dataset, not the serie used for displaying the chart

As #krfurlong showed me in this question, the layer_data function in ggplot2 can get you exactly what you're looking for without the need to recreate the data.
p <- ggplot(iris, aes_string(x = "Sepal.Length")) + stat_ecdf()
p.data <- layer_data(p)
The first column in p.data, "y", contains the ecdf values. "x" is the Sepal.Length values on the x-axis in your plot.

We can recreate the data:
#Recreate ecdf data
dat_ecdf <-
data.frame(x=unique(iris$Sepal.Length),
y=ecdf(iris$Sepal.Length)(unique(iris$Sepal.Length))*length(iris$Sepal.Length))
#rescale y to 0,1 range
dat_ecdf$y <-
scale(dat_ecdf$y,center=min(dat_ecdf$y),scale=diff(range(dat_ecdf$y)))
Below 2 plots should look the same:
#plot using new data
ggplot(dat_ecdf,aes(x,y)) +
geom_step() +
xlim(4,8)
#plot with built-in stat_ecdf
ggplot(iris, aes_string(x = "Sepal.Length")) +
stat_ecdf() +
xlim(4,8)

Related

Restricting the x being counted in a historgram

library(alr4)
par(mfrow = c(2,2))
ggplot(walleye, aes(x= age)) + geom_histogram() + facet_grid(~age)
I would like to create 4 histograms from the data set walleye. I would like the histograms to be for the length of the walleye. The for histograms should each have their own age for counting. I would like to restrict the ages from 1 to 4. How can I do that with ggplot?
If I understand what you are trying to do correctly, this should help:
library(alr4)
library(ggplot2)
ggplot(subset(walleye, age<5), aes(x=length)) + geom_histogram() + facet_grid(~age)
This way you are only plotting the subset of the data where age is 1-4, and you are actually plotting histograms of length.
You could try this too (adding another line of code on top of your code):
library(alr4)
library(ggplot2)
p <- ggplot(walleye, aes(x= age)) + geom_histogram() + facet_grid(~age)
p %+% subset(walleye, age %in% 1:4)

modifying ggplot objects after creation

Is there a preferred way to modify ggplot objects after creation?
For example I recommend my students to save the r object together with the pdf file for later changes...
library(ggplot2)
graph <-
ggplot(mtcars, aes(x=mpg, y=qsec, fill=cyl)) +
geom_point() +
geom_text(aes(label=rownames(mtcars))) +
xlab('miles per galon') +
ggtitle('my title')
ggsave('test.pdf', graph)
save(graph, file='graph.RData')
So new, in case they have to change title or labels or sometimes other things, they can easily load the object and change simple things.
load('graph.RData')
print(graph)
graph +
ggtitle('better title') +
ylab('seconds per quarter mile')
What do I have to do for example to change the colour to discrete scale? In the original plot I would wrap the y in as.factor. But is there a way to do it afterwards?
Or is there a better way on modifying the objects, when the data is gone. Would love to get some advice.
You could use ggplot_build() to alter the plot without the code or data:
Example plot:
data("iris")
p <- ggplot(iris) +
aes(x = Sepal.Length, y = Sepal.Width, colour = Species) +
geom_point()
Colours are respective to Species.
Disassemble the plot using ggplot_build():
q <- ggplot_build(p)
Take a look at the object q to see what is happening here.
To change the colour of the point, you can alter the respective table in q:
q$data[[1]]$colour <- "black"
Reassemble the plot using ggplot_gtable():
q <- ggplot_gtable(q)
And plot it:
plot(q)
Now, the points are black.

Plot point on ggplot2 smoothing regression on vline intersection

I want to create a (time-series) plot out of 40 million data points in order to show two regression lines with two specific events on each of it (first occurrence of an optimum in time-series).
Currently, I draw the regression lines and add a geom_vline to it to indicate the event.
As I want to be independent from colours in the plot, it would be beneficial if I could just plot the marker geom_vline as a point on the regression line.
Do you have any idea how to solve this using ggplot2?
My current approach is this here (replaced data points with test data):
library(ggplot2)
# Generate data
m1 <- "method 1"
m2 <- "method 2"
data1 <- data.frame(Time=seq(100), Value=sample(1000, size=100), Type=rep(as.factor(m1), 100))
data2 <- data.frame(Time=seq(100), Value=sample(1000, size=100), Type=rep(as.factor(m2), 100))
df <- rbind(data1, data2)
rm(data1, data2)
# Calculate first minima for each Type
m1_intercept <- df[which(df$Type == m1), ][which.min(df[which(df$Type == m1), ]$Value),]
m2_intercept <- df[which(df$Type == m2), ][which.min(df[which(df$Type == m2), ]$Value),]
# Plot regression and vertical lines
p1 <- ggplot(df, aes(x=Time, y=Value, group=Type, colour=Type), linetype=Type) +
geom_smooth(se=F) +
geom_vline(aes(xintercept=m1_intercept$Time, linetype=m1_intercept$Type)) +
geom_vline(aes(xintercept=m2_intercept$Time, linetype=m2_intercept$Type)) +
scale_linetype_manual(name="", values=c("dotted", "dashed")) +
guides(colour=guide_legend(title="Regression"), linetype=guide_legend(title="First occurrence of optimum")) +
theme(legend.position="bottom")
ggsave("regression.png", plot=p1, height=5, width=7)
which generates this plot:
My desired plot would be something like this:
So my questions are
Does it make sense to indicate a minimum value on a regression line? The values y-axis position would be in fact wrong but just to indicate the timepoint?
If yes, how can I achieve such a behaviour?
If no, what would you think could be better?
Thank you very much in advance!
Robin
If you first run your ggplot() call with only geom_smooth(), you can access plotted values through ggplot_build(), which we then can use to plot points on the two fitted lines. Example:
# Create initial plot
p1<-ggplot(df, aes(x=Time, y=Value, colour=Type)) +
geom_smooth(se=F)
# Now we can access the fitted values
smooths <- ggplot_build(p1)$data[[1]]
smooths_1 <- smooths[smooths$group==1,] # First group (method 1)
smooths_2 <- smooths[smooths$group==2,] # Second group (method 2)
# Then we find the closest plotted values to the minima
smooth_1_x <- smooths_1$x[which.min(abs(smooths_1$x - m1_intercept$Time))]
smooth_2_x <- smooths_2$x[which.min(abs(smooths_2$x - m2_intercept$Time))]
# Subset the previously defined datasets for respective closest values
point_data1 <- smooths_1[smooths_1$x==smooth_1_x,]
point_data2 <- smooths_1[smooths_2$x==smooth_2_x,]
Now we use point_data1 and point_data2 to place the points on your plot:
ggplot(df, aes(x=Time, y=Value, colour=Type)) +
geom_smooth(se=F) +
geom_point(data=point_data1, aes(x=x, y=y), colour = "red",size = 5) +
geom_point(data=point_data2, aes(x=x, y=y), colour = "red", size = 5)
To reproduce this plot, you can use set.seed(42) for your data generation step.

Removing Empty Facet Categories

Am having trouble making my faceted plot only display data, as opposed to displaying facets with no data.
The following code:
p<- ggplot(spad.data, aes(x=Day, y=Mean.Spad, color=Inoc))+
geom_point()
p + facet_grid(N ~ X.CO2.)
Gives the following graphic:
I have played around with it for a while but can't seem to figure out a solution.
Dataframe viewable here: https://docs.google.com/spreadsheets/d/11ZiDVRAp6qDcOsCkHM9zdKCsiaztApttJIg1TOyIypo/edit?usp=sharing
Reproducible Example viewable here: https://docs.google.com/document/d/1eTp0HCgZ4KX0Qavgd2mTGETeQAForETFWdIzechTphY/edit?usp=sharing
Your issue lies in the missing observations for your x- and y variables. Those don't influence the creation of facets, that is only influenced by the levels of faceting variables present in the data. Here is an illustration using sample data:
#generate some data
nobs=100
set.seed(123)
dat <- data.frame(G1=sample(LETTERS[1:3],nobs, T),
G2 = sample(LETTERS[1:3], nobs, T),
x=rnorm(nobs),
y=rnorm(nobs))
#introduce some missings in one group
dat$x[dat$G1=="C"] <- NA
#attempt to plot
p1 <- ggplot(dat, aes(x=x,y=y)) + facet_grid(G1~G2) + geom_point()
p1 #facets are generated according to the present levels of the grouping factors
#possible solution: remove the missing data before plotting
p2 <- ggplot(dat[complete.cases(dat),], aes(x=x, y=y)) + facet_grid(G1 ~G2) + geom_point()
p2

How to use for function to construct a panel of ggplot2 plots in R

I am a beginner trying to build a multiple plots in ggplot2. Using the mtcars dataset in R
library(datasets)
data (mtcars)
library (ggplot2)
## convert to factor some variables to avoid problems
factors<-c(2,9,10,11)
mtcars[,factors]<-lapply(mtcars[,factors],factor)
I want to plot mpg vs all the other variables except the am variable that is plot in colour in each plot. Each plot looks like this:
g1<- ggplot(mtcars, aes(x=mpg, y=cyl, color=am)) + geom_point(shape=1)
g2<- ggplot(mtcars, aes(x=mpg, y=disp, color=am)) + geom_point(shape=1)
g3...
Only the y axis changes from one plot to the other. I have done the plots form g1 to g9, y axis being any of the following:
variables<- c ("cyl","disp","hp","drat","wt","qsec","vs","gear","carb")
I am sure there must be a more elegant way to generate all 9 plots, but cannot figure out
Any help?
If you want the plots in g1...g_n:
g <- lapply(variables, function(var) {
ggplot(mtcars, aes_string(x="mpg", y=var, color="am")) + geom_point(shape=1)
})
names(g) <- paste0("g", seq(g))
list2env(g, .GlobalEnv)

Resources