Plot a sample of a time series - r

I have a dataset that contains observations for every second of four consecutive days (roughly 340'000 data points). This is too much to display in a scatter plot. I would like to plot only a uniform sample of, say, 2000 time points.
Is it possible to achieve this with ggplot2's "grammar of graphics" approach? I haven't found any built-in "sampling" modifier, but perhaps it's easy enough to write one?
library(ggplot2)
x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()
This is how it can be "hacked" by modifying the data passed to ggplot. But I don't want to modify the data, just filter it to include only a sample.
ggplot(d, aes(x=x, y=y)) + ??? + geom_point()
EDIT: I'm specifically looking for sampling, not smoothing or binning. The data I have shows the time it takes to simulate one second of a specific process. The simulation has been parallelized, and for each simulated seconds I have the run times for each of the cores involved (8 in total). I want to show sub-optimal load balancing by plotting just the raw data points. The reason for the sampling is just that 300'000 data points are way too much for a scatter plot: Plotting takes too long and the visualization is no good.

You can subset with in the geom_point call using the data argument:
... + geom_point(data=d[sample(x,2000),])
This way, you are free to add other geoms using all the data, eg, using the example data:
ggplot(d, aes(x=x, y=y)) + geom_hex() + geom_point(data=d[sample(x,2000),])

If you want create a scatter plot for big data here are a couple of ggplot2 options
They come from This course by hadley
# upload all images to imgur.com
opts_chunk$set(fig.width = 5, fig.height = 5, dev = "png")
render_markdown(strict = T)
# some autocorrelated data
set.seed(1)
x <- 1:1e+05
d <- data.frame(x = x)
d$y <- arima.sim(list(order = c(1, 1, 0), ar = 0.9), n = 1e+05 - 1)
# the basic plot
base_plot <- ggplot(d, aes(x = x, y = y))
geom_bin2d
you can set the binwidth for the x and y variables
base_plot + geom_bin2d(binwidth = c(200, 5))
geom_hex
you can set the number of bins
base_plot + geom_hex(bins = 200)
small points
Stops overplotting
base_plot + geom_point(size = I("."))
use a smoother
This relies on having a smoothing method that will get you the detail you want without crashing or taking too long. In this case the number of knots was chosen by trial and error (and perhaps you will want more detail)
library(mgcv)
base_plot + stat_smooth(method = "gam", formula = y ~ s(x, k = 50))

Related

How to clip an interpolated layer in R so it does not extend past data boundaries

I am trying to display a cross-section of conductivity in a lagoon environment using isolines. I have applied interp() and stat_contour() to my data, but I would like to clip the interpolated output so that it doesn't extend past my data points. This way the bathymetry of the lagoon in the cross-section is clear. Here is the code I have used so far:
cond_df <- read_csv("salinity_profile.csv")
di <- interp(cond_df$stop, cond_df$depth, cond_df$conductivity,
xo = seq(min(cond_df$stop), max(cond_df$stop), length = 200),
yo = seq(min(cond_df$depth), max(cond_df$depth), length = 200))
dat_interp <- data.frame(expand.grid(x=di$x, y=di$y), z=c(di$z))
ggplot(dat_interp) +
aes(x=x, y=y, z=z, fill=z)+
scale_y_reverse() +
geom_tile()+
stat_contour(colour="white", size=0.25) +
scale_fill_viridis_c() +
theme_tufte(base_family="Helvetica")
Here is the output:
interpolated plot
To help clarify, here is the data just as a geom_point() graph, and I do not want the interpolated layer going past the lower points of the graph:
cond_df%>%
ggplot(mapping=aes(x=stop, y=depth, z=conductivity, fill=conductivity)) +
geom_point(aes(colour = conductivity), size = 3) +
scale_y_reverse()
point plot
You can mask the unwanted region of the plot by using geom_ribbon.
You will need to generate a data.frame with values for the max depth at each stop. Here's one somewhat inelegant way to do that:
# Create the empty data frame for all stops
bathymetry <- data.frame(depth = as.numeric(NA),
stop = unique(cond_df$stop))
# Find the max depth for each stop
for(thisStop in bathymetry$stop){
bathymetry[bathymetry$stop==thisStop, "depth"] <- max(cond_df[cond_df$stop==thisStop, "depth"])
}
Then, you can add the geom_ribbon as the last geom of your plot, like so
geom_ribbon(data=bathymetry, aes(x=stop, ymin=depth, ymax=max(cond_df$depth)), inherit.aes = FALSE)

Create multiple histograms in a plot starting from bins and frequencies, instead than from samples?

I have a dataframe of size 10^6x3, i.e., 1 million samples for three variables. I would like to create three histograms in the same plot with overlay (alpha blending?) using R. The problem is that managing that many samples on my pc is possible (they fit in memory and R doesn't hang up forever), but not lightning fast. The code that generated the samples also gives me back lower and upper bin boundaries, and corresponding frequencies. Of course, this is much less data: I can choose the number of bins, but let's say 30 bins for variables, so 30x2x3=180 doubles. Is there a way in R to create overlayed histograms starting from bins and frequencies data? I would like to use ggplot2, but I'm open to solutions with base R or other packages. Also, what would you do in my situation? Would you use the original samples, and don't care about the longer computational time/memory occupation? Or would you go for bin/freqs? I'd like to use the raw data, but I'm worried that R could get too slow or hog too much memory, and that this could create issues in following computations. Thus a solution using raw data but optimized for speed/memory would be great, otherwise it's ok to use bin/freqs (if at all possible!).
Yes, of course you can! Using the bins and frequencies you can make a bar graph.
dat <- data.frame(group = rep(c('a', 'b'), each = 10),
bin = rep(1:10, 2),
frequency = rnorm(20, 5))
library(ggplot2)
Using alpha blending as you suggested:
ggplot(dat, aes(x = bin, y = frequency, fill = group)) +
geom_bar(stat = 'identity', position = position_identity(), alpha = 0.4)
Or we dodge the bars:
ggplot(dat, aes(x = bin, y = frequency, fill = group)) +
geom_bar(stat = 'identity', position = 'dodge')
I was curious about "not lightning fast". The dataset below (1e6 cases X 3 variables) renders in ~6 sec on my machine (Core i7, Win7 x64). Is that too slow?
set.seed(1) # for reproducible example
df <- data.frame(matrix(rnorm(3e6, mean=rep(c(0,3,6), each=1e6)), ncol=3))
names(df) <- c("A","B","C")
library(ggplot2)
library(reshape2)
gg.df <- melt(df, variable.name="category")
system.time({
ggp <- ggplot(gg.df, aes(x=value, fill=category)) +
stat_bin(geom="bar", position="identity", alpha=0.7)
plot(ggp)
})
# user system elapsed
# 5.68 0.53 6.24

Symmetrical histograms

I want to make a number of symmetrical histograms to show butterfly abundance through time. Here's a site that shows the form of the graphs I am trying to create: http://thebirdguide.com/pelagics/bar_chart.htm
For ease, I will use the iris dataset.
library(ggplot2)
g <- ggplot(iris, aes(Sepal.Width)) + geom_histogram(binwidth=.5)
g + coord_fixed(ratio = .003)
Essentially, I would like to mirror this histogram below the x-axis. Another way of thinking about the problem is to create a horizontal violin diagram with distinct bins. I've looked at the plotrix package and the ggplot2 documentation but don't find a solution in either place. I prefer to use ggplot2 but other solutions in base R, lattice or other packages will be fine.
Without your exact data, I can only provide an approximate coding solution, but it is a start for you (if you add more details, I'll be happy to help you tweak the plot). Here's the code:
library(ggplot2)
noSpp <- 3
nTime <- 10
d <- data.frame(
JulianDate = rep(1:nTime , times = noSpp),
sppAbundance = c(c(1:5, 5:1),
c(3:5, 5:1, 1:2),
c(5:1, 1:5)),
yDummy = 1,
sppName = rep(letters[1:noSpp], each = nTime))
ggplot(data = d, aes(x = JulianDate, y = yDummy, size = sppAbundance)) +
geom_line() + facet_grid( sppName ~ . ) + ylab("Species") +
xlab("Julian Date")
And here's the figure.

How to shade part of a density curve in ggplot (with no y axis data)

I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)

R + ggplot : Time series with events

I'm an R/ggplot newbie. I would like to create a geom_line plot of a continuous variable time series and then add a layer composed of events. The continuous variable and its timestamps is stored in one data.frame, the events and their timestamps are stored in another data.frame.
What I would really like to do is something like the charts on finance.google.com. In those, the time series is stock-price and there are "flags" to indicate news-events. I'm not actually plotting finance stuff, but the type of graph is similar. I am trying to plot visualizations of log file data. Here's an example of what I mean...
If advisable (?), I would like to use separate data.frames for each layer (one for continuous variable observations, another for events).
After some trial and error this is about as close as I can get. Here, I am using example data from data sets that come with ggplot. "economics" contains some time-series data that I'd like to plot and "presidential" contains a few events (presidential elections).
library(ggplot2)
data(presidential)
data(economics)
presidential <- presidential[-(1:3),]
yrng <- range(economics$unemploy)
ymin <- yrng[1]
ymax <- yrng[1] + 0.1*(yrng[2]-yrng[1])
p2 <- ggplot()
p2 <- p2 + geom_line(mapping=aes(x=date, y=unemploy), data=economics , size=3, alpha=0.5)
p2 <- p2 + scale_x_date("time") + scale_y_continuous(name="unemployed [1000's]")
p2 <- p2 + geom_segment(mapping=aes(x=start,y=ymin, xend=start, yend=ymax, colour=name), data=presidential, size=2, alpha=0.5)
p2 <- p2 + geom_point(mapping=aes(x=start,y=ymax, colour=name ), data=presidential, size=3)
p2 <- p2 + geom_text(mapping=aes(x=start, y=ymax, label=name, angle=20, hjust=-0.1, vjust=0.1),size=6, data=presidential)
p2
Questions:
This is OK for very sparse events, but if there's a cluster of them (as often happens in a log file), it gets messy. Is there some technique I can use to neatly display a bunch of events occurring in a short time interval? I was thinking of position_jitter, but it was really hard for me to get this far. google charts stacks these event "flags" on top of each other if there's a lot of them.
I actually don't like sticking the event data in the same scale as the continuous measurement display. I would prefer to put it in a facet_grid. The problem is that the facets all must be sourced from the same data.frame (not sure if that's true). If so, that also seems not ideal (or maybe I'm just trying to avoid using reshape?)
Now I like ggplot as much as the next guy, but if you want to make the Google Finance type charts, why not just do it with the Google graphics API?!? You're going to love this:
install.packages("googleVis")
library(googleVis)
dates <- seq(as.Date("2011/1/1"), as.Date("2011/12/31"), "days")
happiness <- rnorm(365)^ 2
happiness[333:365] <- happiness[333:365] * 3 + 20
Title <- NA
Annotation <- NA
df <- data.frame(dates, happiness, Title, Annotation)
df$Title[333] <- "Discovers Google Viz"
df$Annotation[333] <- "Google Viz API interface by Markus Gesmann causes acute increases in happiness."
### Everything above here is just for making up data ###
## from here down is the actual graphics bits ###
AnnoTimeLine <- gvisAnnotatedTimeLine(df, datevar="dates",
numvar="happiness",
titlevar="Title", annotationvar="Annotation",
options=list(displayAnnotations=TRUE,
legendPosition='newRow',
width=600, height=300)
)
# Display chart
plot(AnnoTimeLine)
# Create Google Gadget
cat(createGoogleGadget(AnnoTimeLine), file="annotimeline.xml")
and it produces this fantastic chart:
As much as I like #JD Long's answer, I'll put one that is just in R/ggplot2.
The approach is to create a second data set of events and to use that to determine positions. Starting with what #Angelo had:
library(ggplot2)
data(presidential)
data(economics)
Pull out the event (presidential) data, and transform it. Compute baseline and offset as fractions of the economic data it will be plotted with. Set the bottom (ymin) to the baseline. This is where the tricky part comes. We need to be able to stagger labels if they are too close together. So determine the spacing between adjacent labels (assumes that the events are sorted). If it is less than some amount (I picked about 4 years for this scale of data), then note that that label needs to be higher. But it has to be higher than the one after it, so use rle to get the length of TRUE's (that is, must be higher) and compute an offset vector using that (each string of TRUE must count down from its length to 2, the FALSEs are just at an offset of 1). Use this to determine the top of the bars (ymax).
events <- presidential[-(1:3),]
baseline = min(economics$unemploy)
delta = 0.05 * diff(range(economics$unemploy))
events$ymin = baseline
events$timelapse = c(diff(events$start),Inf)
events$bump = events$timelapse < 4*370 # ~4 years
offsets <- rle(events$bump)
events$offset <- unlist(mapply(function(l,v) {if(v){(l:1)+1}else{rep(1,l)}}, l=offsets$lengths, v=offsets$values, USE.NAMES=FALSE))
events$ymax <- events$ymin + events$offset * delta
Putting this together into a plot:
ggplot() +
geom_line(mapping=aes(x=date, y=unemploy), data=economics , size=3, alpha=0.5) +
geom_segment(data = events, mapping=aes(x=start, y=ymin, xend=start, yend=ymax)) +
geom_point(data = events, mapping=aes(x=start,y=ymax), size=3) +
geom_text(data = events, mapping=aes(x=start, y=ymax, label=name), hjust=-0.1, vjust=0.1, size=6) +
scale_x_date("time") +
scale_y_continuous(name="unemployed \[1000's\]")
You could facet, but it is tricky with different scales. Another approach is composing two graphs. There is some extra fiddling that has to be done to make sure the plots have the same x-range, to make the labels all fit in the lower plot, and to eliminate the x axis in the upper plot.
xrange = range(c(economics$date, events$start))
p1 <- ggplot(data=economics, mapping=aes(x=date, y=unemploy)) +
geom_line(size=3, alpha=0.5) +
scale_x_date("", limits=xrange) +
scale_y_continuous(name="unemployed [1000's]") +
opts(axis.text.x = theme_blank(), axis.title.x = theme_blank())
ylims <- c(0, (max(events$offset)+1)*delta) + baseline
p2 <- ggplot(data = events, mapping=aes(x=start)) +
geom_segment(mapping=aes(y=ymin, xend=start, yend=ymax)) +
geom_point(mapping=aes(y=ymax), size=3) +
geom_text(mapping=aes(y=ymax, label=name), hjust=-0.1, vjust=0.1, size=6) +
scale_x_date("time", limits=xrange) +
scale_y_continuous("", breaks=NA, limits=ylims)
#install.packages("ggExtra", repos="http://R-Forge.R-project.org")
library(ggExtra)
align.plots(p1, p2, heights=c(3,1))
Plotly is an easy way to make ggplots interactive. To display events, coerce them into factors which can be displayed as an aesthetic, like color.
The end result is a plot that you can drag the cursor over. The plots display data of interest:
Here is the code for making the ggplot:
# load data
data(presidential)
data(economics)
# events of interest
events <- presidential[-(1:3),]
# strip year from economics and events data frames
economics$year = as.numeric(format(economics$date, format = "%Y"))
# use dplyr to summarise data by year
#install.packages("dplyr")
library(dplyr)
econonomics_mean <- economics %>%
group_by(year) %>%
summarise(mean_unemployment = mean(unemploy))
# add president terms to summarized data frame as a factor
president <- c(rep(NA,14), rep("Reagan", 8), rep("Bush", 4), rep("Clinton", 8), rep("Bush", 8), rep("Obama", 7))
econonomics_mean$president <- president
# create ggplot
p <- ggplot(data = econonomics_mean, aes(x = year, y = mean_unemployment)) +
geom_point(aes(color = president)) +
geom_line(alpha = 1/3)
It only takes one line of code to make the ggplot into a plotly object.
# make it interactive!
#install.packages("plotly")
library(plotly)
ggplotly(p)
Considering you are plotting time series and qualitative information, most economic book use the area of plotting to indicate a structural change or event on data so i recommend to use something like this:
library(ggplot2)
data(presidential)
data(economics)
ggplot() +
geom_rect(aes(xmin = start,
xmax = end,
ymin = 0, ymax = Inf,
fill = name),
data = presidential,
show.legend = F) +
geom_text(aes(x = start+500,
y = 2000,
label = name,
angle = 90),
data = presidential) +
geom_line(aes(x = date, y = unemploy),
data= economics) +
scale_fill_brewer(palette = "Blues") +
labs(x = "time", y = "unemploy")

Resources