I'm an R/ggplot newbie. I would like to create a geom_line plot of a continuous variable time series and then add a layer composed of events. The continuous variable and its timestamps is stored in one data.frame, the events and their timestamps are stored in another data.frame.
What I would really like to do is something like the charts on finance.google.com. In those, the time series is stock-price and there are "flags" to indicate news-events. I'm not actually plotting finance stuff, but the type of graph is similar. I am trying to plot visualizations of log file data. Here's an example of what I mean...
If advisable (?), I would like to use separate data.frames for each layer (one for continuous variable observations, another for events).
After some trial and error this is about as close as I can get. Here, I am using example data from data sets that come with ggplot. "economics" contains some time-series data that I'd like to plot and "presidential" contains a few events (presidential elections).
library(ggplot2)
data(presidential)
data(economics)
presidential <- presidential[-(1:3),]
yrng <- range(economics$unemploy)
ymin <- yrng[1]
ymax <- yrng[1] + 0.1*(yrng[2]-yrng[1])
p2 <- ggplot()
p2 <- p2 + geom_line(mapping=aes(x=date, y=unemploy), data=economics , size=3, alpha=0.5)
p2 <- p2 + scale_x_date("time") + scale_y_continuous(name="unemployed [1000's]")
p2 <- p2 + geom_segment(mapping=aes(x=start,y=ymin, xend=start, yend=ymax, colour=name), data=presidential, size=2, alpha=0.5)
p2 <- p2 + geom_point(mapping=aes(x=start,y=ymax, colour=name ), data=presidential, size=3)
p2 <- p2 + geom_text(mapping=aes(x=start, y=ymax, label=name, angle=20, hjust=-0.1, vjust=0.1),size=6, data=presidential)
p2
Questions:
This is OK for very sparse events, but if there's a cluster of them (as often happens in a log file), it gets messy. Is there some technique I can use to neatly display a bunch of events occurring in a short time interval? I was thinking of position_jitter, but it was really hard for me to get this far. google charts stacks these event "flags" on top of each other if there's a lot of them.
I actually don't like sticking the event data in the same scale as the continuous measurement display. I would prefer to put it in a facet_grid. The problem is that the facets all must be sourced from the same data.frame (not sure if that's true). If so, that also seems not ideal (or maybe I'm just trying to avoid using reshape?)
Now I like ggplot as much as the next guy, but if you want to make the Google Finance type charts, why not just do it with the Google graphics API?!? You're going to love this:
install.packages("googleVis")
library(googleVis)
dates <- seq(as.Date("2011/1/1"), as.Date("2011/12/31"), "days")
happiness <- rnorm(365)^ 2
happiness[333:365] <- happiness[333:365] * 3 + 20
Title <- NA
Annotation <- NA
df <- data.frame(dates, happiness, Title, Annotation)
df$Title[333] <- "Discovers Google Viz"
df$Annotation[333] <- "Google Viz API interface by Markus Gesmann causes acute increases in happiness."
### Everything above here is just for making up data ###
## from here down is the actual graphics bits ###
AnnoTimeLine <- gvisAnnotatedTimeLine(df, datevar="dates",
numvar="happiness",
titlevar="Title", annotationvar="Annotation",
options=list(displayAnnotations=TRUE,
legendPosition='newRow',
width=600, height=300)
)
# Display chart
plot(AnnoTimeLine)
# Create Google Gadget
cat(createGoogleGadget(AnnoTimeLine), file="annotimeline.xml")
and it produces this fantastic chart:
As much as I like #JD Long's answer, I'll put one that is just in R/ggplot2.
The approach is to create a second data set of events and to use that to determine positions. Starting with what #Angelo had:
library(ggplot2)
data(presidential)
data(economics)
Pull out the event (presidential) data, and transform it. Compute baseline and offset as fractions of the economic data it will be plotted with. Set the bottom (ymin) to the baseline. This is where the tricky part comes. We need to be able to stagger labels if they are too close together. So determine the spacing between adjacent labels (assumes that the events are sorted). If it is less than some amount (I picked about 4 years for this scale of data), then note that that label needs to be higher. But it has to be higher than the one after it, so use rle to get the length of TRUE's (that is, must be higher) and compute an offset vector using that (each string of TRUE must count down from its length to 2, the FALSEs are just at an offset of 1). Use this to determine the top of the bars (ymax).
events <- presidential[-(1:3),]
baseline = min(economics$unemploy)
delta = 0.05 * diff(range(economics$unemploy))
events$ymin = baseline
events$timelapse = c(diff(events$start),Inf)
events$bump = events$timelapse < 4*370 # ~4 years
offsets <- rle(events$bump)
events$offset <- unlist(mapply(function(l,v) {if(v){(l:1)+1}else{rep(1,l)}}, l=offsets$lengths, v=offsets$values, USE.NAMES=FALSE))
events$ymax <- events$ymin + events$offset * delta
Putting this together into a plot:
ggplot() +
geom_line(mapping=aes(x=date, y=unemploy), data=economics , size=3, alpha=0.5) +
geom_segment(data = events, mapping=aes(x=start, y=ymin, xend=start, yend=ymax)) +
geom_point(data = events, mapping=aes(x=start,y=ymax), size=3) +
geom_text(data = events, mapping=aes(x=start, y=ymax, label=name), hjust=-0.1, vjust=0.1, size=6) +
scale_x_date("time") +
scale_y_continuous(name="unemployed \[1000's\]")
You could facet, but it is tricky with different scales. Another approach is composing two graphs. There is some extra fiddling that has to be done to make sure the plots have the same x-range, to make the labels all fit in the lower plot, and to eliminate the x axis in the upper plot.
xrange = range(c(economics$date, events$start))
p1 <- ggplot(data=economics, mapping=aes(x=date, y=unemploy)) +
geom_line(size=3, alpha=0.5) +
scale_x_date("", limits=xrange) +
scale_y_continuous(name="unemployed [1000's]") +
opts(axis.text.x = theme_blank(), axis.title.x = theme_blank())
ylims <- c(0, (max(events$offset)+1)*delta) + baseline
p2 <- ggplot(data = events, mapping=aes(x=start)) +
geom_segment(mapping=aes(y=ymin, xend=start, yend=ymax)) +
geom_point(mapping=aes(y=ymax), size=3) +
geom_text(mapping=aes(y=ymax, label=name), hjust=-0.1, vjust=0.1, size=6) +
scale_x_date("time", limits=xrange) +
scale_y_continuous("", breaks=NA, limits=ylims)
#install.packages("ggExtra", repos="http://R-Forge.R-project.org")
library(ggExtra)
align.plots(p1, p2, heights=c(3,1))
Plotly is an easy way to make ggplots interactive. To display events, coerce them into factors which can be displayed as an aesthetic, like color.
The end result is a plot that you can drag the cursor over. The plots display data of interest:
Here is the code for making the ggplot:
# load data
data(presidential)
data(economics)
# events of interest
events <- presidential[-(1:3),]
# strip year from economics and events data frames
economics$year = as.numeric(format(economics$date, format = "%Y"))
# use dplyr to summarise data by year
#install.packages("dplyr")
library(dplyr)
econonomics_mean <- economics %>%
group_by(year) %>%
summarise(mean_unemployment = mean(unemploy))
# add president terms to summarized data frame as a factor
president <- c(rep(NA,14), rep("Reagan", 8), rep("Bush", 4), rep("Clinton", 8), rep("Bush", 8), rep("Obama", 7))
econonomics_mean$president <- president
# create ggplot
p <- ggplot(data = econonomics_mean, aes(x = year, y = mean_unemployment)) +
geom_point(aes(color = president)) +
geom_line(alpha = 1/3)
It only takes one line of code to make the ggplot into a plotly object.
# make it interactive!
#install.packages("plotly")
library(plotly)
ggplotly(p)
Considering you are plotting time series and qualitative information, most economic book use the area of plotting to indicate a structural change or event on data so i recommend to use something like this:
library(ggplot2)
data(presidential)
data(economics)
ggplot() +
geom_rect(aes(xmin = start,
xmax = end,
ymin = 0, ymax = Inf,
fill = name),
data = presidential,
show.legend = F) +
geom_text(aes(x = start+500,
y = 2000,
label = name,
angle = 90),
data = presidential) +
geom_line(aes(x = date, y = unemploy),
data= economics) +
scale_fill_brewer(palette = "Blues") +
labs(x = "time", y = "unemploy")
Related
I am trying to display a cross-section of conductivity in a lagoon environment using isolines. I have applied interp() and stat_contour() to my data, but I would like to clip the interpolated output so that it doesn't extend past my data points. This way the bathymetry of the lagoon in the cross-section is clear. Here is the code I have used so far:
cond_df <- read_csv("salinity_profile.csv")
di <- interp(cond_df$stop, cond_df$depth, cond_df$conductivity,
xo = seq(min(cond_df$stop), max(cond_df$stop), length = 200),
yo = seq(min(cond_df$depth), max(cond_df$depth), length = 200))
dat_interp <- data.frame(expand.grid(x=di$x, y=di$y), z=c(di$z))
ggplot(dat_interp) +
aes(x=x, y=y, z=z, fill=z)+
scale_y_reverse() +
geom_tile()+
stat_contour(colour="white", size=0.25) +
scale_fill_viridis_c() +
theme_tufte(base_family="Helvetica")
Here is the output:
interpolated plot
To help clarify, here is the data just as a geom_point() graph, and I do not want the interpolated layer going past the lower points of the graph:
cond_df%>%
ggplot(mapping=aes(x=stop, y=depth, z=conductivity, fill=conductivity)) +
geom_point(aes(colour = conductivity), size = 3) +
scale_y_reverse()
point plot
You can mask the unwanted region of the plot by using geom_ribbon.
You will need to generate a data.frame with values for the max depth at each stop. Here's one somewhat inelegant way to do that:
# Create the empty data frame for all stops
bathymetry <- data.frame(depth = as.numeric(NA),
stop = unique(cond_df$stop))
# Find the max depth for each stop
for(thisStop in bathymetry$stop){
bathymetry[bathymetry$stop==thisStop, "depth"] <- max(cond_df[cond_df$stop==thisStop, "depth"])
}
Then, you can add the geom_ribbon as the last geom of your plot, like so
geom_ribbon(data=bathymetry, aes(x=stop, ymin=depth, ymax=max(cond_df$depth)), inherit.aes = FALSE)
Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()
From meteorological data (hourly values of temperature, wind and humidity) I managed to plot time series of wind speed and direction. Now I would like to add wind vectors on the time series plot. Here you can see the output I would like (arrows drawn over the actual plot).
I am using ggplot2, been looking through stackoverflow, ggplot2 package docs (will continue) but no solution found. Any idea or indication where to look would be a starting point.
Thanks in advance
EDIT QUESTION
As suggested in a comment by #slowlearner I add here code and data to make a reproducible example. It looks like geom_segment could do the trick. I managed to calculate yend in geom_segment but can't figure out how to find xend as x axis is a time axis. I've got wind speed and direction data so I can calculate x,y wind components for geom_segment but x needs to be converted to time format.
Here is the code used for the plot (temperature) and data
for (i in 1:2 ) {
rams=subset(data,data$stat_id %in% i)
tore=subset(torre,torre$stat_id %in% i)
# Gràfica en ggplot
# Gráfica de evolución temporal de las estaciones de la zona
gtitol=places$nom[places$stat_id == i]
myplot=ggplot(data=rams,aes(x=datetime, y=tempc, colour="RAMS")) +
geom_line() + ylab("Temperatura (ºC)") + xlab(" ") +
ggtitle(gtitol) + theme(legend.title=element_blank()) + theme(legend.position="bottom") +
scale_y_continuous(limits = c(0,40)) +
geom_line(data=tore,aes(x=datetime,y=temp,colour = "Torre"))
#scale_y_continuous(breaks = round(seq(min(data$tempc), max(data$tempc), by = 2),0))
ggsave(myplot,filename=paste("RAMS-",i,".png",sep=""),width=7.78,height=5.79)
}
Following geom_segment docs I tried
p=ggplot(tore, aes(x =datetime, y = 5))
p + geom_segment(aes(xend = datetime, yend = 5 + v), arrow = arrow(length = unit(0.1,"cm")))
Getting this kind of figure. You can see x coordinate is datetime, how should I convert wind component to get xend? By now xend equals datetime but I would like something like "datetime + xwindcomp"
Thanks for your help
Just as a preamble, please make sure you include all code and relevant data in future questions. If you look at your question above, you will see that some objects such as torre are not defined. That means we can't copy and paste into our R setups. Also the data to which you linked could not be used with the code in the question as it was a limited subset. My advice: (a) create fake data that looks like the data you are using (b) keep your code to the absolute minimum (c) test and double-check code and data in a new R session before you post.
As far as I can tell you want something like the below. Of course you will have to adapt it for your own purposes, but it should give you some ideas on how to tackle your problem. Notice most of the cosmetic properties such as line colours, thicknesses, legends and titles have been omitted from the plot: they are not important for the purposes of this question. EDIT Another approach might be to use the same data frame for the wind data and then use a faceting variable to show the speed in a different but linked plot.
require(ggplot2)
require(scales)
require(gridExtra)
require(lubridate)
set.seed(1234)
# create fake data for temperature
mydf <- data.frame(datetime = ISOdatetime(2013,08,04,0,0,0) +
seq(0:50)*10*60,
temp = runif(51, 15, 25))
# take a subset of temperature data,
# basically sampling every 60 minutes
wind <- mydf[minute(mydf$datetime) == 0, ]
# then create fake wind velocity data
wind$velocity <- runif(nrow(wind), -5, 20)
# define an end point for geom_segment
wind$x.end <- wind$datetime + minutes(60)
ggplot(data = mydf, aes(x = datetime, y = temp, group = 1)) +
geom_line() +
geom_segment(data = wind,
size = 3,
aes(x = datetime,
xend = x.end,
y = 10,
yend = velocity),
arrow = arrow(length = unit(0.5, "cm"))) +
theme()
This generates the following plot:
Compute the direction of the wind using decimal degrees. Assuming that you want 0 degrees to be North (up), use the following:
ggplot(data = wind, aes(x=datetime, y=temp)) +
geom_text(aes(angle=-wind_dir_degrees+90), label="→")
I have a dataset that contains observations for every second of four consecutive days (roughly 340'000 data points). This is too much to display in a scatter plot. I would like to plot only a uniform sample of, say, 2000 time points.
Is it possible to achieve this with ggplot2's "grammar of graphics" approach? I haven't found any built-in "sampling" modifier, but perhaps it's easy enough to write one?
library(ggplot2)
x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()
This is how it can be "hacked" by modifying the data passed to ggplot. But I don't want to modify the data, just filter it to include only a sample.
ggplot(d, aes(x=x, y=y)) + ??? + geom_point()
EDIT: I'm specifically looking for sampling, not smoothing or binning. The data I have shows the time it takes to simulate one second of a specific process. The simulation has been parallelized, and for each simulated seconds I have the run times for each of the cores involved (8 in total). I want to show sub-optimal load balancing by plotting just the raw data points. The reason for the sampling is just that 300'000 data points are way too much for a scatter plot: Plotting takes too long and the visualization is no good.
You can subset with in the geom_point call using the data argument:
... + geom_point(data=d[sample(x,2000),])
This way, you are free to add other geoms using all the data, eg, using the example data:
ggplot(d, aes(x=x, y=y)) + geom_hex() + geom_point(data=d[sample(x,2000),])
If you want create a scatter plot for big data here are a couple of ggplot2 options
They come from This course by hadley
# upload all images to imgur.com
opts_chunk$set(fig.width = 5, fig.height = 5, dev = "png")
render_markdown(strict = T)
# some autocorrelated data
set.seed(1)
x <- 1:1e+05
d <- data.frame(x = x)
d$y <- arima.sim(list(order = c(1, 1, 0), ar = 0.9), n = 1e+05 - 1)
# the basic plot
base_plot <- ggplot(d, aes(x = x, y = y))
geom_bin2d
you can set the binwidth for the x and y variables
base_plot + geom_bin2d(binwidth = c(200, 5))
geom_hex
you can set the number of bins
base_plot + geom_hex(bins = 200)
small points
Stops overplotting
base_plot + geom_point(size = I("."))
use a smoother
This relies on having a smoothing method that will get you the detail you want without crashing or taking too long. In this case the number of knots was chosen by trial and error (and perhaps you will want more detail)
library(mgcv)
base_plot + stat_smooth(method = "gam", formula = y ~ s(x, k = 50))
I am using ggplot to generate a chart that summarises a race made up from several laps. There are 24 participants in the race,numbered 1-12, 14-25; I am plotting out a summary measure for each participant using ggplot, but ggplot assumes I want the number range 1-25, rather than categories 1-12, 14-25.
What's the fix for this? Here's the code I am using (the data is sourced from a Google spreadsheet).
sskey='0AmbQbL4Lrd61dHlibmxYa2JyT05Na2pGVUxLWVJYRWc'
library("ggplot2")
require(RCurl)
gsqAPI = function(key,query,gid){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
sin2011racestatsX=gsqAPI(sskey,'select A,B,G',gid='13')
sin2011proximity=gsqAPI(sskey,'select A,B,C',gid='12')
h=sin2011proximity
k=sin2011racestatsX
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
scale_x_discrete(limits =c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))+
xlab(NULL) + opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position")+
scale_y_discrete(breaks=1:24,limits=1:24)+ opts(legend.position = "none")
Expanding on my cryptic comment, try this:
#Convert these to factors with the appropriate labels
# Note that I removed the ''
h$car <- factor(h$car,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
k$driverNum <- factor(k$driverNum,labels = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','SEN','PET','BAR','MAL',
'SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB'))
l=subset(h,lap==1)
ggplot() +
geom_step(aes(x=h$car, y=h$pos, group=h$car)) +
geom_point(aes(x=l$car, y=l$pos, pch=3, size=2)) +
geom_point(aes(x=k$driverNum, y=k$classification,size=2), label='Final') +
geom_point(aes(x=k$driverNum, y=k$grid, col='red')) +
ylab("Position") +
scale_y_discrete(breaks=1:24,limits=1:24) + opts(legend.position = "none") +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) + xlab(NULL)
Calling scale_x_discrete is no longer necessary. And stylistically, I prefer putting opts and xlab stuff at the end.
Edit
A few notes in response to your comment. Many of your difficulties can be eased by a more streamlined use of ggplot. Your data is in an awkward format:
#Summarise so we can use geom_linerange rather than geom_step
d1 <- ddply(h,.(car),summarise,ymin = min(pos),ymax = max(pos))
#R has a special value for missing data; use it!
k$classification[k$classification == 'null'] <- NA
k$classification <- as.integer(k$classification)
#The other two data sets should be merged and converted to long format
d2 <- merge(l,k,by.x = "car",by.y = "driverNum")
colnames(d2)[3:5] <- c('End of Lap 1','Final Position','Grid Position')
d2 <- melt(d2,id.vars = 1:2)
#Now the plotting call is much shorter
ggplot() +
geom_linerange(data = d1,aes(x= car, ymin = ymin,ymax = ymax)) +
geom_point(data = d2,aes(x= car, y= value,shape = variable),size = 2) +
opts(title="F1 2011 Korea \nRace Summary Chart", axis.text.x=theme_text(angle=-90, hjust=0)) +
labs(x = NULL, y = "Position", shape = "")
A few notes. You were setting aesthetics to fixed values (size = 2) which should be done outside of aes(). aes() is for mapping variables (i.e. columns) to aesthetics (color, shape, size, etc.). This allows ggplot to intelligently create the legend for you.
Merging the second two data sets and then melting it creates a grouping variable for ggplot to use in the legend. I used the shape aesthetic since a few values overlap; using color may make that hard to spot. In general, ggplot will resist mixing aesthetics into a single legend. If you want to use shape, color and size you'll get three legends.
I prefer setting labels using labs, since you can do them all in one spot. Note that setting the aesthetic label to "" removes the legend title.