How do I set the series labels in a multiline ggplot2 series? - r

I'm currently working on automating some basic experiential analysis using R. Currently, I've got my script setup as follows which generates the plot shown below.
data <- list()
for (experiment in experiments) {
path = paste('../out/', experiment, '/', plot, '.csv', sep="")
data[[experiment]] <- read.csv(path, header=F)
}
df <- data.frame(Year=1:40,
'current'=colMeans(data[['current']]),
'vip'=colMeans(data[['vip']]),
'vipbonus'=colMeans(data[['vipbonus']]))
df <- melt(df, id.vars = 'Year', variable.name = 'Series')
plotted <- ggplot(df, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = ylabel, title = title)
file = paste(plot, '.png', sep="")
ggsave(filename = file, plot = plotted)
While this is close to what we want the final product to look like, the series labels need to be updated. Ideally we want them to be something like "VIP, no bonus", "VIP, with bonus" and so forth, but obviously using labels like that in the data frame is not valid R (and invalid characters are automatically replaced with . even with backticks). Since these experiments are a work in progress, we also know that we are gong to need more series labels in the future so we don't want to lose the ability of ggplot to automatically set the colors for us.
How can I set the series labels to be appropriate for humans?

The OP explained that he is currently working on automating some basic experiential analysis, part of which is the relabeling of the series. The OP showed also some code which is used to prepare the data to be plotted.
Based on the additional information supplied in comments, I believe the overall processing could be streamlined which will address the series labeling issue as well.
Some preparations
# used for creating file paths
experiments <- c("current", "vip", "vipbonus")
# used for labeling the series
exp_labels <- c("Current", "VIP, no bonus", "VIP, with bonus")
plot <- "dataset1" # e.g.
paths <- paste0(file.path("../out", experiments, plot), ".csv")
paths
#[1] "../out/current/dataset1.csv" "../out/vip/dataset1.csv" "../out/vipbonus/dataset1.csv"
Read data
library(data.table) #version 1.10.4 used here
# read all files into one large data.table
# add running count in column "Series" to identify the source of each row
DT <- rbindlist(lapply(paths, fread, header = FALSE), idcol = "Series")
# rename file chunks = Series, use predefined labels
DT[, Series := factor(Series, labels = exp_labels)]
Reshape and aggregate by groups
# reshape from wide to long
molten <- melt(DT, id.vars = "Series")
# compute means by Series and Year = variable
aggregated <- molten[, .(value = mean(value)), by = .(Series, variable)]
# take factor level number of "variable" as Year
aggregated[, Year := as.integer(variable)]
Note that aggregation is done in long format (after melt()) to save typing the same command for each column.
Create chart & save to disk
library(ggplot2)
ggplot(aggregated, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = "ylabel", title = "title")
file = paste(plot, '.png', sep="")
ggsave(filename = file) # by default, the last plot is saved

While this may not be an ideal approach, what we found that worked for us was to update the relevant series labels after the melt command was performed:
df$Series <- as.character(df$Series)
df$Series[df$Series == "current"] <- "Current"
df$Series[df$Series == "vip"] <- "VIP, no bonus"
df$Series[df$Series == "vipbonus"] <- "VIP, with bonus"
Which results in plots like the following:

You can try this
library(tidyverse)
df <- df %>% dplyr::mutate(Series = as.character(Series),
Series = fct_recode(Series,
"Current" = "current",
"VIP, no bonus" = "vip",
"VIP, with bonus" = "vipbonus"))

Related

Renaming data in R

I am fairly new to R and am trying to make some figures, but having trouble with renaming data. Basically, I had a super large data set from SPSS that I imported into R and created a smaller data table with one variable I am trying to look at. I was successful in getting my data into the long format, but my Time column is not represented the way I'd like.
When I got my data into the long format, I made a data Time column and the data in that column says TIME1COMPOSITE, TIME2COMPOSITE, TIME3COMPOSITE - which are the original column names from the SPSS file. I would prefer for it to instead read Time1, Time2, or Time3 (so that it can look better on the axis label for the graph I am making). Is there a simple way to do this? Either to rename the data points or to just rename the labels on the graph?
Here is an example of what my code looks like:
dt<- data.table(dt)
#Putting into long format
dt <- melt(dt, measure.vars = c("TIME1COMPOSITE", "TIME2COMPOSITE", "TIME3COMPOSITE"), variable.name = "Time", value.name = "CompositeScore")
#Computing means
dt[, meanCompositeScore:= mean(CompositeScore), by=c("Condition", "Time")]
#Plotting
plot <- ggplot(dt, aes(x=Time, y=meanCompositeScore, color=Condition)) + geom_point()
plot
The easiest method with the code you suggested have would be to change the column names at the beginning using the colnames() function.
colnames(dt) <- c("colname1","colname2", ...)
Another method using the tidy format would be to use the rename() function (from dplyr).
dt %>%
rename(Time1 = TIME1COMPOSITE, Time2 = TIME2COMPOSITE, Time3 = TIME3COMPOSITE)
To change the names once the calculations have occurred you could convert the time to a factor and relabel them. We can use the as.factor() function to convert the array.
dt$Time <- as.factor(dt$Time)
revalue(dt$Time, c("Time1" = "TIME1COMPOSITE", "Time2" = "TIME2COMPOSITE", "Time3" = "TIME3COMPOSITE"))
To add the labels in the graph we can convert it to a factor and set the levels at the line we use the graph using the as.factor() function.
levels = c("Time1", "Time2", "Time3")
plot <- ggplot(dt, aes(x=as.factor(Time, levels = levels), y=meanCompositeScore, color=Condition)) + geom_point()
A final method would be to relabel the graph labels rather than the values using the scale_x_discrete() ggplot function.
plot <- ggplot(dt, aes(x=Time, y=meanCompositeScore, color=Condition)) +
geom_point() +
scale_x_discrete(labels = c('Time1','Time2','Time3'))
Let me know if any method doesn't work for you and I will attempt to clarify the method or rectify the mistake.

Multiple time series plots for several stocks in R returns an error

I have a time series data with multiple stocks. I would like to plot them in one plot in R.
I tried an existing answer in this website but I got an error. Here is my code:
library(quantmod)
library(TSclust)
library(ggplot2)
# download financial data
symbols = c('ASX', 'AZN', 'BP', 'AAPL')
start = as.Date("2014-01-01")
until = as.Date("2014-12-31")
stocks = lapply(symbols, function(symbol) {
Close = getSymbols(symbol,src='yahoo', from = start, to = until, auto.assign = FALSE)[, 6]
names(adjust) = symbol
adjust
})
I tried the following from an exiting answer (from here)
qplot(symbols, value, data = as.data.frame(stocks), geom = "line", group = variable) +
facet_grid(variable ~ ., scale = "free_y")
I got the following error:
Error: At least one layer must contain all faceting variables: variable.
Plot is missing variable
Layer 1 is missing variable
I would like to have similar to the following plot:
While Len Greski's answer has a great explanation and solution, I thought I'd provide an answer with a more 'standard' approach. Maybe some users will find it simpler.
library(quantmod)
library(ggplot2)
symbols <- c("ASX", "AZN", "BP", "AAPL")
start <- as.Date("2014-01-01")
until <- as.Date("2014-12-31")
# import data into an environment
e <- new.env()
getSymbols(symbols, src = "yahoo", from = start, to = until, env = e)
# extract the adjusted close and merge into one xts object
stocks <- do.call(merge, lapply(e, Ad))
# Remove the ".Adjusted" suffix from each symbol column name
colnames(stocks) <- gsub(".Adjusted", "", colnames(stocks), fixed = TRUE)
# convert the xts object to a long data frame
stocks_df <- fortify(stocks, melt = TRUE)
# plot the data
qplot(Index, Value, data = stocks_df, geom = "line", group = Series) +
facet_grid(Series ~ ., scale = "free_y")
The error messages in the original code are caused by the fact that there is no column called variable in the data that is passed to qplot(). Additionally, in order to produce the desired chart, we need to extract the dates from the xts objects generated by quantmod so we can use them as the x axis variable in the chart.
With some adjustments to place the appropriate variables from the stock data into the qplot() specification we can produce the required chart.
We modify the code to read the list of stocks as follows:
Convert the xts objects to objects of type data.frame
Rename columns to eliminate ticker symbols so we can rbind() into a single data frame in a subsequent step
Extract the rownames() into a data frame column
Having made these changes, the stocks object contains a list of data frames, one per stock ticker.
symbols = c('ASX', 'AZN', 'BP', 'AAPL')
start = as.Date("2014-01-01")
until = as.Date("2014-12-31")
stocks = lapply(symbols, function(symbol) {
aStock = as.data.frame(getSymbols(symbol,src='yahoo', from = start, to = until,
auto.assign = FALSE))
colnames(aStock) <- c("Open","High","Low","Close","Volume","Adjusted")
aStock$Symbol <- symbol
aStock$Date <- rownames(aStock)
aStock
})
Next, we use do.call() with rbind() to combine the data into a single data frame that we'll use with qplot()`.
stocksDf <- do.call(rbind,stocks)
Finally, we use qplot() with Date and Close as the x and y variables, and facet_grid() with Symbol to generate the facets.
qplot(Date, Close, data = stocksDf, geom = "line", group = Symbol) +
facet_grid(Symbol ~ ., scale = "free_y")
...and the initial output:
Having generated the chart, we'll make some adjustments to clean up the x axis labels. On the default chart they are unintelligible because there are 251 different character values, and we need to rescale the axis to print fewer labels.
First, we convert the character-based dates with as.Date(). Second, we use the ggeasy package to adjust the content on the x axis.
stocks = lapply(symbols, function(symbol) {
aStock = as.data.frame(getSymbols(symbol,src='yahoo', from = start, to = until,
auto.assign = FALSE))
colnames(aStock) <- c("Open","High","Low","Close","Volume","Adjusted")
aStock$Symbol <- symbol
aStock$Date <- as.Date(rownames(aStock),"%Y-%m-%d")
aStock
})
stocksDf <- do.call(rbind,stocks)
library(ggeasy)
qplot(Date, Close, data = stocksDf, geom = "line", group = Symbol) +
facet_grid(Symbol ~ ., scale = "free_y") +
scale_x_date(date_breaks = "14 days") +
easy_rotate_x_labels(angle = 45, side = "right")
...and the revised output:
NOTE: to chart the Adjusted Closing price, simply change the y variable in the qplot() function to Adjusted.

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')
The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

Drawing a multiline graph with ggplot2 from a zoo object

all.
I read several previous message at stackoverflow, and went through the documentation of zoo and ggplot2 but didn't find any suitable answer.
Say I have a zoo object called 'data'. The original data in the flat file are as follows:
Date,Quote1,Quote2,Quote3,Quote4,Quote5
18/07/2008,42.36,44.53,28.4302,44.3,42
21/07/2008,43.14,44.87,28.6186,44.83,43.27
22/07/2008,43.26,44.85,28.6056,44.86,42.84
23/07/2008,44.74,45.61,29.7558,45.69,#N/A
24/07/2008,43.99,45.14,29.2944,45.19,#N/A
25/07/2008,43.18,45.33,29.4569,45.46,43.65
28/07/2008,43.45,44.72,28.5016,44.89,43.31
29/07/2008,43.49,44.8,28.1247,44.88,42.85
30/07/2008,44.55,45.54,28.0727,45.58,43.67
31/07/2008,43.36,45.5,27.9818,45.63,43.91
01/08/2008,43.34,44.75,28.0792,44.69,43.04
Now, I want to plot the time series of this five financial products on a single line graph so that to compare their evolution.
I wish to use the ggplot2.
Would anyone be kind to give me some hints?
If data is your zoo object then try this (and see ?autoplot.zoo for more info):
p <- autoplot(data, facet = NULL)
p
or perhaps this since I don't think the automatic varying of linetype looks so good with this many series in the same panel:
p + aes(linetype = NULL)
Here is one way to do it:
df <- read.csv(text = "Date,Quote1,Quote2,Quote3,Quote4,Quote5
18/07/2008,42.36,44.53,28.4302,44.3,42
21/07/2008,43.14,44.87,28.6186,44.83,43.27
22/07/2008,43.26,44.85,28.6056,44.86,42.84
23/07/2008,44.74,45.61,29.7558,45.69,#N/A
24/07/2008,43.99,45.14,29.2944,45.19,#N/A
25/07/2008,43.18,45.33,29.4569,45.46,43.65
28/07/2008,43.45,44.72,28.5016,44.89,43.31
29/07/2008,43.49,44.8,28.1247,44.88,42.85
30/07/2008,44.55,45.54,28.0727,45.58,43.67
31/07/2008,43.36,45.5,27.9818,45.63,43.91
01/08/2008,43.34,44.75,28.0792,44.69,43.04", na.string = "#N/A")
df$Date <- strptime(df$Date, format = "%d/%m/%Y")
Create a zoo object:
library(zoo)
dat <- zoo(df[-1], df$Date)
Transform the object to a data frame for ggplot2:
df_new <- data.frame(value = as.vector(dat),
time = time(dat),
quote = rep(names(dat), each = nrow(dat)))
Plot:
library(ggplot2)
ggplot(df_new, aes(y = value, x = time, colour = quote)) + geom_line()
Here's another slightly different method, using melt from reshape
# Read your data and format date (as proposed by Sven)
df <- read.csv(text = "Date,Quote1,Quote2,Quote3,Quote4,Quote5
18/07/2008,42.36,44.53,28.4302,44.3,42
21/07/2008,43.14,44.87,28.6186,44.83,43.27
22/07/2008,43.26,44.85,28.6056,44.86,42.84
23/07/2008,44.74,45.61,29.7558,45.69,#N/A
24/07/2008,43.99,45.14,29.2944,45.19,#N/A
25/07/2008,43.18,45.33,29.4569,45.46,43.65
28/07/2008,43.45,44.72,28.5016,44.89,43.31
29/07/2008,43.49,44.8,28.1247,44.88,42.85
30/07/2008,44.55,45.54,28.0727,45.58,43.67
31/07/2008,43.36,45.5,27.9818,45.63,43.91
01/08/2008,43.34,44.75,28.0792,44.69,43.04", na.string = "#N/A")
df$Date <- strptime(df$Date, format = "%d/%m/%Y")
library(reshape)
# reshape your data with melt
melted <- melt(df[-1])
# add dates
melted2 <- cbind(df$Date,melted)
# plot with ggplot
ggplot(melted2,aes(y = value, x = melted2[,1], color = variable)) + geom_line()

Setting up a CSV file for R to display histograms

Greetings,
Basically, I have two vectors of data (let's call it experimental and baseline). I want to use the lattice library and histogram functions of R to plot the two histograms side-by-side, just as seen at the end of this page.
I have my data in a CSV file like this:
Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Each row should be a new pair of histograms. Columns 1-9 represents the data for the experiment (left-side histogram). Columns 10-18 represents the baseline data (right-side histogram).
Can anyone help me on this? Thanks.
Your data is poorly formatted for faceting with lattice. You can restructure it using reshape.
read.csv(textConnection("Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18"), header = F)->data
colnames(data)[1] <- "ID"
colnames(data)[2:10] <- paste("exp",1:9, sep = "_")
colnames(data)[11:19] <- paste("base", 1:9, sep = "_")
library(reshape)
data.m <- melt(data, id = "ID")
data.m <- cbind(data.m, colsplit(data.m$variable, "_", names = c("Source","Measure")))
data.m is now in the format you really want your data to be in for almost everything. I don't know if each of the 9 measurements from the experiment and the baseline are meaningful or can be meaningfully compared so I kept them distinct.
Now, you can use lattice properly.
histogram(~value | Source + ID, data = data.m)
If the measurements are meaningfully compared (that is, data[,2] and data[,11] are somehow the "same"), you could recast the data to directly compare experiment to baseline
data.comp <- cast(data.m, ID + Measure ~ Source)
## I know ggplot2 better
library(ggplot2)
qplot(base, exp, data = data.comp)+
geom_abline()+
expand_limits(x = 0, y = 0)
Something like this should work:
library(lattice)
data <- matrix(1:18, ncol=18, nrow=3, byrow=T)
for (i in 1:nrow(data))
{
tmp <- cbind(data[i,], rep(1:2, each=9))
print(histogram(~tmp[,1]|tmp[,2]), split=c(1,i,1,nrow(data)), more=T)
}
Note: this will work only for few rows of data... for larger datasets you may want to think of slightly different layout (change the split parameter in histogram)

Resources