I'm having an issue finding out how to calculate an average over "x" days. If I try to plot this csv file over 1 year, it's too much data to display correctly on a plot line (screenshot attached). I'm looking to average the data over every few days (maybe 2, a week, etc..) so the line graph is not so hard to read. Any advice on how I would solve this issue with R?
results.csv
POSTS,PROVIDER,TYPE,DATE
29337,FTP,BLOG,2010-01-01
26725,FTP,BLOG,2010-01-02
27480,FTP,BLOG,2010-01-03
31187,FTP,BLOG,2010-01-04
31488,FTP,BLOG,2010-01-05
32461,FTP,BLOG,2010-01-06
33675,FTP,BLOG,2010-01-07
38897,FTP,BLOG,2010-01-08
37122,FTP,BLOG,2010-01-09
41365,FTP,BLOG,2010-01-10
51760,FTP,BLOG,2010-01-11
50859,FTP,BLOG,2010-01-12
53765,FTP,BLOG,2010-01-13
56836,FTP,BLOG,2010-01-14
59698,FTP,BLOG,2010-01-15
52095,FTP,BLOG,2010-01-16
57154,FTP,BLOG,2010-01-17
80755,FTP,BLOG,2010-01-18
227464,FTP,BLOG,2010-01-19
394510,FTP,BLOG,2010-01-20
371303,FTP,BLOG,2010-01-21
370450,FTP,BLOG,2010-01-22
268703,FTP,BLOG,2010-01-23
267252,FTP,BLOG,2010-01-24
375712,FTP,BLOG,2010-01-25
381041,FTP,BLOG,2010-01-26
380948,FTP,BLOG,2010-01-27
373140,FTP,BLOG,2010-01-28
361874,FTP,BLOG,2010-01-29
265178,FTP,BLOG,2010-01-30
269929,FTP,BLOG,2010-01-31
R Script
library(ggplot2);
data <- read.csv("results.csv", header=T);
dts <- as.POSIXct(data$DATE, format="%Y-%m-%d");
attach(data);
a <- ggplot(dataframe, aes(dts,POSTS/1000, fill = TYPE)) + opts(title = "Report") + labs(x = NULL, y = "Posts (k)", fill = NULL);
b <- a + geom_bar(stat = "identity", position = "stack");
plot_theme <- theme_update(axis.text.x = theme_text(angle=90, hjust=1), panel.grid.major = theme_line(colour = "grey90"), panel.grid.minor = theme_blank(), panel.background = theme_blank(), axis.ticks = theme_blank(), legend.position = "none");
c <- b + facet_grid(TYPE ~ ., scale = "free_y");
d <- c + scale_x_datetime(major = "1 months", format = "%Y %b");
ggsave(filename="/root/results.png",height=14,width=14,dpi=600);
Graph Image
Try this :
Average <- function(Data,n){
# Make an index to be used for aggregating
ID <- as.numeric(as.factor(Data$DATE))-1
ID <- ID %/% n
# aggregate over ID and TYPE for all numeric data.
out <- aggregate(Data[sapply(Data,is.numeric)],
by=list(ID,Data$TYPE),
FUN=mean)
# format output
names(out)[1:2] <-c("dts","TYPE")
# add the correct dates as the beginning of every period
out$dts <- as.POSIXct(Data$DATE[(out$dts*n)+1])
out
}
dataframe <- Average(Data,3)
This works with the plot script you gave.
Some remarks :
never ever call some variable after a function name (data, c, ...)
avoid the use of attach(). If you do, add detach() afterwards, or you'll get into trouble at some point. Better is to use the functions with() and within()
The TTR package also has several moving average functions that will do this with a single statement:
library(TTR)
mavg.3day <- SMA(data$POSTS, n=3) # Simple moving average
Substitute a different value of 'n' for your desired moving average length.
Related
TL;DR: I would like to generate a markdown with the plot of normalized count for a list of gene. As this list is quite long (> 100 genes), I would like to generate a "grid" of 4x4 graph on one figure page for the first 16 genes, then the same for genes 17 to 32, etc...until the end of the list. Currently, my code is only displaying 16 gene, even tho I run the grid.arrange command INSIDE the loop (it worked when I used the "plot" fonction inside the loop but it displays only one graph per page ofc).
I'm currently doing some RNAseq analysis and I'm looking for differentially expressed gene (DEGs) between two population. To have a more visual representation of DEGs, I would like to plot the normalized count for each population for some genes of interest (GoI).
However the list of these GOI can be quite long (for ex., if I'm focusing on DEG that are coding for membrane protein, I've some 159 candidates). I'm able to plot them using a for loop, with the following code (from a first analysis):
# top gene contains all of Gene of Interest
# group_origin contains the factor used to discriminate cell population to compare
# dds_g is a dds object from DESeq2 and contained counts number I want to plot
for (i in unique(1:length(top_gene))){
gene <- top_gene[i]
d <- plotCounts(dds_g, gene = gene, intgroup = "group_origin", returnData = TRUE)
b <- ggplot(d, aes(x = group_origin, y = count)) +
stat_boxplot(geom = 'errorbar', aes(colour = factor(group_origin)), width = 0.2) +
geom_boxplot(aes(colour = factor(group_origin)), width = 0.2) +
stat_summary(fun.y=mean, geom="point", shape=17, size=1.5, aes(color=factor(group_origin), fill=factor(group_origin))) +
labs (title = paste0(resg_cb_db$symbol[gene],' (',gene,')'), x = element_blank()) +
theme_bw() +
scale_color_manual(values = mycolors) +
theme(text= element_text(size=10),
axis.text.x = element_text(size = 7.5, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.position = "none")
plot(b)
}
By using this approach, I'm able to generate (separately) the plot for all the gene ID contained in my "top_gene" vector.
However, the markdown is creating one plot per page which is annoying when you have high number of genes. I would like to group them, e.g., every genes 1 to 16 in the list are plotted together on one page, then its 16 to 32, etc....
I've tried (without sucess) par mfrow. I also tried Grid.extra with the following code (from another analysis example):
for (i in unique(1:length(top_gene))){
gene <- top_gene[i]
d <- plotCounts(dds, gene = gene, intgroup = "cell_type", returnData = TRUE)
b <- ggplot(d, aes(x = cell_type, y = count)) +
stat_boxplot(geom = 'errorbar', aes(colour = factor(cell_type)), width = 0.2) +
geom_boxplot(aes(colour = factor(cell_type)), width = 0.2) +
stat_summary(fun.y=mean, geom="point", shape=17, size=1.5, aes(color=factor(cell_type), fill=factor(cell_type))) +
labs (title = paste0(resg$symbol[gene],' (',gene,')'), x = element_blank()) +
theme_bw() +
scale_color_manual(values = mycolors) +
theme(text= element_text(size=7),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1),
axis.text.y = element_text(size = 6),
legend.position = "none")
plot_list[[gene]] <- b
}
t <- length(plot_list)
x <- seq(1, t, by = 15)
for (i in x){
z <- i+1
if (!is.na(x[z])) {
test_list <- plot_list[c(x[i]:x[z])]
do.call(grid.arrange,test_list)
}
else {
z <- length(plot_list)
test_list <- plot_list[c(x[i]:z)]
do.call(grid.arrange,test_list)
}
}
Which give me an error "Error in x[i]:z : argument NA / NaN"
The idea here was to, for every 16 genes, plot a 4x4 graph page. Of course, when it reach the last "16 group", there are less than 16 genes remaining (unless you have total number of genes that can be divided by 16). So that's why the if loop is there to prevent the error (but it's not working).
Also, I've tried to remove this last part and just try to generate the first "9 x 16 genes" of my list, discarding the last ones. It "works" because I can clearly see the first 16 genes ploted in 4 x 4 but nothing about the rest.
Why plotting inside a for loop using "plot(b)" is working but not using grid.arrange on a list() created inside a for loop too?
Very sorry for my code, I know it's not perfect (I'm still learning all of this) but I hope it's clear enough for you to understand my question...
Thanks!
!EDIT! : solved the first error by adding (i in length(x)). Feel stupid :D. Anyway, it's still only "printing" 16 plots instead of 159...
I think you have some confusion in your for loop. When you say for (i in x) each iteration you get the values in x (1,16,31,46,...) not the index number so when you set z <- i + 1 you get the values (2,17,31,...). This makes your x[i] and x[z] values NA for most values of i and z.
The for loop below should fix the indexing and prevent you from going outside the length of plot_list in your edge case.
for (i in 1:length(x)){ #replaced this line
z <- i+1
if (!is.na(x[z])) {
# changed this to make sure x[z] doesn't show up on two plots
test_list <- plot_list[x[i]:(x[z]-1)]
do.call(grid.arrange,test_list)
}
else {
z <- length(plot_list)
test_list <- plot_list[x[i]:z]
do.call(grid.arrange,test_list)
}
}
I am trying to visualize time series data. I have set of 5 loggers which are indicating snow movement distance and an environmental variable that possibly has an effect on the snow movement. That is why it is meaningful to graph them together to see if snow movement (detected by the loggers) is influenced by these environmental factors. I have a file containing the weather station data and 5 files for each logger. The weather data is measured every 5 minutes while the loggers measure whenever there is movement! I have managed to visualize them together so far, however, my professor wants me to visualize the loggers as a group by presenting a gray area (instead of 5 lines) that always shows the minimum and maximum value of the loggers at a certain time point. I am using ggplot2. I've tried to make such area by using ribbon geom_ribbon but it is not so straight forward with my dataset. The line are crossing and often the loggers that have the min and max value switch. I don't know if joining them in a single dataset would help but this is also not possible because they don't have the same length. Furthermore, its not like all 5 loggers have measurements at the same time. They log only when there is movement. Here is my code and the graph that it creates. Unfortunately, I am not sure how to reproduce the data. I am more than glad to share it with you somehow.
#install.packages("patchwork")
library(ggplot2)
library(scales)
library(patchwork)
Sys.setlocale(category = "LC_ALL", locale = "english")
startTime <- as.Date("2017-10-01")
endTime <- as.Date("2018-06-30")
start_end <- c(startTime,endTime)
################################################## FALL LINE 1 #########################################################
logger1 <- read.csv("F1_17_18_167.csv",header=TRUE, sep=";")
logger1$date <- as.Date(logger1$Date, "%d.%m.%Y")
logger2 <- read.csv("F1_17_18_186.csv",header=TRUE, sep=";")
logger2$date <- as.Date(logger2$Date, "%d.%m.%Y")
logger3 <- read.csv("F1_17_18_031.csv",header=TRUE, sep=";")
logger3$date <- as.Date(logger3$Date, "%d.%m.%Y")
logger4 <- read.csv("F1_17_18_091.csv",header=TRUE, sep=";")
logger4$date <- as.Date(logger4$Date, "%d.%m.%Y")
logger5 <- read.csv("F1_17_18_294.csv",header=TRUE, sep=";")
logger5$date <- as.Date(logger5$Date, "%d.%m.%Y")
station <- read.csv("aggregates.csv",header=TRUE, sep=",")
station$date <- as.Date(station$Group.1, "%Y-%m-%d")
ggplot()+
geom_line(data = station, aes(x = date, y = Mean_snowheight ,color = "Mean Snowheight"),na.rm = TRUE, size = 1)+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(limits= c (0,115))
ggplot()+
geom_line(data = logger1, aes(x = date, y = AccuDist, color = "167 (mid-bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger2, aes(x = date, y = AccuDist, color = "186 (top-middle)"),na.rm= TRUE, size = 1)+
geom_line(data = logger3, aes(x = date, y = AccuDist, color = "31 (top)"),na.rm= TRUE, size = 1)+
geom_line(data = logger4, aes(x = date, y = AccuDist, color = "91 (bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger5, aes(x = date, y = AccuDist, color = "294 (middle)"),na.rm= TRUE, size = 1)+
geom_line(data = station, aes(x = date, y = Mean_snowheight*11.49 ,color = "Mean snowheight"),na.rm = TRUE, size = 1) +
ggtitle("Fall line 1") +
labs(color = "")+
xlab("Season 17/18")+
ylab("Accumulated Distance [mm]")+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(sec.axis = sec_axis(~./11.49,name = "Mean snowheight [cm]"),limits = c(0,1500))+
scale_color_manual("", guide = "legend",
values = c("167 (mid-bottom)"= "darkorange2",
"186 (top-middle)" = "darkgreen",
"31 (top)" = "red",
"91 (bottom)" = "blue",
"294 (middle)" = "purple",
"Mean snowheight" = "black"))+
theme(legend.position="bottom",
#legend.title = element_blank(),
axis.text.x = element_text(angle = 50, size = 10 , vjust = 0.5),
axis.text.y = element_text(size = 10, vjust = 0.5),
panel.background = element_rect(fill = "gray100"),
plot.background = element_rect(fill = "gray100"),
panel.grid.major = element_line(colour = "lightblue"),
plot.margin = unit(c(1, 1, 1, 1), "cm"),
plot.title = element_text(hjust = 0.5, size = 22))
You can see what graph this code produces:
If you ignore the environmental factor for a second (black line) you are left the accumulated snow movement distance over the winter period for each logger (the colored lines). My aim is to fill the area that is always between the lowest and highest line.
Let me know if I need to upload the data somewehere. This is how it the logger data looks like: data table.
Thanks in advance.
Regards,
Zorin
This is actually a tougher than it seems at first. Your goal as I understand is to fill in the area in your line plot between the "lowest" and the "highest" lines. This is made more difficult by the fact what is the lowest and highest line may change places throughout the plot, so you cannot simply choose to plot between one of the logs and another log. It's also made difficult by the fact that your x axis value is a date, so not all logs collect data on the same date and time.
First of all, I'll be ignoring a bit of your personal aesthetics you added and also removing the line you included for Mean snow height (from the dataframe station) for ease of showing you the solution I have.
Data Preparation
To begin, I noticed that you have included a geom_line() call for each individual logging station dataset (logger1 through logger5). While the method certainly works (and you do it in a way that gives you the solution you desire), it's much better practice to combine all logs into one dataset and this is going to be necessary in order for the solution I'm proposing to work anyway. Luckily, it's pretty simple to do this: just use rbind() to combine the datasets. Critically - you'll need to create a new column for each (called id here) that maintains the identity of the logging station of origin. You can then use that new id column as your color= aesthetic and draw all 5 lines using one geom_line() call.
One small problem I ran into is that your datasets had slightly different column names (some were caps, some were not...). They were all in the same order, so it wasn't too difficult to make them all the same before combining... it just added another step. Finally, I converted the date column to date format.
# create the id column
logger1$id <- 'logger1'
logger2$id <- 'logger2'
logger3$id <- 'logger3'
logger4$id <- 'logger4'
logger5$id <- 'logger5'
# fixing inconsistency in column names
my_column_names <- names(logger1)
names(logger2) <- my_column_names
names(logger3) <- my_column_names
names(logger4) <- my_column_names
names(logger5) <- my_column_names
# make one big df
loggers <- rbind(logger1, logger2, logger3, logger4, logger5)
loggers$date <- as.Date(loggers$date)
You can now recreate the plot in a more simple way:
ggplot(loggers, aes(x=date, y=AccuDist)) + theme_bw() +
geom_line(aes(color=id), size=1)
Finding the Running Minimum and Maximum
In order to create the fill, I'm using geom_ribbon(), which needs aesthetics ymin and ymax. You have to set those first though, and they need to be "running minimum" and the "running maximum", which means they will change as you progress through the data. For this, I'm using two functions shown below min_vect() and max_vect().
# find the "running maximum"
max_vect <- function(ac) {
curr_max <- 0
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in 1:length(ac)) {
if(ac[i] > curr_max) {
curr_max <- ac[i]
}
return_vector[i] <- curr_max
}
return(return_vector)
}
# find the "running minimum"
min_vect <- function(ac) {
curr_min <- max(ac)
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in length(ac):1) {
if(ac[i] < curr_min) {
curr_min <- ac[i]
}
return_vector[i] <- curr_min
}
return(return_vector)
}
The idea is that for the maximum, you step through an (ordered) vector and if the number is higher than the previous maximum number, it becomes the new maximum. The same strategy is used for the running minimum, albeit we have to step through the ordered vector in reverse.
In order to apply the functions to create new columns, the dataset needs to be ordered first in order for it to work properly:
# must arrange by date and time first!
loggers <- loggers %>% arrange(date, TIME)
# add your new columns
loggers$min_Accu <- min_vect(loggers$AccuDist)
loggers$max_Accu <- max_vect(loggers$AccuDist)
The Finale
And now, the plot. Basically it's the same, and I'm using geom_ribbon() as described above. For a bonus, I'm also using scale_color_discrete() to set the legend title and labels, just to show you that you can code that in afterwards (and it will still be easier than having separate geom_line() calls.
logger_list <- c('Log 1', 'Log 2', 'Log 3', 'Log 4', 'Log 5')
ggplot(loggers, aes(x=date, y=AccuDist)) +
theme_bw() +
geom_ribbon(aes(ymin=min_Accu, ymax=max_Accu), alpha=0.2) +
geom_line(aes(color=id), size=1) +
scale_color_discrete(name='Log ID Num', labels=logger_list)
The code plots data with computed weekly regression lines.
I would like to combine the legend with weekly doubling times, computed from the weekly slopes.
Nice to solve question: I could get the weekly regression lines with a geom_smooth.
However, I could not extract the slope coefficient (to compute the doubling time) from the geom_smooth. I therefore had to do equivalent regressions outside the ggplot portion.
Any suggestions to do this more elegantly?
Main question: How can I combine the legend with the column of computed doubling times?
With a lot of fiddling I can place the legend sort of next to these computed doubling times.
It does not look nice and when I include another datapoint I have to start fiddling all over again. Suggestions will be appreciated. Thank you.
library(ggplot2)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
gp <-
ggplot(datatib, aes(dates, cases, color = week ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position=c(0.75,0),
legend.justification=c(1.2, -0.1), legend.text=element_text(size=14) ) +
annotation_custom( doublingtime,
xmin=dates[length(cases)]-2, xmax=dates[length(cases)], ymin=-2.65 )
As an answer to your main question ... try this. I simply joined the doubling time to your main df and created a new var combining no. of week and doubling time. Color is then mapped on this new var.
Concerning your second question: There are ways to compute the slope from the computed values of geom_smooth/stat_smooth. However, in my opinion your approach of computing the slopes is the easier way to the kind of problem your are trying to solve.
library(ggplot2)
library(dplyr)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
#resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
#doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
datatib1 <- datatib %>%
left_join(resulttib, by = c("week" = "Week")) %>%
mutate(week1 = paste0(week, " (", Doubling_Time, ")"))
gp <-
ggplot(datatib1, aes(dates, cases, color = week1 ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
legend.position = c(.95, .05),
legend.justification = c("right", "bottom"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
) +
labs(color = "Week (Doubling time in days)")
gp
Created on 2020-03-27 by the reprex package (v0.3.0)
I found how to estimate the historical Variance Decomposition for VAR models in R in the below link
Historical Variance Error Decompotision Daniel Ryback
Daniel Ryback presents the result in an excel plot, but I wanted to prepare it with ggplot so I created some lines to get it, nevertheless, the plot I got in ggplot is very different to the one showed by Daniel in Excel. I replicated in excel and got the same result than Daniel so it seems there is an error in the way I am preparing the ggplot. Does anyone have a suggestion to arrive to the excel result?
See below my code
library(vars)
library(ggplot2)
library(reshape2)
this code is run after runing the code developed by Daniel Ryback in the link above to define the HD function
data(Canada)
ab<-VAR(Canada, p = 2, type = "both")
HD <- VARhd(Estimation=ab)
HD[,,1]
ex <- HD[,,1]
ex1 <- as.data.frame(ex) # transforming the HD matrix as data frame #
ex2 <- ex1[3:84,1:4] # taking our the first 2 rows as they are N/As #
colnames(ex2) <- c("Emplyment", "Productivity", "Real Wages", "Unemplyment") # renaming columns #
ex2$Period <- 1:nrow(ex2) # creating an id column #
col_id <- grep("Period", names(ex2)) # setting the new variable as id #
ex3 <- ex2[, c(col_id, (1:ncol(ex2))[-col_id])] # moving id variable to the first column #
molten.ex <- melt(ex3, id = "Period") # melting the data frame #
ggplot(molten.ex, aes(x = Period, y = value, fill = variable)) +
geom_bar(stat = "identity") +
guides(fill = guide_legend(reverse = TRUE))
ggplot version
Excel version
The difference is that ggplot2 is ordering the variable factor and plotting it in a different order than excel. If you reorder the factor before plotting it will put 'unemployment' at the bottom and 'employment' at the top, as in excel:
molten.ex$variable <- factor(molten.ex$variable, levels = c("Unemployment",
"Real Wages",
"Productivity",
"Employment"))
ggplot(molten.ex, aes(x = Period, y = value, fill = variable)) +
geom_bar(stat = "identity", width = 0.6) +
guides(fill = guide_legend(reverse = TRUE)) +
# Making the R plot look more like excel for comparison...
scale_y_continuous(limits = c(-6,8), breaks = seq(-6,8, by = 2)) +
scale_fill_manual(name = NULL,
values = c(Unemployment = "#FFc000", # yellow
`Real Wages` = "#A4A4A4", # grey
Productivity = "#EC7C30", # orange
Employment = "#5E99CE")) + # blue
theme(rect = element_blank(),
panel.grid.major.y = element_line(colour = "#DADADA"),
legend.position = "bottom",
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.key.size = unit(3, "mm"))
Giving:
To roughly match the excel graph in Daniel Ryback's post:
I would like to visualize the time frame data of my five projects given below. Currently I am using OpenOffice draw application and manually producing the graph shown below. But I am not satisfied. Could you help me to solve the following. Thank you.
1. How can I produce somewhat similar graphs using R (or excel) with better precision in terms of days?
2. Is there a way for better visualization of the data? If so, please let me know how to produce that using R or Excel.
Project Time
------- ------
A Feb 15 – March 1
B March 15 – June 15
C Feb 1 – March 15
D April 10 – May 15
E March 1 – June 30
ggplot2 provides a (reasonably) straightforward way to construct a plot.
First you need to get your data into R. You want your starting and ending dates to be some kind of Date format in R (I have used Date)
library(ggplot2)
library(scales) # for date formatting with ggplot2
DT <- data.frame(Project = LETTERS[1:5],
start = as.Date(ISOdate(2012, c(2,3,2,4,3), c(15,15,1,10) )),
end = as.Date(ISOdate(2012, c(3,5,3,5,6), c(1,15,15,15,30))))
# it is useful to have a numeric version of the Project column (
DT$ProjectN <- as.numeric(DT$Project)
You will also want to calculate where to put the text, I will use `ddply1 from the plyr package
library(plyr)
# find the midpoint date for each project
DTa <- ddply(DT, .(ProjectN, Project), summarize, mid = mean(c(start,end)))
You want to create
rectangles for each project, hence you can use geom_rect
text labels for each midpoint
Here is an example how to build the plot
ggplot(DT) +
geom_rect(aes(colour = Project,ymin = ProjectN - 0.45,
ymax = ProjectN + 0.45, xmin = start, xmax = end)), fill = NA) +
scale_colour_hue(guide = 'none') + # this removes the legend
geom_text(data = DTa, aes(label = Project, y = ProjectN, x = mid,colour = Project), inherit.aes= FALSE) + # now some prettying up to remove text / axis ticks
theme(panel.background = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) + # and add date labels
scale_x_date(labels = date_format('%b %d'),
breaks = sort(unique(c(DT$start,DT$end))))+ # remove axis labels
labs(y = NULL, x = NULL)
You could also check gantt.chart function in plotrix package.
library(plotrix)
?gantt.chart
Here is one implementation
dmY.format<-"%d/%m/%Y"
gantt.info<-list(
labels= c("A","B","C","D","E"),
starts= as.Date(c("15/02/2012", "15/03/2012", "01/02/2012", "10/04/2012","01/03/2012"),
format=dmY.format),
ends= as.Date(c("01/03/2012", "15/06/2012", "15/03/2012", "15/05/2012","30/06/2012"),
format=dmY.format)
)
vgridpos<-as.Date(c("01/01/2012","01/02/2012","01/03/2012","01/04/2012","01/05/2012","01/06/2012","01/07/2012","01/08/2012"),format=dmY.format)
vgridlab<-
c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug")
gantt.chart(gantt.info, xlim= c(as.Date("01/01/2012",format=dmY.format), as.Date("01/08/2012",format=dmY.format)) , main="Projects duration",taskcolors=FALSE, border.col="black",
vgridpos=vgridpos,vgridlab=vgridlab,hgrid=TRUE)
I also tried ggplot2. but mnel was faster than me. Here is my codes
data1 <- as.data.frame(gantt.info)
data1$order <- 1:nrow(data1)
library(ggplot2)
ggplot(data1, aes(xmin = starts, xmax = ends, ymin = order, ymax = order+0.5)) + geom_rect(color="black",fill=FALSE) + theme_bw() + geom_text(aes(x= starts + (ends-starts)/2 ,y=order+0.25, label=labels)) + ylab("Projects") + xlab("Date")