I would like to visualize the time frame data of my five projects given below. Currently I am using OpenOffice draw application and manually producing the graph shown below. But I am not satisfied. Could you help me to solve the following. Thank you.
1. How can I produce somewhat similar graphs using R (or excel) with better precision in terms of days?
2. Is there a way for better visualization of the data? If so, please let me know how to produce that using R or Excel.
Project Time
------- ------
A Feb 15 – March 1
B March 15 – June 15
C Feb 1 – March 15
D April 10 – May 15
E March 1 – June 30
ggplot2 provides a (reasonably) straightforward way to construct a plot.
First you need to get your data into R. You want your starting and ending dates to be some kind of Date format in R (I have used Date)
library(ggplot2)
library(scales) # for date formatting with ggplot2
DT <- data.frame(Project = LETTERS[1:5],
start = as.Date(ISOdate(2012, c(2,3,2,4,3), c(15,15,1,10) )),
end = as.Date(ISOdate(2012, c(3,5,3,5,6), c(1,15,15,15,30))))
# it is useful to have a numeric version of the Project column (
DT$ProjectN <- as.numeric(DT$Project)
You will also want to calculate where to put the text, I will use `ddply1 from the plyr package
library(plyr)
# find the midpoint date for each project
DTa <- ddply(DT, .(ProjectN, Project), summarize, mid = mean(c(start,end)))
You want to create
rectangles for each project, hence you can use geom_rect
text labels for each midpoint
Here is an example how to build the plot
ggplot(DT) +
geom_rect(aes(colour = Project,ymin = ProjectN - 0.45,
ymax = ProjectN + 0.45, xmin = start, xmax = end)), fill = NA) +
scale_colour_hue(guide = 'none') + # this removes the legend
geom_text(data = DTa, aes(label = Project, y = ProjectN, x = mid,colour = Project), inherit.aes= FALSE) + # now some prettying up to remove text / axis ticks
theme(panel.background = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) + # and add date labels
scale_x_date(labels = date_format('%b %d'),
breaks = sort(unique(c(DT$start,DT$end))))+ # remove axis labels
labs(y = NULL, x = NULL)
You could also check gantt.chart function in plotrix package.
library(plotrix)
?gantt.chart
Here is one implementation
dmY.format<-"%d/%m/%Y"
gantt.info<-list(
labels= c("A","B","C","D","E"),
starts= as.Date(c("15/02/2012", "15/03/2012", "01/02/2012", "10/04/2012","01/03/2012"),
format=dmY.format),
ends= as.Date(c("01/03/2012", "15/06/2012", "15/03/2012", "15/05/2012","30/06/2012"),
format=dmY.format)
)
vgridpos<-as.Date(c("01/01/2012","01/02/2012","01/03/2012","01/04/2012","01/05/2012","01/06/2012","01/07/2012","01/08/2012"),format=dmY.format)
vgridlab<-
c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug")
gantt.chart(gantt.info, xlim= c(as.Date("01/01/2012",format=dmY.format), as.Date("01/08/2012",format=dmY.format)) , main="Projects duration",taskcolors=FALSE, border.col="black",
vgridpos=vgridpos,vgridlab=vgridlab,hgrid=TRUE)
I also tried ggplot2. but mnel was faster than me. Here is my codes
data1 <- as.data.frame(gantt.info)
data1$order <- 1:nrow(data1)
library(ggplot2)
ggplot(data1, aes(xmin = starts, xmax = ends, ymin = order, ymax = order+0.5)) + geom_rect(color="black",fill=FALSE) + theme_bw() + geom_text(aes(x= starts + (ends-starts)/2 ,y=order+0.25, label=labels)) + ylab("Projects") + xlab("Date")
Related
I followed this manual (https://afit-r.github.io/cleveland-dot-plots) to create a Cleaveland Dot Plot which I was able to reproduce but I faced the following challenges:
How do I sort my Y-Axis in historical order? The varieties on my y-axis have different release years and although those are not shown in my plot I would like to order them in historical order. Now they are in some wired alphabetic order starting from the back and I don't even know how to change that.
I couldn't manage to show the differences between the plots in percentages (like in the manual), could anyone explain to me that in more detail?
Do you see any possibility of including the same data for another year?
See below for my code and picture:
require(ggplot2)
require(reshape2)
require(dplyr)
require(plotrix)
cleanup = theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(color = "black"))
data19 = read.csv("Harvest_2019_V2.csv", sep = ";")
data19$Experiment_Year <- as.factor(data19$Experiment_Year)
data19$Release_year <- as.factor(data19$Release_year)
Subset2019 = subset(data19, Experiment_Year == 2019)
agHarvest.Weight <- aggregate(Subset2019[, 9], list(Subset2019$Variety,Subset2019$Release_year,Subset2019$Treatment), mean)
agHarvest.Weight$Variety <- agHarvest.Weight$Group.1
agHarvest.Weight$Release_Year <- agHarvest.Weight$Group.2
agHarvest.Weight$Treatment <- agHarvest.Weight$Group.3
agHarvest.Weight$Yield <- agHarvest.Weight$x
right_label <- agHarvest.Weight %>%
group_by(Variety) %>%
arrange(desc(Yield)) %>%
top_n(1)
left_label <- agHarvest.Weight %>%
group_by(Variety) %>%
arrange(desc(Yield)) %>%
slice(2)
ggplot(agHarvest.Weight, aes(Yield, Variety)) +
geom_line(aes(group = Variety)) +
geom_point(aes(color = Treatment), size = 1.5) +
geom_text(data = right_label, aes(color = Treatment, label = round(Yield, 0)),
size = 3, hjust = -.5) +
geom_text(data = left_label, aes(color = Treatment, label = round(Yield, 0)),
size = 3, hjust = 1.5) +
scale_x_continuous(limits = c(2500, 4500)) + cleanup + xlab("Yield, g") +
scale_color_manual(values=c("blue","darkgreen"))
OP. Understandably, you cannot always share data for various reasons. This is why it is always recommended to either use an existing publicly-available dataset or craft your own in order to produce a minimum reproducible example. Fortunately, you're in luck, as I don't mind doing this for you. :)
TL;DR - there are many ways, but simplest method is to use reorder(your_variable, variable_to_sort_by). Note that y axis direction goes "bottom-up" rather than "top-to-bottom" on the plot.
Example Data
df <- data.frame(
Variety=rep(LETTERS[1:5], each=2),
Yield=c(265, 285, 458, 964, 152, 202, 428, 499, 800, 900),
Treatment=rep(c('first','second'), 5),
Year=rep(c(2000, 2001, 2010, 1999, 1998), each=2)
)
> df
Variety Yield Treatment Year
1 A 265 first 2000
2 A 285 second 2000
3 B 458 first 2001
4 B 964 second 2001
5 C 152 first 2010
6 C 202 second 2010
7 D 428 first 1999
8 D 499 second 1999
9 E 800 first 1998
10 E 900 second 1998
Basic Cleveland Dot Plot
p <- ggplot(df, aes(x=Yield, y=Variety)) +
geom_line(aes(group=Variety)) +
geom_point(size=3) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw()
p
Sort Variety (Y axis) by Year Column
You should first notice how ggplot2 arranges your axes. The key is to understand that the origin of the plot starts at the bottom left corner. This means that the lowest value for x and y axes will be at the left and bottom, respectively. This is the reason why df$Variety is alphabetical, but "goes up" (from bottom to top). To reverse the y axis, you can just add scale_y_reverse() to your plot code, but that only works for continuous axes. For discrete axes, you can use scale_y_discrete(limits=rev(df$Variety)). You'll see in the following approach we can avoid that.
To sort the y axis by another column, you can use reorder() right with the aes() call. The reorder() function is basically setup as follows:
reorder(columnA, column_to_use_to_sort_columnA)
In this case, you'll want to sort df$Variety by df$Year, so this should become:
reorder(Variety, Year)
...but remember how the y axis "goes up"? If you want the Y axis to be sorted by df$Year and "go down", you can either reverse the axis via scale_y_discrete(limits=rev(df$Variety)), or conveniently just sort by df$Year in reverse using the syntax:
reorder(Variety, -Year)
Putting this together you get this:
p1 <- ggplot(df, aes(x=Yield, y=reorder(Variety, -Year))) +
geom_line(aes(group=Variety)) +
geom_point(size=2) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw()
p1
You'll see we have our proper order now, where df$Variety is sorted by ascending df$Year, starting from the top (1999) and going down to the bottom (2010).
Other ways?
There's other ways to do your sorting, but I found this most straightforward. The other fundamentally different approach would be to sort your data frame first, then plot. However, if you do this, be aware that ggplot2 will convert any column with discrete values into a factor first, and the default factor levels are created by sorting the names in alphabetical order. This means that if you sort your data frame first, then plot, you'll still be stuck with alphabetical order. You would need to sort, then discretely convert df$Variety into a factor (and specify the levels), then plot. Something like this works just the same:
df <- dplyr::arrange(df, -Year) # arrange by descending Year
df$Variety <- factor(df$Variety, levels=unique(df$Variety)) # factor and indicate levels
ggplot(df, aes(x=Yield, y=Variety)) +
geom_line(aes(group=Variety)) +
geom_point(size=2) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw() +
scale_y_discrete(limits=rev(df$Variety))
Above code gives you the same plot as the method using reorder(Variety, -Year).
I am trying to visualize time series data. I have set of 5 loggers which are indicating snow movement distance and an environmental variable that possibly has an effect on the snow movement. That is why it is meaningful to graph them together to see if snow movement (detected by the loggers) is influenced by these environmental factors. I have a file containing the weather station data and 5 files for each logger. The weather data is measured every 5 minutes while the loggers measure whenever there is movement! I have managed to visualize them together so far, however, my professor wants me to visualize the loggers as a group by presenting a gray area (instead of 5 lines) that always shows the minimum and maximum value of the loggers at a certain time point. I am using ggplot2. I've tried to make such area by using ribbon geom_ribbon but it is not so straight forward with my dataset. The line are crossing and often the loggers that have the min and max value switch. I don't know if joining them in a single dataset would help but this is also not possible because they don't have the same length. Furthermore, its not like all 5 loggers have measurements at the same time. They log only when there is movement. Here is my code and the graph that it creates. Unfortunately, I am not sure how to reproduce the data. I am more than glad to share it with you somehow.
#install.packages("patchwork")
library(ggplot2)
library(scales)
library(patchwork)
Sys.setlocale(category = "LC_ALL", locale = "english")
startTime <- as.Date("2017-10-01")
endTime <- as.Date("2018-06-30")
start_end <- c(startTime,endTime)
################################################## FALL LINE 1 #########################################################
logger1 <- read.csv("F1_17_18_167.csv",header=TRUE, sep=";")
logger1$date <- as.Date(logger1$Date, "%d.%m.%Y")
logger2 <- read.csv("F1_17_18_186.csv",header=TRUE, sep=";")
logger2$date <- as.Date(logger2$Date, "%d.%m.%Y")
logger3 <- read.csv("F1_17_18_031.csv",header=TRUE, sep=";")
logger3$date <- as.Date(logger3$Date, "%d.%m.%Y")
logger4 <- read.csv("F1_17_18_091.csv",header=TRUE, sep=";")
logger4$date <- as.Date(logger4$Date, "%d.%m.%Y")
logger5 <- read.csv("F1_17_18_294.csv",header=TRUE, sep=";")
logger5$date <- as.Date(logger5$Date, "%d.%m.%Y")
station <- read.csv("aggregates.csv",header=TRUE, sep=",")
station$date <- as.Date(station$Group.1, "%Y-%m-%d")
ggplot()+
geom_line(data = station, aes(x = date, y = Mean_snowheight ,color = "Mean Snowheight"),na.rm = TRUE, size = 1)+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(limits= c (0,115))
ggplot()+
geom_line(data = logger1, aes(x = date, y = AccuDist, color = "167 (mid-bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger2, aes(x = date, y = AccuDist, color = "186 (top-middle)"),na.rm= TRUE, size = 1)+
geom_line(data = logger3, aes(x = date, y = AccuDist, color = "31 (top)"),na.rm= TRUE, size = 1)+
geom_line(data = logger4, aes(x = date, y = AccuDist, color = "91 (bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger5, aes(x = date, y = AccuDist, color = "294 (middle)"),na.rm= TRUE, size = 1)+
geom_line(data = station, aes(x = date, y = Mean_snowheight*11.49 ,color = "Mean snowheight"),na.rm = TRUE, size = 1) +
ggtitle("Fall line 1") +
labs(color = "")+
xlab("Season 17/18")+
ylab("Accumulated Distance [mm]")+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(sec.axis = sec_axis(~./11.49,name = "Mean snowheight [cm]"),limits = c(0,1500))+
scale_color_manual("", guide = "legend",
values = c("167 (mid-bottom)"= "darkorange2",
"186 (top-middle)" = "darkgreen",
"31 (top)" = "red",
"91 (bottom)" = "blue",
"294 (middle)" = "purple",
"Mean snowheight" = "black"))+
theme(legend.position="bottom",
#legend.title = element_blank(),
axis.text.x = element_text(angle = 50, size = 10 , vjust = 0.5),
axis.text.y = element_text(size = 10, vjust = 0.5),
panel.background = element_rect(fill = "gray100"),
plot.background = element_rect(fill = "gray100"),
panel.grid.major = element_line(colour = "lightblue"),
plot.margin = unit(c(1, 1, 1, 1), "cm"),
plot.title = element_text(hjust = 0.5, size = 22))
You can see what graph this code produces:
If you ignore the environmental factor for a second (black line) you are left the accumulated snow movement distance over the winter period for each logger (the colored lines). My aim is to fill the area that is always between the lowest and highest line.
Let me know if I need to upload the data somewehere. This is how it the logger data looks like: data table.
Thanks in advance.
Regards,
Zorin
This is actually a tougher than it seems at first. Your goal as I understand is to fill in the area in your line plot between the "lowest" and the "highest" lines. This is made more difficult by the fact what is the lowest and highest line may change places throughout the plot, so you cannot simply choose to plot between one of the logs and another log. It's also made difficult by the fact that your x axis value is a date, so not all logs collect data on the same date and time.
First of all, I'll be ignoring a bit of your personal aesthetics you added and also removing the line you included for Mean snow height (from the dataframe station) for ease of showing you the solution I have.
Data Preparation
To begin, I noticed that you have included a geom_line() call for each individual logging station dataset (logger1 through logger5). While the method certainly works (and you do it in a way that gives you the solution you desire), it's much better practice to combine all logs into one dataset and this is going to be necessary in order for the solution I'm proposing to work anyway. Luckily, it's pretty simple to do this: just use rbind() to combine the datasets. Critically - you'll need to create a new column for each (called id here) that maintains the identity of the logging station of origin. You can then use that new id column as your color= aesthetic and draw all 5 lines using one geom_line() call.
One small problem I ran into is that your datasets had slightly different column names (some were caps, some were not...). They were all in the same order, so it wasn't too difficult to make them all the same before combining... it just added another step. Finally, I converted the date column to date format.
# create the id column
logger1$id <- 'logger1'
logger2$id <- 'logger2'
logger3$id <- 'logger3'
logger4$id <- 'logger4'
logger5$id <- 'logger5'
# fixing inconsistency in column names
my_column_names <- names(logger1)
names(logger2) <- my_column_names
names(logger3) <- my_column_names
names(logger4) <- my_column_names
names(logger5) <- my_column_names
# make one big df
loggers <- rbind(logger1, logger2, logger3, logger4, logger5)
loggers$date <- as.Date(loggers$date)
You can now recreate the plot in a more simple way:
ggplot(loggers, aes(x=date, y=AccuDist)) + theme_bw() +
geom_line(aes(color=id), size=1)
Finding the Running Minimum and Maximum
In order to create the fill, I'm using geom_ribbon(), which needs aesthetics ymin and ymax. You have to set those first though, and they need to be "running minimum" and the "running maximum", which means they will change as you progress through the data. For this, I'm using two functions shown below min_vect() and max_vect().
# find the "running maximum"
max_vect <- function(ac) {
curr_max <- 0
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in 1:length(ac)) {
if(ac[i] > curr_max) {
curr_max <- ac[i]
}
return_vector[i] <- curr_max
}
return(return_vector)
}
# find the "running minimum"
min_vect <- function(ac) {
curr_min <- max(ac)
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in length(ac):1) {
if(ac[i] < curr_min) {
curr_min <- ac[i]
}
return_vector[i] <- curr_min
}
return(return_vector)
}
The idea is that for the maximum, you step through an (ordered) vector and if the number is higher than the previous maximum number, it becomes the new maximum. The same strategy is used for the running minimum, albeit we have to step through the ordered vector in reverse.
In order to apply the functions to create new columns, the dataset needs to be ordered first in order for it to work properly:
# must arrange by date and time first!
loggers <- loggers %>% arrange(date, TIME)
# add your new columns
loggers$min_Accu <- min_vect(loggers$AccuDist)
loggers$max_Accu <- max_vect(loggers$AccuDist)
The Finale
And now, the plot. Basically it's the same, and I'm using geom_ribbon() as described above. For a bonus, I'm also using scale_color_discrete() to set the legend title and labels, just to show you that you can code that in afterwards (and it will still be easier than having separate geom_line() calls.
logger_list <- c('Log 1', 'Log 2', 'Log 3', 'Log 4', 'Log 5')
ggplot(loggers, aes(x=date, y=AccuDist)) +
theme_bw() +
geom_ribbon(aes(ymin=min_Accu, ymax=max_Accu), alpha=0.2) +
geom_line(aes(color=id), size=1) +
scale_color_discrete(name='Log ID Num', labels=logger_list)
I have been searching for 2 or 3 days now trying to find a resolution for my problem without success. I apologize if this is really easy or is already out there, but I can't find it. I have 3 data frames that can vary in length from 1 to 300. It is only possible to display about 60 values along the ggplot x-axis without it becoming unreadable and I don't want to omit values, so I am trying to find a way to calculate how long each data frame is and split it into "x" plots of no more than 60 each.
So far I have tried: facet_grid, facet_wrap, transform and split. "Split" works ok for splitting the data, but it adds an "X1." "X2." ... "Xn." to the front of the variable names (where n is the number of partitions it broke the data into). So when I call ggplot2, it can't find my original variable names ("Cost" and "Month") because they look like X1.Cost X1.Month, X2.Cost etc...How do I fix this?
I'm open to any suggestions, especially if I can fix both issues (not hard coding into 60 rows at a time AND breaking into graphs with smaller x-axis ranges). Thanks in advance for your patience and help.
Stephanie (desperate grad student)
Here is some stub code:
```{r setup, include=FALSE}
xsz <- 60 # would like not to have to hardcode this
ix1 <- seq(1:102) # would like to break into 2 or 3 approx equal graphs #
fcost <- sample(0:200, 102)
f.df <- data.frame("Cost" = fcost, "Month" = ix1)
fn <- nrow(f.df)
fr <- rep(1:ceiling(fn/xsz),each=xsz)[1:fn]
fd <- split(f.df,fr)
fc <- numeric(length(fd))
for (i in 1:length(fd)){
print(ggplot(as.data.frame(fd[i]), aes(Month, Cost)) +
geom_line(colour = "darkred", size = .5) +
geom_point(colour = "red", size = 1) +
labs(x = "Projected Future Costs (monthly)", y = "Dollars") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, vjust = .6)))
}
```
When I run it, I get:
Error in eval(expr, envir, enclos) : object 'Month' not found
When I do:
names(as.data.frame(fd[1]))
I get:
[1] "X1.Cost" "X1.Month"
Use [[]] for lists.
print(ggplot(as.data.frame(fd[[i]]), aes(Month, Cost)) +
To answer your other question, you have to create a new variable with a plot number. Here I'm using rep.
f.df$plot_number <-rep(1:round(nrow(f.df)/60),each=60,len=nrow(f.df))
Then, you create a list of plots in a loop
plots <- list() # new empty list
for (i in unique(f.df$plot_number)) {
p = ggplot(f.df[f.df$plot_number==i,], aes(Month, Cost)) +
geom_line(colour = "darkred", size = .5) +
geom_point(colour = "red", size = 1) +
labs(x = "Projected Future Costs (monthly)", y = "Dollars") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, vjust = .6))
plots[[paste0("p",i)]] <- p # add each plot into plot list
}
With package gridExtra, you can then arrange your plots in a single one.
library(gridExtra)
do.call("grid.arrange", c(plots, ncol=1))
I am currently generating heatmaps in R using the ggplot function. In the code below.. I first read the data into a dataframe, remove any duplicate rows, factorise timestamp field, melt the dataframe (according to 'timestamp'), scale all variable between 0 and 1, then plot the heatmap.
In the resulting heatmap, time is plotted on the x axis and each iostat-sda variable (see sample data below) is plotted along the y axis. Note: If you want to try out the R code – you can paste the sample data below into a file called iostat-sda.csv.
however I really need to be able cluster the rows within this heatmap... anyone know how this can be achieved using the ggplot function?
Any help would be very much appreciated!!
############################## The code
library(ggplot2)
fileToAnalyse_f <- read.csv(file="iostat-sda.csv",head=TRUE,sep=",")
fileToAnalyse <- subset(fileToAnalyse, !duplicated(timestamp))
fileToAnalyse[,1]<-factor(fileToAnalyse[,1])
fileToAnalyse.m <- melt(fileToAnalyse, id=1)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale = rescale(value) ) #scales each variable between 0 and 1
base_size <- 9
ggplot(fileToAnalyse.s, aes(timestamp, variable)) + geom_tile(aes(fill = rescale), colour = "black") + scale_fill_gradient(low = "black", high = "white") + theme_grey(base_size = base_size) + labs(x = "Time", y = "") + opts(title = paste("Heatmap"),legend.position = "right", axis.text.x = theme_blank(), axis.ticks = theme_blank()) + scale_y_discrete(expand = c(0, 0)) + scale_x_discrete(expand = c(0, 0))
########################## Sample data from iostat-sda.csv
timestamp,DSKRRQM,DSKWRQM,DSKR,DSKW,DSKRMB,DSKWMB,DSKARQS,DSKAQUS,DSKAWAIT,DSKSVCTM,DSKUtil
1319204905,0.33,0.98,10.35,2.37,0.72,0.02,120.00,0.01,0.40,0.31,0.39
1319204906,1.00,4841.00,682.00,489.00,60.09,40.68,176.23,2.91,2.42,0.50,59.00
1319204907,0.00,1600.00,293.00,192.00,32.64,13.89,196.45,5.48,10.76,2.04,99.00 1319204908,0.00,3309.00,1807.00,304.00,217.39,26.82,236.93,4.84,2.41,0.45,96.00
1319204909,0.00,5110.00,93.00,427.00,0.72,43.31,173.43,4.43,8.67,1.90,99.00
1319204910,0.00,6345.00,115.00,496.00,0.96,52.25,178.34,4.00,6.32,1.62,99.00
1319204911,0.00,6793.00,129.00,666.00,1.33,57.22,150.83,4.74,6.16,1.26,100.00
1319204912,0.00,6444.00,115.00,500.00,0.93,53.06,179.77,4.20,6.83,1.58,97.00
1319204913,0.00,1923.00,835.00,215.00,78.45,16.68,185.55,4.81,4.58,0.91,96.00
1319204914,0.00,0.00,788.00,0.00,83.51,0.00,217.04,0.45,0.57,0.25,20.00
1319204915,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204916,0.00,4.00,2.00,4.00,0.01,0.04,17.67,0.00,0.00,0.00,0.00
1319204917,0.00,8.00,4.00,8.00,0.02,0.09,17.83,0.00,0.00,0.00,0.00
1319204918,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204919,0.00,2.00,113.00,4.00,11.96,0.03,209.93,0.06,0.51,0.43,5.00
1319204920,0.00,59.00,147.00,54.00,11.15,0.63,120.02,0.04,0.20,0.15,3.00
1319204921,1.00,19.00,57.00,18.00,4.68,0.20,133.47,0.07,0.93,0.67,5.00
There is a nice package called NeatMap which simplifies generating heatmaps in ggplot2. Some of the row clustering methods include Multidimensional Scaling, PCA, or hierarchical clustering. Things to watch out for are:
Data to make.heatmap1 has to be in wide format
Data has to be a matrix, not a dataframe
Assign rownames to the wide-format matrix before generating the plot
I've changed your code slightly to avoid naming variables the same as base functions (i.e. rescale)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale.x = rescale(value) ) #scales each variable between 0 and 1
fileToAnalyse.w <- dcast(fileToAnalyse.s, timestamp ~ variable, value_var="rescale.x")
rownames(fileToAnalyse.w) <- as.character(fileToAnalyse.w[, 1])
ggheatmap <- make.heatmap1(as.matrix(fileToAnalyse.w[, -1]), row.method = "complete.linkage", row.metric="euclidean", column.cluster.method ="none", row.labels = rownames(fileToAnalyse.w))
+scale_fill_gradient(low = "black", high = "white") + labs(x = "Time", y = "") + opts(title = paste("Heatmap")
I'm having an issue finding out how to calculate an average over "x" days. If I try to plot this csv file over 1 year, it's too much data to display correctly on a plot line (screenshot attached). I'm looking to average the data over every few days (maybe 2, a week, etc..) so the line graph is not so hard to read. Any advice on how I would solve this issue with R?
results.csv
POSTS,PROVIDER,TYPE,DATE
29337,FTP,BLOG,2010-01-01
26725,FTP,BLOG,2010-01-02
27480,FTP,BLOG,2010-01-03
31187,FTP,BLOG,2010-01-04
31488,FTP,BLOG,2010-01-05
32461,FTP,BLOG,2010-01-06
33675,FTP,BLOG,2010-01-07
38897,FTP,BLOG,2010-01-08
37122,FTP,BLOG,2010-01-09
41365,FTP,BLOG,2010-01-10
51760,FTP,BLOG,2010-01-11
50859,FTP,BLOG,2010-01-12
53765,FTP,BLOG,2010-01-13
56836,FTP,BLOG,2010-01-14
59698,FTP,BLOG,2010-01-15
52095,FTP,BLOG,2010-01-16
57154,FTP,BLOG,2010-01-17
80755,FTP,BLOG,2010-01-18
227464,FTP,BLOG,2010-01-19
394510,FTP,BLOG,2010-01-20
371303,FTP,BLOG,2010-01-21
370450,FTP,BLOG,2010-01-22
268703,FTP,BLOG,2010-01-23
267252,FTP,BLOG,2010-01-24
375712,FTP,BLOG,2010-01-25
381041,FTP,BLOG,2010-01-26
380948,FTP,BLOG,2010-01-27
373140,FTP,BLOG,2010-01-28
361874,FTP,BLOG,2010-01-29
265178,FTP,BLOG,2010-01-30
269929,FTP,BLOG,2010-01-31
R Script
library(ggplot2);
data <- read.csv("results.csv", header=T);
dts <- as.POSIXct(data$DATE, format="%Y-%m-%d");
attach(data);
a <- ggplot(dataframe, aes(dts,POSTS/1000, fill = TYPE)) + opts(title = "Report") + labs(x = NULL, y = "Posts (k)", fill = NULL);
b <- a + geom_bar(stat = "identity", position = "stack");
plot_theme <- theme_update(axis.text.x = theme_text(angle=90, hjust=1), panel.grid.major = theme_line(colour = "grey90"), panel.grid.minor = theme_blank(), panel.background = theme_blank(), axis.ticks = theme_blank(), legend.position = "none");
c <- b + facet_grid(TYPE ~ ., scale = "free_y");
d <- c + scale_x_datetime(major = "1 months", format = "%Y %b");
ggsave(filename="/root/results.png",height=14,width=14,dpi=600);
Graph Image
Try this :
Average <- function(Data,n){
# Make an index to be used for aggregating
ID <- as.numeric(as.factor(Data$DATE))-1
ID <- ID %/% n
# aggregate over ID and TYPE for all numeric data.
out <- aggregate(Data[sapply(Data,is.numeric)],
by=list(ID,Data$TYPE),
FUN=mean)
# format output
names(out)[1:2] <-c("dts","TYPE")
# add the correct dates as the beginning of every period
out$dts <- as.POSIXct(Data$DATE[(out$dts*n)+1])
out
}
dataframe <- Average(Data,3)
This works with the plot script you gave.
Some remarks :
never ever call some variable after a function name (data, c, ...)
avoid the use of attach(). If you do, add detach() afterwards, or you'll get into trouble at some point. Better is to use the functions with() and within()
The TTR package also has several moving average functions that will do this with a single statement:
library(TTR)
mavg.3day <- SMA(data$POSTS, n=3) # Simple moving average
Substitute a different value of 'n' for your desired moving average length.