TL;DR: I would like to generate a markdown with the plot of normalized count for a list of gene. As this list is quite long (> 100 genes), I would like to generate a "grid" of 4x4 graph on one figure page for the first 16 genes, then the same for genes 17 to 32, etc...until the end of the list. Currently, my code is only displaying 16 gene, even tho I run the grid.arrange command INSIDE the loop (it worked when I used the "plot" fonction inside the loop but it displays only one graph per page ofc).
I'm currently doing some RNAseq analysis and I'm looking for differentially expressed gene (DEGs) between two population. To have a more visual representation of DEGs, I would like to plot the normalized count for each population for some genes of interest (GoI).
However the list of these GOI can be quite long (for ex., if I'm focusing on DEG that are coding for membrane protein, I've some 159 candidates). I'm able to plot them using a for loop, with the following code (from a first analysis):
# top gene contains all of Gene of Interest
# group_origin contains the factor used to discriminate cell population to compare
# dds_g is a dds object from DESeq2 and contained counts number I want to plot
for (i in unique(1:length(top_gene))){
gene <- top_gene[i]
d <- plotCounts(dds_g, gene = gene, intgroup = "group_origin", returnData = TRUE)
b <- ggplot(d, aes(x = group_origin, y = count)) +
stat_boxplot(geom = 'errorbar', aes(colour = factor(group_origin)), width = 0.2) +
geom_boxplot(aes(colour = factor(group_origin)), width = 0.2) +
stat_summary(fun.y=mean, geom="point", shape=17, size=1.5, aes(color=factor(group_origin), fill=factor(group_origin))) +
labs (title = paste0(resg_cb_db$symbol[gene],' (',gene,')'), x = element_blank()) +
theme_bw() +
scale_color_manual(values = mycolors) +
theme(text= element_text(size=10),
axis.text.x = element_text(size = 7.5, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.position = "none")
plot(b)
}
By using this approach, I'm able to generate (separately) the plot for all the gene ID contained in my "top_gene" vector.
However, the markdown is creating one plot per page which is annoying when you have high number of genes. I would like to group them, e.g., every genes 1 to 16 in the list are plotted together on one page, then its 16 to 32, etc....
I've tried (without sucess) par mfrow. I also tried Grid.extra with the following code (from another analysis example):
for (i in unique(1:length(top_gene))){
gene <- top_gene[i]
d <- plotCounts(dds, gene = gene, intgroup = "cell_type", returnData = TRUE)
b <- ggplot(d, aes(x = cell_type, y = count)) +
stat_boxplot(geom = 'errorbar', aes(colour = factor(cell_type)), width = 0.2) +
geom_boxplot(aes(colour = factor(cell_type)), width = 0.2) +
stat_summary(fun.y=mean, geom="point", shape=17, size=1.5, aes(color=factor(cell_type), fill=factor(cell_type))) +
labs (title = paste0(resg$symbol[gene],' (',gene,')'), x = element_blank()) +
theme_bw() +
scale_color_manual(values = mycolors) +
theme(text= element_text(size=7),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1),
axis.text.y = element_text(size = 6),
legend.position = "none")
plot_list[[gene]] <- b
}
t <- length(plot_list)
x <- seq(1, t, by = 15)
for (i in x){
z <- i+1
if (!is.na(x[z])) {
test_list <- plot_list[c(x[i]:x[z])]
do.call(grid.arrange,test_list)
}
else {
z <- length(plot_list)
test_list <- plot_list[c(x[i]:z)]
do.call(grid.arrange,test_list)
}
}
Which give me an error "Error in x[i]:z : argument NA / NaN"
The idea here was to, for every 16 genes, plot a 4x4 graph page. Of course, when it reach the last "16 group", there are less than 16 genes remaining (unless you have total number of genes that can be divided by 16). So that's why the if loop is there to prevent the error (but it's not working).
Also, I've tried to remove this last part and just try to generate the first "9 x 16 genes" of my list, discarding the last ones. It "works" because I can clearly see the first 16 genes ploted in 4 x 4 but nothing about the rest.
Why plotting inside a for loop using "plot(b)" is working but not using grid.arrange on a list() created inside a for loop too?
Very sorry for my code, I know it's not perfect (I'm still learning all of this) but I hope it's clear enough for you to understand my question...
Thanks!
!EDIT! : solved the first error by adding (i in length(x)). Feel stupid :D. Anyway, it's still only "printing" 16 plots instead of 159...
I think you have some confusion in your for loop. When you say for (i in x) each iteration you get the values in x (1,16,31,46,...) not the index number so when you set z <- i + 1 you get the values (2,17,31,...). This makes your x[i] and x[z] values NA for most values of i and z.
The for loop below should fix the indexing and prevent you from going outside the length of plot_list in your edge case.
for (i in 1:length(x)){ #replaced this line
z <- i+1
if (!is.na(x[z])) {
# changed this to make sure x[z] doesn't show up on two plots
test_list <- plot_list[x[i]:(x[z]-1)]
do.call(grid.arrange,test_list)
}
else {
z <- length(plot_list)
test_list <- plot_list[x[i]:z]
do.call(grid.arrange,test_list)
}
}
Related
I have a dataframe containing some comparisons and the value represent the similarity between objects. I have a real object compared to some random ones which led to very small similarity. Also, I compared random objects versus random which led to higher similarity rate. At this point I want to put all together and plot it as a heatmap. Problem is that very small values of similarity which I want to highlight have the same colour as the not-so-small from the random-random comparison. Of course this is a problem of scale but I don't know how to manage colour scale. The following code generate a heatmap that actually show the issue. Here, the first column has a yellowish colour, which is fine, but this is the same colour as other tiles which, on the other hand, have higher, non comparable values. How to colour tiles accordingly to the actual scale?
The code:
set.seed(131)
#number of comparisons in the original data: 1 value versus n=10
n <- 10
#generate real data (very small values)
fakeRealData <- runif(n, min=0.00000000000001, max=0.00000000000002)
#and create the data structure
realD <- cbind.data.frame(rowS=rep("fakeRealData", n), colS=paste("rnd", seq(1, n, by=1), sep=" "), Similarity=fakeRealData, stringsAsFactors=F)
#the same for random data, n=10 random comparisons make for a n by n matrix
rndN <- n*n
randomData <- data.frame(matrix(runif(rndN), nrow=n, ncol=n))
rowS <- vector()
#for each column of randomData
for (r in seq(1, n, by=1)) {
#create a vector of the first rowname, then the second, the third, etc etc which is long as the number of columns
rowS <- append(rowS, rep(paste("rnd", r, sep=" "), n))
}
#and create the random data structure
randomPVs <- cbind.data.frame(rowS=rowS, colS=rep(paste("rnd", seq(1, n, by=1), sep=" "), n), Similarity=unlist(randomData), stringsAsFactors=F)
#eventually put everything together
everything <- rbind.data.frame(randomPVs, realD)
#and finally plot the heatmap
heaT <- ggplot(everything, aes(rowS, colS, fill=Similarity)) +
geom_tile() +
scale_fill_distiller(palette = "YlGn", direction=2) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
xlab("")+
ylab("")
plot(heaT)
Here are three approaches:
Add geom_text to your plot to show the values when color differences are small.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2) +
geom_text(aes(label = round(Similarity, 2))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ylab("")
Use the values argument to set a nonlinear scale to scale_fill_distiller. I added an extra break point at 0.01 to the otherwise linear scale to accentuate the difference between 0 and small nonzero numbers. I let the rest of the scale linear.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2,
values = c(0, 0.01, seq(0.05, 1, 0.05))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
ylab("")
Transform your scale as Richard mentioned in the comments. Note that this will mess with the values in the legend, so either rename it or hide it.
heaT <- ggplot(everything, aes(rowS, colS)) +
geom_tile(aes(fill=Similarity)) +
scale_fill_distiller(palette = "YlGn", direction=2, trans = "log10",
name = "log10(Similarity)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
xlab("")+
ylab("")
Try combinations of these approaches and see what you like.
I am trying to visualize time series data. I have set of 5 loggers which are indicating snow movement distance and an environmental variable that possibly has an effect on the snow movement. That is why it is meaningful to graph them together to see if snow movement (detected by the loggers) is influenced by these environmental factors. I have a file containing the weather station data and 5 files for each logger. The weather data is measured every 5 minutes while the loggers measure whenever there is movement! I have managed to visualize them together so far, however, my professor wants me to visualize the loggers as a group by presenting a gray area (instead of 5 lines) that always shows the minimum and maximum value of the loggers at a certain time point. I am using ggplot2. I've tried to make such area by using ribbon geom_ribbon but it is not so straight forward with my dataset. The line are crossing and often the loggers that have the min and max value switch. I don't know if joining them in a single dataset would help but this is also not possible because they don't have the same length. Furthermore, its not like all 5 loggers have measurements at the same time. They log only when there is movement. Here is my code and the graph that it creates. Unfortunately, I am not sure how to reproduce the data. I am more than glad to share it with you somehow.
#install.packages("patchwork")
library(ggplot2)
library(scales)
library(patchwork)
Sys.setlocale(category = "LC_ALL", locale = "english")
startTime <- as.Date("2017-10-01")
endTime <- as.Date("2018-06-30")
start_end <- c(startTime,endTime)
################################################## FALL LINE 1 #########################################################
logger1 <- read.csv("F1_17_18_167.csv",header=TRUE, sep=";")
logger1$date <- as.Date(logger1$Date, "%d.%m.%Y")
logger2 <- read.csv("F1_17_18_186.csv",header=TRUE, sep=";")
logger2$date <- as.Date(logger2$Date, "%d.%m.%Y")
logger3 <- read.csv("F1_17_18_031.csv",header=TRUE, sep=";")
logger3$date <- as.Date(logger3$Date, "%d.%m.%Y")
logger4 <- read.csv("F1_17_18_091.csv",header=TRUE, sep=";")
logger4$date <- as.Date(logger4$Date, "%d.%m.%Y")
logger5 <- read.csv("F1_17_18_294.csv",header=TRUE, sep=";")
logger5$date <- as.Date(logger5$Date, "%d.%m.%Y")
station <- read.csv("aggregates.csv",header=TRUE, sep=",")
station$date <- as.Date(station$Group.1, "%Y-%m-%d")
ggplot()+
geom_line(data = station, aes(x = date, y = Mean_snowheight ,color = "Mean Snowheight"),na.rm = TRUE, size = 1)+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(limits= c (0,115))
ggplot()+
geom_line(data = logger1, aes(x = date, y = AccuDist, color = "167 (mid-bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger2, aes(x = date, y = AccuDist, color = "186 (top-middle)"),na.rm= TRUE, size = 1)+
geom_line(data = logger3, aes(x = date, y = AccuDist, color = "31 (top)"),na.rm= TRUE, size = 1)+
geom_line(data = logger4, aes(x = date, y = AccuDist, color = "91 (bottom)"),na.rm= TRUE, size = 1)+
geom_line(data = logger5, aes(x = date, y = AccuDist, color = "294 (middle)"),na.rm= TRUE, size = 1)+
geom_line(data = station, aes(x = date, y = Mean_snowheight*11.49 ,color = "Mean snowheight"),na.rm = TRUE, size = 1) +
ggtitle("Fall line 1") +
labs(color = "")+
xlab("Season 17/18")+
ylab("Accumulated Distance [mm]")+
scale_x_date(limits=start_end,breaks=date_breaks("1 month"),labels=date_format("%b %y"))+
scale_y_continuous(sec.axis = sec_axis(~./11.49,name = "Mean snowheight [cm]"),limits = c(0,1500))+
scale_color_manual("", guide = "legend",
values = c("167 (mid-bottom)"= "darkorange2",
"186 (top-middle)" = "darkgreen",
"31 (top)" = "red",
"91 (bottom)" = "blue",
"294 (middle)" = "purple",
"Mean snowheight" = "black"))+
theme(legend.position="bottom",
#legend.title = element_blank(),
axis.text.x = element_text(angle = 50, size = 10 , vjust = 0.5),
axis.text.y = element_text(size = 10, vjust = 0.5),
panel.background = element_rect(fill = "gray100"),
plot.background = element_rect(fill = "gray100"),
panel.grid.major = element_line(colour = "lightblue"),
plot.margin = unit(c(1, 1, 1, 1), "cm"),
plot.title = element_text(hjust = 0.5, size = 22))
You can see what graph this code produces:
If you ignore the environmental factor for a second (black line) you are left the accumulated snow movement distance over the winter period for each logger (the colored lines). My aim is to fill the area that is always between the lowest and highest line.
Let me know if I need to upload the data somewehere. This is how it the logger data looks like: data table.
Thanks in advance.
Regards,
Zorin
This is actually a tougher than it seems at first. Your goal as I understand is to fill in the area in your line plot between the "lowest" and the "highest" lines. This is made more difficult by the fact what is the lowest and highest line may change places throughout the plot, so you cannot simply choose to plot between one of the logs and another log. It's also made difficult by the fact that your x axis value is a date, so not all logs collect data on the same date and time.
First of all, I'll be ignoring a bit of your personal aesthetics you added and also removing the line you included for Mean snow height (from the dataframe station) for ease of showing you the solution I have.
Data Preparation
To begin, I noticed that you have included a geom_line() call for each individual logging station dataset (logger1 through logger5). While the method certainly works (and you do it in a way that gives you the solution you desire), it's much better practice to combine all logs into one dataset and this is going to be necessary in order for the solution I'm proposing to work anyway. Luckily, it's pretty simple to do this: just use rbind() to combine the datasets. Critically - you'll need to create a new column for each (called id here) that maintains the identity of the logging station of origin. You can then use that new id column as your color= aesthetic and draw all 5 lines using one geom_line() call.
One small problem I ran into is that your datasets had slightly different column names (some were caps, some were not...). They were all in the same order, so it wasn't too difficult to make them all the same before combining... it just added another step. Finally, I converted the date column to date format.
# create the id column
logger1$id <- 'logger1'
logger2$id <- 'logger2'
logger3$id <- 'logger3'
logger4$id <- 'logger4'
logger5$id <- 'logger5'
# fixing inconsistency in column names
my_column_names <- names(logger1)
names(logger2) <- my_column_names
names(logger3) <- my_column_names
names(logger4) <- my_column_names
names(logger5) <- my_column_names
# make one big df
loggers <- rbind(logger1, logger2, logger3, logger4, logger5)
loggers$date <- as.Date(loggers$date)
You can now recreate the plot in a more simple way:
ggplot(loggers, aes(x=date, y=AccuDist)) + theme_bw() +
geom_line(aes(color=id), size=1)
Finding the Running Minimum and Maximum
In order to create the fill, I'm using geom_ribbon(), which needs aesthetics ymin and ymax. You have to set those first though, and they need to be "running minimum" and the "running maximum", which means they will change as you progress through the data. For this, I'm using two functions shown below min_vect() and max_vect().
# find the "running maximum"
max_vect <- function(ac) {
curr_max <- 0
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in 1:length(ac)) {
if(ac[i] > curr_max) {
curr_max <- ac[i]
}
return_vector[i] <- curr_max
}
return(return_vector)
}
# find the "running minimum"
min_vect <- function(ac) {
curr_min <- max(ac)
return_vector <- vector(mode = 'numeric', length=length(ac))
for(i in length(ac):1) {
if(ac[i] < curr_min) {
curr_min <- ac[i]
}
return_vector[i] <- curr_min
}
return(return_vector)
}
The idea is that for the maximum, you step through an (ordered) vector and if the number is higher than the previous maximum number, it becomes the new maximum. The same strategy is used for the running minimum, albeit we have to step through the ordered vector in reverse.
In order to apply the functions to create new columns, the dataset needs to be ordered first in order for it to work properly:
# must arrange by date and time first!
loggers <- loggers %>% arrange(date, TIME)
# add your new columns
loggers$min_Accu <- min_vect(loggers$AccuDist)
loggers$max_Accu <- max_vect(loggers$AccuDist)
The Finale
And now, the plot. Basically it's the same, and I'm using geom_ribbon() as described above. For a bonus, I'm also using scale_color_discrete() to set the legend title and labels, just to show you that you can code that in afterwards (and it will still be easier than having separate geom_line() calls.
logger_list <- c('Log 1', 'Log 2', 'Log 3', 'Log 4', 'Log 5')
ggplot(loggers, aes(x=date, y=AccuDist)) +
theme_bw() +
geom_ribbon(aes(ymin=min_Accu, ymax=max_Accu), alpha=0.2) +
geom_line(aes(color=id), size=1) +
scale_color_discrete(name='Log ID Num', labels=logger_list)
This question already has an answer here:
Adding elements to a list in for loop in R
(1 answer)
Closed 3 years ago.
using ggplot2 inside for loop doesn't show the plots names
I want to print the plots into one page and save them
I have 60 csv files each consists of two columns 1st is date and second is ssh and each has different rows number. I list them in variable called files and then I plot them. the for loop produced 60 plots. the problem is how to know the name of those plots to call them when I want e.g. to print 4 plots in one page so I will have at the end 15 pages each contains 4 plots.
when I used
library(gridExtra)
grid.arrange(p1,p2,p3,p4, nrow=2,ncol=2)
grid.arrange(p5,p6,p7,p8, nrow=2,ncol=2)
and so on till p60
it showed no result. warning messages said there is no object called p1, p2,p3,p4,....p60.
the code is as follows:
files<- list.files("F:/R Practice/time series")
for (i in seq_along(files)){
mydf <- read.csv(files[i], stringsAsFactors=FALSE)
a<- data.frame(as.Date(mydf$date, "%d-%m-%y"),mydf[,-1])
names(a)[1]<- "Date"
names(a)[2]<- "SSH"
b <- zoo(a[,-1],order.by=as.Date(a[,1]))
p<- ggplot(a, aes(x=Date,y=SSH, color=SSH)) +geom_line(colour="darkblue")# +labs(title = "gridcell of (31.25N, 33.25E) ",x="Date", y="SSH")
p<-p+ggtitle(readline(prompt = "enter cell coordinates: "))+xlab("Year")+ylab("weighted average SSH of center cell")
# adding atrribute to the plot p[i]
p<-p+theme(axis.title.x=element_text(color = "black",size = 12),
axis.title.y=element_text(color = "black",size = 8),
axis.text.x=element_text(color="black",size=8),
axis.text.y = element_text(color = "black",size = 8),
panel.background = element_rect(colour = "black", size=0.5 ,fill = NA),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
axis.line = element_line(colour = "black",size=1),
legend.position="none" ,
plot.title=element_text(hjust = 0.5,vjust= 0.5,lineheight = .3,face = "bold"))
print(p)
}
Your code is overwriting p each time to the last plot. Here is simpler code using the examples from the ggplot2 help example but with a loop like yours.
df <- data.frame(
gp = factor(rep(letters[1:3], each = 10)),
y = rnorm(30)
)
p.list<- list()
for (i in 1:4)
{
df<- sample_frac(df, 0.5)
p<- ggplot(df, aes(gp, y)) +
geom_point() +
geom_point(data = ds, aes(y = mean), colour = 'red', size = 3)
p.list[[i]] <-p
}
grid.arrange(p.list[[1]], p.list[[2]], p.list[[3]], p.list[[4]], nrow= 2 )
Note two things:
p.list begins as an empty list.
At the end of each loop you add a plot to a numbered element of a growing list i.e. p.list[[i]] <- p
p.list grows to become a list with each numbered element (1:4) holding a single plot, each plot also being a named list.
I have been searching for 2 or 3 days now trying to find a resolution for my problem without success. I apologize if this is really easy or is already out there, but I can't find it. I have 3 data frames that can vary in length from 1 to 300. It is only possible to display about 60 values along the ggplot x-axis without it becoming unreadable and I don't want to omit values, so I am trying to find a way to calculate how long each data frame is and split it into "x" plots of no more than 60 each.
So far I have tried: facet_grid, facet_wrap, transform and split. "Split" works ok for splitting the data, but it adds an "X1." "X2." ... "Xn." to the front of the variable names (where n is the number of partitions it broke the data into). So when I call ggplot2, it can't find my original variable names ("Cost" and "Month") because they look like X1.Cost X1.Month, X2.Cost etc...How do I fix this?
I'm open to any suggestions, especially if I can fix both issues (not hard coding into 60 rows at a time AND breaking into graphs with smaller x-axis ranges). Thanks in advance for your patience and help.
Stephanie (desperate grad student)
Here is some stub code:
```{r setup, include=FALSE}
xsz <- 60 # would like not to have to hardcode this
ix1 <- seq(1:102) # would like to break into 2 or 3 approx equal graphs #
fcost <- sample(0:200, 102)
f.df <- data.frame("Cost" = fcost, "Month" = ix1)
fn <- nrow(f.df)
fr <- rep(1:ceiling(fn/xsz),each=xsz)[1:fn]
fd <- split(f.df,fr)
fc <- numeric(length(fd))
for (i in 1:length(fd)){
print(ggplot(as.data.frame(fd[i]), aes(Month, Cost)) +
geom_line(colour = "darkred", size = .5) +
geom_point(colour = "red", size = 1) +
labs(x = "Projected Future Costs (monthly)", y = "Dollars") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, vjust = .6)))
}
```
When I run it, I get:
Error in eval(expr, envir, enclos) : object 'Month' not found
When I do:
names(as.data.frame(fd[1]))
I get:
[1] "X1.Cost" "X1.Month"
Use [[]] for lists.
print(ggplot(as.data.frame(fd[[i]]), aes(Month, Cost)) +
To answer your other question, you have to create a new variable with a plot number. Here I'm using rep.
f.df$plot_number <-rep(1:round(nrow(f.df)/60),each=60,len=nrow(f.df))
Then, you create a list of plots in a loop
plots <- list() # new empty list
for (i in unique(f.df$plot_number)) {
p = ggplot(f.df[f.df$plot_number==i,], aes(Month, Cost)) +
geom_line(colour = "darkred", size = .5) +
geom_point(colour = "red", size = 1) +
labs(x = "Projected Future Costs (monthly)", y = "Dollars") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, vjust = .6))
plots[[paste0("p",i)]] <- p # add each plot into plot list
}
With package gridExtra, you can then arrange your plots in a single one.
library(gridExtra)
do.call("grid.arrange", c(plots, ncol=1))
I'm having an issue finding out how to calculate an average over "x" days. If I try to plot this csv file over 1 year, it's too much data to display correctly on a plot line (screenshot attached). I'm looking to average the data over every few days (maybe 2, a week, etc..) so the line graph is not so hard to read. Any advice on how I would solve this issue with R?
results.csv
POSTS,PROVIDER,TYPE,DATE
29337,FTP,BLOG,2010-01-01
26725,FTP,BLOG,2010-01-02
27480,FTP,BLOG,2010-01-03
31187,FTP,BLOG,2010-01-04
31488,FTP,BLOG,2010-01-05
32461,FTP,BLOG,2010-01-06
33675,FTP,BLOG,2010-01-07
38897,FTP,BLOG,2010-01-08
37122,FTP,BLOG,2010-01-09
41365,FTP,BLOG,2010-01-10
51760,FTP,BLOG,2010-01-11
50859,FTP,BLOG,2010-01-12
53765,FTP,BLOG,2010-01-13
56836,FTP,BLOG,2010-01-14
59698,FTP,BLOG,2010-01-15
52095,FTP,BLOG,2010-01-16
57154,FTP,BLOG,2010-01-17
80755,FTP,BLOG,2010-01-18
227464,FTP,BLOG,2010-01-19
394510,FTP,BLOG,2010-01-20
371303,FTP,BLOG,2010-01-21
370450,FTP,BLOG,2010-01-22
268703,FTP,BLOG,2010-01-23
267252,FTP,BLOG,2010-01-24
375712,FTP,BLOG,2010-01-25
381041,FTP,BLOG,2010-01-26
380948,FTP,BLOG,2010-01-27
373140,FTP,BLOG,2010-01-28
361874,FTP,BLOG,2010-01-29
265178,FTP,BLOG,2010-01-30
269929,FTP,BLOG,2010-01-31
R Script
library(ggplot2);
data <- read.csv("results.csv", header=T);
dts <- as.POSIXct(data$DATE, format="%Y-%m-%d");
attach(data);
a <- ggplot(dataframe, aes(dts,POSTS/1000, fill = TYPE)) + opts(title = "Report") + labs(x = NULL, y = "Posts (k)", fill = NULL);
b <- a + geom_bar(stat = "identity", position = "stack");
plot_theme <- theme_update(axis.text.x = theme_text(angle=90, hjust=1), panel.grid.major = theme_line(colour = "grey90"), panel.grid.minor = theme_blank(), panel.background = theme_blank(), axis.ticks = theme_blank(), legend.position = "none");
c <- b + facet_grid(TYPE ~ ., scale = "free_y");
d <- c + scale_x_datetime(major = "1 months", format = "%Y %b");
ggsave(filename="/root/results.png",height=14,width=14,dpi=600);
Graph Image
Try this :
Average <- function(Data,n){
# Make an index to be used for aggregating
ID <- as.numeric(as.factor(Data$DATE))-1
ID <- ID %/% n
# aggregate over ID and TYPE for all numeric data.
out <- aggregate(Data[sapply(Data,is.numeric)],
by=list(ID,Data$TYPE),
FUN=mean)
# format output
names(out)[1:2] <-c("dts","TYPE")
# add the correct dates as the beginning of every period
out$dts <- as.POSIXct(Data$DATE[(out$dts*n)+1])
out
}
dataframe <- Average(Data,3)
This works with the plot script you gave.
Some remarks :
never ever call some variable after a function name (data, c, ...)
avoid the use of attach(). If you do, add detach() afterwards, or you'll get into trouble at some point. Better is to use the functions with() and within()
The TTR package also has several moving average functions that will do this with a single statement:
library(TTR)
mavg.3day <- SMA(data$POSTS, n=3) # Simple moving average
Substitute a different value of 'n' for your desired moving average length.