Plotting depth data - r

Working with a large chemistry dataset of samples collected at different depths. The data is in long format as:
<Date> <Depth> <Temp>
2015-06-11 4 m 15
2015-07-11 4 m 16
2015-08-11 4 m 17
2015-06-11 3 m 19
2015-07-11 3 m 20
2015-08-11 3 m 21
2015-06-11 2 m 25
2015-07-11 2 m 26
2015-08-11 2 m 27
Trying to graph it as such that I have temperature on my x-axis and depth on my y-axis and then color them by their dates. Currently when I add a geom_line to the function it just connects all the dots.
ggplot(aes(x = Temp, y = Depth, color = Date)) +
geom_point() +
geom_line()

Connection for geoms is established typically just by applying an aesthetic as you did (color=). What's actually happening there is that ggplot is applying the actual aesthetic (drawing the color), but will also apply a somewhat hidden aesthetic, group= to the same value. Normally, this works fine unless the column you assign to color= is continuous (like a date), rather than a factor (which is ordinal, but discrete). If df$Date is actually formatted as a "Date" class, then it's continuous and would exhibit behavior consistent with what you mentioned. The fix is to either explicitly define the group= aesthetic in addition to color=, or to convert df$Date to a factor (discrete).
The example below using your dataset should help explain. For exemplary purposes, I'm adding a column called df$Other, which is formatted as a factor.
df <- data.frame(
Date=rep(c('2015-06-11','2016-07-11','2015-08-11'),3),
Other=rep(c('Jun','July','Aug'),3),
Depth=c(4,4,4,3,3,3,2,2,2),
Temp=c(15,16,17,19,20,21,25,26,27)
)
df$Date <- as.Date(df$Date, format='%Y-%m-%d')
First, here's what your code posted gives you:
ggplot(df, aes(x=Temp, y=Depth, color=Date)) + geom_point() + geom_line()
Look familiar? We know that df$Date is continuous, because ggplot2 draws a legend which is continuous by default, and also because we know it is formatted as a Date class. Consider what happens if you swap out df$Other in place of df$Date:
ggplot(df, aes(x=Temp, y=Depth, color=Other)) + geom_point() + geom_line()
Now the issue should be very clear, but how can you solve it? Well, like I mentioned there are two approaches. One is to maintain df$Date as a continuous variable, but clarify to ggplot2 that you want to use this as a grouping variable. In order to do so, ggplot2 will basically convert it to a factor for purposes of connecting the lines, but keep it continuous to make the color scale:
ggplot(df, aes(x=Temp, y=Depth, color=Date)) +
geom_point() + geom_line(aes(group=Date))
One of the best options might be to set df$Date as a factor with ordered levels, since you're not actually using the "Date" class's continuous nature anyway. You can actually just use color=factor(Date) to fix it right in-line, but you'll notice that the levels are not going to be correct (in terms of the months in the correct order). In this case, I'd recommend changing the column prior to plotting using factor() and setting the levels there. Here's my solution:
# convert to character vector first
df$Date <- as.character(df$Date)
# it's already in the correct order, so just use the order of the df
df$Date <- factor(df$Date, levels=unique(df$Date))
ggplot(df, aes(x=Temp, y=Depth, color=Date)) + geom_point() + geom_line()

Related

Can someone explain why my first ggplot2 box plot was just one big box and how the solution worked?

So my first ggplot2 box plot was just one big stretched out box plot, the second one was correct but I don't understand what changed and why the second one worked. I'm new to R and ggplot2, let me know if you can, thanks.
#----------------------------------------------------------
# This is the original ggplot that didn't work:
#----------------------------------------------------------
zSepalFrame <- data.frame(zSepalLength, zSepalWdth)
zPetalFrame <- data.frame(zPetalLength, zPetalWdth)
p1 <- ggplot(data = zSepalFrame, mapping = aes(x=zSepalWdth, y=zSepalLength, group = 4)) + #fill = zSepalLength
geom_boxplot(notch=TRUE) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
theme_classic() +
labs(title = "Iris Data Box Plot") +
labs(subtitle ="Z Values of Sepals From Iris.R")
p1
#----------------------------------------------------------
# This is the new ggplot box plot line that worked:
#----------------------------------------------------------
bp = ggplot(zSepalFrame, aes(x=factor(zSepalWdth), y=zSepalLength, color = zSepalWdth)) + geom_boxplot() + theme(legend.position = "none")
bp
This is what the ggplot box plot looked like
I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.
A continuous variable is something like a numeric column in a dataframe. So something like this:
x <- c(4,4,4,8,8,8,8)
Even though the variable x only contains 4's and 8's, R assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.
The other type of variable is called discrete, which would be something like this:
y <- c("Green", "Green", "Flags", "Flags", "Cars")
The variable y contains only characters. It must be discrete, since there is no such thing as something between "Green" and "Cars". If plotted on an x axis, ggplot will group things as either being "Green", "Flags", or "Cars".
The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R to consider a variable as a factor. If you typed factor(x), you get this:
[1] 4 4 4 8 8 8 8
Levels: 4 8
The values in x are the same, but now there is no such thing as a number between 4 and 8 when x is a factor - it would just add another level.
That is in short why your box plot changes. Let's demonstrate with the iris dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length. In the iris dataset, Sepal.Length is numeric, so continuous.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_boxplot()
This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length). Here's what it looks like when you do that:
ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
geom_boxplot()
The other way to have this same effect would be to use the group= aesthetic, which does what you might think: it groups according to that column in the dataset.
ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
geom_boxplot()

R - ggplot2 - difference between ggplot(data, aes(x=variable...)) and ggplot(data, aes(x=data$variable...)) [duplicate]

This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.

how to display factor id on plot axis and factor id and value in legend?

I have a plot where the request is a factor with long values so they don't display on the char axis.
plot( time_taken ~ request )
The data in this case looks like:
time_taken request
1 7 /servlet1/endpoint2/
2 2 /session/
3 10 /servlet1/endpoint3/
4 2 /servlet1/endpoint2/
5 8 /servlet4/endpoint2/
6 5 /session/
...
Question: Is there a way to plot something like the factor level id on the x axis, and the factor level id + factor full string in the legend?
The code in your question generates a boxplot, so I assume that's what you want. Here are four ways to go about it.
This will generate a boxplot with the x-axis numbered, and the full names in the legend.
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(x=as.integer(request),y=time_taken, color=request))+
labs(x="request")
As you can see below, though, with ggplot the labels are discernible (at least in the example).
ggp <- ggplot(df) + geom_boxplot(aes(x=request,y=time_taken))
ggp
In a situation like this I'd be inclined to rotate the plot.
ggp + coord_flip()
Finally, here's a way in base R, although IMO it's the least appealing option.
plot(time_taken~factor(as.integer(request)),df, xlab="request")
labs <- with(df,paste(as.integer(sort(unique(request))),sort(unique(request)),sep=" - "))
legend("topright",legend=labs)
A possible solution uses ggplot2. Following an example with some sample data.
df <- data.frame(factor = c("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
"ccccccccccccccccccccccccccccccccccccc"),
time = c(5, 7, 9))
library(ggplot2)
qplot(data = df, factor, time) + scale_x_discrete(labels = abbreviate)
You can also use directly the function abbreviate on your factor levels in your data frame, so that you can work with abbreviated labels, also avoiding ggplot2, if you're not familiar with it.
Look at ?abbreviate

Ordering bar plots with ggplot2 according to their size, i.e. numerical value

This question asks about ordering a bar graph according to an unsummarized table. I have a slightly different situation. Here's part of my original data:
experiment,pvs_id,src,hrc,mqs,mcs,dmqs,imcs
dna-wm,0,7,9,4.454545454545454,1.4545454545454546,1.4545454545454541,4.3939393939393945
dna-wm,1,7,4,2.909090909090909,1.8181818181818181,0.09090909090909083,3.9090909090909087
dna-wm,2,7,1,4.818181818181818,1.4545454545454546,1.8181818181818183,4.3939393939393945
dna-wm,3,7,8,3.4545454545454546,1.5454545454545454,0.4545454545454546,4.272727272727273
dna-wm,4,7,10,3.8181818181818183,1.9090909090909092,0.8181818181818183,3.7878787878787876
dna-wm,5,7,7,3.909090909090909,1.9090909090909092,0.9090909090909092,3.7878787878787876
dna-wm,6,7,0,4.909090909090909,1.3636363636363635,1.9090909090909092,4.515151515151516
dna-wm,7,7,3,3.909090909090909,1.7272727272727273,0.9090909090909092,4.030303030303029
dna-wm,8,7,11,3.6363636363636362,1.5454545454545454,0.6363636363636362,4.272727272727273
I only need a few variables from this, namely mqs and imcs, grouped by their pvs_id, so I create a new table:
m = melt(t, id.var="pvs_id", measure.var=c("mqs","imcs"))
I can plot this as a bar graph where one can see the correlation between MQS and IMCS.
ggplot(m, aes(x=pvs_id, y=value))
+ geom_bar(aes(fill=variable), position="dodge", stat="identity")
However, I'd like the resulting bars to be ordered by the MQS value, from left to right, in decreasing order. The IMCS values should be ordered with those, of course.
How can I accomplish that? Generally, given any molten dataframe — which seems useful for graphing in ggplot2 and today's the first time I've stumbled over it — how do I specify the order for one variable?
It's all in making
pvs_id a factor and supplying the appropriate levels to it:
dat$pvs_id <- factor(dat$pvs_id, levels = dat[order(-dat$mqs), 2])
m = melt(dat, id.var="pvs_id", measure.var=c("mqs","imcs"))
ggplot(m, aes(x=pvs_id, y=value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity")
This produces the following plot:
EDIT:
Well since pvs_id was numeric it is treated in an ordered fashion. Where as if you have a factor no order is assumed. So even though you have numeric labels pvs_id is actually a factor (nominal). And as far as dat[order(-dat$mqs), 2] is concerned the order function with a negative sign orders the data frame from largest to smallest along the variable mqs. But you're interested in that order for the pvs_id variable so you index that column which is the second column. If you tear that apart you'll see it gives you:
> dat[order(-dat$mqs), 2]
[1] 6 2 0 5 7 4 8 3 1
Now you supply that to the levels argument of factor and this orders the factor as you want it.
With newer tidyverse functions, this becomes much more straightforward (or at least, easy to read for me):
library(tidyverse)
d %>%
mutate_at("pvs_id", as.factor) %>%
mutate(pvs_id = fct_reorder(pvs_id, mqs)) %>%
gather(variable, value, c(mqs, imcs)) %>%
ggplot(aes(x = pvs_id, y = value)) +
geom_col(aes(fill = variable), position = position_dodge())
What it does is:
create a factor if not already
reorder it according to mqs (you may use desc(mqs) for reverse-sorting)
gather into individual rows (same as melt)
plot as geom_col (same as geom_bar with stat="identity")

Understanding dates and plotting a histogram with ggplot2 in R

Main Question
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
I'm looking for:
A histogram of the frequency of my dates
Tick marks centered under the matching bars
Date labels in %Y-b format
Appropriate limits; minimized empty space between edge of grid space and outermost bars
I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
Correct x axis label format
The frequency distribution has changed shape (binwidth issue?)
Tick marks don't appear centered under bars
The xlims have changed as well
I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
Updates based on answers from edgester and gauden
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
Assume for both:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on #edgester's answer below, I was able to do the following:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
Plot based on gauden's approach:
Note the following:
gaps in gauden's plot for 2009-Dec and 2010-Mar; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data
edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!
Any thoughts on the differences here? edgester's method of creating a separate count
Related References
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
r-bloggers has a post on this, but it appears outdated. The simple format= option did not work for me.
This SO question is playing with breaks and labels. I tried treating my Date vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).
UPDATE
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks= argument for geom_histogram() and creating a little shortcut function to make the required sequence.
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter:
... + Geom_histogram (binwidth = 30, color = "white") + ...
If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.

Resources