Struggled with this question for a while and not sure what's going on. I know this question will definitely demonstrate my deficiency with ggplot.
I have a script like this that functions nicely:
beta.bray= c(0.681963714,0.73301985,0.6797153,0.79358052,0.85055556,0.76297686,0.60653007)
beta.bray.gradient=c(0.182243513, 0.565267411,0.427449441,0.655012391,0.357146893,0.286457524,0.338706138)
Date=c("07/18/14","07/26/14","08/19/14","08/25/14", "07/25/15","08/22/15", "07/26/16")
dat=data.frame(Date, beta.bray, beta.bray.gradient)
test<-ggplot(dat, aes(x=reorder(Date, x=fct_inorder(Date)), y=beta.bray, group=1))+geom_line(linetype="dashed")+geom_point()+
labs(x="Date", y="β, multiple-site dissimilarity", title="SNARL riffle site/site β through time, 2014-2016") +coord_cartesian(xlim=c(1,7),ylim=c(.58,.85))
test
But when I want to add another line for beta.bray.gradient, I can't get anything to work. I think it has something to do with the way I used aes() in the above code, but I didn't know how else to do it, in order to use reorder() and fct_inorder() to make sure the dates are plotted in the right way. Here's an example of a way I tried adding the second line:
test<-ggplot(dat, aes(x=reorder(Date, x=fct_inorder(Date)), y=beta.bray, group=1))+geom_line(linetype="dashed")+geom_point()+
geom_line(dat, aes(y=beta.bray.gradient, linetype="c"))+
labs(x="Date", y="β, multiple-site dissimilarity", title="SNARL riffle site/site β through time, 2014-2016") +coord_cartesian(xlim=c(1,7),ylim=c(.58,.85))
In these situations we see a multitude of errors, in this case Error: ggplot2 doesn't know how to deal with data of class uneval
I would think it best to use actual date objects for the x axis and reshape your data into a long format:
library(dplyr)
library(tidyr)
library(ggplot2)
beta.bray <- c(0.681963714,0.73301985,0.6797153,0.79358052,0.85055556,0.76297686,0.60653007)
beta.bray.gradient <- c(0.182243513, 0.565267411,0.427449441,0.655012391,0.357146893,0.286457524,0.338706138)
Date <- as.Date(c("07/18/14","07/26/14","08/19/14","08/25/14", "07/25/15","08/22/15", "07/26/16"),"%m/%d/%y")
dat <- data.frame(Date, beta.bray, beta.bray.gradient) %>%
gather(key = "grp",value = "val",beta.bray,beta.bray.gradient)
ggplot(dat, aes(x = Date, y = val, group = grp,color = grp)) +
geom_line(linetype="dashed") +
geom_point() +
labs(x="Date", y="β, multiple-site dissimilarity",
title="SNARL riffle site/site β through time, 2014-2016") +
coord_cartesian(xlim=Date[c(1,7)])
Related
I'd much appreciate anyone's help to resolve this question please. It seems like it should be so simple, but after many hours experimenting, I've had to stop in and ask for help. Thank you very much in advance!
Summary of question:
How can one ensure in ggplot2 the y-axis of a histogram is labelled using only integers (frequency count values) and not decimals?
The functions, arguments and datatype changes tried so far include:
geom_histogram(), geom_bar() and geom(col) - in each case, including, or not, the argument stat = "identity" where relevant.
adding + scale_y_discrete(), with or without + scale_x_discrete()
converting the underlying count data to a factor and/or the bin data to a factor
Ideally, the solution would be using baseR or ggplot2, instead of additional external dependencies e.g. by using the function pretty_breaks() func in the scales package, or similar.
Sample data:
sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))
The x-axis consists of bins of a continuous variable, and the y-axis is intended to show the count of observations in those bins. For example, Bin 1 covers the x-axis range [4000 <= x < 5000], has a mid-point 4500, with 8 data points observed in that bin / range.
Code that almost works:
The following code generates a graph similar to the one I'm seeking, however the y-axis is labelled with decimal values on the breaks (which aren't valid as the data are integer count values).
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col()
Graph produced by this code is:
I realise I could hard-code the breaks / labels onto a scale_y_continuous() axis but (a) I'd prefer a flexible solution to apply to many differently sized datasets where the scale isn't know in advance, and (b) I expect there must be a simpler way to generate a basic histogram.
References
I've consulted many Stack Overflow questions, the ggplot2 manual (https://ggplot2.tidyverse.org/reference/scale_discrete.html), the sthda.com examples and various blogs. These tend to address related problems, e.g. using scale_y_continuous, or where count data is not available in the underlying dataset and thus rely on stat_bin() for a transformation.
Any help would be much appreciated! Thank you.
// Update 1 - Extending scale to zero
Future readers of this thread may find it helpful to know that the range of break values formed by base::pretty() does not necessarily extend to zero. Thus, the axis scale may omit values between zero and the lower range of the breaks, as shown here:
To resolve this, I included '0' in the range() parameter, i.e.:
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
scale_y_continuous(breaks=round(pretty(range(0,sample$counts))))
which gives the desired full scale on the y-axis, thus:
How about:
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
scale_y_continuous( breaks=round(pretty( range(sample$counts) )) )
This answer suggests pretty_breaks from the scales package. The manual page of pretty_breaks mentions pretty from base. And from there you just have to round it to the nearest integer.
The default y-axis breaks is calculated with scales::extended_breaks(). This function factory has a ... argument that passes on arguments to labeling::extended, which has a Q argument for what it considers 'nice numbers'. If you omit the 2.5 from the default, you should get integer breaks when the range is 3 or larger.
library(ggplot2)
library(scales)
sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))
ggplot(data = sample, aes (x = binMidPts, y = counts)) +
geom_col() +
scale_y_continuous(
breaks = extended_breaks(Q = c(1, 5, 2, 4, 3))
)
Created on 2021-04-28 by the reprex package (v1.0.0)
Or you can calculate the breaks with some rules customized to the dataset you are working like this
library(ggplot2)
breaks_min <- 0
breaks_max <- max(sample[["counts"]])
# Assume 5 breaks is perferable
breaks_bin <- round((breaks_max - breaks_min) / 5)
custom_breaks <- seq(breaks_min, breaks_max, breaks_bin)
ggplot(data = sample, aes (x = binMidPts, y = counts)) +
geom_col() +
scale_y_continuous(breaks = custom_breaks, expand = c(0, 0))
Created on 2021-04-28 by the reprex package (v2.0.0)
How can I make this in order of month, x axis is not in date class its in character? I tried using reorder and sort it doesn't work for my case.
Two approaches.
Fake data:
set.seed(42) # R-4.0.2
dat <- data.frame(
when = sample(c("Apr20", "Feb20", "Mar20"), size = 500, replace = TRUE),
charge = 10000 * rexp(500)
)
ggplot(dat, aes(charge, when)) +
geom_boxplot() +
coord_flip()
Date class
This is what I'll call "The Right Way (tm)", for two reasons: if the data is date-like, them let's use Date; and allow R to handle the ordering naturally.
dat$when2 <- as.Date(paste0("01", dat$when), "%d%b%y")
ggplot(dat, aes(charge, when2, group = when)) +
geom_boxplot() +
coord_flip() +
scale_y_date(labels = function(z) format(z, format = "%b%y"))
(I should note that I need both when2 and group=when: since when2 is a continuous variable, ggplot2 is not going to auto-group things based on it, so we need group=.)
factor
I think this is the wrong approach, for two reasons: (1) not using dates as the numeric data they are; and (2) the more months you have, the more you have to manually control the levels within the factors.
However, having said that:
dat$when3 <- factor(dat$when, levels = c("Feb20", "Mar20", "Apr20"))
ggplot(dat, aes(charge, when3)) +
geom_boxplot() +
coord_flip()
(You could easily overwrite dat$when instead of creating a new variable dat$when3, but I kept it separate because I went back and forth during code-testing here. Frankly, if you prefer to not go the Date route, then doing this allows other things to be ordered correctly, too.)
One pattern I do a lot is to facet plots on cuts of numeric values. facet_wrap in ggplot2 doesn't allow you to call a function from within, so you have to create a temporary factor variable. This is okay using mutate from dplyr. The advantage of this is that you can play around doing EDA and varying the number of quantiles, or changing to set cut points etc. and view the changes in one line. The downside is that the facets are only labelled by the factor level; you have to know, for example, that it's a temperature. This isn't too bad for yourself, but even I get confused if I'm doing a facet_grid on two such variables and have to remember which is which. So, it's really nice to be able to relabel the facets by including a meaningful name.
The key points of this problem is that the levels will change as you change the number of quantiles etc.; you don't know what they are in advance. You could use the base levels() function, but that means augmenting the data frame with the cut variable, then calling levels(), then passing this augmented data frame to ggplot().
So, using plyr::mapvalues, we can wrap all this into a dplyr::mutate, but the required arguments for mapvalues() makes it quite clunky. Having to retype "Temp.f" many times is not very "dplyr"!
Is there a neater way of renaming such factor levels "on the fly"? I hope this description is clear enough and the code example below helps.
library(ggplot2)
library(plyr)
library(dplyr)
library(Hmisc)
df <- data.frame(Temp = seq(-100, 100, length.out = 1000), y = rnorm(1000))
# facet_wrap doesn't allow functions so have to create new, temporary factor
# variable Temp.f
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# fine, but facet headers aren't very clear,
# we want to highlight that they are temperature
ggplot(df %>% mutate(Temp.f = paste0("Temp: ", cut2(Temp, g = 4)))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# use of paste0 is undesirable because it creates a character vector and
# facet_wrap then recodes the levels in the wrong numerical order
# This has the desired effect, but is very long!
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4), Temp.f = mapvalues(Temp.f, levels(Temp.f), paste0("Temp: ", levels(Temp.f))))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
I think you can do this from within facet_wrap using a custom labeller function, like so:
myLabeller <- function(x){
lapply(x,function(y){
paste("Temp:", y)
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = myLabeller)
That labeller is clunky, but at least an example. You could write one for each variable that you are going to use (e.g. tempLabeller, yLabeller, etc).
A slight tweak makes this even better: it automatically uses the name of the thing you are facetting on:
betterLabeller <- function(x){
lapply(names(x),function(y){
paste0(y,": ", x[[y]])
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = betterLabeller)
Okay, with thanks to Mark Peterson for pointing me towards the labeller argument/function, the exact answer I'm happy with is:
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f, labeller = labeller(Temp.f = label_both))
I'm a fan of lazy and "label_both" means I can simply create a meaningful temporary (or overwrite the original) variable column and both the name and the value are given. Rolling your own labeller function is more powerful, but using label_both is a good, easy option.
This question follows from these two topics:
How to use stat_bin2d() to compute counts labels in ggplot2?
How to show the numeric cell values in heat map cells in r
In the first topic, a user wants to use stat_bin2d to generate a heatmap, and then wants the count of each bin written on top of the heat map. The method the user initially wants to use doesn't work, the best answer stating that stat_bin2d is designed to work with geom = "rect" rather than "text". No satisfactory response is given.
The second question is almost identical to the first, with one crucial difference, that the variables in the second question question are text, not numeric. The answer produces the desired result, placing the count value for a bin over the bin in a stat_2d heat map.
To compare the two methods i've prepared the following code:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))
We know this first gives you the error:
"Error: geom_text requires the following missing aesthetics: x, y".
Same issue as in the first question. Interestingly, changing from stat_bin2d to stat_binhex works fine:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_binhex() +
stat_binhex(geom="text", aes(label=..count..))
Which is great and all, but generally, I don't think hex binning is very clear, and for my purposes wont work for the data i'm trying to desribe. I really want to use stat_2d.
To get this to work, i've prepared the following work around based on the second answer:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
x_t<-as.character(round(data$x,.1))
y_t<-as.character(round(data$y,.1))
x_x<-as.character(seq(-3,3),1)
y_y<-as.character(seq(-3,3),1)
data<-cbind(data,x_t,y_t)
ggplot(data, aes(x = x_t, y = y_t)) +
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))+
scale_x_discrete(limits =x_x) +
scale_y_discrete(limits=y_y)
This works around allows one to bin numerical data, but to do so, you have to determine bin width (I did it via rounding) before bringing it into ggplot. I actually figured it out while writing this question, so I may as well finish.
This is the result: (turns out I can't post images)
So my real question here, is does any one have a better way to do this? I'm happy I at least got it to work, but so far I haven't seen an answer for putting labels on stat_2d bins when using a numerical variable.
Does any one have a method for passing on x and y arguments to geom_text from stat_2dbin without having to use a work around? Can any one explain why it works with text variables but not with numbers?
Another work around (but perhaps less work). Similar to the ..count.. method you can extract the counts from the plot object in two steps.
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")
I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.