geom_col is not using stat_identify when values are rounded to whole numbers - r

I'm trying to use geom_col to chart columns for values in time series (annual and quarterly).
When I use Zoo package's YearQtr datatype for the x-axis values and I round the y-axis values to a whole number, geom_col appears to not use the default postion = 'identity' for determining the column bar heights based on the y-value of each occurrence. Instead it appears to switch to position = 'count' and treats the rounded y-values as factors, counting the number of occurrences for each factor value (e.g., 3 occurrences have a rounded y-value = 11)
If I switch to geom_line, the graph is fine with quarterly x-axis values and rounded y-axis values.
library(zoo)
library(ggplot2)
Annual.Periods <- seq(to = 2020, by = 1, length.out = 8) # 8 years
Quarter.Periods <- as.yearqtr(seq(to = 2020, by = 0.25, length.out = 8)) # 8 Quarters
Values <- seq(to = 11, by = 0.25, length.out = 8)
Data.Annual.Real <- data.frame(X = Annual.Periods, Y = round(Values, 1))
Data.Annual.Whole <- data.frame(X = Annual.Periods, Y = round(Values, 0))
Data.Quarter.Real <- data.frame(X = Quarter.Periods, Y = round(Values, 1))
Data.Quarter.Whole <- data.frame(X = Quarter.Periods, Y = round(Values, 0))
ggplot(data = Data.Annual.Real, aes(X, Y)) + geom_col()
ggplot(data = Data.Annual.Whole, aes(X, Y)) + geom_col()
ggplot(data = Data.Quarter.Real, aes(X, Y)) + geom_col()
ggplot(data = Data.Quarter.Whole, aes(X, Y)) + geom_col() # appears to treat y-values as factors and uses position = 'count' to count occurrences (e.g., 3 occurrences have a rounded Value = 11)
ggplot(data = Data.Quarter.Whole, aes(X, Y)) + geom_line()
rstudioapi::versionInfo()
# $mode
# [1] "desktop"
#
# $version
# [1] ‘1.3.959’
#
# $release_name
# [1] "Middlemist Red"
sessionInfo()
# R version 4.0.0 (2020-04-24)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Mojave 10.14.6
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] ggplot2_3.3.1 zoo_1.8-8

ggplot tries to guess the orientation of its geom_col()-function, meaning which variable serves as the base of the bars and which as the values to represent. Apparently without any decimal numbers in your Y- variable it choses it as it's base (it stays numeric though, no conversion to factor), and sums up your quarters.
For cases like this you can provide geom_col() with the information what variable to use as the base of the bars via the orientation=argument:
ggplot(data = Data.Quarter.Whole, aes(X, Y)) + geom_col(orientation = "x")
EDIT: I have just seen that Roman answered it in the comments.

Related

Manually sort labels in plot_ly [duplicate]

Is it possible to order the legend entries in R?
If I e.g. specify a pie chart like this:
plot_ly(df, labels = Product, values = Patients, type = "pie",
marker = list(colors = Color), textfont=list(color = "white")) %>%
layout(legend = list(x = 1, y = 0.5))
The legend gets sorted by which Product has the highest number of Patients. I would like the legend to be sorted in alphabetical order by Product.
Is this possible?
Yes, it's possible. Chart options are here:
https://plot.ly/r/reference/#pie.
An example:
library(plotly)
library(dplyr)
# Dummy data
df <- data.frame(Product = c('Kramer', 'George', 'Jerry', 'Elaine', 'Newman'),
Patients = c(3, 6, 4, 2, 7))
# Make alphabetical
df <- df %>%
arrange(Product)
# Sorts legend largest to smallest
plot_ly(df,
labels = ~Product,
values = ~Patients,
type = "pie",
textfont = list(color = "white")) %>%
layout(legend = list(x = 1, y = 0.5))
# Set sort argument to FALSE and now orders like the data frame
plot_ly(df,
labels = ~Product,
values = ~Patients,
type = "pie",
sort = FALSE,
textfont = list(color = "white")) %>%
layout(legend = list(x = 1, y = 0.5))
# I prefer clockwise
plot_ly(df,
labels = ~Product,
values = ~Patients,
type = "pie",
sort = FALSE,
direction = "clockwise",
textfont = list(color = "white")) %>%
layout(legend = list(x = 1, y = 0.5))
Session info:
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.5 plotly_4.7.1 ggplot2_2.2.1
EDIT:
Modified to work with plotly 4.x.x (i.e. added ~)

ggplot generates jagged/broken lines in specific cases

I recently encountered several cases where the ggplot produced jagged lines. In the following example, I generate dense time-course data with the package fda and draw two line plots. The first plot gives the black line and the other plot displays the same line except that we use different colors to denote the signs of the values. In the end, I export the plots as eps files and open them in Adobe Illustrator.
# install.packages("fda")
# dir.create("tmp")
library(dplyr)
library(tidyr)
library(ggplot2)
library(fda)
times_sparse <- seq(0, 10, 0.5)
times <- seq(0, 10, 0.02)
basis <- create.bspline.basis(
rangeval = c(0, 10), norder = 4,
breaks = times_sparse
)
nbasis <- basis$nbasis
set.seed(2501)
coeff <- rnorm(nbasis, sd = 0.1)
y <- eval.fd(times, fd(coeff, basis)) |> as.numeric()
dat <- data.frame(t = times, y = y) |>
mutate(pos = factor((y > 0) * 1, levels = c(1, 0)))
### first plot: 1 colors, smooth lines
ggplot(dat) +
geom_line(aes(x = t, y = y)) +
theme_bw() +
theme(panel.grid = element_blank())
# ggsave("tmp/line1a.eps", device = "eps",
# width = 6, height = 6)
### second plot: 2 colors, jagged lines
ggplot(dat) +
geom_line(aes(x = t, y = y, color = pos, group = 1)) +
theme_bw() +
theme(panel.grid = element_blank())
# ggsave("tmp/line1b.eps", device = "eps",
# width = 6, height = 6)
In the screenshots which display the zoomed-in line, we observe that the line in the first plot is smooth, while the line in the second plot is jagged. How can I fix the problem?
Here's my system info:
# R version 4.1.1 (2021-08-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 22000)
Note: My goal is to generate an eps/pdf file of something like the second plot from R. Other methods that achieve the same goal are appreciated.
You should add lineend = "round" to your geom_line
ggplot(dat) +
geom_line(aes(x = t, y = y, color = pos, group = 1), lineend = "round") +
theme_bw() +
theme(panel.grid = element_blank())
It will look nice via export -> save as PDF and windows() -> save as... too.
An example (2400%):

stat_sum and stat_identity give weird results

I have the following code, including randomly generated demo data:
n <- 10
group <- rep(1:4, n)
mass.means <- c(10, 20, 15, 30)
mass.sigma <- 4
score.means <- c(5, 5, 7, 4)
score.sigma <- 3
mass <- as.vector(model.matrix(~0+factor(group)) %*% mass.means) +
rnorm(n*4, 0, mass.sigma)
score <- as.vector(model.matrix(~0+factor(group)) %*% score.means) +
rnorm(n*4, 0, score.sigma)
data <- data.frame(id = 1:(n*4), group, mass, score)
head(data)
Which gives:
id group mass score
1 1 1 12.643603 5.015746
2 2 2 21.458750 5.590619
3 3 3 15.757938 8.777318
4 4 4 32.658551 6.365853
5 5 1 6.636169 5.885747
6 6 2 13.467437 6.390785
And then I want to plot the sum of "score", grouped by "group", in a bar chart:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="sum")
plot
This gives me:
Weirdly, using stat_identity seems to give the result I am looking for:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="identity")
plot
Is this a bug? Using ggplot2 1.0.0 on R
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 1.2
year 2014
month 10
day 31
svn rev 66913
language R
version.string R version 3.1.2 (2014-10-31)
nickname Pumpkin Helmet
Or what am I doing wrong?
plot <- ggplot(data = data, aes(x = group, y = score)) +
stat_summary(fun.y = "sum", geom = "bar", position = "identity")
plot
aggregate(score ~ group, data=data, FUN=sum)
# group score
#1 1 51.71279
#2 2 58.94611
#3 3 67.52100
#4 4 39.24484
Edit:
stat_sum does not work, because it doesn't just return the sum. It returns the "number of observations at position" and "percent of points in that panel at that position". It was designed for a different purpose. The docs say " Useful for overplotting on scatterplots."
stat_identity (kind of) works because geom_bar by default stacks the bars. You have many bars on top of each other in contrast to my solution that gives you just one bar per group. Look at this:
plot <- ggplot(data = data, aes(x = group, y = score)) +
geom_bar(stat="identity", color = "red")
plot
Also consider the warning:
Warning message:
Stacking not well defined when ymin != 0

Why will geom_tile plot a subset of my data, but not more?

I am trying to plot a map, but I can not figure out why the following will not work:
Here is a minimal example
testdf <- structure(list(x = c(48.97, 44.22, 44.99, 48.87, 43.82, 43.16, 38.96, 38.49, 44.98, 43.9), y = c(-119.7, -113.7, -109.3, -120.6, -109.6, -121.2, -114.2, -118.9, -109.7, -114.1), z = c(0.001216, 0.001631, 0.001801, 0.002081, 0.002158, 0.002265, 0.002298, 0.002334, 0.002349, 0.00249)), .Names = c("x", "y", "z"), row.names = c(NA, 10L), class = "data.frame")
This works for 1-8 rows:
ggplot(data = testdf[1,], aes(x,y,fill = z)) + geom_tile()
ggplot(data = testdf[1:8,], aes(x,y,fill = z)) + geom_tile()
But not for 9 rows:
ggplot(data = testdf[1:9,], aes(x,y,fill = z)) + geom_tile()
Ultimately, I am seeking a way to plot data on a non-regular grid. It is not essential that I use geom_tile, but any space-filling interpolation over the points will do.
The full dataset is available as a gist
testdf above was a small subset of the full dataset, a high-resolution raster of the US (>7500 rows)
require(RCurl) # requires libcurl; sudo apt-get install libcurl4-openssl-dev
tmp <- getURL("https://gist.github.com/raw/4635980/f657dcdfab7b951c7b8b921b3a109c7df1697eb8/test.csv")
testdf <- read.csv(textConnection(x))
What I have tried:
using geom_point works, but does not have the desired effect:
ggplot(data = testdf, aes(x,y,color=z)) + geom_point()
if I convert either x or y to a vector 1:10, the plot works as expected:
newdf <- transform(testdf, y =1:10)
ggplot(data = newdf[1:9,], aes(x,y,fill = z)) + geom_tile()
newdf <- transform(testdf, x =1:10)
ggplot(data = newdf[1:9,], aes(x,y,fill = z)) + geom_tile()
sessionInfo()R version 2.15.2 (2012-10-26) Platform: x86_64-pc-linux-gnu (64-bit)
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
> other attached packages: [1] reshape2_1.2.2 maps_2.3-0
> betymaps_1.0 ggmap_2.2 ggplot2_0.9.3
> loaded via a namespace (and not attached): [1] colorspace_1.2-0
> dichromat_1.2-4 digest_0.6.1 grid_2.15.2
> gtable_0.1.2 labeling_0.1 [7] MASS_7.3-23
> munsell_0.4 plyr_1.8 png_0.1-4
> proto_0.3-10 RColorBrewer_1.0-5 [13] RgoogleMaps_1.2.0.2
> rjson_0.2.12 scales_0.2.3 stringr_0.6.2
> tools_2.15.2
The reason you can't use geom_tile() (or the more appropriate geom_raster() is because these two geoms rely on your tiles being evenly spaced, which they are not. You will need to coerce your data to points, and resample these to an evenly spaced raster which you can then plot with geom_raster(). You will have to accept that you will need to resample your original data slightly in order to plot this as you wish.
You should also read up on raster:::projection and rgdal:::spTransform for more information on map projections.
require( RCurl )
require( raster )
require( sp )
require( ggplot2 )
tmp <- getURL("https://gist.github.com/geophtwombly/4635980/raw/f657dcdfab7b951c7b8b921b3a109c7df1697eb8/test.csv")
testdf <- read.csv(textConnection(tmp))
spdf <- SpatialPointsDataFrame( data.frame( x = testdf$y , y = testdf$x ) , data = data.frame( z = testdf$z ) )
# Plotting the points reveals the unevenly spaced nature of the points
spplot(spdf)
# You can see the uneven nature of the data even better here via the moire pattern
plot(spdf)
# Make an evenly spaced raster, the same extent as original data
e <- extent( spdf )
# Determine ratio between x and y dimensions
ratio <- ( e#xmax - e#xmin ) / ( e#ymax - e#ymin )
# Create template raster to sample to
r <- raster( nrows = 56 , ncols = floor( 56 * ratio ) , ext = extent(spdf) )
rf <- rasterize( spdf , r , field = "z" , fun = mean )
# Attributes of our new raster (# cells quite close to original data)
rf
class : RasterLayer
dimensions : 56, 135, 7560 (nrow, ncol, ncell)
resolution : 0.424932, 0.4248191 (x, y)
extent : -124.5008, -67.13498, 25.21298, 49.00285 (xmin, xmax, ymin, ymax)
# We can then plot this using `geom_tile()` or `geom_raster()`
rdf <- data.frame( rasterToPoints( rf ) )
ggplot( NULL ) + geom_raster( data = rdf , aes( x , y , fill = layer ) )
# And as the OP asked for geom_tile, this would be...
ggplot( NULL ) + geom_tile( data = rdf , aes( x , y , fill = layer ) , colour = "white" )
Of course I should add that this data is quite meaningless. What you really must do is take the SpatialPointsDataFrame, assign the correct projection information to it, and then transform to latlong coordinates via spTransform and then rasterzie the transformed points. Really you need to have more information about your raster data. What you have here is a close approximation, but ultimately it is not a true reflection of the data.
This will not be answer to geom_tile() problem but another way to plot data.
As you have x and y coordinates of 30 km grid (I assume middle of that grid) then you can used geom_point() and plot data. You should select appropriate shape= value. Shape 15 will plot rectangles.
Another problem is x and y values - when plotting data they should be plotted as x=y and y=x to correspond to latitude and longitude.
coord_equal() will ensure that there is a correct aspect ratio (I found this solution with ratio as example on net).
ggplot(data = testdf, aes(y,x,colour=z)) + geom_point(shape=15)+
coord_equal(ratio=1/cos(mean(testdf$x)*pi/180))
answer:
data is plotted but is just very small.
From here:
"Tile plot as densely as possible, assuming that every tile is the same size.
Consider this plot
ggplot(data = testdf[1:2,], aes(x,y,fill = z)) + geom_tile()
There are two tiles in the plot above. geom_tile is trying to make the plot as dense as possible considering that every tile is the same size. Here we can make two tiles this big without overlapping. making enough space for 4 tiles.
Have a go at the following plots and see what the resulting plots tell you:
df1 <- data.frame(x=c(1:3),y=(1:3))
# df1
# x y
#1 1 1
#2 2 2
#3 3 3
ggplot(data = df1[1,], aes(x,y)) + geom_tile()
ggplot(data = df1[1:2,], aes(x,y)) + geom_tile()
ggplot(data = df1[1:3,], aes(x,y)) + geom_tile()
compare to this example:
df2 <- data.frame(x=c(1:3),y=c(1,20,300))
df2
# x y
#1 1 1
#2 2 20
#3 3 300
ggplot(data = df2[1,], aes(x,y)) + geom_tile()
ggplot(data = df2[1:2,], aes(x,y)) + geom_tile()
ggplot(data = df2[1:3,], aes(x,y)) + geom_tile()
Note that for the first two plots are same for df1 and df2 but the third plot for df2 is different. This is because the biggest we can make the tiles is between (x[1],y[1]) and (x[2],y[2]). Any more and they would overlap which leaves lots of space between these two tiles and the last 3rd tile at y=300.
There is also a width parameter in geom_tile although I am not sure how sensible this is here. are you sure you dont fancy another option with such sparse data ?
(Your full data is still plotted: see ggplot(data = testdf, aes(x,y)) + geom_tile(width=1000)
If you want to use geom_tile I think you will need to aggregate first:
# NOTE: tmp.csv downloaded from https://gist.github.com/geophtwombly/4635980/raw/f657dcdfab7b951c7b8b921b3a109c7df1697eb8/test.csv
testdf <- read.csv("~/Desktop/tmp.csv")
# combine x,y coordinates by rounding
testdf$x2 <- round(testdf$x, digits=0)
testdf$y2 <- round(testdf$y, digits=0)
# aggregate on combined coordinates
library(plyr)
testdf <- ddply(testdf, c("x2", "y2"), summarize,
z = mean(z))
# plot aggregated data using geom_tile
ggplot(data = testdf, aes(y2,x2,fill=z)) +
geom_tile() +
coord_equal(ratio=1/cos(mean(testdf$x2)*pi/180)) # copied from #Didzis Elferts answer--nice!
Once we have done all this we will probably conclude that geom_point() is better, as suggested by #Didzis Elferts.

R: round() can find object, sprintf() cannot, why?

I have a function that takes a dataframe and plots a number of columns from that data frame using ggplot2. The aes() function in ggplot2 takes a label argument and I want to use sprintf to format that argument - and this is something I have done many times before in other code. When I pass the format string to sprintf (in this case "%1.1f") it says "object not found". If I use the round() function and pass an argument to that function it can find it without problems. Same goes for format(). Apparently only sprintf() is unable to see the object.
At first I thought this was a lazy evaluation issue caused by calling the function rather than using the code inline, but using force() on the format string I pass to sprintf does not resolve the issue. I can work around this, but I would like to know why it happens. Of course, it may be something trivial that I have overlooked.
Q. Why does sprintf() not find the string object?
Code follows (edited and pruned for more minimal example)
require(gdata)
require(ggplot2)
require(scales)
require(gridExtra)
require(lubridate)
require(plyr)
require(reshape)
set.seed(12345)
# Create dummy time series data with year and month
monthsback <- 64
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback), myvalue5 = runif(monthsback, min = 200, max = 300))
mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
getchart_highlight_value <- function(
plotdf,
digits_used = 1
)
{
force(digits_used)
#p <- ggplot(data = plotdf, aes(x = month(mydate, label = TRUE), y = year(mydate), fill = myvalue5, label = round(myvalue5, digits_used))) +
# note that the line below using sprintf() does not work, whereas the line above using round() is fine
p <- ggplot(data = plotdf, aes(x = month(mydate, label = TRUE), y = year(mydate), fill = myvalue5, label = sprintf(paste("%1.",digits_used,"f", sep = ""), myvalue5))) +
scale_x_date(labels = date_format("%Y"), breaks = date_breaks("years")) +
scale_y_reverse(breaks = 2007:2012, labels = 2007:2012, expand = c(0,0)) +
geom_tile() + geom_text(size = 4, colour = "black") +
scale_fill_gradient2(low = "blue", high = "red", limits = c(min(plotdf$myvalue5), max(plotdf$myvalue5)), midpoint = median(plotdf$myvalue5)) +
scale_x_discrete(expand = c(0,0)) +
opts(panel.grid.major = theme_blank()) +
opts(panel.background = theme_rect(fill = "transparent", colour = NA)) +
png(filename = "c:/sprintf_test.png", width = 700, height = 300, units = "px", res = NA)
print(p)
dev.off()
}
getchart_highlight_value (plotdf <- mydf,
digits_used <- 1)
Using the minimal example of Martin (that is a minimal example, see also this question), you can make the code work by specifying the environment ggplot() should use. For that, specify the argument environment in the ggplot() function, eg like this:
require(ggplot2)
getchart_highlight_value <- function(df)
{
fmt <- "%1.1f"
ggplot(df, aes(x, x, label=sprintf(fmt, lbl)),
environment = environment()) +
geom_tile(bg="white") +
geom_text(size = 4, colour = "black")
}
df <- data.frame(x = 1:5, lbl = runif(5))
getchart_highlight_value (df)
The function environment() returns the current (local) environment, which is the environment created by the function getchart_highlight_value(). If you don't specify this, ggplot() will look in the global environment, and there the variable fmt is not defined.
Nothing to do with lazy evaluation, everything to do with selecting the right environment.
The code above produces following plot:
Here's a minimal-er example
require(ggplot2)
getchart_highlight_value <- function(df)
{
fmt <- "%1.1f"
ggplot(df, aes(x, x, label=sprintf(fmt, lbl))) + geom_tile()
}
df <- data.frame(x = 1:5, lbl = runif(5))
getchart_highlight_value (df)
It fails with
> getchart_highlight_value (df)
Error in sprintf(fmt, lbl) : object 'fmt' not found
If I create fmt in the global environment then everything is fine; maybe this explains the 'sometimes it works' / 'it works for me' comments above.
> sessionInfo()
R version 2.15.0 Patched (2012-05-01 r59304)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_0.9.1
loaded via a namespace (and not attached):
[1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 grid_2.15.0
[5] labeling_0.1 MASS_7.3-18 memoise_0.1 munsell_0.3
[9] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1
[13] scales_0.2.1 stringr_0.6

Resources