I cannot figure out how to add multiple barcharts (or, even better, piecharts) to one plot.
The simplest case would be to add two barcharts at different x,y locations onto a plane.
An application example would be to illustrate both the number of people living in a certain area, and the number of migrants (for lack of better example) living there as well.
By packaging this population information with spatial information, I hope to convey the corresponding information efficiently.
Solutions involving ggmaps are fine, however, I do not require them (displaying the data without a map layer in the background is acceptable).
To be more precise, here is some code, that is not working as I would like it to. In particular, the bar-charts are replaced by rectangles, which are not stacked, but overlap each other, leading to wrongly displayed information.
Furthermore, at each location, the total height of each bar in the bar chart (or size of the pie, for that matter) should correspond to the sum of both parts.
require(ggplot2)
x <- c(1,2,3)
y <- c(3,2,4)
pop <- c(1,7,8)
mig <- c(1,5,2)
df <- rbind(x,y,pop,mig)
df <- t(df)
df <- data.frame(df)
# bring data in long format
require(reshape2)
tmp <- melt(df, id.vars = c("x","y"))
p <- ggplot(tmp, aes(x=x, y=y, fill = variable))
p <- p + geom_rect(aes(xmin = x, xmax = x + 0.1,
ymin = y, ymax = y + value
))
print(p)
Eventually, this should serve as an input into a larger animation, that visualizes temporal development of the variables.
Related
I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)
I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
gg
I'm plotting a dense scatter plot in ggplot2 where each point might be labeled by a different color:
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size))
When I do this, the scatter point labeled "point" (green) is plotted on top of the red points which have the label "a". What controls this z ordering in ggplot, i.e. what controls which point is on top of which?
For example, what if I wanted all the "a" points to be on top of all the points labeled "point" (meaning they would sometimes partially or fully hide that point)? Does this depend on alphanumerical ordering of labels?
I'd like to find a solution that can be translated easily to rpy2.
2016 Update:
The order aesthetic has been deprecated, so at this point the easiest approach is to sort the data.frame so that the green point is at the bottom, and is plotted last. If you don't want to alter the original data.frame, you can sort it during the ggplot call - here's an example that uses %>% and arrange from the dplyr package to do the on-the-fly sorting:
library(dplyr)
ggplot(df %>%
arrange(label),
aes(x = x, y = y, color = label, size = size)) +
geom_point()
Original 2015 answer for ggplot2 versions < 2.0.0
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, you can create a variable holding the order in which you'd like points to be drawn.
To put the green dot on top by plotting it after the others:
df$order <- ifelse(df$label=="a", 1, 2)
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=order))
Or to plot the green dot first and bury it, plot the points in the opposite order:
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=-order))
For this simple example, you can skip creating a new sorting variable and just coerce the label variable to a factor and then a numeric:
ggplot(df) +
geom_point(aes(x=x, y=y, color=label, size=size, order=as.numeric(factor(df$label))))
ggplot2 will create plots layer-by-layer and within each layer, the plotting order is defined by the geom type. The default is to plot in the order that they appear in the data.
Where this is different, it is noted. For example
geom_line
Connect observations, ordered by x value.
and
geom_path
Connect observations in data order
There are also known issues regarding the ordering of factors, and it is interesting to note the response of the package author Hadley
The display of a plot should be invariant to the order of the data frame - anything else is a bug.
This quote in mind, a layer is drawn in the specified order, so overplotting can be an issue, especially when creating dense scatter plots. So if you want a consistent plot (and not one that relies on the order in the data frame) you need to think a bit more.
Create a second layer
If you want certain values to appear above other values, you can use the subset argument to create a second layer to definitely be drawn afterwards. You will need to explicitly load the plyr package so .() will work.
set.seed(1234)
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
library(plyr)
ggplot(df) + geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(aes(x = x, y = y, color = label, size = size),
subset = .(label == 'point'))
Update
In ggplot2_2.0.0, the subset argument is deprecated. Use e.g. base::subset to select relevant data specified in the data argument. And no need to load plyr:
ggplot(df) +
geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(data = subset(df, label == 'point'),
aes(x = x, y = y, color = label, size = size))
Or use alpha
Another approach to avoid the problem of overplotting would be to set the alpha (transparancy) of the points. This will not be as effective as the explicit second layer approach above, however, with judicious use of scale_alpha_manual you should be able to get something to work.
eg
# set alpha = 1 (no transparency) for your point(s) of interest
# and a low value otherwise
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size,alpha = label)) +
scale_alpha_manual(guide='none', values = list(a = 0.2, point = 1))
The fundamental question here can be rephrased like this:
How do I control the layers of my plot?
In the 'ggplot2' package, you can do this quickly by splitting each different layer into a different command. Thinking in terms of layers takes a little bit of practice, but it essentially comes down to what you want plotted on top of other things. You build from the background upwards.
Prep: Prepare the sample data. This step is only necessary for this example, because we don't have real data to work with.
# Establish random seed to make data reproducible.
set.seed(1)
# Generate sample data.
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
# Initialize 'label' and 'size' default values.
df$label <- "a"
df$size <- 2
# Label and size our "special" point.
df$label[50] <- "point"
df$size[50] <- 4
You may notice that I've added a different size to the example just to make the layer difference clearer.
Step 1: Separate your data into layers. Always do this BEFORE you use the 'ggplot' function. Too many people get stuck by trying to do data manipulation from with the 'ggplot' functions. Here, we want to create two layers: one with the "a" labels and one with the "point" labels.
df_layer_1 <- df[df$label=="a",]
df_layer_2 <- df[df$label=="point",]
You could do this with other functions, but I'm just quickly using the data frame matching logic to pull the data.
Step 2: Plot the data as layers. We want to plot all of the "a" data first and then plot all the "point" data.
ggplot() +
geom_point(
data=df_layer_1,
aes(x=x, y=y),
colour="orange",
size=df_layer_1$size) +
geom_point(
data=df_layer_2,
aes(x=x, y=y),
colour="blue",
size=df_layer_2$size)
Notice that the base plot layer ggplot() has no data assigned. This is important, because we are going to override the data for each layer. Then, we have two separate point geometry layers geom_point(...) that use their own specifications. The x and y axis will be shared, but we will use different data, colors, and sizes.
It is important to move the colour and size specifications outside of the aes(...) function, so we can specify these values literally. Otherwise, the 'ggplot' function will usually assign colors and sizes according to the levels found in the data. For instance, if you have size values of 2 and 5 in the data, it will assign a default size to any occurrences of the value 2 and will assign some larger size to any occurrences of the value 5. An 'aes' function specification will not use the values 2 and 5 for the sizes. The same goes for colors. I have exact sizes and colors that I want to use, so I move those arguments into the 'geom_plot' function itself. Also, any specifications in the 'aes' function will be put into the legend, which can be really useless.
Final note: In this example, you could achieve the wanted result in many ways, but it is important to understand how 'ggplot2' layers work in order to get the most out of your 'ggplot' charts. As long as you separate your data into different layers before you call the 'ggplot' functions, you have a lot of control over how things will be graphed on the screen.
It's plotted in order of the rows in the data.frame. Try this:
df2 <- rbind(df[-50,],df[50,])
ggplot(df2) + geom_point(aes(x=x, y=y, color=label, size=size))
As you see the green point is drawn last, since it represents the last row of the data.frame.
Here is a way to order the data.frame to have the green point drawn first:
df2 <- df[order(-as.numeric(factor(df$label))),]
I am trying to build a parallel coordinate diagram in R for showing the difference in ranking in different age groups. And I want to have a fixed scale on the Y axis for showing the values.
Here is a PC plot :
The goal is to see the slopes of the lines really well. So if I have value 1 that is bound with the value 1000, I want to see the line going aaall the way down steeply.
In R so far, if I have values that are too big, my plot is all squished so everything fits and it's hard to visualize anything.
My code for drawing the parallel coordinate plot is the following so far:
pc_18_34 <- read.table("parCoordData_18_24_25_34.csv", header=FALSE, sep="\t")
#name columns of data frame
colnames(pc_18_34) = c("18-25","25-34")
#build the parallel coordinate plot
# doc : http://docs.ggplot2.org/current/geom_path.html
group <- rep(c("Top 10", "Top 10-29", "Top 30-49"), each = 18)
df <- data.frame(id = seq_along(group), group, pc_18_34[,1], pc_18_34[,2])
colnames(df)[3] = "18-25"
colnames(df)[4] = "25-34"
library(reshape2) # for melt
dfm <- melt(df, id.var = c("id", "group"))
dfm[order(dfm$group,dfm$ArtistRank,decreasing=TRUE),]
colnames(dfm)[3] = "AgeGroup"
colnames(dfm)[4] = "ArtistRank"
ggplot(dfm, aes(x=AgeGroup, y=ArtistRank, group = id, colour = group), main="Tops across age groups")+ geom_path(alpha = 0.5, size=1) + geom_path(aes(color=group))
I have looked into how to get the scales to change in ggplot, using libraries like scales but when I had a layer of scale, the diagram doesn't even show up anymore.
Any thoughts on how to make to use a fixed scale (say difference of 1 in rank shown as 5px in the plot), even if it means that the plot is very tall ?
Thaanks !! :)
You can set the panel height to an absolute size based on the number of axis breaks. Note that the device won't scale automatically, so you'll have to adjust it manually for your plot to fit well.
library(ggplot2)
library(gtable)
p <- ggplot(Loblolly, aes(height, factor(age))) +
geom_point()
gb <- ggplot_build(p)
gt <- ggplot_gtable(gb)
n <- length(gb$panel$ranges[[1]]$y.major_source)
# locate the panel in the gtable layout
panel <- gt$layout$t[grepl("panel", gt$layout$name)]
# assign new height to the panels, based on the number of breaks
gt$heights[panel] <- list(unit(n*25,"pt"))
grid.newpage()
grid.draw(gt)
I refer to the answer for this question and have additional question.
I have modify the code as below:
library(ggplot2)
ids <- letters[1:2]
# IDs and values to use for fill colour
values <- data.frame(
id = ids,
value = c(4,5)
)
# Polygon position
positions <- data.frame(
id = c(rep(ids, each = 10),rep("b",5)),
# shape hole shape hole
x = c(1,4,4,1,1, 2,2,3,3,2, 5,10,10,5,5, 6,6,7,7,6, 8,8,9,9,8),
y = c(1,1,4,4,1, 2,3,3,2,2, 5,5,10,10,5, 6,7,7,6,6, 8,9,9,8,8)
)
# Merge positions and values
datapoly <- merge(values, positions, by=c("id"))
chart <- ggplot(datapoly, aes(x=x, y=y)) +
geom_polygon(aes(group=id, fill=factor(value)),colour="grey") +
scale_fill_discrete("Key")
And gives the following output:
There is a line passing through the two colored boxes, which I don't quite like it, how can I remove that? Thanks.
The solution I came up with years ago for drawing holes is to make sure that after each hole your x,y coordinates return to the same place. This stops the line buzzing all around and crossing other polygons and leaving open areas that the winding number algorithm doesn't fill (or does fill when it shouldn't).
So, if you have a data set where the first 27 points are your outer, and then you've got three holes of 5, 6, and 7 points, construct a new dataset which is:
newdata = data[c(1:27,28:32,27,33:38,27,39:45,27),] # untested
note how it jumps back to point 27 after each hole. Make sure your holes go in the clockwise direction (I think).
Then draw using newdata but only filling, not drawing outlines. If you want outlines, add them later (using the original data grouped by ring id)
You can sometimes get very very thin artifacts where the outgoing line to the hole isn't quite drawn the same as the incoming line, but they are hardly noticeable. Blame Bresenham.
Try this one
ggplot(datapoly, aes(x=x, y=y)) +
geom_polygon(aes(group=id, fill=factor(value))) +
scale_fill_discrete("Key")