Related
I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
gg
I started learning R for data analysis and, most importantly, for data visualisation.
Since I am still in the switching process, I am trying to reproduce the activities I was doing with Graphpad Prism or Origin Pro in R. In most of the cases everything was smooth, but I could not find a smart solution for plotting multiple y columns in a single graph.
What I usually get from the softwares I use for data visualisations look like this:
Each single black trace is a measurement, and I would like to obtain the same plot in R. In Prism or Origin, this will take a single copy-paste in a XY graph.
I exported the matrix of data (one X, which indicates the time, and multiple Y values, which are the traces you see in the image).
I imported my data in R with the following commands:
library(ggplot2) #loaded ggplot2
Data <- read.csv("Directory/File.txt", header=F, sep="") #imported data
DF <- data.frame(Data) #transformed data into data frame
If I plot my data now, I obtain a series of columns, where the first one (called V1) is the X axis and all the others (V2 to V140) are the traces I want to put on the same graph.
To plot the data, I tried different solutions:
ggplot(data=DF, aes(x=DF$V1, y=DF[V2:V140]))+geom_line()+theme_bw() #did not work
plot(DF, xy.coords(x=DF$V1, y=DF$V2:V140)) #gives me an error
plot(DF, xy.coords(x=V1, y=c(V2:V10))) #gives me an error
I tried the matplot, without success, following the EZH guide:
The code I used is the following: matplot(x=DF$V1, type="l", lty = 2:100)
The only solution I found would be to individually plot a command for each single column, but it is a crazy solution. The number of columns varies among my data, and manually enter commands for 140 columns is insane.
What would you suggest?
Thank you in advance.
Here there are also some data attached.Data: single X, multiple Y
I tried using the matplot(). I used a very sample data which has no trend at all. so th eoutput from my code shall look terrible, but my main focus is on the code. Since you have already tried matplot() ,just recheck with below solution if you had done it right!
set.seed(100)
df = matrix(sample(1:685765,50000,replace = T),ncol = 100)
colnames(df)=c("x",paste0("y", 1:99))
dt=as.data.frame(df)
matplot(dt[["x"]], y = dt[,c(paste0("y",1:99))], type = "l")
If you want to plot in base R, you have to make a plot and add lines one at a time, however that isn't hard to do.
we start by making some sample data. Since the data in the link seemed to all be on the same scale, I will assume your data frame only has y values and the x value is stored separately.
plotData <- as.data.frame(matrix(sort(rnorm(500)),ncol = 5))
xval <- sort(sample(200, 100))
Now we can initialize a plot with the first column.
plot(xval, plotData[[1]], type = "l",
ylim = c(min(plotData), max(plotData)))
type = "l" makes a line plot instead of a scatter plot
ylim = c(min(plotData), max(plotData)) makes sure the y-axis will fit all the data.
Now we can add the rest of the values.
apply(plotData[-1], 2, lines, x = xval)
plotData[-1] removes the column we already plotted,
apply function with 2 as the second parameter means we want to execute a function on every column,
lines defines the function we are applying to the columns. lines adds a new line to the current plot.
x = xval passes an extra parameter (x) to the lines function.
if you wat to plot the data using ggplot2, the data should be transformed to long format;
library(ggplot2)
library(reshape2)
dat <- read.delim('AP.txt', header = F)
# plotting only first 9 traces
# my rstudio will crach if I plot the full data;
df <- melt(dat[1:10], id.vars = 'V1')
ggplot(df, aes(x = V1, y = value, color = variable)) + geom_line()
# if you want all traces to be in same colour, you can use
ggplot(df, aes(x = V1, y = value, group = variable)) + geom_line()
I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()
I am relatively new to ggplot, so please forgive me if some of my problems are really simple or not solvable at all.
What I am trying to do is generate a "Heat Map" of a country where the filling of the shape is continous. Furthermore I have the shape of the country as .RData. I used hadley wickham's script to transform my SpatialPolygon data into a data frame. The long and lat data of my data frame now looks like this
head(my_df)
long lat group
6.527187 51.87055 0.1
6.531768 51.87206 0.1
6.541202 51.87656 0.1
6.553331 51.88271 0.1
This long/lat data draws the outline of Germany. The rest of the data frame is omitted here since I think it is not needed. I also have a second data frame of values for certain long/lat points. This looks like this
my_fixed_points
long lat value
12.817 48.917 0.04
8.533 52.017 0.034
8.683 50.117 0.02
7.217 49.483 0.0542
What I would like to do now, is colour each point of the map according to an average value over all the fixed points that lie within a certain distance of that point. That way I would get a (almost)continous colouring of the whole map of the country.
What I have so far is the map of the country plotted with ggplot2
ggplot(my_df,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") +
geom_path(color="white",aes(group=group)) + coord_equal()
My first Idea was to generate points that lie within the map that has been drawn and then calculate the value for every generated point my_generated_point like so
value_vector <- subset(my_fixed_points,
spDistsN1(cbind(my_fixed_points$long, my_fixed_points$lat),
c(my_generated_point$long, my_generated_point$lat), longlat=TRUE) < 50,
select = value)
point_value <- mean(value_vector)
I havent found a way to generate these points though. And as with the whole problem, I dont even know if it is possible to solve this way. My question now is if there exists a way to generate these points and/or if there is another way to come to a solution.
Solution
Thanks to Paul I almost got what I wanted. Here is an example with sample data for the Netherlands.
library(ggplot2)
library(sp)
library(automap)
library(rgdal)
library(scales)
#get the spatial data for the Netherlands
con <- url("http://gadm.org/data/rda/NLD_adm0.RData")
print(load(con))
close(con)
#transform them into the right format for autoKrige
gadm_t <- spTransform(gadm, CRS=CRS("+proj=merc +ellps=WGS84"))
#generate some random values that serve as fixed points
value_points <- spsample(gadm_t, type="stratified", n = 200)
values <- data.frame(value = rnorm(dim(coordinates(value_points))[1], 0 ,1))
value_df <- SpatialPointsDataFrame(value_points, values)
#generate a grid that can be estimated from the fixed points
grd = spsample(gadm_t, type = "regular", n = 4000)
kr <- autoKrige(value~1, value_df, grd)
dat = as.data.frame(kr$krige_output)
#draw the generated grid with the underlying map
ggplot(gadm_t,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") + geom_path(color="white",aes(group=group)) + coord_equal() +
geom_tile(aes(x = x1, y = x2, fill = var1.pred), data = dat) + scale_fill_continuous(low = "white", high = muted("orange"), name = "value")
I think what you want is something along these lines. I predict that this homebrew is going to be terribly inefficient for large datasets, but it works on a small example dataset. I would look into kernel densities and maybe the raster package. But maybe this suits you well...
The following snippet of code calculates the mean value of cadmium concentration of a grid of points overlaying the original point dataset. Only points closer than 1000 m are considered.
library(sp)
library(ggplot2)
loadMeuse()
# Generate a grid to sample on
bb = bbox(meuse)
grd = spsample(meuse, type = "regular", n = 4000)
# Come up with mean cadmium value
# of all points < 1000m.
mn_value = sapply(1:length(grd), function(pt) {
d = spDistsN1(meuse, grd[pt,])
return(mean(meuse[d < 1000,]$cadmium))
})
# Make a new object
dat = data.frame(coordinates(grd), mn_value)
ggplot(aes(x = x1, y = x2, fill = mn_value), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
An alternative approach is to use an interpolation algorithm. One example is kriging. This is quite easy using the automap package (spot the self promotion :), I wrote the package):
library(automap)
kr = autoKrige(cadmium~1, meuse, meuse.grid)
dat = as.data.frame(kr$krige_output)
ggplot(aes(x = x, y = y, fill = var1.pred), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
However, without knowledge as to what your goal is with this map, it is hard for me to see what you want exactly.
This slideshow offers another approach--see page 18 for a description of the approach and page 21 for a view of what the results looked like for the slide-maker.
Note however that the slide-maker used the sp package and the spplot function rather than ggplot2 and its plotting functions.
The options for 2D plots of (x,y,z) in R are a bit numerous. However, grappling with the options is a bit of a challenge, especially in the case that all three are continuous.
To clarify the problem (and possibly assist in explaining why I might be getting tripped up with contour or image), here is a possible classification scheme:
Case 1: The value of z is not provided but is a conditional density based on values in (x,y). (Note: this is essentially relegating the calculation of z to a separate function - a density estimation. Something still has to use the output of that calculation, so allowing for arbitrary calculations would be nice.)
Case 2: (x,y) pairs are unique and regularly spaced. This implies that only one value of z is provided per (x,y) value.
Case 3: (x,y) pairs are unique, but are continuous. Coloring or shading is still determined by only 1 unique z value.
Case 4: (x,y) pairs are not unique, but are regularly spaced. Coloring or shading is determined by an aggregation function on the z values.
Case 5: (x,y) pairs are not unique, are continuous. Coloring / shading must be determined by an aggregation function on the z values.
If I am missing some cases, please let me know. The case that interests me is #5. Some notes on relationships:
Case #1 seems to be well supported already.
Case #2 is easily supported by heatmap, image, and functions in ggplot.
Case #3 is supported by base plot, though use of a color gradient is left to the user.
Case #4 can become case #2 by use of a split & apply functionality. I have done that before.
Case #5 can be converted to #4 (and then #2) by using cut, but this is inelegant and boxy. Hex binning may be better, though that does not seem to be easily conditioned on whether there is a steep gradient in the value of z. I'd settle for hex binning, but alternative aggregation functions are quite welcome, especially if they can utilize the z values.
How can I do #5? Here is code to produce a saddle, though the value of spread changes the spread of the z value, which should create differences in plotting gradients.
N = 1000
spread = 0.6 # Vals: 0.6, 3.0
set.seed(0)
rot = matrix(rnorm(4), ncol = 2)
mat0 = matrix(rnorm(2 * N), ncol = 2)
mat1 = mat0 %*% rot
zMean = mat0[,2]^2 - mat0[,1]^2
z = rnorm(N, mean = zMean, sd = spread * median(abs(zMean)))
I'd like to do something like hexbin, but I've banged on this with ggplot and haven't made much progress. If I can apply an arbitrary aggregation function to the z values in a region, that would be even better. (The form of such a function might be like plot(mat1, colorGradient = f(z), aggregation = "bin", bins = 50).)
How can I do this in ggplot or another package? I am happy to make this question a community wiki question (or other users can, by editing it enough times). If so, one answer per post, please, so that we can focus on, say, ggplot, levelplot, lattice, contourplot (or image), and other options, if they exist.
Updates 1: The volcano example is a good example of case #3: the data is regularly spaced (it could be lat/long), with one z value per observation. A topographic map has (latitude, longitude, altitude), and thus one value per location. Suppose one is obtaining weather (e.g. rainfall, windspeed, sunlight) over many days for many randomly placed sensors: that is more akin to #5 than to #3 - we may have lat & long, but the z values can range quite a bit, even for the same or nearby (x,y) values.
Update 2: The answers so far, by DWin, Kohske, and John Colby are all excellent. My actual data set is a small sample of a larger set, but at 200K points it produces interesting results. On the (x,y) plane, it is has very high density in some regions (thus, overplotting would occur in those areas) and much lower density or complete absence in other regions. With John's suggestion via fields, I needed to subsample the data for Tps to work out (I'll investigate if I can do it without subsampling), but the results are quite interesting. Trying rms/Hmisc (DWin's suggestion), the full 200K points seem to work out well. Kohske's suggestion is quite good, and, as the data is transformed into a grid before plotting, there's no issue with the number of input data points. It also gives me greater flexibility to determine how to aggregate the z values in the region. I am not yet sure if I will use mean, median, or some other aggregation.
I also intend to try out Kohske's nice example of mutate + ddply with the other methods - it is a good example of how to get different statistics calculated over a given region.
Update 3: The different methods are distinct and several are remarkable, though there isn't a clear winner. I selected John Colby's answer as the first. I think I will use that or DWin's method in further work.
I've had great luck with the fields package for this type of problem. Here is an example using Tps for thin plate splines:
EDIT: combined plots and added standard error
require(fields)
dev.new(width=6, height=6)
set.panel(2,2)
# Plot x,y
plot(mat1)
# Model z = f(x,y) with splines
fit = Tps(mat1, z)
pred = predict.surface(fit)
# Plot fit
image(pred)
surface(pred)
# Plot standard error of fit
xg = make.surface.grid(list(pred$x, pred$y))
pred.se = predict.se(fit, xg)
surface(as.surface(xg, pred.se))
I generally use the rms/Hmisc package combination. This is a linear regression analysis (function ols) using crossed cubic spline terms whose plotted output closely resembles the fields example offered:
dfrm <- data.frame(z=z, xcor = mat1[,1], ycor=mat1[,2])
require(rms) # will automatically load Hmisc which needs to have been installed
lininterp <- ols(z ~ rcs(xcor,3)*rcs(ycor,3), data=dfrm)
ddI <- datadist(dfrm)
options(datadist="ddI")
bplot(Predict(lininterp, xcor, ycor)) # Plot not shown
perim <- with(dfrm, perimeter(xcor, ycor))
bplot(Predict(lininterp, xcor, ycor), perim=perim)
# Plot attached after converting to .png
You can also see a method that does not rely on regression estimates of the 3-D surface in second part of my answer to this question: Using color as the 3rd dimension
The plotting paradigm is lattice and you can also get contour plots as well as this pretty levelplot. If you want the predicted values at an iterior point, then you can get that with the Predict function applied to the fit-object.
There is a panel.2dsmoother function in the latticeExtra package:
library(lattice)
library(latticeExtra)
df <- data.frame(mat1, z)
names(df)[c(1,2)] <- c('x', 'y')
levelplot(z ~ x * y, data = df, panel = panel.2dsmoother, contour=TRUE)
According to its help page "the smoothing model is constructed (approximately) as method(form, data = list(x=x, y=y, z=z), {args}) [...] This should work with any model function that takes a formula argument, and has a predict method argument".
Probably the question can be divided into two parts. The first is aggregating the data, and the second is visualizing it.
fields package, as #John shows, can do these things at one time.
In ggplot2, if aggregation is simply count of the data, stat_bin2d is available.
Anyway, if you want to your own aggregate function, maybe something like this will help:
df <- data.frame(x = mat1[,1], y = mat1[,2], z = z)
Nx <- 10 # nubmer of bins for x
Ny <- 4 # number of bins for y
# create a data.
df2 <- mutate(ddply(df, .(x = cut(x, Nx), y = cut(y, Ny)), summarise,
Mean = mean(z),
Var = var(z)),
xmin = as.numeric( sub("\\((.+),.*", "\\1", x)),
xmax = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", x)),
ymin = as.numeric( sub("\\((.+),.*", "\\1", y)),
ymax = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", y)),
xint = as.numeric(x),
yint = as.numeric(y))
# then, visualize
ggplot(df2, aes(xint, yint, xmin = xmin, ymin = ymin, xmax = xmax, ymax = ymax, fill = Mean)) +
geom_tile(stat = "identity")
ggplot(df2, aes(xint, yint, xmin = xmin, ymin = ymin, xmax = xmax, ymax = ymax, fill = Var)) +
geom_tile(stat = "identity")