I have a matrix with x rows (i.e. the number of draws) and y columns (the number of observations). They represent a distribution of y forecasts.
Now I would like to make sort of a 'heat map' of the draws. That is, I want to plot a 'confidence interval' (not really a confidence interval, but just all the values with shading in between), but as a 'heat map' (an example of a heat map ). That means, that if for instance a lot of draws for observation y=y* were around 1 but there was also a draw of 5 for that same observation, that then the area of the confidence interval around 1 is darker (but the whole are between 1 and 5 is still shaded).
To be totally clear: I like for instance the plot in the answer here, but then I would want the grey confidence interval to instead be colored as intensities (i.e. some areas are darker).
Could someone please tell me how I could achieve that?
Thanks in advance.
Edit: As per request: example data.
Example of the first 20 values of the first column (i.e. y[1:20,1]):
[1] 0.032067416 -0.064797792 0.035022338 0.016347263 0.034373065
0.024793101 -0.002514447 0.091411355 -0.064263536 -0.026808208 [11] 0.125831185 -0.039428744 0.017156454 -0.061574540 -0.074207109 -0.029171227 0.018906181 0.092816957 0.028899699 -0.004535961
So, the hard part of this is transforming your data into the right shape, which is why it's nice to share something that really looks like your data, not just a single column.
Let's say your data is this a matrix with 10,000 rows and 10 columns. I'll just use a uniform distribution so it will be a boring plot at the end
n = 10000
k = 10
mat = matrix(runif(n * k), nrow = n)
Next, we'll calculate quantiles for each column using apply, transpose, and make it a data frame:
dat = as.data.frame(t(apply(mat, MARGIN = 2, FUN = quantile, probs = seq(.1, 0.9, 0.1))))
Add an x variable (since we transposed, each x value corresponds to a column in the original data)
dat$x = 1:nrow(dat)
We now need to get it into a "long" form, grouped by the min and max values for a certain deviation group around the median, and of course get rid of the pesky percent signs introduced by quantile:
library(dplyr)
library(tidyr)
dat_long = gather(dat, "quantile", value = "y", -x) %>%
mutate(quantile = as.numeric(gsub("%", "", quantile)),
group = abs(50 - quantile))
dat_ribbon = dat_long %>% filter(quantile < 50) %>%
mutate(ymin = y) %>%
select(x, ymin, group) %>%
left_join(
dat_long %>% filter(quantile > 50) %>%
mutate(ymax = y) %>%
select(x, ymax, group)
)
dat_median = filter(dat_long, quantile == 50)
And finally we can plot. We'll plot a transparent ribbon for each "group", that is 10%-90% interval, 20%-80% interval, ... 40%-60% interval, and then a single line at the median (50%). Using transparency, the middle will be darker as it has more ribbons overlapping on top of it. This doesn't go from the mininum to the maximum, but it will if you set the probs in the quantile call to go from 0 to 1 instead of .1 to .9.
library(ggplot2)
ggplot(dat_ribbon, aes(x = x)) +
geom_ribbon(aes(ymin = ymin, ymax = ymax, group = group), alpha = 0.2) +
geom_line(aes(y = y), data = dat_median, color = "white")
Worth noting that this is not a conventional heatmap. A heatmap usually implies that you have 3 variables, x, y, and z (color), where there is a z-value for every x-y pair. Here you have two variables, x and y, with y depending on x.
That is not a lot to go on, but I would probably start with the hexbin or hexbinplot package. Several alternatives are presented in this SO post.
Formatting and manipulating a plot from the R package "hexbin"
Related
Frequently I want to plot raster data with a lot of missing values, including entire missing rows or columns. Consider the following as a toy example:
library(ggplot2)
set.seed(50)
d = expand.grid(x = 1:100, y = 1:100)
d$v = rnorm(nrow(d))
d[d$x %in% sample(d$x, 5), "v"] = NA_real_
ggplot() + geom_raster(aes(x, y, fill = v), data = d)
This works so far, but what if I want to omit plotting the missing values at all, instead of plotting gray squares for them? If I change data = d to data = d[!is.na(d$v),], then I get the warning "Raster pixels are placed at uneven horizontal intervals and will be shifted. Consider using geom_tile() instead." I don't see a shift in this example, but I worry that if ggplot2 shifts the data, that could lead to squares being plotted at the wrong coordinates for real data. How do I avoid this shifting?
I'm sure this can be done by separately collecting all the data and then just using ggplot for the plotting, but I'd really prefer a simpler solution implementing ggplot, particulalry stat_ecdf() because of easier access to grouping variables, facets, etc.
My dataframe contains, amongst others, two columns of corresponding data x and y. I'd like to plot the ecdf of y on an axis of the corresponding x values. In other words, I'd like to plot what cumulative portion of the y variable is reached at its corresponding x value. While x and y are correlated (both descending), they are not analytically connected, so I cannot simply scale values of y to x. My attempts to do this with separate calculations of the ecdf functions of each subset have gotten extremely messy and complicated, while the stat_ecdf function seems to be very close to getting me what I need.
If I set the x variable in the ggplot aes to x and then set the variable within stat_ecdf to y, I am able to get the ecdf of y with axis labels of x; however, the actual values on the axis correspond to x. I'm plotting This is done with something like:
ggplot(df, aes(x, color=group_var)) + stat_ecdf(aes(y))
EDIT:
To visualize this:
This sample plot
shows the ecdf of x for multiple groups. Each x value has a corresponding y value in a sorted dataframe (approximate relationship, ignore the decreasing regions at the end. I would like to have a similar plot where the horizontal axis is in the corresponding y values. Basically, I need to map the horizontal axis of the first ecdf plot from x->y as simply as possible. I could do this manually by adding ecdf values as a column in the dataframe, but I am looking to do it within ggplot for simplicity, if possible.
Instead of trying to bend stat_ecdf to do something it was not designed for, it's better to be explicit about your intention in the code.
It's quite straightforward. The most weird piece of code: ecdf(y)(y) menas 'calculate the empirical CDF for y, and then evaluate it for the actual values of y in my data. The cummax deals with the decreasing y, to get ever increasing eCDF along x.
d_sample %>%
group_by(group) %>%
arrange(group, x) %>%
mutate(
fraction = ecdf(y)(y),
maxf = pmax(fraction, cummax(fraction))) %>%
ggplot(aes(x, maxf)) +
geom_point() +
facet_wrap(~group)
I'm still not really sure if that's what you need.
Sample data
To be honest it took me most of the time to 'fake' your dataset:
library(tidyverse)
tibble(x = seq_len(300) + 100) %>%
mutate(
one = - 1e-3 * (x * x) + 50 + 0.7 * x,
two = - 1e-3 * (x * x) + 55 + 0.68 * x,
three = - 1e-3 * (x * x) + 110 + 0.5 * x,
four = - 1e-3 * (x * x) + 10 + 0.8 * x) %>%
pivot_longer(-x, names_to = "group", values_to = "y") %>%
filter(
group == "one"
| group == "two"
| (group == "three" & x < 200)
| (group == "four" & x > 250)) ->
d_sample
d_sample %>%
ggplot(aes(x, y, colour = group)) +
geom_point()
I've got a dataset of different energies (eV) and related counts. I changed the detection wavelength throughout the measurement which resulted in having a first column with all wavelength and than further columns. There the different rows are filled with NAs because no data was measured at the specific wavelength.
I would like to plot the spectra in R, but it doesn't work because the length of X and y values differs for each column.
It would be great, if someone could help me.
Thank you very much.
It would be better if we could work with (simulated) data you provided. Here's my attempt at trying to visualize your problem the way I see it.
library(ggplot2)
library(tidyr)
# create and fudge the data
xy <- data.frame(measurement = 1:20, red = rnorm(20), green = rnorm(20, mean = 10), uv = NA)
xy[16:20, "green"] <- NA
xy[16:20, "uv"] <- rnorm(5, mean = -3)
# flow it into "long" format
xy <- gather(xy, key = color, value = value, - measurement)
# plot
ggplot(xy, aes(x = measurement, y = value, group = color)) +
theme_bw() +
geom_line()
I have data of the form
cvar1 cvar1 numvar
a x 0.1
a y 0.2
b x 0.15
b y 0.25
That is, two categorical variables, and one numerical variable.
Using ggplot2, I can get a nice scatter plot that plots the data for each combination of cv1 and cv2 by doing qplot(y=numvar, x=interaction(cvar1, cvar2)). This gives me several columns of points like this:
To each of these columns I would like to add a small horizontal line representing the mean of the data points in that column. And a similar small horizontal line for the mean + sd and the mean - sd. (Kind of a bastardized box plot, but with all points visible and using mean and sd rather than median and IQR.) Thanks in advance!
You can create a table that contains the mean and mean +/- sd for each group of points. Then you can plot lines using geom_segment().
First, I create some sample data:
set.seed(1245)
data <- data.frame(cvar1 = rep(letters[1:2], each = 12),
cvar2 = rep(letters[25:26], times = 12),
numvar = runif(2*12))
This creates the table with the values that you need using dplyr and tidyr:
library(dplyr)
library(tidyr)
summ <- group_by(data, cvar1, cvar2) %>%
summarise(mean = mean(numvar),
low = mean - sd(numvar),
high = mean + sd(numvar)) %>%
gather(variable, value, mean:high)
The three lines do the following: First, the data is split into the groups and then for each group the three required values are calculated. Finally, the data is converted to long format, which is needed for ggplot(). (Maybe your are more familiar with melt(), which does basically the same thing as gather())
And finally, this creates the plot:
gplot(data) + geom_point(aes(x = interaction(cvar1, cvar2), y = numvar)) +
geom_segment(data = summ,
aes(x = as.numeric(interaction(cvar1, cvar2)) - .5,
xend = as.numeric(interaction(cvar1, cvar2)) + .5,
y = value, yend = value, colour = variable))
You probably won't want the colours. I just added them to make the example more clear.
geom_segments() needs the start and end coordinates of each line to be specified. Because interaction(cvar1, cvar2) is a factor, it needs to be converted to numeric before it is possible to do arithmetic with it. I added and subtracted 0.5 to interaction(cvar1, cvar2), which makes the lines quite wide. Choosing a smaller value will make the lines shorter.
I'm working with a data frame of size 2 x 400. I need to graph this (let's call it data set A) on the same graph as the main data set for my project.
All I need is the general shape of data set A's graph. ie i only need to see the trend.
The scale that data set A takes place on happens to be much smaller than that of the main graph. So dataset A just looks like a horizontal line.
I decided to scale data set A by multiplying it by a factor of... I tried various values to get the optimum vertical scaling, which leads me to the problem I'm having.
When trying to find the ideal multiplicative factor by trial and error, I expected the general shape of data set A's graph to retain its shape, and only vary in its relative vertical points . ie the horizontal coordinates of all maxes and mins shouldn't move, and only the vertical points should be moving. but this wasn't happening. I'd like to know why.
Here's the data set A (yellow), when multiplied by factor of 3:
factor of 5:
The yellow dots are the geom_point and the yellow curve is the corresponding geom_smooth.
EDIT:
here is my the code original code:
I haven't had much formal training with code. I'm apologize for any messiness!
library("ggplot2")
library("dplyr")
# READ IN DATA
temp_data <-read.table(col.names = "y",
"C:/Users/Ben/Documents/Visual Studio 2013/Projects/Home/Home/steamdata2.txt")
boilpoint <- which(temp_data$y == "boil") # JUST A MARKER..
temp_data <- filter(temp_data, y != "boil") # GETTING RID OF THE MARKER ENTRY
# DON'T KNOW WHY BUT I HAD TO DO THIS INTERMEDIATE STEP
# BEFORE I COULD CONVERT FROM FACTOR -> NUMERIC
temp_data$y <- as.character(temp_data$y)
# CONVERTING TO NUMERIC
temp_data$y <- as.numeric(temp_data$y)
# GETTING RID OF BASICALLY THE LAST ENTRY WHICH HAS THE LARGEST VALUE
temp_data <- filter(temp_data, y<max(temp_data$y))
# ADD ANOTHER COLUMN WITH THE ROW NUMBER,
# BECAUSE I DON'T KNOW HOW TO ACCESS THIS FOR GGPLOT
temp_data <- transform(temp_data, x = 1:nrow(temp_data))
n <- nrow(temp_data) # Num of readings
period <- temp_data[n,1] # (sec)
RpS <- n / period # Avg Readings per Second
MIN <- min(temp_data$y)
MAX <- max(temp_data$y)
# DERIVATIVE OF ORIGINAL
deriv <- data.frame(matrix(ncol=2, nrow=n))
# ADD ANOTHER COLUMN TO ACCESS ROW NUMBERS FOR GGPLOT LATER
colnames(deriv) <- c("y","x")
deriv <- transform(deriv, x = c(1:n))
# FILL DERIVATIVE DATAFRAME
deriv[1, 1] <- 0
for(i in 2:n){
deriv[i - 1, 1] <- temp_data[i, 1] - temp_data[i - 1, 1]
}
deriv <- filter(deriv, y != 0)
# DID THE SAME FOR SECOND DERIVATIVE
dderiv <- data.frame(matrix(ncol = 2, nrow = nrow(deriv)))
colnames(dderiv) <- c("y", "x")
dderiv <- transform(dderiv, x=rep(0, nrow(deriv)))
dderiv[1, 1] <- 0
for(i in 2:nrow(deriv)) {
dderiv$y[i - 1] <- (deriv$y[i] - deriv$y[i - 1]) /
(deriv$x[i] - deriv$x[i - 1])
dderiv$x[i - 1] <- deriv$x[i] + (deriv$x[i] - deriv$x[i - 1]) / 2
}
dderiv <- filter(dderiv, y!=0)
# HERE'S WHERE I FACTOR BY VARIOUS MULTIPLES
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
graph <- ggplot(temp_data, aes(x, y)) + geom_smooth()
graph <- graph + geom_point(data = deriv, color = "yellow")
graph <- graph + geom_smooth(data = deriv, color = "yellow")
graph <- graph + geom_point(data = dderiv, color = "green")
graph <- graph + geom_smooth(data = dderiv, color = "green")
graph <- graph + geom_vline(xintercept = boilpoint, color = "red")
graph <- graph + xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)")))
graph <- graph + xlim(c(0,n)) + ylim(c(MIN, MAX))
It's hard to check without your raw data, but I'm 99% sure that your main problem is that you're hard-coding the y limits with ylim(c(MIN, MAX)). This is exacerbated by accidentally scaling both variables in your deriv and dderiv data frame, not just y.
I was able to debug the problem when I noticed that your top "scale by 3" graph has a lot more yellow points than your bottom "scale by 5" graph.
The quick fix is don't scale the row numbers, only scale the y values, which is to say, replace this
# scales entire data frame: bad!
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
with this:
# only scale y
deriv$y <- MIN + deriv$y * 3
dderiv$y <- MIN + dderiv$y * 3
I think there is another problem too: even with my correction above, negative values of your derivatives will be excluded. If deriv$y or dderiv$y is ever negative, then MIN + deriv$y * 3 will be less than MIN, and since your y axis begins at MIN it won't be plotted.
So I think the whole fix would be to instead do something like
# keep the original y values around so we can experiment with scaling
# without running *all* the code again
deriv$y_orig <- deriv$y
# multiplicative scale
# fill in the value of `prop` to be the proportion of the vertical plot area
# that you want taken up by the derivative
deriv$y <- deriv$y_orig * diff(c(MIN, MAX)) / diff(range(deriv$y_orig)) * prop
# shift into plot range
# fill in the value of `intercept` to be the y value of the
# lowest point of this line
deriv$y <- deriv$y + MIN - min(deriv$y) + 1
I normally don't answer questions that aren't reproducible with data because I hate lack of clarity and I hate the inability to test. However, your question was very clear and I'm pretty sure this will work even without testing. Fingers crossed!
A few other, more general comments:
It's good you know that to convert factor to numeric you need to go via character. It's an annoyance, but if you want to understand more here's the r-faq on it.
I'm not sure why you bother with (deriv$x[i] - deriv$x[i - 1]) in your for loop. Since you define x to be 1, 2, 3, ... the difference is always 1. I'm more confused by why you divide by 2 in the second derivative.
Your for loop can probably be replaced by the diff() function. (See below.)
You seem to have just gotten your foot in the dplyr door, so I used base functions in my recommendation. Keep working with dplyr, I think you'll like it. The big dplyr function you're not using is mutate. It works like base::transform for adding new columns.
I dislike that you've created all these different data frames, it clutters things up. I think your code could be simplified to something like this
all_data = filter(temp_data, y != "boil") %>%
mutate(y = as.numeric(as.character(y))) %>%
filter(y < max(y)) %>%
mutate(
x = 1:n(),
deriv = c(NA, diff(y)) / c(NA, diff(x)),
dderiv = c(NA, diff(deriv)) / 2
)
Rather than having separate data frames for the original data, first derivative and second derivative, this puts them all in the same data frame.
The big benefit of having things in one data frame is that you could then "gather" it into a nice, long (rather than wide) tidy format and simplify your plotting call:
library(tidyr)
long_data = gather(all_data, key = function, value = y, y, deriv, dderiv)
Then your ggplot call would look more like this:
graph <- ggplot(temp_data, aes(x, y, color = function)) +
geom_smooth() +
geom_point() +
geom_vline(xintercept = boilpoint, color = "red") +
scale_color_manual(values = c("green", "yellow", "blue")) +
xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)"))) +
xlim(c(0,n)) + ylim(c(MIN, MAX))
With data in long format, you'd have a column of you data (I've named it "function") that maps to color, so you don't have to add all the layers one at a time, and you get a nicely generated legend!