Display maximum frequency point of each bin in ggplot2 stat_binhex - r

I have a data set in which a coordinate can be repeated several times.
I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.
Minimum working example (the bins display the count not the max):
library(ggplot2)
library(data.table)
set.seed(41)
dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),
y=sample(seq(-10,10,1),1000,replace=TRUE))
dat[,.N,by=c("x","y")][,max(N)]
# No bin should be over 9
p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)
p1
I believe the approach should be related to this question:
calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.
Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.
Is it possible to make the bins display the maximum number of times a point is repeated?

I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the expected behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).
I don't think you will be able to do this directly in a call to geom_hex or stat_hexbin, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.
For your purpose, if you want finer control, you may want to instead use geom_tile and count the values yourself, eg. (using dplyr and magrittr):
countedData <-
dat %$%
table(x,y) %>%
as.data.frame()
ggplot(countedData
, aes(x = x
, y = y
, fill = Freq)) +
geom_tile()
and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.
Alternatively, you could filter your raw data to only include the points that are the maximum within a bin. That would require you to match the binning, but could at least be an option.
For completeness, here is how to adapt the stat_summary_hex solution that #Jon Nagra (OP) linked to. Note that there are a few additional steps, so I don't think that this is quite a duplicate. Specifically, the table step above is required to generate something that can be used as a z for the summaries, and then you need to convert x and y back from factors to the original scale.
ggplot(countedData
, aes(x = as.numeric(as.character(x))
, y = as.numeric(as.character(y))
, z = Freq)) +
stat_summary_hex(fun = max, bins = 10
, col = "white")
Of note, I still think that the geom_tile may be more useful, even it is not quite as flashy.

Related

Best way to visually compare an individual's value to multiple subgroup means in R

For an individual feedback sheet generated by a Shiny App in R I would like to visually compare an individual's value in variable X to the mean of the whole group, the mean of people of the same age and the mean of people playing the same sports. I was considering making a barplot with four bars for each value and since I keep reading ggplot2 is neat for making plots tried to figure out how to do it in ggplot2. However when trying to implement this idea the factor on the x axis would conceptually be the subsets of the dataset and since the subsets are build from different variables and one individual can be in more than one subset I absolutely can't seem to wrap my head around how to actually feed that into any barplot synthax I found. I wondered if your could just make a list along the lines of c(your_value, mean(group), mean(age_subset), mean(sports_subset)) but I didn't find if that was possible also first making a list or even a second dataframe seems kinda messy to me - isn't there an easier and more elegant way to do something like that?
Below I start with arbitrary numbers (equivalent to the list you considered starting with). The code might give you an idea how to make a general function of the kind you're seeking.
library(ggplot2)
library(dplyr)
own_result <- 5.4
mean_age <- 5.6
mean_sport <- 4.5
data.frame(group = c("age", "sport"),
means = c(mean_age, mean_sport)) %>%
ggplot(aes(x = group, y = means)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = own_result, lty = 2, col = "red")
Created on 2021-07-20 by the reprex package (v2.0.0)

Time series difference in ggplot

I'm trying to plot the (first) difference of a time series with ggplot.
As the difference (by definition) contains one less element than the data, I (predictably) get the error message: "Error: Aesthetics must be either length 1 or the same as the data".
I solved this by defining my y aesthetic as c(NA, diff(data)) instead of just diff(data), which works.
However, this feels like a clumsy workaround and only works so far as it gets me in trouble for instance when I'm trying to facet several plots. (Also, you need to keep adding NA's if a higher order of difference is needed, or more lag).
Anyone knows of a more robust solution?
The ultimate problem is this:
What I want (this was made using patchwork::)
What I get using faceting (NB: if I put the NA at the end, it's the third chart which becomes correct)
While adding NA to the differenced vector is not clumsy, doing this within the ggplot aesthetic is. Compare the following two code:
ggplot(data = data.long, aes(x = date, y = c(NA, count %>% diff()))) +
geom_point()
data.long %>%
mutate(diff_count = c(NA, diff(count))) %>%
ggplot(aes(x = date, y = diff_count)) +
geom_point()
They both would give the same graph, but the second code is the preferred method since the data used for plotting (the differenced count) is calculated before being sent to ggplot and is easier to read and modify. In other words, do the data management first, then visualise the data. As you say, using the first method can get you into trouble later, for example doing more complicated graphing such as facetting.

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

How can I wrap a ggplot column n times after hitting threshold number

I have a barplot where I have one entry that is so much larger then my other entries that it makes it difficult to do interesting analysis on the other smaller valued data-points.
plt <- ggplot(dffd[dffd$Month==i & dffd$UniqueCarrier!="AA",],aes(x=UniqueCarrier,y=1,fill=DepDelay))+
geom_col()+
coord_flip()+
scale_fill_gradientn(breaks=late_breaks,labels=late_breaks,limits=c(0,150),colours=c('black','yellow','orange','red','darkred'))
When I remove it I get back to an interesting degree of interpretation but now I'm tossing out upwards of half the data and arguably the most important one to explore.
I was wondering if there is a way that I could set an interval on my bar plot, say 500 in this case, after which I can start another column for the same entry right under it and resume building up my bar plot. In this example, that would translate here into WN splitting into 3 bars of length 500 500 and ~400 stacked one below the other all under that one WN label (ideally it shows the one tick for all three). Since I have a couple of other disproportionately large representative, plots doing this in as a layer during the plotting is of great interest to me.
Typically, when you have such disproportionate values in your data set, you should either put your values on a log scale (or use some other transformation) or zoom in on the plot using coord_cartesian. I think you probably could hack your way around and create the desired plot, but it's going to be quite misleading in terms of visualisation and analysis.
EDIT:
Based on your comments, I have a rather hacky solution. The data you've pasted was not directly usable (a part of dput was missing + there's no DepDelay columns, so I improvised).
The idea is to create an extra tag column based on the UniqueCarrier column and the max amount you want.
df2 <- df %>%
filter(UniqueCarrier != "AA" & Month == i) %>%
group_by(UniqueCarrier) %>%
mutate(tag = paste(UniqueCarrier, rep(seq(1, n()%/%500+1), each=500), sep="_")[1:n()])
This adds a tag column that basically says how many columns you'll have in each category.
plt <- ggplot(df2, aes(x=tag, y=1, fill=DepDelay)) +
geom_col() +
coord_flip() +
scale_fill_gradientn(breaks=late_breaks, labels=late_breaks,
limits=c(0,150),
colours=c('black','yellow','orange','red','darkred')) +
scale_x_discrete(labels=str_replace(sort(unique(df2$tag)), "_[:digit:]", ""))
plt
In the image above, I've used CarrierDelay with break interval of 100. You can see that the WN label then repeats - there are ways to remove the extra ones (some more creative replacements in scale_x_discrete labels.
If you want the columns to be ordered differently, just replace seq(1, n()%/%500+1) with seq(n()%/%500+1, 1).

Controlling alpha in ggparcoord (from GGally package)

I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()

Resources