Is there a way to create a geom_path heatmap in ggplot? - r

For example, this is a heatmap from a website using GPS data:
I have gotten some degree of success with adding a weight parameter to each vertex and calculating the number of events that have vertices near those, but that takes a long time, especially with a large amount of data. It also appears a bit spotty when the distance between vertices is a bit wonky, which causes random splotches of different colors throughout the heatmap. It looks kind of cool, but it makes the data a bit harder to read.
When you zoom out, it looks a bit more continuous due to the paths overlapping more.
In R, the closest I can do to this involves using an alpha channel, but that only gets me a monochromatic heatmap, which is not always desirable, especially when you want to see lesser-traveled paths visibly. In theory I could do two lines to resolve the visibility part (first opaque, second semi-transparent), but I would like to be able to have different hue values.
Ideally I would like this to work with ggplot, but if it cannot, I would accept other methods, provided they are reasonably quick computationally.
Edit: The data format is a data frame with sequential (latitude, longitude) coordinate pairs, along with some associated data that can be used for filter & grouping (such as activity type and event ID).
Here is a sample of the data for the region displayed in the above images (~1.5 MB):
https://www.dropbox.com/s/13p2jtz4760m26d/sample_coordinate_data.csv?dl=0

I would try something like
ggplot() + geom_count(data, aes(longitude, latitude, alpha=..prop..))
but you need to show some data to check how it works.

Related

Are dual-axes time series plots more acceptable if the data for each axis is from the exact same time/location?

I know dual-axes time series plots can be misleading. They're difficult to make in ggplot because Hadley Wickham believes they are fundamentally flawed. Others have concluded that they are ok sometimes, when axes are chosen so that the lines look as though they had been converted to indices first (even if they are given in their actual units). I'm wondering if this example is one in which dual-axes are justifiable.
This online tool is an example similar to what I want to create: https://carve.ornl.gov/visualize/
Measurements taken at the same point in time, from the same flight, are plotted over time. The user can select any two measurements to overlay, and the time matches up with a map showing flight coordinates. I think this is an elegant way for users to interact with the data, and I can't really imagine an alternative that would convey the same information.
That being said, I am interested to hear other opinions. Will this type of plot draw vitriol from other data scientists?! Do you have other ideas? And, if you have recommendations for what R tools I should turn to (since ggplot might be off the table...), I would love to hear them (I will be using Shiny). Thanks!
The debate on multiple axis on a same cartesian plane is indeed a hot one. It reminds me of the endless debates around social scienceĀ“s approaches.
If you follow the orthodoxy of the Grammar of GraphicsG Gospel, then the graph you linked is flawed. To come back into the herd, you could simply map either the CO2 or the Altitude to a different plotted symbology, like the size of the dots or color. Or simply plot two different panels, aligned by the X scale.
Now, the Grammar of Graphic people have much fewer problems with multiple scales on the plotted scales than on the cartesian scales.
Yet, I think that methodological opportunism is preferable to methodological orthodoxy. Do whatever is easier for you to communicate the idea to the public.

Visualising changing rank-ordering when missing data is present

I wish to visualise changes in relative rankings between categories through time, much like this so-called 'subway-style' plot. However, not all categories are present in all time steps.
I have made a preliminary plot (see attached) that is sufficient to interpret the data (one simply needs to look at the crossing lines). However, because not every category is represented in every time slice, lines may traverse the y-axis without any change in rank, which is visually confusing:
Do algorithms exist for minimising the kinks in static ranks when missing data is present? To put it another way, my goal is to maintain straight lines wherever possible (when there is no change in relative rank).

Tableau map shapes overlapped

I am trying to render some geographic data onto the map in Tableau. However, some data points located at the same point, so the shape images of the data points overlaps together. By clicking on a shape, you could only get the top one.
How can we distinguish the overlapped data points in Tableau? I know that we can manually exclude the top data to see another, but is there any other way, for example, make a drop down list in the right click menu to select the overlapped data points?
Thank you!
There are a couple of ways to deal with this issue.
Some choices you can try are:
Add some transparency to the marks by editing the color shelf properties. That way at least you get a visual indication when there are multiple marks stacked on top of each other. This approach can be considered a poor man's heat map if you have many points in different areas as the denser/darker sections will have more marks. (But that just affects the appearance and doesn't help you select and view details for marks that are covered by others)
Add some small pseudo-random jitter to each coordinate using calculated fields. This will be easier when Tableau supports a rand() function, but in the meantime you can get creative enough using other fields and the math function to add a little jitter. The goal here is to slightly shift locations enough that they don't stack exactly, but not enough to matter in precision. Depends on the scale.
Make a grid style heat map where the color indicates the number of data points in each grid. To do this, you'll need to create calculated fields to bin together nearby latitudes or longitudes. Say to round each latitude to a certain number of decimal places, or use the hex bin functions in Tableau. Those calculated fields will need to have a geographic role and be treated as continuous dimensions.
Define your visualization to display one mark for each unique location, and then use color or size to indicate the number of data points at that location, as opposed to a mark for each individual data point

Need a fast dataset 2D-viewer/plotter for large datasets

I'm searching a data viewer/plotter for some data I've generated.
Facts
First some facts about the data I've generated:
There are several datasets with about 3 million data points each.
Each dataset currently is stored in ascii format.
Every line represents a point and consists of multiple columns.
The first two columns determine the position of the point (i.e. x and y value) whereas the first column is a timestamp and the second is a normalized float between 0 and 1.
The other columns contain additional data which may be used to colorize the plot or filter the data.
An example data point:
2012-08-08T01:02:03.040 0.0165719281 foobar SUCCESS XX:1
Current Approach
Currently I am generating multiple png files (with gnuplot) with different selection criteria like the following ones for each data set:
Display all points in grey.
Display all points in grey, but SUCCESS in red.
Display all points in grey, but SUCCESS in red, XX:-1 in green; if both SUCCESS and XX:-1 match use blue as coloring.
Drawbacks
With the current approach there are some drawbacks I'd like to have addressed:
I can't easily switch on/off some filters or colorings because I have to generate a new png file every time.
I need to use a limited resolution in my image file because the higher the resolution the slower is the viewer. So I can only zoom in to a limited level of detail.
I don't have the raw data available in the png viewer for each point. Ideally I'd like to have the data visible on selection of a point.
Already tested
I've already tested some other approaches:
Gnuplot itself has a viewer but it can't handle that amount of points efficiently - it is too slow and consumes too much memory.
I've had a quick look at KST, but I couldn't find a way to display 2D data and I don't think it will meet my wishes.
Wishes
I'd like to have a viewer which can operate on the raw data, can displays the points quickly if zoomed out, can also zoom in quickly and as well should resolve the aforementioned drawbacks.
Question
So finally, does anybody know of such a viewer or has another suggestion?
If there isn't a viewer some recommendations for programming it myself are welcome, too.
Thanks in advance
Stefan

How to avoid overplotting (for points) using base-graph?

I am in my way of finishing the graphs for a paper and decided (after a discussion on stats.stackoverflow), in order to transmit as much information as possible, to create the following graph that present both in the foreground the means and in the background the raw data:
However, one problem remains and that is overplotting. For example, the marked point looks like it reflects one data point, but in fact 5 data points exists with the same value at that place.
Therefore, I would like to know if there is a way to deal with overplotting in base graph using points as the function.
It would be ideal if e.g., the respective points get darker, or thicker or,...
Manually doing it is not an option (too many graphs and points like this). Furthermore, ggplot2 is also not what I want to learn to deal with this single problem (one reason is that I tend to like dual-axes what is not supprted in ggplot2).
Update: I wrote a function which automatically creates the above graphs and avoids overplotting by adding vertical or horizontal jitter (or both): check it out!
This function is now available as raw.means.plot and raw.means.plot2 in the plotrix package (on CRAN).
Standard approach is to add some noise to the data before plotting. R has a function jitter() which does exactly that. You could use it to add the necessary noise to the coordinates in your plot. eg:
X <- rep(1:10,10)
Z <- as.factor(sample(letters[1:10],100,replace=T))
plot(jitter(as.numeric(Z),factor=0.2),X,xaxt="n")
axis(1,at=1:10,labels=levels(Z))
Besides jittering, another good approach is alpha blending which you can obtain (on the graphics devices supporing it) as the fourth color parameter. I provided an example for 'overplotting' of two histograms in this SO question.
One additional idea for the general problem of showing the number of points is using a rug plot (rug function), this places small tick marks along the margin that can show how many points contribute (still use jittering or alpha blending for ties). This allows the actual points to show their true rather than jittered values, but the rug can then indicate which parts of the plot have more values.
For the example plot direct jittering or alpha blending is probably best, but in some other cases the rug plot can be useful.
You may also use sunflowerplot, while it would be hard to implement it here. I would use alpha-blending, as Dirk suggested.

Resources