R plotting strangeness with large dataset - r

I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.

High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.

For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.

Related

Are dual-axes time series plots more acceptable if the data for each axis is from the exact same time/location?

I know dual-axes time series plots can be misleading. They're difficult to make in ggplot because Hadley Wickham believes they are fundamentally flawed. Others have concluded that they are ok sometimes, when axes are chosen so that the lines look as though they had been converted to indices first (even if they are given in their actual units). I'm wondering if this example is one in which dual-axes are justifiable.
This online tool is an example similar to what I want to create: https://carve.ornl.gov/visualize/
Measurements taken at the same point in time, from the same flight, are plotted over time. The user can select any two measurements to overlay, and the time matches up with a map showing flight coordinates. I think this is an elegant way for users to interact with the data, and I can't really imagine an alternative that would convey the same information.
That being said, I am interested to hear other opinions. Will this type of plot draw vitriol from other data scientists?! Do you have other ideas? And, if you have recommendations for what R tools I should turn to (since ggplot might be off the table...), I would love to hear them (I will be using Shiny). Thanks!
The debate on multiple axis on a same cartesian plane is indeed a hot one. It reminds me of the endless debates around social scienceĀ“s approaches.
If you follow the orthodoxy of the Grammar of GraphicsG Gospel, then the graph you linked is flawed. To come back into the herd, you could simply map either the CO2 or the Altitude to a different plotted symbology, like the size of the dots or color. Or simply plot two different panels, aligned by the X scale.
Now, the Grammar of Graphic people have much fewer problems with multiple scales on the plotted scales than on the cartesian scales.
Yet, I think that methodological opportunism is preferable to methodological orthodoxy. Do whatever is easier for you to communicate the idea to the public.

Polygon/contour around subset of vertices on graph (more precise than mark.groups in igraph)

Problem definition
I need to produce a number of specific graphs, and on these graphs, highlight subsets of vertices (nodes) by drawing a contour/polygon/range around or over them (see image below).
A graph may have multiple of these contours/ranges, and they may overlap, iff one or more vertices belong to multiple subsets.
Given a graph of N vertices, any subset may be of size 1..N.
However, vertices not belonging to a subset must not be inside the contour (as that would be misleading, so that's priority no. 1). This is gist of my problem.
All these graphs happen to have the property that the ranges are continuous, as the data they represent covers only directly connected subsets of vertices.
All graphs will be undirected and connected (no unconnected vertices will ever be plotted).
Reproducible attempts
I am using R and the igraph package. I have already tried some solutions, but none of them work well enough.
First attempt, mark.groups in plot.igraph:
library(igraph)
g = make_graph("Frucht")
l = layout.reingold.tilford(g,1)
plot(g, layout=l, mark.groups = c(1,3,6,12,5), mark.shape=1)
# bad, vertex 11 should not be inside the contour
plot(g, layout=l, mark.groups = c(1,6,12,5,11), mark.shape=1)
# 3 should not be in; image below
# just choosing another layout here is not a generalizable solution
The plot.igraph calls igraph.polygon, which calls convex_hull (also igraph), which calls xspline. The results is, from what I understand, something called a convex hull (which otherwise looks very nice!), but for my purposes that is not precise enough, covering vertices that should not be covered.
Second attempt with contour. So I tried implementing my own version, based on the solution suggested here:
library(MASS)
xx <- runif(5, 0, 1);yy <- abs(xx)+rnorm(5,0,0.2)
plot(xx,yy, xlim=c( min(xx)-sd(xx),max(xx)+sd(xx)), ylim =c( min(yy)-sd(yy), max(yy)+sd(yy)))
dens2 <- kde2d(xx, yy, lims=c(min(xx)-sd(xx), max(xx)+sd(xx), min(yy)- sd(yy), max(yy)+sd(yy) ),h=c(bandwidth.nrd(xx)/1.5, bandwidth.nrd(xx)/ 1.5), n=50 )
contour(dens2, level=0.001, col="red", add=TRUE, drawlabels=F)
The contour plot looks in principle like something I could use, given enough tweaking of the bandwidth and level values (to make the contour snug enough so it doesn't cover any points outside the group). However, this solution has the drawback that when the level value is too small, the contour breaks (doesn't produce a continuous area) - so if I would go that way, controlling for continuity (and determining good bandwidth/level values on the fly) automatically should be implemented. Another problem is, I cannot quite see how could I plot the contour over the plots produced by igraph: the layout.* commands produce what looks like a coordinate matrix, but the coordinates do not match the axis coordinates on the plot:
# compare:
layout.reingold.tilford(g,1)
plot(g, layout=l, axes=T)
The question:
What would be a better way to achieve the plotting of such ranges on graphs (ideally igraphs) in R that would meet the criteria outlined above - ranges that include only the vertices that belong to their subset and exclude all else - while being continous ranges?
The solution I am looking for should be scalable to graphs of different sizes and layouts that I may need to create (so hand-tweaking each graph by hand using e.g. tkplot is not a good solution). I am aware that on some graphs with some vertex groups, meeting both the criteria will indeed be impossible in practise, but intuitively it should be possible to implement something that still works most of the time with smallish (10..20 vertices) and not-too-complex graphs (ideally it would be possible to detect and give a warning if a perfectly fitting range could not be plotted). Either an improvement of the mark.groups approach (not necessarily within the package, but using the hull-idea mentioned above), or something with contour or a similar suitable function, or suggesting something else entirely would be welcome, as long as it works (most of the time).
Update stemming from the discussion: a solution that only utilizes functions of core R or CRAN packages (not external software) is desirable, since I will eventually want to incorporate this functionality in a package.
Edit: specified the last paragraph as per the comments.
The comment area is not long enough to fit my answer there, so I'm putting this here, although I'd rather post it as a comment as it is not a full solution.
Quite a long throw, but the first thing that popped into my mind is support vector machines. The idea would be that you construct a support vector machine classifier that classifies your points into two groups (in or out) based on the coordinates of the vertices, using some non-linear kernel function (I would try the radial basis function). Then you plot the separating hyperplane of the trained support vector machine. One drawback is that the area that you obtain this way might be unbounded (i.e. go to infinity in some directions), so this idea definitely requires some further thinking, but at least that's one possible direction to go.

How to easily visualize a matrix?

When doing matrix operations, I would like to be able to see what the results of my calculations are, at least to get a rough idea of the nature of the matrices going in and coming out of the operation.
How can I plot a matrix of real numbers, so that the x axis represents columns, the y represents rows, and the color or size of a point represents the cell value?
Ultimately, I would like to display multiple plots, e.g. the right and left hand sides of an equation.
Here is some example code:
a <- matrix(rnorm(100), ncol = 10)
b <- diag(1,10)
c <- a*b
par(mfrow = c(1,3))
plot.matrix.fn <- function(m) {
#enter answer to this question here
}
lapply(list(a,b,c), plot.matrix.fn)
update: since posting this question, I found that there are some great examples here: What techniques exists in R to visualize a "distance matrix"?
You could try something like (adjusting the parameters to your particular needs)
image(t(m[nrow(m):1,] ), axes=FALSE, zlim=c(-4,4), col=rainbow(21))
producing something like
See ?image for a single plot (note that row 1 will be at the bottom) and ?rasterImage for adding 1 or more representations to an existing plot. You may want to do some scaling or other transformation on the matrix first.
Not an answer but a longer comment.
I've been working on a package to plot matrices using grid.raster, but it's not quite ready for release yet. Your example would read,
library(gridplot)
row_layout(a, b, c)
I found that writing custom functions was probably easier than tweaking 10s of parameters in lattice or base graphics, and ggplot2 lacks some control over the axes.
However, writing graphics functions from scratch also means reinventing non-trivial things like layout and positioning; hopefully Hadley's scales and guides packages can make this easier. I'll add the functions to gridExtra when the overall design seems sound and more stable.

How to avoid overplotting (for points) using base-graph?

I am in my way of finishing the graphs for a paper and decided (after a discussion on stats.stackoverflow), in order to transmit as much information as possible, to create the following graph that present both in the foreground the means and in the background the raw data:
However, one problem remains and that is overplotting. For example, the marked point looks like it reflects one data point, but in fact 5 data points exists with the same value at that place.
Therefore, I would like to know if there is a way to deal with overplotting in base graph using points as the function.
It would be ideal if e.g., the respective points get darker, or thicker or,...
Manually doing it is not an option (too many graphs and points like this). Furthermore, ggplot2 is also not what I want to learn to deal with this single problem (one reason is that I tend to like dual-axes what is not supprted in ggplot2).
Update: I wrote a function which automatically creates the above graphs and avoids overplotting by adding vertical or horizontal jitter (or both): check it out!
This function is now available as raw.means.plot and raw.means.plot2 in the plotrix package (on CRAN).
Standard approach is to add some noise to the data before plotting. R has a function jitter() which does exactly that. You could use it to add the necessary noise to the coordinates in your plot. eg:
X <- rep(1:10,10)
Z <- as.factor(sample(letters[1:10],100,replace=T))
plot(jitter(as.numeric(Z),factor=0.2),X,xaxt="n")
axis(1,at=1:10,labels=levels(Z))
Besides jittering, another good approach is alpha blending which you can obtain (on the graphics devices supporing it) as the fourth color parameter. I provided an example for 'overplotting' of two histograms in this SO question.
One additional idea for the general problem of showing the number of points is using a rug plot (rug function), this places small tick marks along the margin that can show how many points contribute (still use jittering or alpha blending for ties). This allows the actual points to show their true rather than jittered values, but the rug can then indicate which parts of the plot have more values.
For the example plot direct jittering or alpha blending is probably best, but in some other cases the rug plot can be useful.
You may also use sunflowerplot, while it would be hard to implement it here. I would use alpha-blending, as Dirk suggested.

R: update plot [xy]lims with new points() or lines() additions?

Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy:

Resources