How to easily visualize a matrix? - r

When doing matrix operations, I would like to be able to see what the results of my calculations are, at least to get a rough idea of the nature of the matrices going in and coming out of the operation.
How can I plot a matrix of real numbers, so that the x axis represents columns, the y represents rows, and the color or size of a point represents the cell value?
Ultimately, I would like to display multiple plots, e.g. the right and left hand sides of an equation.
Here is some example code:
a <- matrix(rnorm(100), ncol = 10)
b <- diag(1,10)
c <- a*b
par(mfrow = c(1,3))
plot.matrix.fn <- function(m) {
#enter answer to this question here
}
lapply(list(a,b,c), plot.matrix.fn)
update: since posting this question, I found that there are some great examples here: What techniques exists in R to visualize a "distance matrix"?

You could try something like (adjusting the parameters to your particular needs)
image(t(m[nrow(m):1,] ), axes=FALSE, zlim=c(-4,4), col=rainbow(21))
producing something like

See ?image for a single plot (note that row 1 will be at the bottom) and ?rasterImage for adding 1 or more representations to an existing plot. You may want to do some scaling or other transformation on the matrix first.

Not an answer but a longer comment.
I've been working on a package to plot matrices using grid.raster, but it's not quite ready for release yet. Your example would read,
library(gridplot)
row_layout(a, b, c)
I found that writing custom functions was probably easier than tweaking 10s of parameters in lattice or base graphics, and ggplot2 lacks some control over the axes.
However, writing graphics functions from scratch also means reinventing non-trivial things like layout and positioning; hopefully Hadley's scales and guides packages can make this easier. I'll add the functions to gridExtra when the overall design seems sound and more stable.

Related

How can you create Marginal Histogram Scatterplot using lattice package (not ggplot2)?

Long story short, I am working on an assignment for a data visualization course and the assignment specifies that we have to use the lattice package and that we have to create a marginal histogram scatterplot. (I know that asking about homework questions is frowned upon, but I'm not asking you to write my assignment for me - only asking for guidance or at least a direction to start in).
Our lecture and book don't mention anything about marginal histogram scatterplots and while the lecture shows how to create them using the standard plot function in R as well as how to do it using ggplot2, we are told not to use either. I've never used lattice before, and when I ask for help, I get general responses that aren't helpful at all.
Note: I'm not posting the question or what type of data I have to use as I'm not looking for an answer to the homework here. Just some help on where to begin. You can literally use any data if you want to show an example.
This is definitely a tricky question in lattice as well. There are quite some compelling reasons why ggplot2 has become one of the more popular packages, while lattice is still extremely powerful. As this is part of a visualization course, I'd assume you are meant to come up with something similar to ggMarginal. For this you'll have to use some time adjusting margins on your lattice plot.
As a guideline for how I'd solve this question, I found an answer doing the following:
Search google for lattice marginal histogram, the second link is an answer to a mailing help list, which gives an example to a similar problem
Open R and following the link make a small example. Eg.
data(mtcars)
library(lattice)
scatter <- xyplot(hp ~ mpg, mtcars)
hist <- histogram(~ mpg, mtcars)
plot(scatter, more = TRUE, split = c(1, 2, 1, 2))
plot(hist, more = FALSE, split = c(1, 1, 1, 2))
after getting this far, it comes about figuring what is actually happening. The link above suggests looking at ?plot.trellis, and the importance here seems how can we move around our plots, which seems to be controlled by split. Looking at the documentation (?plot.trellis) we get some help for understanding how to use this argument
a vector of 4 integers, c(x, y, nx, ny), that says to position the current plot at the x, y position in a regular array of nx by ny plots. (Note: this has origin at top left)
From here we have everything we need to create the marginal plot, If we make this a 2x2 plot, we'd place one histogram at c(1, 1, 2, 2), a scatter plot at c(2, 1, 2, 2) and another histogram at c(2, 2, 2, 2). Of course this is not going to be the best looking marginal plot, for which you'd have to work with the margins or go under the hood and manually set up the plot using the grid package. I'd say that is definitely a bit on the "next level" side of thing.
Note:
In the above example I didn't cover how one can rotate one histogram, or how one can create a sideways histogram, if you are seeking to replicate ggMarginal more closely.
In addition as you said you had some problems finding information on this. Another option for finding an answer would've been reading the ?histogram documentation page. There are several examples within this page (and many others) which show how one can manipulate the position of lattice plots.

R plotting strangeness with large dataset

I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.

How to reproduce this graphical explanation (a scatter plot) of how covariance works?

I found this graphical intuitive explanation of covariance:
32 binormal points drawn from distributions with the given covariances, ordered from most negative (bluest) to most positive (reddest)
The whole material can be found at:
https://stats.stackexchange.com/questions/18058/how-would-you-explain-covariance-to-someone-who-understands-only-the-mean
I would like to recreate this sort of graphical illustration in R, but I'm not sufficiently familiar with R's plotting tools. I don't even know where to start in order to get those colored rectangles between each pair of data points, let alone make them semi-transparent.
I think this could make a very efficient teaching tool.
The cor.rect.plot function in the TeachingDemos package makes plots similar to what is shown. You can modify the code for the function to make the plot even more similar if you desire.

Polygon/contour around subset of vertices on graph (more precise than mark.groups in igraph)

Problem definition
I need to produce a number of specific graphs, and on these graphs, highlight subsets of vertices (nodes) by drawing a contour/polygon/range around or over them (see image below).
A graph may have multiple of these contours/ranges, and they may overlap, iff one or more vertices belong to multiple subsets.
Given a graph of N vertices, any subset may be of size 1..N.
However, vertices not belonging to a subset must not be inside the contour (as that would be misleading, so that's priority no. 1). This is gist of my problem.
All these graphs happen to have the property that the ranges are continuous, as the data they represent covers only directly connected subsets of vertices.
All graphs will be undirected and connected (no unconnected vertices will ever be plotted).
Reproducible attempts
I am using R and the igraph package. I have already tried some solutions, but none of them work well enough.
First attempt, mark.groups in plot.igraph:
library(igraph)
g = make_graph("Frucht")
l = layout.reingold.tilford(g,1)
plot(g, layout=l, mark.groups = c(1,3,6,12,5), mark.shape=1)
# bad, vertex 11 should not be inside the contour
plot(g, layout=l, mark.groups = c(1,6,12,5,11), mark.shape=1)
# 3 should not be in; image below
# just choosing another layout here is not a generalizable solution
The plot.igraph calls igraph.polygon, which calls convex_hull (also igraph), which calls xspline. The results is, from what I understand, something called a convex hull (which otherwise looks very nice!), but for my purposes that is not precise enough, covering vertices that should not be covered.
Second attempt with contour. So I tried implementing my own version, based on the solution suggested here:
library(MASS)
xx <- runif(5, 0, 1);yy <- abs(xx)+rnorm(5,0,0.2)
plot(xx,yy, xlim=c( min(xx)-sd(xx),max(xx)+sd(xx)), ylim =c( min(yy)-sd(yy), max(yy)+sd(yy)))
dens2 <- kde2d(xx, yy, lims=c(min(xx)-sd(xx), max(xx)+sd(xx), min(yy)- sd(yy), max(yy)+sd(yy) ),h=c(bandwidth.nrd(xx)/1.5, bandwidth.nrd(xx)/ 1.5), n=50 )
contour(dens2, level=0.001, col="red", add=TRUE, drawlabels=F)
The contour plot looks in principle like something I could use, given enough tweaking of the bandwidth and level values (to make the contour snug enough so it doesn't cover any points outside the group). However, this solution has the drawback that when the level value is too small, the contour breaks (doesn't produce a continuous area) - so if I would go that way, controlling for continuity (and determining good bandwidth/level values on the fly) automatically should be implemented. Another problem is, I cannot quite see how could I plot the contour over the plots produced by igraph: the layout.* commands produce what looks like a coordinate matrix, but the coordinates do not match the axis coordinates on the plot:
# compare:
layout.reingold.tilford(g,1)
plot(g, layout=l, axes=T)
The question:
What would be a better way to achieve the plotting of such ranges on graphs (ideally igraphs) in R that would meet the criteria outlined above - ranges that include only the vertices that belong to their subset and exclude all else - while being continous ranges?
The solution I am looking for should be scalable to graphs of different sizes and layouts that I may need to create (so hand-tweaking each graph by hand using e.g. tkplot is not a good solution). I am aware that on some graphs with some vertex groups, meeting both the criteria will indeed be impossible in practise, but intuitively it should be possible to implement something that still works most of the time with smallish (10..20 vertices) and not-too-complex graphs (ideally it would be possible to detect and give a warning if a perfectly fitting range could not be plotted). Either an improvement of the mark.groups approach (not necessarily within the package, but using the hull-idea mentioned above), or something with contour or a similar suitable function, or suggesting something else entirely would be welcome, as long as it works (most of the time).
Update stemming from the discussion: a solution that only utilizes functions of core R or CRAN packages (not external software) is desirable, since I will eventually want to incorporate this functionality in a package.
Edit: specified the last paragraph as per the comments.
The comment area is not long enough to fit my answer there, so I'm putting this here, although I'd rather post it as a comment as it is not a full solution.
Quite a long throw, but the first thing that popped into my mind is support vector machines. The idea would be that you construct a support vector machine classifier that classifies your points into two groups (in or out) based on the coordinates of the vertices, using some non-linear kernel function (I would try the radial basis function). Then you plot the separating hyperplane of the trained support vector machine. One drawback is that the area that you obtain this way might be unbounded (i.e. go to infinity in some directions), so this idea definitely requires some further thinking, but at least that's one possible direction to go.

R: update plot [xy]lims with new points() or lines() additions?

Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy:

Resources