Visualizing big-data xy regression plots in R (maybe contour histograms?) - r

I have 1 million x-y data points. 100,000 of them are from foo; 900,000 of them are from bar. And perhaps a few unusual mass points. Let me help my audience visualize them, and not merely the regression or loess lines but the data. Let me draw bars in red, and foos in blue, and then my two loess lines on top of them. think something like
K <- 1000 ; M <- K*K ; HT <- 100*K
x <- rnorm(M); y <- x+rnorm(M); y[1:HT] <- y[1:HT]+1 ; x[HT:(HT*2)] <- y[HT:(HT*2)] <- 0
pdf(file="try.pdf")
plot( x, y, col="blue", pch=".")
points( x[1:HT], y[1:HT], col="red", pch="." )
## scatter.smooth( x[1:HT], y[1:HT] ), but this seems to take forever
dev.off()
this is not only not a great visual (for example, the high-elevation zero point is lost), but also creates a 7.5MB(!) pdf file. my previewer almost chokes on it, too. (hint: jpeg compression is pretty good for the problem. that is, instead of the pdf(), just use jpeg and a different file extension. drawback: the axes become fuzzily compressed, too.)
so, I need some better ideas. I am thinking two-dimensional filled.contourplot on the full data set (in a gray-scale reaching not too far towards black), with a plain contour overlay of the 1:HT points, and then two loess overlays. alas, even to do this, I need to start off smoothing the number of data points that appear at an x-y location, and presumably binning-first is not the best way to do this---it would throw away information, which the contour plot could use.
alternatively, I could stay with the standard xy plot, and simply cull random points until the file is small enough and the visuals good enough. this could be done perhaps better via binning, too.
better ideas?

Related

Scanning and storing a simple image in a complex matrix

I have been playing with linear algebra transformations in R, moving around a bunch of points plotted in the complex plane. I have posted the results here - the code is linked on the first sentence.
I would like to do the same operations on a real image. Evidently I don't want to get into Fourier transforming the image, or dealing with color or grayscale. I would like to get any old jpeg, turn it into a summarized plot of black and white dots, locate each dot in terms of its position in the complex plane, and then apply the linear algebra operations as I did to my drawing of a house.
The questions are, 1. What is the name for the type of stripped-down, basic black and white image that I am describing? 2. How can I turn a regular jpeg (or other file) into that type of image? How can then store every dot of the thousands of dots the image will contain into a matrix of complex numbers?
Is there software to do this? Is there code in R or python to do it?
It's not clear what you're trying to do with those complex vectors, that wouldn't be more easily obtained using standard x,y coordinates, but here goes a possible starting point
library(jpeg)
im <- readJPEG(system.file("img", "Rlogo.jpg", package="jpeg"))
gr <- apply(im, 1:2, mean)
bw <- which(gr < 0.5, arr.ind = TRUE)
conjure_matrix_of_darkness <- function(bw, xlim=c(-2, 2), ylim=c(-2,2)){
x <- (bw[,1] - min(bw[,1]))/diff(range(bw[,1])) * diff(xlim) + min(xlim)
y <- (bw[,2] - min(bw[,2]))/diff(range(bw[,2])) * diff(ylim) + min(ylim)
x+1i*y
}
test <- conjure_matrix_of_darkness(bw)
par(mfrow=c(2,1), mar=c(0,0,0,0))
plot(test, pch=19, xaxt="n", yaxt="n")
plot(test*exp(1i*pi), pch=19, xaxt="n", yaxt="n")

R - locate intersection of two curves

There are a number of questions in this forum on locating intersections between a fitted model and some raw data. However, in my case, I am in an early stage project where I am still evaluating data.
To begin with, I have created a data frame that contains a ratio value whose ideal value should be 1.0. I have plotted the data frame and also used abline() function to plot a horizontal line at y=1.0. This horizontal line and the plot of ratios intersect at some point.
plot(a$TIME.STAMP, a$PROCESS.RATIO,
xlab='Time (5s)',
ylab='Process ratio',
col='darkolivegreen',
type='l')
abline(h=1.0,col='red')
My aim is to locate the intersection point, say x and draw two vertical lines at x±k, as abline(v=x-k) and abline(v=x+k) where, k is certain band of tolerance.
Applying a grid on the plot is not really an option because this plot will be a part of a multi-panel plot. And, because ratio data is very tightly laid out, the plot will not be too readable. Finally, the x±k will be quite valuable in my discussions with the domain experts.
Can you please guide me how to achieve this?
Here are two solutions. The first one uses locator() and will be useful if you do not have too many charts to produce:
x <- 1:5
y <- log(1:5)
df1 <-data.frame(x= 1:5,y=log(1:5))
k <-0.5
plot(df1,type="o",lwd=2)
abline(h=1, col="red")
locator()
By clicking on the intersection (and stopping the locator top left of the chart), you will get the intersection:
> locator()
$x
[1] 2.765327
$y
[1] 1.002495
You would then add abline(v=2.765327).
If you need a more programmable way of finding the intersection, we will have to estimate the function of your data. Unfortunately, you haven’t provided us with PROCESS.RATIO, so we can only guess what your data looks like. Hopefully, the data is smooth. Here’s a solution that should work with nonlinear data. As you can see in the previous chart, all R does is draw a line between the dots. So, we have to fit a curve in there. Here I’m fitting the data with a polynomial of order 2. If your data is less linear, you can try increasing the order (2 here). If your data is linear, use a simple lm.
fit <-lm(y~poly(x,2))
newx <-data.frame(x=seq(0,5,0.01))
fitline = predict(fit, newdata=newx)
est <-data.frame(newx,fitline)
plot(df1,type="o",lwd=2)
abline(h=1, col="red")
lines(est, col="blue",lwd=2)
Using this fitted curve, we can then find the closest point to y=1. Once we have that point, we can draw vertical lines at the intersection and at +/-k.
cross <-est[which.min(abs(1-est$fitline)),] #find closest to 1
plot(df1,type="o",lwd=2)
abline(h=1)
abline(v=cross[1], col="green")
abline(v=cross[1]-k, col="purple")
abline(v=cross[1]+k, col="purple")

Trouble saving complex plots in Octave

I'm running Octave 3.6.4 on a Windows 7 system.
I'm having a hard time saving plots as .png files. Plots that are somewhat complex (several subplots, lots of legends, 1.6 million datapoints) can only be saved using an very large image size, and even that works only sometimes.
A specific example (with titles and axis labels left out):
figure (11)
clf ;
subplot (1,2,1) ;
plot (ratingRepeated, (predictions - Ymean)(:), ".k", "markersize", 1) ;
axis([0.5 5.5 -4 8]) ;
grid("on") ;
subplot (1,2,2) ;
plot (ratingRepeated, predictions(:), ".k", "markersize", 1) ;
axis([0.5 5.5 -4 8]) ;
grid("on") ;
Generates a nice plot with millions of datapoints. But using:
print -dpng figure11
creates an image containing only a small portion of the plot. Sometimes it helps using a very large image like this:
print -dpng "-S3400,2400" figure11
But mostly Octave just stalls, then crashes after CTRL+C.
I have without success tried:
Using Octave 3.8.x
Using gnuplot rather than "fltk": graphics_toolkit ("gnuplot")
Closing all but one plot before saving
Maximizing the plot window
Searched high and low for a solution
Some minor but possible related issues are: Danish characters won't show up; extremely slow performance with plots taking several minutes to display and/or save.
Any advice would be greatly appreciated.
I would argue that you are trying to fix the problem in the wrong way. You have too many data points and a normal scatter plot, like what you are trying to do, will not give you a good display of the distribution of data. Instead, use some sort of density plot. Just compare this two:
x = vertcat (randn (2000000, 1)*3, randn (1000000, 1) +5);
y = vertcat (randn (2000000, 1)*3, randn (1000000, 1) +5);
plot (x, y, ".")
pkg load statistics;
data = hist3 ([x y], [100 100]);
imagesc (data)
axis xy
colormap (hot (124)(end:-1:1,:)) # invert colormap since hot ends in white
You can use one the the existing colormaps (see colormap list) or create your own. The most common is jet (not for being good but because it's the default)
colormap (jet (124))

Make a 3D rendered plot of time-series

I have a set of 3D coordinates (below - just for a single point, in 3D space):
x <- c(-521.531433, -521.511658, -521.515259, -521.518127, -521.563416, -521.558044, -521.571228, -521.607178, -521.631165, -521.659973)
y <- c(154.499557, 154.479568, 154.438705, 154.398682, 154.580688, 154.365189, 154.3564, 154.559189, 154.341309, 154.344223)
z <- c(864.379272, 864.354675, 864.365479, 864.363831, 864.495667, 864.35498, 864.358582, 864.50415, 864.35553, 864.359863)
xyz <- data.frame(x,y,z)
I need to make a time-series plot of this point with a 3D rendering (so I can rotate the plot, etc.). The plot will visualize a trajectory of the point above in time (for example in the form of solid line). I used 'rgl' package with plot3d method, but I can't make it to plot time-series (below, just plot a single point from first frame in time-series):
require(rgl)
plot3d(xyz[1,1],xyz[1,2],xyz[1,3],axes=F,xlab="",ylab="",zlab="")
I found this post, but it doesn't really deal with a real-time rendered 3D plots. I would appreciate any suggestions. Thank you.
If you read help(plot3d) you can see how to draw lines:
require(rgl)
plot3d(xyz$x,xyz$y,xyz$z,type="l")
Is that what you want?
How about this? It uses rgl.pop() to remove a point and a line and draw them as a trail - change the sleep argument to control the speed:
ts <- function(xyz,sleep=0.3){
plot3d(xyz,type="n")
n = nrow(xyz)
p = points3d(xyz[1,])
l = lines3d(xyz[1,])
for(i in 2:n){
Sys.sleep(sleep)
rgl.pop("shapes",p)
rgl.pop("shapes",l)
p=points3d(xyz[i,])
l=lines3d(xyz[1:i,])
}
}
The solution was simpler than I thought and the problem was that I didn't use as.matrix on my data. I was getting error (list) object cannot be coerced to type 'double' when I was simply trying to plot my entire dataset using plot3d (found a solution for this here). So, if you need to plot time-series of set of coordinates (in my case motion capture data of two actors) here is my complete solution (only works with the data set below!):
download example data set
read the above data into a table:
data <- read.table("Bob12.txt",sep="\t")
extract XYZ coordinates into a separate matrixes:
x <- as.matrix(subset(data,select=seq(1,88,3)))
y <- as.matrix(subset(data,select=seq(2,89,3)))
z <- as.matrix(subset(data,select=seq(3,90,3)))
plot the coordinates on a nice, 3D rendered plot using 'rgl' package:
require(rgl)
plot3d(x[1:nrow(x),],y[1:nrow(y),],z[1:nrow(z),],axes=F,xlab="",ylab="",zlab="")
You should get something like on the image below (but you can rotate it etc.) - hope you can recognise there are joint centers for people there. I still need to tweak it to make it visually better - to have first frame as a points (to clearly see actor's joints), then a visible break, and then the rest of frames as a lines.

How to make topographic map from sparse sampling data?

I need to make a topographic map of a terrain for which I have only fairly sparse samples of (x, y, altitude) data. Obviously I can't make a completely accurate map, but I would like one that is in some sense "smooth". I need to quantify "smoothness" (probably the reciprocal the average of the square of the surface curvature) and I want to minimize an objective function that is the sum of two quantities:
The roughness of the surface
The mean square distance between the altitude of the surface at the sample point and the actual measured altitude at that point
Since what I actually want is a topographic map, I am really looking for a way to construct contour lines of constant altitude, and there may be some clever geometric way to do that without ever having to talk about surfaces. Of course I want contour lines also to be smooth.
Any and all suggestions welcome. I'm hoping this is a well-known numerical problem. I am quite comfortable in C and have a working knowledge of FORTRAN. About Matlab and R I'm fairly clueless.
Regarding where our samples are located: we're planning on roughly even spacing, but we'll take more samples where the topography is more interesting. So for example we'll sample mountainous regions more densely than a plain. But we definitely have some choices about sampling, and could take even samples if that simplifies matters. The only issues are
We don't know how much terrain we'll need to map in order to find features that we are looking for.
Taking a sample is moderately expensive, on the order of 10 minutes. So sampling a 100x100 grid could take a long time.
Kriging interpolation may be of some use for smoothly interpolating your sparse samples.
R has many different relevant tools. In particular, have a look at the spatial view. A similar question was asked in R-Help before, so you may want to look at that.
Look at the contour functions. Here's some data:
x <- seq(-3,3)
y <- seq(-3,3)
z <- outer(x,y, function(x,y,...) x^2 + y^2 )
An initial plot is somewhat rough:
contour(x,y,z, lty=1)
Bill Dunlap suggested an improvement: "It often works better to fit a smooth surface to the data, evaluate that surface on a finer grid, and pass the result to contour. This ensures that contour lines don't cross one another and tends to avoid the spurious loops that you might get from smoothing the contour lines themselves. Thin plate splines (Tps from library("fields")) and loess (among others) can fit the surface."
library("fields")
contour(predict.surface(Tps(as.matrix(expand.grid(x=x,y=y)),as.vector(z))))
This results in a very smooth plot, because it uses Tps() to fit the data first, then calls contour. It ends up looking like this (you can also use filled.contour if you want it to be shaded):
For the plot, you can use either lattice (as in the above example) or the ggplot2 package. Use the geom_contour() function in that case. An example can be found here (ht Thierry):
ds <- matrix(rnorm(100), nrow = 10)
library(reshape)
molten <- melt(data = ds)
library(ggplot2)
ggplot(molten, aes(x = X1, y = X2, z = value)) + geom_contour()
Excellent review of contouring algorithm, you might need to mesh the surface first to interpolate onto a grid.
maybe you can use:
GEOMap
geomapdata
gtm
with
Matrix
SparseM
slam
in R

Resources