Trouble saving complex plots in Octave - plot

I'm running Octave 3.6.4 on a Windows 7 system.
I'm having a hard time saving plots as .png files. Plots that are somewhat complex (several subplots, lots of legends, 1.6 million datapoints) can only be saved using an very large image size, and even that works only sometimes.
A specific example (with titles and axis labels left out):
figure (11)
clf ;
subplot (1,2,1) ;
plot (ratingRepeated, (predictions - Ymean)(:), ".k", "markersize", 1) ;
axis([0.5 5.5 -4 8]) ;
grid("on") ;
subplot (1,2,2) ;
plot (ratingRepeated, predictions(:), ".k", "markersize", 1) ;
axis([0.5 5.5 -4 8]) ;
grid("on") ;
Generates a nice plot with millions of datapoints. But using:
print -dpng figure11
creates an image containing only a small portion of the plot. Sometimes it helps using a very large image like this:
print -dpng "-S3400,2400" figure11
But mostly Octave just stalls, then crashes after CTRL+C.
I have without success tried:
Using Octave 3.8.x
Using gnuplot rather than "fltk": graphics_toolkit ("gnuplot")
Closing all but one plot before saving
Maximizing the plot window
Searched high and low for a solution
Some minor but possible related issues are: Danish characters won't show up; extremely slow performance with plots taking several minutes to display and/or save.
Any advice would be greatly appreciated.

I would argue that you are trying to fix the problem in the wrong way. You have too many data points and a normal scatter plot, like what you are trying to do, will not give you a good display of the distribution of data. Instead, use some sort of density plot. Just compare this two:
x = vertcat (randn (2000000, 1)*3, randn (1000000, 1) +5);
y = vertcat (randn (2000000, 1)*3, randn (1000000, 1) +5);
plot (x, y, ".")
pkg load statistics;
data = hist3 ([x y], [100 100]);
imagesc (data)
axis xy
colormap (hot (124)(end:-1:1,:)) # invert colormap since hot ends in white
You can use one the the existing colormaps (see colormap list) or create your own. The most common is jet (not for being good but because it's the default)
colormap (jet (124))

Related

Reducing number of datapoints when plotting in loglog scale in Gnuplot

I have a large dataset which I need to plot in loglog scale in Gnuplot, like this:
set log xy
plot 'A_1D_l0.25_L1024_r0.dat' u 1:($2-512)
LogLogPlot of my datapoints
Text file with the datapoints
Datapoints on the x axis are equally spaced, but because of the logscale they get very dense on the right part of the graph, and as a result the output file (I finally export it in .tex) gets very large.
In linear scale, I would simply use the option every to reduce the number of points which get plotted. Is there a similar option for loglogscale, such that the plotted points appear equally spaced?
I am aware of a similar question which was raised a few years ago, but in my opinion the solution is unsatisfactory: plotted points are not equally spaced along the x-axis. I think this is a really unsophisticated problem which deserves a clearer solution.
As I understand it, you don't want to plot the actual data points; you just want to plot a line through them. But you want to keep the appearance of points rather than a line. Is that right?
set log xy
plot 'A_1D_l0.25_L1024_r0.dat' u 1:($2-512) with lines dashtype '.' lw 2
Amended answer
If it is important to present outliers/errors in the data set then you must not use every or any other technique that simply discards or skips most of the data points. In that case I would prefer the plot with points that you show in the original question, perhaps modified to represent each point as a dot rather than a cross. I will simulate this by modifying a single point in your 500000 point data set (first figure below). But I would also suggest that the presence of outliers is even more apparent if you plot with lines (second figure below).
Showing error bounds is another alternative for noisy data, but the options depend on what you have to work with in your data set. If you want to pursue that, please ask a separate question.
If you really want to reduce the number of data to be plotted, you might consider the following script.
s = 0.1 ### sampling interval in log scale
### (try 0.05 for more detail)
c = log10(0.01) ### a parameter used in sampler(x)
### which should be initialized by
### smaller value than any x in log scale
sampler(x) = (x>0 && log10(x)>=c) ? (c=ceil(log10(x)/s+0.5)*s, x) : NaN
set log xy
set grid xtics
plot 'A_1D_l0.25_L1024_r0.dat' using (sampler($1)):($2-512) with points pt 7 lt 1 notitle , \
'A_1D_l0.25_L1024_r0.dat' using 1:($2-512) with lines lt 1 notitle
This script samples the data in increments of roughly 0.1 on x-axis in log scale. It makes use of the property that points whose x value is evaluated as NaN in using are not drawn.

Visualizing big-data xy regression plots in R (maybe contour histograms?)

I have 1 million x-y data points. 100,000 of them are from foo; 900,000 of them are from bar. And perhaps a few unusual mass points. Let me help my audience visualize them, and not merely the regression or loess lines but the data. Let me draw bars in red, and foos in blue, and then my two loess lines on top of them. think something like
K <- 1000 ; M <- K*K ; HT <- 100*K
x <- rnorm(M); y <- x+rnorm(M); y[1:HT] <- y[1:HT]+1 ; x[HT:(HT*2)] <- y[HT:(HT*2)] <- 0
pdf(file="try.pdf")
plot( x, y, col="blue", pch=".")
points( x[1:HT], y[1:HT], col="red", pch="." )
## scatter.smooth( x[1:HT], y[1:HT] ), but this seems to take forever
dev.off()
this is not only not a great visual (for example, the high-elevation zero point is lost), but also creates a 7.5MB(!) pdf file. my previewer almost chokes on it, too. (hint: jpeg compression is pretty good for the problem. that is, instead of the pdf(), just use jpeg and a different file extension. drawback: the axes become fuzzily compressed, too.)
so, I need some better ideas. I am thinking two-dimensional filled.contourplot on the full data set (in a gray-scale reaching not too far towards black), with a plain contour overlay of the 1:HT points, and then two loess overlays. alas, even to do this, I need to start off smoothing the number of data points that appear at an x-y location, and presumably binning-first is not the best way to do this---it would throw away information, which the contour plot could use.
alternatively, I could stay with the standard xy plot, and simply cull random points until the file is small enough and the visuals good enough. this could be done perhaps better via binning, too.
better ideas?

Why is my plot3d white in SciLab?

t = 0:%pi/50:10*%pi;
plot3d(sin(t),cos(t),t)
When I execute this code the plot is done but the line is not visible, only the box. Any ideas which property I have to change?
Thanks
The third argument should, in this case, be a matrix of the size (length arg1) x (length arg2).
You'd expect plot3d to behave like an extension of plot and plot2d but it isn't quite the case.
The 2d plot takes a vector of x and a vector of y and plots points at (x1,y1), (x2,y2) etc., joined with lines or not as per style settings. That fits the conceptual model we usually use for 2d plots - charting the relationship of one thing as a function of another, in most cases (y = f(x)). THere are other ways to use a 2d plot: scatter graphs are common but it's easy enough to produce one using the two-rows-of-data concept.
This doesn't extend smoothly to 3d though as there are many other ways you could use a 3d plot to represent data. If you gave it three vectors of coordinates and asked it to draw a line between them all what might we want to use that for? Is that the most useful way of using a 3d plot?
Most packages give you different visualisation types for the different kinds of data. Mathematica has a lot of 3d visualisation types and Python/Scipy/Mayavi2 has even more. Matlab has a number too but Scilab, while normally mirroring Matlab, in this case prefers to handle it all with the plot3d function.
I think of it like a contour plot: you give it a vector of x and a vector of y and it uses those to create a grid of (x,y) points. The third argument is then a matrix whose dimensions match those of the (x,y) grid holding the z-coordinates of each point. The first example in the docs does what I think you're after:
t=[0:0.3:2*%pi]';
z=sin(t)*cos(t');
plot3d(t,t,z);
The first line creates a column vector of length 21
-->size(t)
ans =
21. 1.
The second line computes a 21 x 21 matrix of products of the permutations of sin(t) with cos(t) - note the transpose in the cos(t') element.
-->size(z)
ans =
21. 21.
Then when it plots them it draws (x1,y1,z11), (x1,y2,x12), (x2,y2,z22) and so on. It draws lines between adjacent points in a mesh, or no lines, or just the surface.

R Avoid x axis beeing automatically power of ten scaled

R automatically uses powers of ten for the x axis (values are from zero to 500000) - but i want just the plain figures in steps of 50000 or something (NOT written as powers of ten).
I tried to set the axis with axis(1,c(0,100000,....)) but it is plotted as powers of ten again.
I tried to scale down the font with cex.axis but it still uses power of ten for the x-axis. I think R tries to secure enough space between the values on the x-axis - but i want to force the full values to be plotted.
Axis looks at the moment like this:
-4e+05 -2e+05 0e+00 2e+05 4e+04 and so on ...
This link seems to answer your question: http://tolstoy.newcastle.edu.au/R/help/05/09/12499.html
e.g. option(scipen=6) would make the cutoff for scientific notation only for numbers larger than 1e6 I believe.

How to generate medoid plots

Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning

Resources