Plotting huge data files in R? - r

I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most of them are numbers.
I have tried my plotting script with a small subset of the input file about 800K lines, but even though i have about 8G of RAM, I dont seem to be able to plot all the data. Is there any simple way to do this.

Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, e.g. aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, e.g. only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.

The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.
For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.

plotting directly into a raster file device (calling png() for instance) is a lot faster. I tried plotting rnorm(100000) and on my laptop X11 cairo plot took 2.723 seconds, while png device finished in 2.001 seconds. with 1 million points, the numbers are 27.095 and 19.954 seconds.
I use Fedora Linux and here is the code.
f = function(n){
x = rnorm(n)
y = rnorm(n)
png('test.png')
plot(x, y)
dev.off()}
g = function(n){
x = rnorm(n)
y = rnorm(n)
plot(x, y)}
system.time(f(100000))
system.time(g(100000))

Increasing the memory with memory.limit() helped me ... This is for plotting with ggplot nearly 36K records.

does expanding the available memory with memory.limit(size=2000) (or something bigger) help?

Related

Visualizing big-data xy regression plots in R (maybe contour histograms?)

I have 1 million x-y data points. 100,000 of them are from foo; 900,000 of them are from bar. And perhaps a few unusual mass points. Let me help my audience visualize them, and not merely the regression or loess lines but the data. Let me draw bars in red, and foos in blue, and then my two loess lines on top of them. think something like
K <- 1000 ; M <- K*K ; HT <- 100*K
x <- rnorm(M); y <- x+rnorm(M); y[1:HT] <- y[1:HT]+1 ; x[HT:(HT*2)] <- y[HT:(HT*2)] <- 0
pdf(file="try.pdf")
plot( x, y, col="blue", pch=".")
points( x[1:HT], y[1:HT], col="red", pch="." )
## scatter.smooth( x[1:HT], y[1:HT] ), but this seems to take forever
dev.off()
this is not only not a great visual (for example, the high-elevation zero point is lost), but also creates a 7.5MB(!) pdf file. my previewer almost chokes on it, too. (hint: jpeg compression is pretty good for the problem. that is, instead of the pdf(), just use jpeg and a different file extension. drawback: the axes become fuzzily compressed, too.)
so, I need some better ideas. I am thinking two-dimensional filled.contourplot on the full data set (in a gray-scale reaching not too far towards black), with a plain contour overlay of the 1:HT points, and then two loess overlays. alas, even to do this, I need to start off smoothing the number of data points that appear at an x-y location, and presumably binning-first is not the best way to do this---it would throw away information, which the contour plot could use.
alternatively, I could stay with the standard xy plot, and simply cull random points until the file is small enough and the visuals good enough. this could be done perhaps better via binning, too.
better ideas?

how to calculate massive dissimilarity matrix in R

I am currently working on clustering some big data, about 30k rows, the dissimilarity matrix just too big for R to handle, I think this is not purely memory size problem. Maybe there are some smart way to do this?
If your data is so large that base R can't easily cope, then you have several options:
Work on a machine with more RAM.
Use a commercial product, e.g. Revolution Analytics that supports working with larger data with R.
Here is an example using RevoScaleR the commercial package by Revolution. I use the dataset diamonds, part of ggplot2 since this contains 53K rows, i.e. a bit larger than your data. The example doesn't make much analytic sense, since I naively convert factors into numerics, but it illustrates the computation on a laptop:
library(ggplot2)
library(RevoScaleR)
artificial <- as.data.frame(sapply(diamonds, as.numeric))
clusters <- rxKmeans(~carat + cut + color + clarity + price,
data=artificial, numClusters=6)
clusters$centers
This results in:
carat cut color clarity price
1 0.3873094 4.073170 3.294146 4.553910 932.6134
2 1.9338503 3.873151 4.285970 3.623935 16171.7006
3 1.0529018 3.655348 3.866056 3.135403 4897.1073
4 0.7298475 3.794888 3.486457 3.899821 2653.7674
5 1.2653675 3.879387 4.025984 4.065154 7777.0613
6 1.5808225 3.904489 4.066285 4.066285 11562.5788

How to generate medoid plots

Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning

Difference between two density plots

Is there a simple way to plot the difference between two probability density functions?
I can plot the pdfs of my data sets (both are one-dimensional vectors with roughly 11000 values) on the same plot together to get an idea of the overlap/difference but it would be more useful to me if I could see a plot of the difference.
something along the lines of the following (though this obviously doesn't work):
> plot(density(data1)-density(data2))
I'm relatively new to R and have been unable to find what I'm looking for on any of the forums.
Thanks in advance
This should work:
plot(x =density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$x,
y= density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y-
density(data2, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y )
The trick is to make sure the densities have the same limits. Then you can plot their differences at the same locations.My understanding of the need for the identical limits comes from having made the error of not taking that step in answering a similar question on Rhelp several years ago. Too bad I couldn't remember the right arguments.
It looks like you need to spend a little time learning how to use R (or any other language, for that matter). Help files are your friend.
From the output of ?density :
Value [i.e. the data returned by the function]
If give.Rkern is true, the number R(K), otherwise an object with class
"density" whose underlying structure is a list containing the
following components.
x the n coordinates of the points where the density is estimated.
y the estimated density values. These will be non-negative, but can
be zero [remainder of "value" deleted for brevity]
So, do:
foo<- density(data1)
bar<- density(data2)
plot(foo$y-bar$y)

R + ggplot2 - Cannot allocate vector of size 128.0 Mb

I have a file of 4.5MB (9,223,136 lines) with the following information:
0 0
0.0147938 3.67598e-07
0.0226194 7.35196e-07
0.0283794 1.10279e-06
0.033576 1.47039e-06
0.0383903 1.83799e-06
0.0424806 2.20559e-06
0.0465545 2.57319e-06
0.0499759 2.94079e-06
In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:
library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")
I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.
I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?
Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...
n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]
It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ... unique(dataset) will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:
smdata <- dataset[seq(1,nrow(dataset),by=1000),]
and see how it goes from there. (edit: forgot a comma!)
Graphical representations of large data sets are often a challenge. In general you will be better off:
summarizing the data somehow before plotting it
using a specialized graphical type (density plots, contours, hexagonal binning) that reduces the data
using base graphics, which uses a "draw and forget" model (unless graphics recording is turned on, e.g. in Windows), rather than lattice/ggplot/grid graphics, which save a complete graphical object and then render it
using raster or bitmap graphics (PNG etc.), which only record the state of each pixel in the image, rather than vector graphics, which save all objects whether they overlap or not

Resources