I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.
Related
Can I read a binary file written by C++ in R?
I have been using Rcpp in my R package and the simulations typically generate a large amount of data. I am planning to write the output to binary files in C++ and then read those back in R. This works if I write as text files but I didn't find a solution with binary files. The program sometimes crashes abruptly if I pass data using many NumericVectors (I am yet to fully understand the memory management using Rcpp).
Can this approach enable me to share larger datasets between C++ and R compared to what is possible by passing vectors? In C++, the maximum vector size is limited by RAM and address bus (may be?) but I think R is able to load larger vectors using swap. Am I correct or misunderstanding the concepts?
Yes you can. But it's "complicated".
You are embarking on a topic called binary serialization. There is a lot of work out there. In essence you are somewhere in the continum between of
minimal: open a file, write out N binary items; then on the other side read N binaries. We did something similar at work years ago where wrote some metadata with <rows,cols,version> and then a binary blob of rows * cols double to attach to a matrix
maximal: use a fully descriptive meta language like Protocol Buffer or MessagePack to describe the binary content, write it in C++ (using the appropriate library) and read in back in R (using the corresponding packages---I am involved with one each: RProtoBuf and RcppMsgPack).
And a lot in between. If you really only need to communicate between C(++) and R you could try the RData / rds format. There is one library: librdata and I experimented with it (and filed some bug reports and made some pull requests). I might start there.
So in short: do some research, figure out what to do and then do it :)
PS If you call C++ via Rcpp from R then you may not need files. We can pass large object back and forth -- the limit may be your RAM.
I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:
x, y, z, i (intensity), m (mass)
As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.
I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?
Can anyone explain this?
Yep
How can it be that two files containing essentially the exact same data can have such different sizes and speeds?
R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.
How can I work with a matrix of image data much faster?
Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.
Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32
UPDATE
If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart
The answer to this question turned out to be a little esoteric and pretty specific to my data-set, but may be of interest to others. My data is very sparse - i.e. most of the values in my matrix are zero. Therefore, I was able to significantly reduce the size of my data using the Matrix package (capitalisation important), which is designed to more efficiently handle sparse matrices. To implement the package, I just inserted the line:
data <- Matrix(data)
The amount of space saved will vary depending on the sparsity of the dataset, but in my case I reduced 1.8 GB to 156 Mb. A Matrix behaves just as a matrix, so there was no need to change my other code, and there was no noticeable change in speed. Sparsity is obviously something that the proprietary format could take advantage of.
I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.
I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")
So I've been trying to read this particular .mat file into R. I don't know too much about matlab, but I know enough that the R.matlab package can only read uncompressed data into R, and to save it as uncompressed I need to save it as such in matlab by using
save new.mat -v6.
Okay, so I did that, but when I used readMat("new.mat") in R, it just got stuck loading that forever. I also tried using package hdf5 via:
> hdf5load("new.mat", load=FALSE)->g
Error in hdf5load("new.mat", load = FALSE) :
can't handle hdf type 201331051
I'm not sure what this problem could be, but if anyone wants to try to figure this out the file is located at http://dibernardo.tigem.it/MANTRA/MANTRA_online/Matlab_Code%26Data.html and is called inventory.mat (the first file).
Thanks for your help!
This particular file has one object, inventory, which is a struct object, with a lot of different things inside of it. Some are cell arrays, others are vectors of doubles or logicals, and a couple are matrices of doubles. It looks like R.matlab does not like cells arrays within structs, but I'm not sure what's causing issues for R to load this. For reasons like this, I'd generally recommend avoiding mapping structs in Matlab to objects in R. It is similar to a list, and this one can be transformed to a list, but it's not always a good idea.
I recommend creating a new file, one for each object, e.g. ids = inventory.instance_ids and save each object to either a separate .mat file, or save all of them, except for the inventory object, into 1 file. Even better is to go to text, e.g via csvwrite, so that you can see what's being created.
I realize that's going around use of a Matlab to R reader, but having things in a common, universal format is much more useful for reproducibility than to acquire a bunch of different readers for a proprietary format.
Alternatively, you can pass objects in memory via R.matlab, or this set of functions + the R/DCOM interface (on Windows).
Although this doesn't address how to use R.matlab, I've done a lot of transferring of data between R and Matlab, in both directions, and I find that it's best to avoid .mat files (and, similarly, .rdat files). I like to pass objects in memory, so that I can inspect them on each side, or via standard text files. Dealing with application specific file formats, especially those that change quite a bit and are inefficient (I'm looking at you MathWorks), is not a good use of time. I appreciate the folks who work on readers, but having a lot more control over the data structures used in the target language is very much worth the space overhead of using a simple output file format. In-memory data transfer is very nice because you can interface programs, but that may be a distraction if your only goal is to move data.
Have you run the examples in http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf on pages 22 to 24? That will test your ability to read from versions 4 and 5. I'm not sure that R cannot read compressed files. There is an Rcompresssion package in Omegahat.