I have data in a normalised, tidy "long" data structure I want to upload to H2O and if possible analyse on a single machine (or have a definitive finding that I need more hardware and software than currently available). The data is large but not enormous; perhaps 70 million rows of 3 columns in its efficient normalised form, and 300k by 80k when it has been cast into a sparse matrix (a big majority of cells being zeroes).
The analytical tools in H2O need it to be in the latter, wide, format. Part of the overall motivation is seeing where the limits of various hardware setups is with analysing such data, but at the moment I'm struggling just to get the data into an H2O cluster (on a machine where R can hold it all in RAM) so can't make the judgments about size limits for analysis.
The trial data are like the below, where the three columns are "documentID", "wordID" and "count":
1 61 2
1 76 1
1 89 1
1 211 1
1 296 1
1 335 1
1 404 1
Not that it matters - because this isn't even a real life dataset for me, just a test set - this test data is from https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz (caution, large download).
To analyse, I need it in a matrix with a row for each documentID, a column for each wordID, and the cells are the counts (number of that word in that document). In R (for example), this can be done with tidyr::spread or (as in this particular case the dense data frame created by spread would be too large) tidytext::cast_sparse, which works fine with this sized data so long as I am happy for the data to stay in R.
Now, the very latest version of H2O (available from h2o.ai but not yet on CRAN) has the R function as.h2o which understands sparse matrices, and this works well with smaller but still non-trivial data (eg in test cases of 3500 rows x 7000 columns it imports a sparse matrix in 3 seconds when the dense version takes 22 seconds), but when it gets my 300,000 x 80,000 sparse matrix it crashes with this error message:
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
As far as I can tell there are two ways forward:
upload a long, tidy, efficient form of the data into H2O and do the reshaping "spread" operation in H2O.
do the data reshaping in R (or any other language), save the resulting sparse matrix to disk in a sparse format, and upload from there into H2O
As far as I can tell, H2O doesn't have the functionality to do #1 ie the equivalent of tidytext::cast_sparse or tidyr::spread in R. Its data munging capabilities look to be very limited. But maybe I've missed something? So my first (not very optimistic) question is can (and how can) H2O "cast" or "spread" data from long to wide format?.
Option #2 becomes the same as this older question, for which the accepted answer was to save the data in SVMlight format. However, it's not clear to me how to do this efficiently, and it's not clear that SVMlight format makes sense for data that is not intended to be modelled with a support vector machine (for example, the data might be just for an unsupervised learning problem). It would be much more convenient if I could save my sparse matrix in MatrixMarket format, which is supported by the Matrix package in R, but as far as I can tell isn't by H2O. The MatrixMarket format looks very similar to my original long data, it's basically a space-delimited file that looks like colno rowno cellvalue (with a two line header).
I think #2 is your best bet right now, since we don't currently have a function to do that in H2O. I think this would be a useful utility, so have created a JIRA ticket for it here. I don't know when it will get worked on, so I'd still suggesting coding up #2 for the time-being.
The SVMLight/LIBSVM format was originally developed for a particular SVM implementation (as the name suggests), but it's generic and not at all specific to SVM. If you don't have labeled data, then you can fill in a dummy value where it expects a label.
To export an R data.frame in this format, you can use this package and there is more info here. You might be able to find better packages for this by searching "svmlight" or "libsvm" on http://rdocumentation.org.
You can then read in the sparse file directly into H2O using the h2o.importFile() function with parse_type = "SVMLight".
Related
I have a very large dataset in SAS (> 6million rows). I'm trying to read that to R. For this purpose, I'm using "read_sas" from the "haven" library in R.
However, due to its extremely large size, I'd like to split the data into subsets (e.g., 12 subsets each having 500000 rows), and then read each subset into R. I was wondering if there is any possible way to address this issue. Any input is highly appreciated!
Is there any way you can split the data with SAS beforehand ... ?
read_sas has skip and n_max arguments, so if your increment size is N=5e5 you should be able to set an index i to read in the ith chunk of data using read_sas(..., skip=(i-1)*N, n_max=N). (There will presumably be some performance penalty to skipping rows, but I don't know how bad it will be.)
I have a data set with, 10 million rows and 1,000 variables, and I want to best fit those variables, so I can estimate a new rows value. I am using Jama's QR decomposition to do it (better suggestions welcome, but I think this question applies to any implementation). Unfortunately that takes too long.
It appears I have two choices. Either I can split the data into, say, 1000 size 10,000 chunks and then average the results. Or I can add up every , say, 100 rows, and stick those combined rows into the QR decomposition.
One or both ways may be mathematical disasters, and I'm hoping someone can point me in the right direction.
For such big datasets I'd have to say you need to use HDF5. HDF5 is Hierarchical Data Format v5. They have C/C++ implementation APIs, and other bindings for different languages. HDF uses B-trees to keep index of datasets.
HDF5 is supported by Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, and Julia.
Unfortunately I don't know more than this about it. However I'd suggest you'd begin your research with a simple exploratory internet search!
I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:
x, y, z, i (intensity), m (mass)
As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.
I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?
Can anyone explain this?
Yep
How can it be that two files containing essentially the exact same data can have such different sizes and speeds?
R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.
How can I work with a matrix of image data much faster?
Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.
Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32
UPDATE
If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart
The answer to this question turned out to be a little esoteric and pretty specific to my data-set, but may be of interest to others. My data is very sparse - i.e. most of the values in my matrix are zero. Therefore, I was able to significantly reduce the size of my data using the Matrix package (capitalisation important), which is designed to more efficiently handle sparse matrices. To implement the package, I just inserted the line:
data <- Matrix(data)
The amount of space saved will vary depending on the sparsity of the dataset, but in my case I reduced 1.8 GB to 156 Mb. A Matrix behaves just as a matrix, so there was no need to change my other code, and there was no noticeable change in speed. Sparsity is obviously something that the proprietary format could take advantage of.
I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.
One of the new features of R 3.0.0 was the introduction of long vectors. However, .C() and .Fortran() do not accept long vector inputs. On R-bloggers I find:
This is a precaution as it is very unlikely that existing code will have been written to handle long vectors (and the R wrappers often assume that length(x) is an integer)
I work with R-package randomForest and this package obviously needs .Fortran() since it crashes leaving the error message
Error in randomForest.default: long vectors (argument 20) are not supported in .Fortran
How to overcome this problem? I use randomForest 4.6-7 (built under R 3.0.2) on a Windows 7 64bit computer.
The only way to guarantee that your input data frame will be accepted by randomForest is to ensure that the vectors inside the data frame do not have length which exceeds 2^31 – 1 (i.e are not long). If you must start off with a data frame containing long vectors, then you would have the subset the data frame to achieve an acceptable dimension for the vectors. Here is one way you could subset a data frame to make it suitable for randomForest:
# given data frame 'df' with long vectors
maxDim <- 2^31 - 1;
df[1:maxDim, ]
However, there is a major problem with doing this which is that you would be throwing away all observations (i.e. features) appearing in rows 2^31 or higher. In practice, you probably do not need so many observations to run a random forest calculation. The easy workaround to your problem is to simply take a statistically valid sub sample of the original dataset with a size which does not exceed 2^31 - 1. Store the data using R vectors not of the long type, and your randomForest calculation should run without any issues.