I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?
I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?
I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.
This is a possibly broad topic. Let me highlight the features that are key in my opinion.
What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays
They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of Any in general which will be slower and use up more memory than having data columns having concrete types.
You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two DataFrames or building GLM model using GLM.jl).
This type of storage (heterogeneous columns with names) is a representation of table in relational databases.
What is the difference between DataFrames.jl and JuliaDB.jl
JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using SharedArray but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
JuliaDB.jl supports indexing while DataFrames.jl currently does not;
Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).
Related
Is there an efficient implementation of the set data structure in R?
In C++ I would use an std::set (which is implemented using red-black trees), in Python a set (which is implemented using hash tables), but I am not sure what I should use in R.
I have found this link, which describes some set operations, like union() and intersection(), that you can perform on vectors. So, I guess that since vectors are involved, the complexities would not be logarithmic, as you could have using the data structures mentioned above.
Fun fact, note how in this case the name of the language does not help, searching "r set" one finds many results concerning $\mathbb{R}$, and not the programming language :D
I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.
I would like to define a slightly more general version of a complex number in R. This should be a vector that has more than one component, accessible in a similar manner to using Re() and Im() for complex numbers. Is there a way to do this using S3/S4 classes?
I have read through the OO field guide among other resources, but most solutions seem focused around the use of lists as fundamental building objects. However, I need vectors for use in data.frames and matrices. I was hoping to use complex numbers as a template, but they seem to be implemented largely in C. At this point, I don't even know where to start.
I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.
I am trying to build a data processing program. Currently I use a double matrix to represent the data table, each row is an instance, each column represents a feature. I also have an extra vector as the target value for each instance, it is of double type for regression, it is of integer for classification.
I want to make it more general. I am wondering what kind of structure R uses to store a dataset, i.e. the internal implementation in R.
Maybe if you inspect the rpy2 package, you can learn something about how data structures are represented (and can be accessed).
The internal data structures are `data.frame', a detailed introduction to the data frame can be found here.
http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames