I do a lot of computational intelligence research. I have used Matlab almost exclusively as my programming medium for a decade or so. I am now trying to move to OSS. I have settled on R as my new environment.
After a long search for neural net software, the only Matlab-comparable OSS packages are Stuttgart NN and FANN (this can be debated another time =). The former doesn't appear to be maintained so I'd like to go with the latter. So my question is:
Does anyone have experience using R and FANN?
FANN has C++ bindings and R seems to have a couple of packages for a C++ interface, but since I'm a R newbie I need an idea of where exactly to start. Any guidance or recommendations would be appreciated.
Cheers.
I do not know anything abuot FANN but I can assure you that R has an actively maintained interface to the Stuttgart Neural Net Simulator (SNNS) library via the
RSNNS package --- as RSNNS happens to employ the
Rcpp package for interfacing R and C++ which I am involved in.
Related
I currently work with Kedro (from quantum black https://kedro.readthedocs.io/en/stable/01_introduction/01_introduction.html) as a framework for deployment oriented framework to code collaboratively. It is a great framework to develop machine learning in a team.
I am looking for an R equivalent.
My main issue is that I have teams of data scientists that develop in R, but each team is developing in different formats.
I wanted to make them follow a common framework to develop deployment ready R code, easy to work on in 2 or 3-people teams.
Any suggestions are welcome
member of the Kedro team here. We've heard good things about the Targets library doing similar things in the R world.
It would be remiss for me to not try and covert you and your team to the dark side too :)
Before Kedro our teams internally were writing a mix of Python, SQL, Scala and R. Part of the drive to write the framework was to get our teams internally speaking the same language. Python felt like the best compromise available at the time and I'd argue this still holds. We also had trouble productionising R projects and felt Python is more manageable in that respect.
Whilst not officially documented - I've also seen some people on the Kedro Discord play with r2py so that they can use specific R functionality within their Python pipelines.
Not on a very prominent scale as kedro but i can think of the below :
Local project of a R Expert : https://github.com/Jeniffen/projectr
Pipeliner on Tidyverse : https://cran.r-project.org/web/packages/pipeliner/index.html
I recently looked into the usage of GPU computation, where the usage of package seemed to be confusing.
For example, CuArrays and ArrayFire seemed to be doing the same thing, where ArrayFire seemed to be the "official" package on Nvidia developers' webpage.(https://devblogs.nvidia.com/gpu-computing-julia-programming-language )
Also, there were CUDAdrv and CUDAnative Packages..., which seemed to be confusing, as their functionality seemed to be not as straightforward as the others.
What does these packages do? Is there any difference between CuArrays and ArrayFire?
As explained in the blog post you shared, it is quite simply as given below
The Julia package ecosystem already contains quite a few GPU-related
packages, targeting different levels of abstraction as Figure 1 shows.
At the highest abstraction level, domain-specific packages like
MXNet.jl and TensorFlow.jl can transparently use the GPUs in your
system. More generic development is possible with ArrayFire.jl, and if
you need a specialized CUDA implementation of a linear algebra or deep
neural network algorithm you can use vendor-specific packages like
cuBLAS.jl or cuDNN.jl. All these packages are essentially wrappers
around native libraries, making use of Julia’s foreign function
interfaces (FFI) to call into the library’s API with minimal overhead.
CUDAdrv and CUDAnative packages are meant for directly using CUDA runtime API and writing kernels from Julia itself. I believe that is where CuArray come in handy - wrapping native Julia objects into CUDA accessible format, roughly speaking.
ArrayFire on the other hand is a generic library that wraps around all(cuBLAS, cuSparse, cuSolve, cuFFT) CUDA provided domain specific libraries into nice interface(functions). Apart from the interface to CUDA's domain specific libraries, ArrayFire by itself provides lot of other functions in the areas of statistics, image processing, computer vision etc. It has nice JIT feature where user's code is compiled to a runtime kernel - simply put. ArrayFire.jl is an language binding with some extra Julia specific improvements at wrapper level.
That's the general difference. From a developers perspective, using a library(like ArrayFire) basically takes out the burden of keeping up with CUDA API and maintaining/tweaking the kernels for optimum performance which I think takes lot of time.
PS. I am a member of ArrayFire development team.
There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.
I just learned about feature hashing (also known as the hashing trick) and that some see it as an important feature for efficiently doing machine learning on large data sets.
However, I haven't seen anything like this being used for machine learning with R.
A Google search revealed that there is indeed a package hash on CRAN.
Could someone provide an example where this is used in R to speed up a machine learning task (or just to reduce RAM usage)?
I submit a package named FeatureHashing recently. Please check the github page for demo: https://github.com/wush978/FeatureHashing and let me know if you have any issue of using it.
The makeCluster function for the SNOW package has the different cluster types of "SOCK", "PVM", "MPI", and "NWS" but I'm not very clear on the differences among them, and more specifically which would be best for my program.
Currently I have a queue of tasks of different length going into a load balancing cluster with clusterApplyLB and am using a 64bit 32-core Windows machine.
I am looking for a brief description of the differences among the four cluster types, which would be best for my use and why.
Welcome to parallel programming. You may want to peruse the vignette of the excellent parallel package that comes with R as it gives a general introduction. It also gives you an idea of what you can or cannot do on Windows -- in short, PVM and MPI are standard parallel programming approaches supported by namesake libraries. These exists on Windows, but are less frequently used and often not as mature as their Unix counterparts.
If you want to stick with snow, your options are essentially limited to SOCK types clusters. Again, the package documentation will have pointers.