Using feature hashing/hashing tricks for machine learning in R - r

I just learned about feature hashing (also known as the hashing trick) and that some see it as an important feature for efficiently doing machine learning on large data sets.
However, I haven't seen anything like this being used for machine learning with R.
A Google search revealed that there is indeed a package hash on CRAN.
Could someone provide an example where this is used in R to speed up a machine learning task (or just to reduce RAM usage)?

I submit a package named FeatureHashing recently. Please check the github page for demo: https://github.com/wush978/FeatureHashing and let me know if you have any issue of using it.

Related

Advice needed for R package security in production

I am working as a Data Scientist for a small start up and we are using R as part of our platform for analysis, dashboards etc. Therefore, I need to ensure that we maintain security with each package we use and load.
I have looked around and done extensive searching and have come across the following links:
This is the official R Studio Blog Security update page.
This blog post shows how you can implement rJava to help with those packages that require it, though it does state that '...the integrity & safety of the R package ecosystem is still in the “trust me, everything’s 👍!!”'
This post gives some good advice for package security, but basically boils down to: if you get it from CRAN or another trusted source then it should be ok.
The CVE site lists vulnerabilities, though the last one was back in 2017.
However, all the above links essentially say the same thing, which is "if its from CRAN (or similar), then it is probably fine". Now this might indeed be the case, but I was hoping for something a bit more rigorous. Has anyone else come across this issue with production R deployment?
If possible, if someone could direct to where I might be able to find out more information on checking for security updates, breaches and changes for R packages, or how to go about testing the security myself, I would be very grateful.
Thanks!

Using R in Apache Spark

There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

Calling one R Installation from Another

I seem to be unable to compile RPostgreSQL for Windows x64, and after extensive searching, I've not been able to find a precompiled binary. To get on with my work, I've installed a 32 bit version of Postgre and have been using 32 bit R for all database ops.
I need to do much of my work in 64 bit R, so switching back and forth has become a bit painful, especially since this requires a save() and load() operation each time I need to run a query.
I'm wondering whether it is possible to call one R installation directly from another? For example, could I simply pass queries to my 32 bit R installation and retrieve the result? I think there are other times when the ability to call another R installation would be useful as well.
All I've come up with is using a system() call, either directly to pgsql or to 32-bit R, but this doesn't allow for very efficient transfer of data.
I'd very sincerely appreciate any advice or assistance!
P.S. I'd rather ask how to compile RPostgreSQL for x64, but as I understand the rules here, such a question would be inappropriate since it's not a general question (e.g. I'd need step-by-step instructions since I don't have the requisite skills).
http://wiki.postgresql.org/wiki/64bit_Windows_port

Consume a web service in R

Here's the scenario:
I have JBoss serving a web service with JBossWS providing me with a wsdl. I have connected and used it from both .NET and Java so far (and it has been quite easy once I figured it out). I am now trying to do the same with R.
Is there anything out there considered to be "the right way" for doing this? I am not that familiar with R, and my searches have not turned up much, so I figured I'd ask and maybe spare my head and the wall a bit of damage.
I have had good luck using rJava to recreate in R something that works in Java. I use this method for connecting to Amazon's AWS Java SDK for their API with R. This allows me, for example, to transfer files to/from S3 from R without having to recreate the whole connection/handshake/boogieWoogie from R.
If you wanted to go more "pure R" I think you'll have to use some combination of RCurl and the XML package to grab and parse the wsdl.
There are a number of ways:
You could retain your Java approach and use the rJava package around it
You could use RCurl which is used to power a few higher-level packages (accessing Google APIs, say)
I believe there is an older SSOAP package on Omegahat which may help too.

Document/Scripts management for R code

I am looking for a solution that allows me to keep a track of a multitude of R scripts that I create for various projects and purposes. Some scripts are easily tracked to specific projects, whereas others are "convenience" functions created to serve a set of tasks.
Is there a way I can create a central DB and query it to find which scripts match most appropriately?
I could create a system using a DBMS manually, but are users aware of anything in general or specific to R, that comes in the form of a software tool (maybe FOSS) ?
EDIT: Thank you for the responses. My current system is just a set of scripts with comments that allow me to identify their intended task. Though I use StatET with SVN, I would like a search utility along the lines of the "sos" package.
The question
I am looking for a solution that allows me to keep a track of a multitude of R scripts
that I create for various projects and purposes. Some scripts are easily tracked to specific
projects, whereas others are "convenience" functions created to serve a set of tasks.
fails to address the obvious follow-up of why the existing mechanism is not suitable:
Create a local package for each project
Create one or more local packages for local utility functions
Use R's already existing mechanisms for searching, indexing, testing, cross-referencing
And use any revision control system of your liking, local or on the web, to host the code for 1. to 3. above.
Reinventing an RDBMS schema for 1. to 3. is just wrong in my book. But if you must, go ahead and replicate what you can already (mostly) get for free in tested and widely used code.
R comes with several mechanisms for searching for help, most of which naturally use CRAN. Some examples: the sos package, cranberries, crantastic, and rseek. In many cases, these could be adapted to use a local repository (you can find out how to create a local repository in the R manual, which is very easy to do). Otherwise, if you package your scripts and submit them to CRAN, you will naturally have these available to you. I would also highly recommend this presentation on the subject: Creating R Packages, Using CRAN, R-Forge, And Local R Archive Networks And Subversion (SVN) Repositories from Spencer Graves and Sundar Dorai-Raj.
These would require you to put your code in packages, and create documentation, all of which is worth doing anyway. The package documentation turns out to be very useful for both documenting what things do, and helping your find them in the future. You can use roxygen to create this documentation in-line with your code. Also read this related question: Organizing R Source Code.
Alternatively, the help.search() function can be very useful for searching local packages, regardless of whether you have a repository set up.
You'd probably be best working with a version control system. Many can be indexed and be made search-able. At my work, a stack of R, Eclipse, StatET, Subversion and Subclipse works very well for us.

Resources