I'm hoping to take advantage of Amazon spot instances which come at a lower cost but can terminate anytime. I want to set it up such that I can send myself data mid-way through a script so I can pick up from there in the future.
How would I email myself a .rdata file?
difficulty: The ideal solution will not involve RCurl since I am unable to install that package on my machine instance.
The same way you would on the command-line -- I like the mpack binary for that which you find in Debian and Ubuntu.
So save data to a file /tmp/foo.RData (or generate a temporary name) and then
system("mpack -s Data /tmp/foo.RData you#some.where.com")
in R. That assumes the EC2 instance has mail setup, of course.
Edit Per request for a windoze alternative: blat has been recommended by other for this task.
There is a good article on this in R News from 2007. Amongst other things, the author describes some tactics for catching errors as they occur, and automatically sending email alerts when this happens -- helpful for long simulations.
Off topic: the article also gives tips about how the linux/unix tools screen and make can be very useful for remote monitoring and automatic error reporting. These may also be relevant in cases when you are willing to let R email you.
What you're asking is probably best solved not by email but by using an EBS volume. The volume will persist regardless of the instance (note though that I'm referring to an EBS volume as opposed to an EBS-backed instance).
In another question, I mention a bunch of options for checkpointing and related tools, if you would like to use a separate function for storing your data during the processing.
Related
I want to use the forge viewer as a preview tool in my web app for generated data.
The problem I have is that the model derivative API is sometimes slow sometimes fast.
I read that this happens because the files are placed in a queue and being processed subsequentially.
In my opinion, this can be solved by:
Having the extraction.update webhook also tell me where I am in the queue. So I can inform my users with better progress information. Or when the queue is too long I can not stop the process.
Being able to have a private queue. I have no problem paying more credits if necessary.
Being able to generate svf2 files on my own server.
But I don't know if any of these options are possible. Or if there is another workaround.
Yes, that could be useful. I logged that request in our system: DERI-7940
Might be considered later on, but no plans currently
I'm not aware of any plans for that
We're always working on making the translation service better, but unfortunately, I cannot tell when it will meet your requirements - including the implementation of the webhook feature you mentioned.
SVF2 is specifically for very large models - is that what you are working with? If not, then I'm quite certain that translating to SVF would be faster.
I have a problem with an FTP server that slows dramatically after returning a few files.
I am trying to access data from a government server at the National Snow and Ice Data Center, using an R script and the RCurl library, which is a wrapper for libcurl. The line of code I am using is this (as an example for a directory listing):
getURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/")
or this example, to download a particular file:
getBinaryURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf
I have to make the getURL() and getBinaryURL() requests frequently because I am picking through directories looking for particular files and processing them as I go.
In each case, the server very quickly returns the first 5 or 6 files (which are ~1 Mb each), but then my script often has to wait for 10 minutes or more until the next files are available; in the meantime the server doesn't respond. If I restart the script or try curl from the OSX Terminal, I again get a very quick response for the first few files, then a massive slowdown.
I am quite sure that the server's behavior has something to do with preventing DOS attacks or limiting bandwidth used by bots or ignorant users. However, I am new to this stuff and I don't understand how to circumvent the slowdown. I've asked the people who maintain the server but I don't have a definitive answer yet.
Questions:
Assuming for a moment that this problem is not unique to the particular server, would my goal generally be to keep the same session open, or to start new sessions with each FTP request? Would the server be using a cookie to identify my session? If so, would I want to erase or modify the cookie? I don't understand the role of handles, either.
I apologize for the vagueness but I'm wandering in the wilderness here. I would appreciate any guidance, even if it's just to existing resources.
Thanks!
The solution was to release the curl handle after making each FTP request. However, that didn't work at first because R was hanging onto the handle even though it had been removed. The solution (provided by Bill Dunlap on the R help list) was to call garbage collection. In summary, the successful code looked like this:
for(file in filelist){
curl<-getCurlHandle() #create a new curl handle
getURL(url=file, curl=curl,...) #download the file
rm(curl) #remove the curl
gc() #the magic call to garbage collection, without which the above does not work
}
I still suspect that there may be a more elegant way to accomplish the same thing using the RCurl library, but at least this works.
I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:
The segue package;
A presentation on parallel web-services using Amazon.
What I don't understand is:
How do I get the data into the ten nodes?
How do I send and execute the R functions on the nodes?
I would be very grateful if you could share suggestions and hints to point me in the right direction.
PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).
You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw
But it will take you several hours to get everything running:
starting 11 EC2 instances (for every group one instance + one head instance)
R and MPI on all machines (check for preinstalled images)
configuring MPI correctly (probably add a security layer)
in best case a file server which will be mounted to all nodes (share data)
with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)
The segue package is nice but you will definitely get data communication problems!
The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference
I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)
A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:
How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?
Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.
In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.
If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.
I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.
I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.
Let's presuppose that you have R running with root/admin privileges. What R calls do you consider harmful, apart from system() and file.*()?
This is a platform-specific question, I'm running Linux, so I'm interested in Linux-specific security leaks. I will understand if you block discussions about R, since this post can easily emerge into "How to mess the system up with R?"
Do not run R with root privs. There is no effective way to secure R in this way, since the language includes eval and reflection, which means I can construct invocations to system even if you don't want me to.
Far better is to run R in a way that cannot affect the system or user data, no matter what it tries to do.
Anything that calls external code could also be making system changes, so you would need to block certain packages and things like .Call(), .C(), .jcall(), etc.
Suffice it to say that it will end up being a virtually impossible task, and you are better off running it in a virtualized environment, etc. if you need root access.
You can't. You should just change the question: "How do I run user-supplied R code so as not to harm the user or other users of the system?" That's actually a very interesting question and one that can be solved with a little bit of cloud computing, apparmor, chroot magic, etc.
There are tons of commands you could use to harm the system. A handful of examples: Sys.chmod, Sys.umask, unlink, any command that allows you to read/write to a connection (there are many), .Internal, .External, etc.
And if you blocked users from those commands, there's nothing stopping them from implementing something in a package that you wouldn't know to block.
As noted by just about every response to this thread, removing the "potentially harmful" calls in the R language would:
Be potentially impossible to do completely.
Be difficult to do without spending significant time writing complicated (i.e. ugly) hacks.
Kneecap the language by removing a ton of functionality that makes R so flexible.
A safer solution that doesn't require modifying/rewriting large parts of the R language would be to run R inside a jail using something like BSD Jails, Jailkit or Solaris Zones.
Many of these solutions allow the jailed process to exercise root-like privileges but restrict the areas of the computer that the process can operate on.
A disposable virtual machine is another option. If a privileged user thrashes the virtual environment, just delete it and boot another copy.
One of my all time favorites. You don't even have to be r00t.
library(multicore);
forkbomb <- function(){
repeat{
parallel(forkbomb());
}
}
forkbomb();
To adapt a cliche from gun rights people, "system() isn't harmful - people who call system() are harmful".
No function calls are intrinsically harmful, but if you allow people to use them freely then those people may cause harm.
Also, the definition of harm will depend on what you consider harmful.
In general, R is so complex that you can assume that there is a way to trick it in executing data with seemingly harmless functions, for instance through buffer overflow.