Sending CSV via web to R and sending results back - r

I would like to be able to have an API system, in which a POST message that contains a csv file is sent to a server/webserver/domain name. It is used as an input for an R function, and then outputs a value which is set back to the sender of the POST message.
One of the issues I have is that most of the solutions I have seen such as rApache(http://rapache.net/) invoke R to run a script, and take back the output. The problem is that my R script also loads from disk some very large data files, which are used as further inputs in order to create the final output.
If running R from the console, with the large data files already loaded as well as all the relevant libraries, the final part of loading up the user input csv, running the function and creating an output is reasonably quick. I.e. for each POST request, it seems highly inefficient to keep re-invoking R loading all relevant files and then closing it after creating the output. I.e. having R constantly running with all the relevant files and libraries and finally only loading in the given CSV file to run the final calculations seems much more efficient...Is there a way to do this?
Shiny (http://shiny.rstudio.com/) looks like a close solution since it always has R running in the background and may be able to take in POST requests, but it also has a lot of unnecessary overhead which probably makes it too inefficient for my purposes.
Also will this method be able to handle many POST messages coming in simultaneously?
As always any help is always much appreciated. Thanks in advance.

FastRWeb can accept POST requests and may be what you are looking for.

Related

Is there any way for me to check or find out the code that produced the dataset?

While practicing, I made the mistake of writing and running the code in the console instead of on the script. However, I managed to save all my work and datasets on the environment to work with on new sessions.
Right now I'm trying to rehash what I did on the last session. But since I didn't save the script from the last session, I don't know the codes that produced certain data set.
Is there any way for me to check or find out the codes?

Is there a way of rendering rmarkdown to an object directly without saving to disk?

We are working on a project where following an analysis of data using R, we use rmarkdown to render an html report which will be returned to users uploading the original dataset. This will be part of an online complex system involving multiple steps. One of the requirements is that the rmarkdown html will be serialized and saved in a SQL database for the system to return to users.
My question is - is there a way to render the markdown directly to an object in R to allow for direct serialisation? We want to avoid saving to disk unless absolutely needed as there will be multiple parallel processes doing similar tasks and resources might be limited. From my reasearch so far it doesn't seem possible, but would appreciate any insight.
You are correct, it's not possible due to the architecture of rmarkdown.
If you have this level of control over your server, you could create a RAM disk, using part of your the memory to simulate a hard drive. The actual hard disk won't be used.

Is there a way to make R code only able to be run and not edited? Essentially read-only?

I am in the midst of writing some scripts to perform data analysis on large excel sheets faster than by hand. However, my company has a strict quality review system where the program used needs to be validated and secure (i.e. no one can edit it, there is proof of what code was run, etc.). So essentially I would like my code to be able to be ran by my coworkers without them being able to edit the script. I was also interested in inserting prompts that they can fill in (e.g. "Which column would you like to analyze?")
Is all of this possible? I have read a few things online about file permissions but I know that these can easily be changed by the user. I also read about obfuscators but am entirely unfamiliar with their use.
One thought I have is to use Rmarkdown as a method of displaying which lines were run for which results. However, I believe that document could be edited as well? This would also leave the issue of the script itself being able to be edited.

R unable to process heavy tasks for many hours

I have a list [~90 files] of zipped files . I have written a loop to unzip them (become 1Gb approx. per file), do some computations, save the output for each of the files and delete the unzipped file. One iteration of this process takes like 30-60min per file [not all files are the same size exactly].
I am not too worried about the time as I can leave it working over the weekend. However, R doesn’t manage to get all the way through. I left it on Friday night and it was only running 12 hours so it only processed 30 of the 90 files.
I don’t deal with this type of heavy processes often but the same has happened in the past with analogous processes. Is there any command I need to insert in my loops to avoid the computer from freezing with this intensive processes? I tried gc() at the end of the loop to no avail.
Is there a list of “good practice” recommendations for this type of procedures?
If your session is freezing you are likely running into a problem you need to isolate as it may be a single file, or maybe you are becoming restricted by memory or extensively using swap.
Regardless, here are some tips or ideas you could implement:
Writing your code to process a file as a singular case, e.g. a function like
process_gz_folder(). Then loop over the file paths and invoke the function you created each time, this keeps the global environment clean.
As you already tried, sometimes gc() can help, but it depends on the situation and if memory is being cleared (after running rm() for example). Could be used after invoking function in first point.
Are you keeping the results of each folder in memory? Does this set of results get larger with each iteration? If so this may be taking up required memory - storing the results to disk as a suitable format will let you accumulate the results after each has been processed.
To add to the prior point, if files produce outputs making sure their names are appropriate and even adding a timestamp (e.g. inputfile_results_YYYYMMDD).
Code could check if file is already processed and skip to next, this can help restarting from scratch, especially if your method for checking if a file is processed is using the existence of an output (with timestamp!).
Using try() to make sure failures do not stop future iterations - however this should produce warnings/output to notify of a failure so that you can come back at a later point.
An abstract approach could be to create a single script that processes a single file, it could just include the function from the first point, proceeded with setTimeLimit() and provide a time for which if the file is not processed the code will stop running. Iterate over this script with a bash script invoking said R script with Rscript which can be passed arguments (filepaths for example). This approach may help avoid freezes but is dependent on you knowing and setting an acceptable time.
Determine if the files are too large for memory when processing the code may need be adjusted to be more memory efficient or change code to process the data incrementally as to not run out of memory.
Reduce other tasks on the computer that can take resources that may cause a freeze.
These are just some ideas that spring to mind that could be things to consider in your example (given the info provided). It would help to see some code and understand what kind of processing you are doing on each file.
Given as little information as you have provided, it is hard to tell what the problem really is.
If possible, I would first unzip and concatenate the files. Then preprocess the data and strip off all fields that are not required for analysis. The resulting file would then be used as input for R.
Also be aware that parsing the input strings as e.g. timestamps may be quite time consuming.

project organization with R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)

Resources