I have a list [~90 files] of zipped files . I have written a loop to unzip them (become 1Gb approx. per file), do some computations, save the output for each of the files and delete the unzipped file. One iteration of this process takes like 30-60min per file [not all files are the same size exactly].
I am not too worried about the time as I can leave it working over the weekend. However, R doesn’t manage to get all the way through. I left it on Friday night and it was only running 12 hours so it only processed 30 of the 90 files.
I don’t deal with this type of heavy processes often but the same has happened in the past with analogous processes. Is there any command I need to insert in my loops to avoid the computer from freezing with this intensive processes? I tried gc() at the end of the loop to no avail.
Is there a list of “good practice” recommendations for this type of procedures?
If your session is freezing you are likely running into a problem you need to isolate as it may be a single file, or maybe you are becoming restricted by memory or extensively using swap.
Regardless, here are some tips or ideas you could implement:
Writing your code to process a file as a singular case, e.g. a function like
process_gz_folder(). Then loop over the file paths and invoke the function you created each time, this keeps the global environment clean.
As you already tried, sometimes gc() can help, but it depends on the situation and if memory is being cleared (after running rm() for example). Could be used after invoking function in first point.
Are you keeping the results of each folder in memory? Does this set of results get larger with each iteration? If so this may be taking up required memory - storing the results to disk as a suitable format will let you accumulate the results after each has been processed.
To add to the prior point, if files produce outputs making sure their names are appropriate and even adding a timestamp (e.g. inputfile_results_YYYYMMDD).
Code could check if file is already processed and skip to next, this can help restarting from scratch, especially if your method for checking if a file is processed is using the existence of an output (with timestamp!).
Using try() to make sure failures do not stop future iterations - however this should produce warnings/output to notify of a failure so that you can come back at a later point.
An abstract approach could be to create a single script that processes a single file, it could just include the function from the first point, proceeded with setTimeLimit() and provide a time for which if the file is not processed the code will stop running. Iterate over this script with a bash script invoking said R script with Rscript which can be passed arguments (filepaths for example). This approach may help avoid freezes but is dependent on you knowing and setting an acceptable time.
Determine if the files are too large for memory when processing the code may need be adjusted to be more memory efficient or change code to process the data incrementally as to not run out of memory.
Reduce other tasks on the computer that can take resources that may cause a freeze.
These are just some ideas that spring to mind that could be things to consider in your example (given the info provided). It would help to see some code and understand what kind of processing you are doing on each file.
Given as little information as you have provided, it is hard to tell what the problem really is.
If possible, I would first unzip and concatenate the files. Then preprocess the data and strip off all fields that are not required for analysis. The resulting file would then be used as input for R.
Also be aware that parsing the input strings as e.g. timestamps may be quite time consuming.
Related
I'm producing an .exe file via PyInstaller and I wonder whether there is a specific option or similar to do so while iterating through some changes. For example, I add a simple print statement in the entire code, do I always have to go through the whole building stage of the .exe file?
I'm asking because it takes a while just to try something out. In my case I think it takes between 5-10 min to create the .exe. In general it's ok but when you do that a dozens or more times a day it adds up.
I am very new to R and I am using RStudio. I am building a new "user-defined" function, which entails a huge amount of trial and error. Every time I make the slightest change to the function I need select the entire function and do crtl+Enter in order to "commit" the function to the workspace.
I am hoping there is a better way of doing it, perhaps in a separate window that automatically "commits" when I save.
I am coming from Matlab and am used to just saving the function after which it is already "committed".
Ctrl+Shift+P re-runs previously executed region, so you won't have to highlight your function again. so this will work unless you have executed something else in the interim.
If you wan to run some part of your code in RStudio you simply have to use Ctrl+Enter. If the code were run every time you saved it, it could have very bad effects. Imagine that you have a huge script that runs for a long time and uses much computer resources - this would lead you to killing R to stop the script every time you saved it!
What you could do is to save script in external file and than call it from your main script using source("some_directory/myscript.R").
I would like to be able to have an API system, in which a POST message that contains a csv file is sent to a server/webserver/domain name. It is used as an input for an R function, and then outputs a value which is set back to the sender of the POST message.
One of the issues I have is that most of the solutions I have seen such as rApache(http://rapache.net/) invoke R to run a script, and take back the output. The problem is that my R script also loads from disk some very large data files, which are used as further inputs in order to create the final output.
If running R from the console, with the large data files already loaded as well as all the relevant libraries, the final part of loading up the user input csv, running the function and creating an output is reasonably quick. I.e. for each POST request, it seems highly inefficient to keep re-invoking R loading all relevant files and then closing it after creating the output. I.e. having R constantly running with all the relevant files and libraries and finally only loading in the given CSV file to run the final calculations seems much more efficient...Is there a way to do this?
Shiny (http://shiny.rstudio.com/) looks like a close solution since it always has R running in the background and may be able to take in POST requests, but it also has a lot of unnecessary overhead which probably makes it too inefficient for my purposes.
Also will this method be able to handle many POST messages coming in simultaneously?
As always any help is always much appreciated. Thanks in advance.
FastRWeb can accept POST requests and may be what you are looking for.
Let me first say that I assiduously avoid hand-cleaning data in favor of regular expressions and the like. However, occasionally it is inevitable.
I use something like the Load-Clean-Func-Do workflow normally, so this obviously fits into the cleaning phase. However, any hand-editing breaks the ability to run the stuff before the hand-cleaning if it needs updating.
I can think of at least three ways to handle this:
Put the by-hand changes as early in the workflow as possible, so that everything after that remains runnable.
Write out regexes or assignment operations for every single change.
Use a tool that generates (2) for you after you close the spreadsheet where you've made the changes.
The problem with 2 is that it can be extremely unweildy. The problem with 3 is that I'm unaware of any such tool existing for R. Stata has an extremely good implementation of this.
So the questions are:
Which results in the most replicable code with the least-frustrating code writing?
Does a tool as in (3) exist?
I agree that hand-cleaning is generally a rather bad idea. However, sometimes it is unavoidable. I'd suggest one of the two, or both:
Keep a separate data file with "data fixing" containing three variables "case_id", "variable_name", "value". Use it to store information about which values in the original data need to be replaced. You may add some additional variables to extra information about cleaning (e.g. why value on variable "variable_name" need to be replaced with "value" for case "case_id", etc.). Then have a short piece of R code, which loads your original data and then cleans it with the additional information in the "fixing" file.
Perhaps you should start using some version control system like git or subversion (there are other progs also). Every hand-made change to the data could be recorded in the system as a separate commit. By the end of the day, you will be able to easily check the log for what change you made to the data and when. Moreover, you will be able to generate patch files that transform original data files to the cleaned ones. It is also beneficial to have your R code files version-controlled.
I have an issue regarding mrjob.
I'm using an hadoopcluster over 3 datanodes using one namenode and one
jobtracker.
Starting with a nifty sample application I wrote something like the
following
first_script.py:
for i in range(1,2000000):
print "My Line "+str(i)
this is obviously writing a bunch of lines to stdout
the secondary script is the mrjobs Mapper and Reducer.
Calling from an unix (GNU) i tried:
python first_script| python second_script.py -r hadoop
This get's the job done but it is uploading the input to the hdfs
completely. Just when everything ist uploaded he is starting the
second job.
So my question is:
Is it possible to force a stream? (Like sending EOF?)
Or did I get the whole thing wrong?
Obviously you have long since forgotten about this but I'll reply anyway: No it's not possible to force a stream. The whole hadoop programming model is about taking files as input and outputting files (and possibly creating side effects e.g. uploading the same stuff to a database).
It might help if you clarified what you want to achieve a little more.
However it sounds like you might want the contents of a pipe to be periodically processed, rather than wait until the stream is finished. The stream cant be forced.
The reader of the pipe (your second_script.py) needs break its stdin into chunks, either using
a fixed number of lines like this question and answer, or
non-blocking reads and a preset idle period, or
a predetermined break sequence emitted from first_script.py, such as a 'blank' line consisting of only \0.