I have some algorithms that load quite large CSV files and was wondering if it is possible to debug some of the logic without having to reload the data every time?
In Spyder you can just debug a single cell is there something similar like that in R?
Related
I have a list [~90 files] of zipped files . I have written a loop to unzip them (become 1Gb approx. per file), do some computations, save the output for each of the files and delete the unzipped file. One iteration of this process takes like 30-60min per file [not all files are the same size exactly].
I am not too worried about the time as I can leave it working over the weekend. However, R doesn’t manage to get all the way through. I left it on Friday night and it was only running 12 hours so it only processed 30 of the 90 files.
I don’t deal with this type of heavy processes often but the same has happened in the past with analogous processes. Is there any command I need to insert in my loops to avoid the computer from freezing with this intensive processes? I tried gc() at the end of the loop to no avail.
Is there a list of “good practice” recommendations for this type of procedures?
If your session is freezing you are likely running into a problem you need to isolate as it may be a single file, or maybe you are becoming restricted by memory or extensively using swap.
Regardless, here are some tips or ideas you could implement:
Writing your code to process a file as a singular case, e.g. a function like
process_gz_folder(). Then loop over the file paths and invoke the function you created each time, this keeps the global environment clean.
As you already tried, sometimes gc() can help, but it depends on the situation and if memory is being cleared (after running rm() for example). Could be used after invoking function in first point.
Are you keeping the results of each folder in memory? Does this set of results get larger with each iteration? If so this may be taking up required memory - storing the results to disk as a suitable format will let you accumulate the results after each has been processed.
To add to the prior point, if files produce outputs making sure their names are appropriate and even adding a timestamp (e.g. inputfile_results_YYYYMMDD).
Code could check if file is already processed and skip to next, this can help restarting from scratch, especially if your method for checking if a file is processed is using the existence of an output (with timestamp!).
Using try() to make sure failures do not stop future iterations - however this should produce warnings/output to notify of a failure so that you can come back at a later point.
An abstract approach could be to create a single script that processes a single file, it could just include the function from the first point, proceeded with setTimeLimit() and provide a time for which if the file is not processed the code will stop running. Iterate over this script with a bash script invoking said R script with Rscript which can be passed arguments (filepaths for example). This approach may help avoid freezes but is dependent on you knowing and setting an acceptable time.
Determine if the files are too large for memory when processing the code may need be adjusted to be more memory efficient or change code to process the data incrementally as to not run out of memory.
Reduce other tasks on the computer that can take resources that may cause a freeze.
These are just some ideas that spring to mind that could be things to consider in your example (given the info provided). It would help to see some code and understand what kind of processing you are doing on each file.
Given as little information as you have provided, it is hard to tell what the problem really is.
If possible, I would first unzip and concatenate the files. Then preprocess the data and strip off all fields that are not required for analysis. The resulting file would then be used as input for R.
Also be aware that parsing the input strings as e.g. timestamps may be quite time consuming.
I'm currently working on a shiny application to report on a business wide scale, which is working fine. However I've been asked if I can implement a means of writing information to a central document, which will then be read back into the application again on the next run.
Essentially what I need to make is a shiny app that I can input data into, and this is then retrievable at a later date. Is this achieveable with an Excel document? Organising a database within company filestructure wouldn't be allowed, so this is all I can think of.
Would this be as straightforward as using a package to write to Excel and then having an update script run at the beginning of each update or would there be more to consider? Most importantly, if two users are updating at the same time, would R wait for one update to finish before running the next one?
Thanks a lot in advance!
I'm trying to find a way to automate a series of processes that uses several different programmes. (Indifferently on Ubuntu and Windows).
For each programme, I've boiled the process down to either a macro or a script in each programme, so I feel confident that the entire process can be almost entirely automated. I just can't figure out what I can do to create a unifying tool.
The process is the following;
I have a simple text file with data, I use a jedit macro to tidy the data. This then goes to a OpenCalc template to create a graph, that data is then imported to a programme called TXM which (after many clicks) generates a column of data, this is exported to a csv file, that csv file is imported to an R session where upon a script is executed.
I have to repeat this process( and a few more similarones) dozens of times a day, and it's driving me nuts.
My research into how to automate the import treatment export process has shown a few glimmers of hope but I haven't been able to make any real progress.
I tried Autoexpect, but couldn't make it work on Ubuntu. TCL, I think only works for internet applications, Fabric I also haven't been able to make work.
I'm willing to spend a lot of time learning and develloping a tool to achieve this, but at the moment I'm not even sure what terms to search for.
I've figured it out for windows; I created a .bat file in a text editor which, when click prompst the user for names, etc and rewrites another text file. It then executes that .txt file as a script with r with the
command R.exe CMD BATCH c:\user\desktop\mymacro.txt
I would like to be able to have an API system, in which a POST message that contains a csv file is sent to a server/webserver/domain name. It is used as an input for an R function, and then outputs a value which is set back to the sender of the POST message.
One of the issues I have is that most of the solutions I have seen such as rApache(http://rapache.net/) invoke R to run a script, and take back the output. The problem is that my R script also loads from disk some very large data files, which are used as further inputs in order to create the final output.
If running R from the console, with the large data files already loaded as well as all the relevant libraries, the final part of loading up the user input csv, running the function and creating an output is reasonably quick. I.e. for each POST request, it seems highly inefficient to keep re-invoking R loading all relevant files and then closing it after creating the output. I.e. having R constantly running with all the relevant files and libraries and finally only loading in the given CSV file to run the final calculations seems much more efficient...Is there a way to do this?
Shiny (http://shiny.rstudio.com/) looks like a close solution since it always has R running in the background and may be able to take in POST requests, but it also has a lot of unnecessary overhead which probably makes it too inefficient for my purposes.
Also will this method be able to handle many POST messages coming in simultaneously?
As always any help is always much appreciated. Thanks in advance.
FastRWeb can accept POST requests and may be what you are looking for.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)