Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am having a larger data set with more than 1 million entries. If I am running scripts it sometimes takes up a while till I get an output. Sometimes it seems that there is no output what so ever, even if I let it run for hours. Is there a way to track the progress of the computation (or maybe just see if it is not stuck)?
1. Start small
Write your analysis script and then test it using trivially small amounts of data. Gradually scale up and see how the runtime increases. The microbenchmark package is great at this. In the example below, I compare the amount of time it takes to run the same function with three different sized chunks of data.
library(microbenchmark)
long_running_function <- function(x) {
for(i in 1:nrow(x)) {
Sys.sleep(0.01)
}
}
microbenchmark(long_running_function(mtcars[1:5,]),
long_running_function(mtcars[1:10,]),
long_running_function(mtcars[1:15,]))
2. Look for functions that provide progress bars
I'm not sure what kind of analysis you're performing, but some packages already have this functionality. For example, ranger gives you more updates than the equivalent RandomForest functions.
3. Write your own progress updates
I regularly add print() or cat() statements to large code blocks to tell me when R has finished running a particular part of my analysis. Functions like txtProgressBar() let you add your own progress bars to functions as well.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
This post was edited and submitted for review 3 days ago.
Improve this question
Let's say you've just completed writing a series of custom functions in a RMarkdown book to analyze your dataset from reading, tidying, analysis, visualization and export. You now want to deploy these functions on
a folder full of csv datasets sequentially. The functions can't be used as standalones as it requires variables/objects that are outputs from the first function. Essentially, they need to be run in a linear order.
What is the most efficient method of the two below for combining these functions together?
I imagine there's two approaches:
Should you create individual R script files for each function and source these into another R script file to run each function as standalone lines of code one after another e.g.,
x<- read_csv(data_sets)
clean_output <- func1(x)
results_output <- func2(clean_output)
table_plots_output <- func3(results_output)
export_csv <- func4(table_plots_output)
OR
Should you write a sort of master function that contains all the functions you've created previously to run all your processes/functions (cleaning, analysis, visualization and export of results) in a single line of code?
x<- read_csv(data_sets)
export_csv <- master_funct(x) {
clean_output <- func1(x)
results_output <- func2(clean_output)
table_plots_output <- func3(results_output)
func4(table_plots_output)
}
I try to follow Tidyverse approaches, so if there is a Tidyverse approach to this task that would be great.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I’m trying to learning Forth directly in an embedded system and using Starting Forth by Leo Brodie as a text. The Forth version I’m using is 328eForth (a port of eforth to the ATmega328) which I’ve flashed into an Arduino Uno.
It appears that the DO LOOP words are not implemented in 328eForth - which puts a kink in my learning with Brodie. But looking at the dictionary using “WORDS” shows that a series of looping words exist e.g. BEGIN UNTIL WHILE FOR NEXT AFT EXIT AGAIN REPEAT amongst others.
My questions are as follows:
Q1 Why was DO LOOP omitted from 328eForth?
Q2 Can DO LOOP be implemented in other existing words? If so, how please and if not why? (I guess there must be a very good reason for the omission of DO LOOP...)
Q3 Can you give some commented examples of the 328eForth looping words?
Q1: A choice was made for a different loop construct.
Q2: The words FOR and NEXT perform a similar function that just counts down to 0 and runs exactly the specified number of times, including zero.
The ( n2 n1 -- ) DO ... LOOP always runs at least once, which requires additional (mental) bookkeeping. People have been complaining
about that as long back as I can remember.
Q3: The 382eforth documentation ForthArduino_1.pdf contains some examples.
Edit: Added some exposé to Q2
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have read all transactions from my datasat and then made apriori.
But I "ate" whole RAM.
It is possibble to omit this?
It is possible to prepare apriori without loading everything to RAM or
somehow merge the results?
Generally, one can increase the memory available to R processes using command line parameters. See Increasing (or decreasing) the memory available to R processes
However, apriori has some optimization options itself. Add a list of control parameters to your call to apriori using control = list(memopt = TRUE) to minimize memory usage and control = list(load = FALSE) to disable loading transactions into memory.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Right now I wrote some script that could analyze the daily dumped file from Hadoop. What I want is to let my R script runs daily at 4AM after new data becomes available. Is there any script from R side or OS side could make this happen automatically?
What I can think of is to leave have another R script idling and keep checking system time to decide to call my script to run, but is this too much? I prefer to have R closed unless necessarily.
OK, I see the answer. Does anyone have experience commenting on the stability between R and Python, in terms of running large scale data processing task.
http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/
-or better yet-
http://tgmstat.wordpress.com/2013/09/11/schedule-rscript-with-cron/
Those websites should be all you need to get it going. Assuming you are using linux.
you can use this code
Sys.time()
for(period in 1:365){
{
your code here
}
newdate=as.POSIXct("2014-11-14 04:00:00 GMT")+24*60*60*period
Sys.sleep( newdate - Sys.time() )
}
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a species abundance dataset with quite a few zeros in it and even when I set trymax = 1000 for metaMDS() the program is unable to find a stable solution for the stress. I have already tried combining data (collapsing multiple years together to reduce the number of zeros) and I can't do any more. I was just wondering if anyone knows - is it scientifically valid to pick what R gives me at the end (the lowest of the 1000 solutions) or should I not be using NMDS because it cannot find a stable spot? There seems to be very little information about this on the internet.
One explanation for this is that you are trying to use too few dimensions for the mapping. I presume you are using the default k = 2? If so, try k = 3 and compare the stress from the best solution you got from the 1000 tries for the k = 2 solution.
I would be a little concerned to take one solution out of 1000 just because it had the best/lowest stress.
You could also try 1000 more random starts and see if it converges if you run more iterations. When you saved the output from metaMDS(), you can supply that object to another call to metaMDS() via the previous.best argument. It will then do trymax further random starts but compare any lower-stress solutions with the previous best and converge if it finds one similar to it, rather than have to find two similar low-stress solutions in the 1000 starts.