Slow output in RStudio - r

Ouput of 1000 numbers by the command
c(1:1000)
takes 6 seconds in a R Markdown. One can observe how the numbers are gradually displayed row by row. Similar oberservations with rep(1,1000) or rep('1',1000).
This happens on a new laptop with freshly installed RStudio.
There are no other programs that work in the background and all other programs on the laptop perform quickly as expected.
Is this a built-in delay to get an impression of the amount of displayed data and how this can be switched off?
technical details:
OS: Win10.19042
RStudio 1.3.1093
CPU: i7-9750H (6 cores)
RAM: 32GB
RStudio being busy. After 3 seconds about 500 numbers were displayed. After further 3 seconds the whole output would be completed.

Related

Parallelization loop R "repeat"

this is my first post on this forum so if I don't do things correctly, please tell me I'm correcting it!
So here's my problem: at the end of one of my R programs, I do a simple loop using "repeat". This loop will perform a series of calculations for each month of the year 137 times, making a total of 1,646 repetitions. This takes time, about 50 minutes.
numero_profil_automatisation = read.csv2("./import_csv/numero_profil_automatisation.csv", header=TRUE, sep=";")
n=1
repeat {
annee="2020"
annee_profil="2020"
mois_de_debut="1"
mois_de_fin="12"
operateur=numero_profil_automatisation[n,1]
profil_choisi=numero_profil_automatisation[n,3]
couplage=numero_profil_automatisation[n,4]
subvention=numero_profil_automatisation[n,5]
type=numero_profil_automatisation[n,6]
seuil=0.05
graine=4356
resultat=calcul_depense(annee,annee_profil,mois_de_debut,mois_de_fin,operateur,
profil_choisi,couplage,subvention,type,graine,seuil)
nom=paste("./resultat_csv/",operateur,an,type,couplage,subvention,
profil,".csv",sep="_")
write.csv2(resultat,nom)
n<-n+1
if (n == 138) break
}
I would like to optimize the code, so I talked to a friend of mine, a computational developer (who doesn't "know" R) who advised me among other things to parallelize the calculations. My new work computer has 4 cores (R detects the 8 logical cores) so I can save a lot of time.
Being an economist-statistician and not a developer, I'm completely uncertain on this subject. I looked at some forums and articles and I found a code that worked on my previous computer which had only 2 cores (R detected the 4 logical cores), it divided the computing time by almost 2, from 2h to 1h. On my new computer, this piece of code doesn't change at all the computing time, with or without this piece of code it runs arround 50 minutes (better processor, more ram memory).
Below are the two lines of code I added just above the code I put above. With of course the packages at the beginning of the code that I can give you
no_cores <- availableCores() - 1
plan(multicore, workers = no_cores)
Do you have any idea why it seemed to work on my previous computer and not on the new one when nothing has changed except the computer? Or a corrective action to take?

R stuck after trying to register clusters with registerDoParallel

I have a piece of R code which never freezes on my laptop, but sometimes (~once per 100 times) it freezes immediately after "registerDoParallel(cl)" while running on the computational cluster. So, ages may pass - it will not move further. Nothing complicated with the computational cluster - just LINUX machine with 32 cores and a lot of RAM. It can freeze even when I try to register cluster with 1 core.
I have tried to use FORK or PSOCK clusters - does not matter, still does not work once per several times. It is pretty difficult to get anything meaningful from log files - it stucks immediately after this command:
library(foreach)
library(doParallel)
numberOfThreads = 4 # may be any number from 1 to e.g. 5
no_cores <- min(detectCores() - 1, numberOfThreads)
cl<-makeCluster(no_cores)#, type="FORK")
registerDoParallel(cl)
print("This message will never be printed if it freezes")
Does somebody have any ideas on why it behaves like that?
UPD: forgot to add - there are plenty of free cores and plenty RAM when this error occurs. Clear solution - https://www.rdocumentation.org/packages/R.utils/versions/2.8.0/topics/withTimeout and repetitive tries to register cluster again and again - but this is sooo ugly.

beginCluster function takes a very long time

I'm trying to go parallel with a moving window operation on a large RasterStack with a function that calls overlay(). The issue is, that setting up a cluster with raster::beginCluster takes a very long time (more than 18 hours so far) and hasn't succeeded yet. Any suggestions? I work with a 125 GB RAM remote instance with 32 cores. I have tried to set up the cluster with 31, 28 or only 4 cores and neither way it worked. Nor did beginCluster(31, type = "SOCK"). Setting up clusters with other packages, eg. doParallel or snow do work, though.
cheers,
Ben

R: How do I permanently set the amount of memory R will use to the maximum for my machine?

I know that some version of this question has been addressed multiple times in the past, but I think this iteration of this widely shared problem is sufficiently distinct to justify its own response. I would like to permanently set the maximum memory available to R to largest value that my machine can handle, i.e., not just for a single session. I am running 64-bit R on a windows 7 machine with 6 gig of RAM.
Currently I am trying to do a conversion of a 10 GB Stata file into a .rds object. On similar smaller objects the compression in the .dta to .rds conversion has been by a factor of four or better, and I (rather surprisingly) have not had any trouble doing dplyr manipulation on objects of 2 to 3 GB (after compression), even when two of them and work product are all in memory at once. This seems to conflict with my previous belief that the amount of physical RAM is the absolute upper limit of what R can handle, as I am fairly certain that between loaded .rds objects and various intermediate work products I have had more than 6 GB of undeleted objects laying about my workspace at one time.
I find conflicting statements about whether the maximum memory size is my actual RAM less OS demands, or my actual RAM, or my actual RAM plus an unknown (to me) amount of virtual RAM (subject to a potentially serious slowdown when you reach into virtual RAM). These file conversions are one-time (per file) jobs and I do not care if they are slow.
Looking at the base R help page on “Memory limits” and the help-pages for memory.size(), it seems that there are multiple distinct limits under Windows, relating to total memory used in a session, available to a single process, allocatable by malloc or contained in a single vector. The individual vectors in my file are only around eight million rows long.
memory.size and memory.limit both report current settings in the neighborhood of 6 GB. I got multiple warning messages saying that I was pressed up against that limit, but the actual error message was something like “cannot allocate vector of length 120 MB”.
So I think there are three distinct questions:
How do I determine the maximum possible memory for each 64-bit R
memory setting; and
How many distinct memory settings do I need to make; and
How do I make them permanently, as opposed to for a single session?
Following the advice of #Konrad below, I had this rather puzzling exchange with R/RStudio:
> memory.size()
[1] 424.85
> memory.size(max=TRUE)
[1] 454.94
> memory.size()
[1] 436.89
> memory.size(5000)
[1] 6046
Warning message:
In memory.size(5000) : cannot decrease memory limit: ignored
> memory.size()
[1] 446.27
The first three interactions seem to suggest that there is a hard memory limit on my machine of 455 MB. The second-to-last one, on the other hand, appears to be saying that the memory limit is set at my RAM level, without allowance for the OS, and without using virtual memory. Then the last one goes back claiming to a limit of around 450.
I just tried the recommendation here:
Increasing (or decreasing) the memory available to R processes
but with 6000 MB rather than 500; I'll provide a report.

What affects the time to create a cluster using the parallel package?

I'm experiencing slowness when creating clusters using the parallel package.
Here is a function that just creates and then stops a PSOCK cluster, with n nodes.
library(parallel)
library(microbenchmark)
f <- function(n)
{
cl <- makeCluster(n)
on.exit(stopCluster(cl))
}
microbenchmark(f(2), f(4), times = 10)
## Unit: seconds
## expr min lq median uq max neval
## f(2) 4.095315 4.103224 4.206586 5.080307 5.991463 10
## f(4) 8.150088 8.179489 8.391088 8.822470 9.226745 10
My machine (a reasonably modern 4-core workstation running Win 7 Pro) is taking about 4 seconds to create a two node cluster and 8 seconds to create a four node cluster. This struck me as too slow, so I tried the same profiling on a colleague's identically specced machine, and it took one/two seconds for the two tests respectively.
This suggested I may have some odd configuration set up on my machine, or that there is some other problem. I read the ?makeCluster and socketConnection help pages, but did not see anything related to improving performance.
I had a look in the Windows Task Manager while the code was running: there was no obvious interference with anti-virus or other software, just an Rscript process running at ~17% (less than one core).
I don't know where to look to find the source of the problem. Are there any known causes of slowness with PSOCK cluster creation under Windows?
Is 8 seconds to create a 4-node cluster actually slow (by 2014 standards), or are my expectations too high?
To monitor what was happening, I installed and opened Process Monitor (HT #qethanm). I also exited most of the things in my system tray like Dropbox, in order to generate less noise. (Though in the end, this didn't make a difference.)
I then re-ran a simplified version of the R code in the question, directly from R GUI (instead of an IDE).
microbenchmark(f(4), times = 5)
After some digging, I noticed that R GUI spawns an Rscript process for each cluster that it creates (see picture).
After many dead ends and wild goose chases, it occurred to me that perhaps these Rscript instances weren't vanilla R. I renamed my Rprofile.site file to hide it and repeated the benchmark.
This time, a 4 node cluster was created, on average, in just under a second.
For a four node cluster, the Rprofile.site file (and presumably the personal startup file, ~/.Rprofile, if it exists) gets read four times, which can slow things down considerably. Pass rscript_args = c("--no-init-file", "--no-site-file", "--no-environ") to makeCluster to avoid this behaviour.

Resources