Jupyter notebook on Ubuntu doesn't free memory - jupyter-notebook

I have a piece of code (Python 3.7) in a Jupyter notebook cell that generates a pandas data frame (df) containing numpy arrays.
I am checking the memory consumption of the df just by looking at the system monitor app preinstalled in Ubuntu.
The problem is that if I run the cell a second time, the memory consumption double even if the df is assigned to the same variable.
If I run multiple times the same cell, the system goes out of memory, and the kernel will dye by itself.
Using del df or gc.collect() won't free the memory as well.
Restarting the notebook kernel is the only way to free the memory.
In practice, I would expect the memory to stay roughly the same because I am just reassigning a new df to the same variable over and over again.
Indeed, the memory accumulates only if I run the code on a linux machine and in the notebook. If I run the same code via terminal python script.py, or if I run the very same notebook on macOS, the memory pressure will not change, I can run the same cell multiple time and the occupied memory stays stable (as expected).
Can you help me pointing out where is the problem coming from and how to solve it?
P.S. Both Python and Jupiter are installed with Anaconda 2018.12 on Ubuntu 18.04.
I have asked the same question on the Ubuntu community since I am not sure this is strictly related to python itself, but I got no answers so far.

Related

Rstudio potential memory leak / background activity?

I’m having a lot of trouble working with Rstudio on a new PC. I could not find a solution searching the web.
When Rstudio is on, it is constantly eating up memory until it becomes unworkable. If I work on an existing project, it takes half an hour to an hour to become impossible to work with. If I start a new project without loading any objects or packages, just writing scripts without even running them, it takes longer to reach that point, however, it still does.
When I first start the program, the Task Manager shows memory usage of 950-1000 MB already (sometimes larger), and as I work, it climbs up to 6000 MB at which point it is impossible to work with as every activity is delayed and 'stuck'. Just to compare, on my old PC while working on the program, the Task Manager shows 100-150 MB. When I click the "Memory Usage Report" within Rstudio, the "used by session" is very small, the "used by system" is almost at a maximum yet Rstudio is the only thing taking up they system memory on the PC.
Things I tried: installing older versions of both R and Rstudio, pausing my anti-virus program, changing compatibility mode, zoom on "100%". It feels like Rstudio is continuously running something in the background as the memory usage keeps growing (and quite quickly). But maybe it is something else entirely.
I am currently using the latest versions of R and Rstudio (4.1.2, and 2021.09.0-351), on a PC with processor Intel i7, x64 bit, RAM 16GM, Windows 10.
What should I look for at this point?
On Windows, there is several typical memory or CPU issues with Rstudio. In my answer, I explain how the Rstudio interface itself use memory and CPU, as soon as you open a project (e.g., when Rstudio show you some .Rmd files). The memory / CPU cost associated with the computation is not covered in my answer (i.e. when you have performance issues when executing a line of code = not covered).
When working on 'long' .Rmd files within Rstudio on Windows, the CPU and/or memory usage get sometimes very high and increases progressively (e.g., because of a process named 'Qtwebengineprocess'). To solve the problem caused by long Rmd files loaded within a Rstudio session, you should:
pay attention to the process of Rstudio that consume memory, when scanning your code (i.e. disable or enable stuff in the 'Global options' menu of Rstudio). For example, try to disable 'inline display'(Tools => Global options => Rmarkdown => Show equation and image preview => Never). This post put me on this way to consider that memory / CPU leak are sometimes due to Rstudio itself, nor the data or the code.
set up a bookdown project, in order to split your large Rmd files into several Rmd. See here.
Bonus step, see if there is a conflict in some packages which are loaded with the command tidyverse_conflicts(), but it's already a 'computing problem' (not covered here).

Notebook instance running R with a GPU

I am new to cloud computing and GCP. I am trying to create a notebook instance running R with a GPU. I got a basic instance with 1 core and 0 GPUs to start and I was able to execute some code which was cool. When I try to create an instance with a GPU I keep getting all sorts of errors about something called live migration, or that there are no resources available, etc. Can someone tell me how to start an R notebook instance with a GPU? It can't be this difficult.
The CRAN (The Comprehensive R Archive Network) doesn't support GPU. However, you can follow this link might help you to install a Notebook instance running R with a GPU. You need a machine with Nvidia GPU drivers installed then install R and Jupyter Lab. After that compile those R packages which require it for use with GPU's.

How to replicate the package check time performed on CRAN?

I've been trying to reduce the check time on a package I am submitting to CRAN. On my local machines, check time is somewhere between a minute (i7 CPU) and 2 minutes (i5 CPU). However, CRAN reviewers keep pointing out the check time is over 10 minutes. The only way I could find to reproduce such long check times is by uploading my package to http://win-builder.r-project.org/, where it indeed takes > 600 s to check.
I wish I could reproduce this check time locally so I am not dependent on a remote solution. The only difference I can see between Win builder and my local machine is the OS (Win vs. Linux) and how Win builder seems to be doing multiarch checks (i386 and x64).
I am not sure how to reproduce this locally. I have tried R CMD check with seemingly-relevant switches like --multiarch and --force-multiarch but it doesn't seem to be doing anything differently. I guess I have to install some extra packages like r-cran-i386 or whatever, but I couldn't find anything of the sort in my repositories ("R" can be such a PITA of a search expression) and the instructions on README files like the one on https://cran.r-project.org/bin/linux/ubuntu/ didn't get me far enough.
I am already using --as-cran, and am aware of solutions like this, though I think installing R i386 on a separate VM containing a 32-bit OS defeats the purpose of what I am trying to accomplish.

Crashing When knitting R Markdown under Linux

Ubuntu 16.04 LTS, R Version: 3.4.3, R Studio: Version 1.1.383
I'm playing around and learning R Markdown this afternoon. I am not doing any intensive data analysis. I am knitting my R Markdown into HTML with the following command rmarkdown::render("document.Rmd").
About every half an hour my Ubuntu GNOME session almost totally freezes. I can sort of move the mouse cursor around and every several minutes I'm presented with a brief window of time where the computer works again before going back into a deep freeze. I'm not running anything other programs.
I've kept my System Monitor open and notice that rsession and rstudio usually utilize ~200 MiB of memory. When the computer freezes the rsession rises to ~4 GiB and it's also directly after I issue the rmarkdown::render("document.Rmd") command in R Studio.
I did sudo-apt-update and sudo-apt-upgrade. What else should I do? Do I update the Linux kernel? Upgrade R Studio? Submit a bug report? Is this a memory leak (and what is that)?

Loading .dta data into R takes long time

Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)
Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?

Resources