How to find heavy objects that are not stored in .GlobalEnv? - r

I am trying to find which objects are taking a lot of memory in my R session, but the problem is that the object might have been invisibly created with an unknown name in an unknown environment.
If the object is stored in .GlobalEnv or a known environment, I can easily use a strategy like ls(enviro)+get()+object.size() (see lsos on this post for example) to list all objects and their size, allowing me to identify the heavy objects.
However, the object in question might not be stored in .GlobalEnv, but might be in some obscure environment implicitly created by an external package. How can in that case identify which object is using a lot of RAM?
The best case study is ggplot2 creating .last_plot in a dedicated environment. Looking under the hood one can find that it is stored in environment(ggplot2:::.store$get), so one can find it and eventually remove it. But if I didn't know that location or name a priori, would there be a way to find that there is a heavy object called .last_plot somewhere in memory?
pryr::mem_used()
#> 34.7 MB
## example: implicit creation of heavy and hidden object by ggplot
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 34.9 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
## Hidden object is not in .GlobalEnv
ls(.GlobalEnv, all.names = TRUE)
#> [1] "path"
## Here I know where to find it: environment(ggplot2:::.store$get)
ls(all.names = TRUE, envir = environment(ggplot2:::.store$get))
#> [1] ".last_plot"
pryr::object_size(get(".last_plot", environment(ggplot2:::.store$get))$data)
#> 80 MB
## But how could I have found this otherwise?
Created on 2020-11-03 by the reprex package (v0.3.0)

I don't think there's any existing way to do this. If you combine #AllanCameron's answer with my comment, where you'd also run ls(y) for y environments calculated as
ns <- loadedNamespaces()
for (x in ns) {
y <- loadNamespace(x)
# look at the size of everything in y
}
you still won't find all the environments. I think you could do it if you also examined every object that might contain a reference to an environment (e.g. every function, formula, list, and various exotic objects) but it would be tricky not to miss something or count things more than once.
Edited to add: Actually, pryr::object_size is pretty smart at reporting on the environments attached to objects, so we'd get close by searching namespaces. For example, to find the top 20 objects:
pryr::mem_used()
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 35 MB
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 35.2 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
envs <- c(globalenv = globalenv(),
sapply(loadedNamespaces(), function(ns) loadNamespace(ns)))
sizes <- lapply(envs, function(e) {
objs <- ls(e, all = TRUE)
sapply(objs, function(obj) pryr::object_size(get(obj, envir = e)))
})
head(sort(unlist(sizes), decreasing = TRUE), 20)
#> base..__S3MethodsTable__. utils..__S3MethodsTable__.
#> 96216872 83443704
#> grid..__S3MethodsTable__. ggplot2..__S3MethodsTable__.
#> 80945520 80636768
#> ggplot2..store methods..classTable
#> 80418936 10101152
#> graphics..__S3MethodsTable__. tools..check_packages
#> 9325608 5185880
#> compiler.inlineHandlers methods..genericTable
#> 3444600 2808440
#> Rcpp..__T__show:methods colorspace..__T__show:methods
#> 2474672 2447880
#> Rcpp..RcppClass Rcpp..__C__C++OverloadedMethods
#> 2127584 1990504
#> Rcpp..__C__RcppClass Rcpp..__C__C++Field
#> 1982576 1980176
#> Rcpp..__C__C++Constructor Rcpp..__T__$:base
#> 1979992 1939616
#> tools..install_packages Rcpp..__C__Module
#> 1904032 1899872
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't know why those methods tables come out so large (I suspect it's because ggplot2 adds methods to those tables, so its environment gets captured); but somehow they are finding your object, because they aren't so big if I don't create it.
A hint about the issue is in the 5th object, listed as ggplot2..store (i.e. the object named .store in the ggplot2 namespace). Doesn't tell you to look in the environments of the functions in .store, but at least it gets you started.
Second edit:
Here are some tweaks to make the output a bit more readable.
# Unlist first, so we can clean up the names
sizes <- unlist(sizes)
# Replace the first dot with :::
names(sizes) <- sub(".", ":::", names(sizes), fixed = TRUE)
# Remove internal R objects
keep <- !grepl(".__", names(sizes), fixed = TRUE)
sizes <- sizes[keep]
With these changes, the output from sort(sizes[keep], decreasing = TRUE) starts out as
ggplot2:::.store
80418936
base:::.userHooksEnv
47855920
base:::.Options
45016888
utils:::Rprof
44958416

If you do
unlist(lapply(search(), function(y) sapply(ls(y), function(x) object.size(get(x)))))
You will get a complete list of all the objects in all the environments on your search path, including their sizes. You can then sort these and find the offending objects.

Related

Send messages from within parallel function (parallel or future framework)

I would like to send messages from within a function to the R console during a parallel process using the parallel package or the future framework and pblapply::pblapply().
Here is a reprex that does not send any messages to the R console:
# library
library(pbapply)
library(stringi)
library(parallel)
# make fun
fun_func <- function(x){
cat(paste0("hello world ",x))
return(paste0("hello world ",x))}
# get data
set.seed(23)
d <- stri_rand_strings(100, 2, '[a-z]')
names(d) <- d
# make cluster
cl <- parallel::makeCluster(3)
# load func
clusterExport(cl, c("fun_func"))
# run function
pblapply(cl=cl,X=d,FUN=fun_func) -> res
# stop cluster
parallel::stopCluster(cl)
# show res
head(res)
#> $of
#> [1] "hello world of"
#>
#> $is
#> [1] "hello world is"
#>
#> $vl
#> [1] "hello world vl"
#>
#> $zz
#> [1] "hello world zz"
#>
#> $vz
#> [1] "hello world vz"
#>
#> $ws
#> [1] "hello world ws"
Created on 2022-12-07 with reprex v2.0.2
Update:
I just learned that it mitght be hard to get console message in Windows/RStudio. However, logging the messages with ParallelLogger might be an option.
Unfortunaltely I could not implement it.
So I would be happy to have a solution either for sending the messages to the console or to a file.
If you use a parallelization method on top of the future framework, e.g. future_lapply, furrr, foreach with doFuture, and soon also pbapply (https://github.com/psolymos/pbapply/issues/54), output from cat(), print(), message(), warning(), and so on in parallel workers are captured and truly re-outputted ("relayed") in your main R session as the parallel tasks completed. You can read about this in https://future.futureverse.org/articles/future-2-output.html.
If you want to see near-live output, that is, output that is produced while the parallel tasks are still running, then you use the progressr package. It is designed to send near-live progress updates when using Futureverse. Those updates can also include custom messages. For examples, see https://progressr.futureverse.org/#parallel-processing-and-progress-updates.

Compress output raster and parallelize gdalwarp from R

I would like to include -co options to compress output raster using gdalwarp from gdalUtilities in R.
I have tried some options (commented in the code), but I have not been successful in generating the compressed raster.
gdalUtilities::gdalwarp(srcfile = paste0(source_path,"/mask_30.tif"),
dstfile = paste0(writing_path,"/mask_30_gdalwarp.tif"),
cutline = paste0(source_path,"/amazon.shp"),
crop_to_cutline = TRUE,
multi = TRUE,
wo = "NUM_THREADS = 32",
co = "COMPRESS = DEFLATE")
# co = c("COMPRESS = DEFLATE","ZLEVEL = 9"))
# co COMPRESS = DEFLATE,
# co ZLEVEL = 9),
# co = "COMPRESS = DEFLATE",
# co = ZLEVEL = 9")
Additionally, I would like to use multithread warping implementation. I am including-multi and -wo "NUM_THREADS = 16" (my computer has 32 cores) options, but I also have not been able to decrease the runtime vs. the default -multi option, which uses two cores by default.
Any suggestions for compression and parallelization?
Many thanks in advance.
1 - COMPRESSION
Please find the solution for the problem of file compression. To be honest, I have already been confronted with the same problem as you and, at the time, I was racking my brains... to finally find the solution which is quite simple (once we know it!): you must not put any spaces (i.e. "COMPRESS=DEFLATE" and not "COMPRESS = DEFLATE")
So, please find below a small reprex.
Reprex
library(gdalUtilities)
library(stars) # Loaded just to have a '.tif' image for the reprex
# Import a '.tif' image from the 'stars' library
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
# Write the image to disk (in your working directory)
write_stars(tif, "image.tif")
# Size of the image on disk (in bytes)
file.size("image.tif")
#> [1] 2950880
# Compress the image
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp.tif",
co = "COMPRESS=DEFLATE")
# Size of the compressed image on disk (in bytes)
file.size("image_gdalwarp.tif")
#> [1] 937920 # The image has been successfully compressed.
As #MarkAdler said, there is not much difference between the default compression level (i.e. 6) and level 9. That said, please find below how you should write the code to be able to apply the desired compression level (i.e. still without spaces and in a list):
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"))
file.size("image_gdalwarp_Z9.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)
2 - PARALLELIZATION
For the problem of parallelization on the cores of the processor, you should not use multi = TRUE. Only the argument wo = "NUM_THREADS=4" (always without spaces ;-)) is enough.
Just a clarification, I guess you are confusing the RAM and the number of cores. Usually computers are equipped with a 4 or 8 cores processor. The 32 that you indicate in your code refers to the 32 gigas of RAM that your computer probably has.
Reprex
library(gdalUtilities)
library(stars)
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
write_stars(tif, "image.tif")
file.size("image.tif")
#> [1] 2950880
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9_parallel.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"),
wo = "NUM_THREADS=4") # Replace '4' by '8' if your processor has 8 cores
file.size("image_gdalwarp_Z9_parallel.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)

How can I input a single additional parameter to disk.frame's inmapfn at readin?

According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found
You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:
library(disk.frame)
setup_disk.frame()
df <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
wrap_csv_to_disk <- function(col) {
my_date_col <- col
csv_to_disk.frame(
file.path(tempdir(), "df.csv"),
in_chunk_size = 1L,
inmapfn = function(chunk, dates = my_date_col) {
chunk[, dates] <- lubridate::dmy(chunk[[dates]])
chunk
})
}
date_col <- "dates"
df_disk_frame <- wrap_csv_to_disk(date_col)
#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"
I see. For a work around would it be possible to do something like this?
date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")
a = csv_to_disk.frame(files, inmapfn = function(chunk) {
date_var = readRDS("some/path/date_var.rds")
# do the rest
})
I think letting inmapfn have other options is doable see https://github.com/xiaodaigh/disk.frame/issues/377 for tracking

Replacing the attribute value of an htmltools::tag

Say I have the following tag:
library(htmltools)
t = div(name = 'oldname')
I can overwrite the 'name' attribute of this tag using t$attribs$name = 'newname' but prefer using htmltools getters/setters, does the package have a function that facilitates this?
Looking through the package manual, the only function that allows for the manipulation of tag attributes is tagAppendAttributes, which only appends the new atrribute value to the original:
t = tagAppendAttributes(t, name = 'newname')
t
#<div name="oldname newname"></div>
Does the absence of a helper function that overwrites the value of an attribute mean that tag attributes are not meant to be overwritten?
You're probably overthinking this. Look at the code for tagAppendAttributes:
tagAppendAttributes
#> function (tag, ...)
#> {
#> tag$attribs <- c(tag$attribs, list(...))
#> tag
#> }
All it does is take whatever you pass and write directly to tag$attribs. If you unclass your object you'll see it's just a list really:
unclass(t)
#> $name
#> [1] "div"
#>
#> $attribs
#> $attribs$name
#> [1] "oldname"
#>
#>
#> $children
#> list()
I can see why writing directly to an object's data member rather than using a setter might not feel right if you come from an object-oriented programming background, but this is clearly a "public" data member in an informal S3 class. Setting it directly is no more likely to break it that any other implementation.
If you really want to I suppose you could define a setter:
tagSetAttributes <- function(tag, ...) {tag$attribs <- list(...); tag}
tagSetAttributes(t, name = "new name")
#> <div name="new name"></div>

Give a character string to define a file dependency in drake

I am learning drake to define my analysis workflow but I have trouble getting data files as dependencies.
I use the function file_in() inside drake_plan() but it only works if I give the path to the file directly. If I give it with the file.path() function or with a variable storing that file path, it doesn't work.
Examples:
# preparation
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
 Working plan:
# working plan
working_plan <-
drake_plan(iris_data = read.csv(file_in("data/iris.csv")),
strings_in_dots = "literals")
working_config <- make(working_plan)
vis_drake_graph(working_config)
This plan works fine, and the file data/iris.csv is considered as a dependency
Working plan
 Not working plan:
# not working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.path(path.data, "iris.csv"))),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
Here it is trying to read the file iris.csv instead of data/iris.csv.
Working but problem with dependency:
# working but "data/iris.csv" is not considered as a dependency
file.name <- file.path(path.data, "iris.csv")
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.name)),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
This last one works fine but the file is not considered a dependency, so drake doesn't re-run the plan if this file is changed.
Not working drake plan
So, is there a way to tell drake file dependencies from variables?
Per tidyeval, if you add !! in front of the file.path(), it will be evaluated and not quoted.
Also, in the new version of drake, the strings_in_dots = "literals" argument is deprecated.
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
# now working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(!!file.path(path.data, "iris.csv"))))
notworking_plan
#> # A tibble: 1 x 2
#> target command
#> <chr> <expr>
#> 1 iris_data read.csv(file_in("data/iris.csv"))
Created on 2019-05-08 by the reprex package (v0.2.1)
After a answer from the developpers on Github, the code in file_in() is not evaluated so that's why it is not possible to use file.path in it.

Resources