mem_change in R, should be negative on deleting a vector - r

Here is a reprex:
> pryr::mem_change(x<- 1:1e7)
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
11.5 kB
> pryr::mem_change(rm(x))
592 B
>
My query is when I do mem_change(rm(x)) I should get a negative number since memory used should decrease. Why do I get a positive 592 B ?
# Trying to recreate Irina's code on my computer
> library(pryr)
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
> mem_used()
37.2 MB
> mem_change(x<-1:1e7)
12.8 kB
> mem_used()
37.4 MB
> mem_change(rm(x)) # This should be negative, but it's not
592 B
> mem_used()
37.4 MB
>

You should mem_used() to see how big your memory now.
Then when you mem_change(x<- 1:1e7), you you extending you memory for vector x.
The mem_change(rm(x)) will just remove this vector and take you back to the initial memory size.
Will be helpful to read Package ‘pryr’Author Hadley Wickham
> mem_used() # how much you use now
253 MB
> pryr::mem_change(x<- 1:1e7) # add 40MB
40 MB
> mem_used() # now you have 253 + 40 = 293 MB
293 MB
> mem_change(rm(x)) # deleting 40MB
-40 MB
> mem_used() # you should see the oroginal memory size.
253 MB

Related

Extracting ancillary date from array in R

I have an object called data (representing a field of wind strengths at geographical locations) obtained by using some code to read in a .grib file:
> data
ecmf : u-component of wind
Time:
2020/07/09 z00:00 0-0 h
Domain summary:
601 x 351 domain
Projection summary:
proj= latlong
NE = ( 50 , 75 )
SW = ( -10 , 40 )
Data summary:
-89.06099 -50.08242 -41.13694 -43.42623 -34.77617 -25.03278
data is 601 x 351 array of doubles:
> typeof(data)
[1] "double"
> is.array(data)
[1] TRUE
> dim(data)
[1] 601 351
but, as shown above, it also has extra information attached beyond the numerical values of the array elements (Time:, Projection summary etc). How do I extract these? Attempts such as data$time do not seem to work.
As suggested in the comments to the question, I was able to access the values I wanted using attributes(). attributes(data) returns a list of all the relevant elements.

classify a large collection of image files

I have a large collection of image files for a book, andthe publisher wants a list where files are classified by "type" (greyscale graph, b/w halftone image, color image, line drawing, etc.). This is a hard problem in general, but perhaps I can do some of this automatically using image processing tools, e.g., ImageMagick with the R magick package.
I think ImageMagick is the right tool, but I don't really know how to use it for this purpose.
What I have is just a list of fig numbers & file names:
1.1 ch01-intro/fig/alcohol-risk.jpg
1.2 ch01-intro/fig/measels.png
1.3 ch01-intro/fig/numbers.png
1.4 ch01-intro/fig/Lascaux-bull-chamber.jpg
...
Can someone help get me started?
Edit: This was probably an ill-framed or overly-arching question as initially stated. I thought that ImageMagick identify or the R magick::image_info() function could help, so the initial question perhaps should have been: "How to extract image information from a list of files [in R]". I can pose this separately, if not already asked.
An initial attempt at this gave me the following for my first images,
library(magick)
# initialize an empty array to hold the results of `image_info`
figinfo <- data.frame(
format=character(),
width=numeric(),
height=numeric(),
colorspace=character(),
matte=logical(),
filesize=numeric(),
density=character(), stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
img <- image_read(files[i])
info <- image_info(img)
figinfo[i,] <- info
}
I get:
> figinfo
format width height colorspace matte filesize density
1 JPEG 661 733 sRGB FALSE 41884 72x72
2 PNG 838 591 sRGB TRUE 98276 38x38
3 PNG 990 721 sRGB TRUE 427253 38x38
4 JPEG 798 219 sRGB FALSE 99845 300x300
I conclude that this doesn't help much in answering the question I posed, of how to classify these images.
Edit2 Before closing this question, the advice to look into direct use of ImageMagick identify was helpful. https://imagemagick.org/script/escape.php
In particular, the %[type] is closer to
what I need. This is not exposed in magick::image_info(), so I may have to write a shell script or call system() in a loop.
For the record, here is how I can extract relevant attributes of these image files using identify directly.
# Get image characteristics via ImageMagick identify
# from: https://imagemagick.org/script/escape.php
#
# -format elements:
# %m image file format
# %f filename
# %[type] image type
# %k number of unique colors
# %h image height in pixels
# %r image class and colorspace
identify -format "%m,%f,%[type],%r,%k,%hx%w" imagefile
>identify -format "%m,%f,%[type],%r,%k,%hx%w" Quipu.png
PNG,Quipu.png,GrayscaleAlpha,DirectClass Gray Matte,16,449x299
The %[type] attribute takes me towards what I want.
To close this question:
In an R context, I was successful in using system(, intern=TRUE) for this task, as follows, with some manual fixups
# Use identify directly via system()
# function to run identify for one file
get_info <- function(file) {
cmd <- 'identify -quiet -format "%f,%m,%[type],%r,%h,%w,%x,%y"'
info <- system(paste(cmd, file), intern=TRUE)
unlist(strsplit(info, ","))
}
# This doesn't cause coercion to numeric
figinfo <- data.frame(
filename=character(),
format=character(),
type=character(),
class=character(),
height=numeric(),
width=numeric(),
xres=numeric(),
yres=numeric(),
stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
info <- get_info(files[i])
info[4] <- sub("DirectClass ", "", info[4])
figinfo[i,] <- info
}
figinfo$height <- as.numeric(figinfo$height)
figinfo$width <- as.numeric(figinfo$width)
figinfo$xres=round(as.numeric(figinfo$xres))
figinfo$yres=round(as.numeric(figinfo$yres))
Then I have more or less what I want:
> str(figinfo)
'data.frame': 161 obs. of 8 variables:
$ filename: chr "mileyears4.png" "alcohol-risk.jpg" "measels.png" "numbers.png" ...
$ format : chr "PNG" "JPEG" "PNG" "PNG" ...
$ type : chr "Palette" "TrueColor" "TrueColorAlpha" "TrueColorAlpha" ...
$ class : chr "PseudoClass sRGB " "sRGB " "sRGB Matte" "sRGB Matte" ...
$ height : num 500 733 591 721 219 ...
$ width : num 720 661 838 990 798 ...
$ xres : num 72 72 38 38 300 38 300 38 28 38 ...
$ yres : num 72 72 38 38 300 38 300 38 28 38 ...
>

How to list and remove lazyData of R package?

I want to write a package with internal data, and my method is discribe here
My DESCRIPTION file is:
Package: cancerProfile
Title: A collection of data sets of cancer
Version: 0.1
Authors#R: person("NavyCheng", email = "navycheng2020#gmail.com", role = c("aut", "cre"))
Description: This package contain some data sets of cancers, such as RNA-seq data, TF bind data and so on.
Depends: R (>= 3.4.0)
License: What license is it under?
Encoding: UTF-8
LazyData: true
and my project is like this:
cancerProfile.Rproj
NAMESPACE
LICENSE
DESCRIPTION
R/
data/
|-- prad.rna.count.rda
Then I install my package and load it:
> library(pryr)
> library(devtools)
> install_github('hcyvan/cancerProfile')
> library(cancerProfile)
> mem_used()
82.2 MB
> invisible(prad.rna.count)
> mem_used()
356 MB
> ls()
character(0)
> prad.rna.count[1:3,1:3]
TCGA.2A.A8VL.01A TCGA.2A.A8VO.01A TCGA.2A.A8VT.01A
ENSG00000000003.13 2867 1667 3140
ENSG00000000005.5 6 0 0
ENSG00000000419.11 1354 888 1767
> rm(prad.rna.count)
Warning message:
In rm(prad.rna.count) : object 'prad.rna.count' not found
My question is why I can't 'ls' and 'rm' prad.rna.count and how can I don this?
In your case you couldn't ls() or rm() the dataset because you never put it in your global environment. Consider the following:
# devtools::install_github("hcyvan/cancerProfile")
library(cancerProfile)
library(pryr)
mem_used()
#> 31.8 MB
data(prad.rna.count)
mem_used()
#> 32.2 MB
ls()
#> [1] "prad.rna.count"
prad.rna.count[1:3,1:3]
#> TCGA.2A.A8VL.01A TCGA.2A.A8VO.01A TCGA.2A.A8VT.01A
#> ENSG00000000003.13 2867 1667 3140
#> ENSG00000000005.5 6 0 0
#> ENSG00000000419.11 1354 888 1767
mem_used()
#> 305 MB
rm(prad.rna.count)
ls()
#> character(0)
mem_used()
#> 32.5 MB
Created on 2019-01-15 by the reprex package (v0.2.1)
Since I used data() rather than invisible(), I actually put the data into the global environment, allowing me to see it via ls() and remove it via rm(). The way I loaded the data (data()) didn't increase memory usage because it just returns a promise, but when I evaluated the promise via prad.rna.count[1:3,1:3], the memory usage shot up. Luckily, since I had a name pointing to the object by using data() rather than invisible(), when I used rm(prad.rna.count), R recognized there was no longer a name pointing to that object and released the memory. I'd check out http://adv-r.had.co.nz/memory.html#gc and http://r-pkgs.had.co.nz/data.html#data-data for more details.

Is there anyway to read .dat file from movielens to R studio

I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

Memory leak in data.table grouped assignment by reference

I'm seeing odd memory usage when using assignment by reference by group in a data.table. Here's a simple example to demonstrate (please excuse the triviality of the example):
N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))
gc()
for (i in seq(100)) {
dt[, value := value+1, by="id"]
}
gc()
tables()
which produces the following output:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for (i in seq(100)) {
+ dt[, value := value+1, by="id"]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8
Vcells 59966825 457.6 73320781 559.4 69633650 531.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:
> rm(dt)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 320888 17.2 597831 32 407500 21.8
Vcells 57977069 442.4 77066820 588 69633650 531.3
> tables()
No objects of class data.table exist in .GlobalEnv
The memory leak seems to disappear when removing the by=... from the assignment:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 312955 16.8 597831 32.0 467875 25.0
Vcells 2458890 18.8 3279586 25.1 2704448 20.7
> for (i in seq(100)) {
+ dt[, value := value+1]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 322698 17.3 597831 32.0 467875 25.0
Vcells 2478772 19.0 5826337 44.5 5139567 39.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
To summarize, two questions:
Am I missing something or is there a memory leak?
If there is indeed a memory leak, can anyone suggest a workaround that lets me use assignment by reference by group without the memory leak?
For reference, here's the output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.10
loaded via a namespace (and not attached):
[1] tools_3.0.2
UPDATE from Matt - Now fixed in v1.8.11. From NEWS :
Long outstanding (usually small) memory leak in grouping fixed. When
the last group is smaller than the largest group, the difference in
those sizes was not being released. Most users run a grouping query
once and will never have noticed, but anyone looping calls to grouping
(such as when running in parallel, or benchmarking) may have suffered,
#2648. Test added.
Many thanks to vc273, Y T and others.
From Arun ...
Why was this happening?
I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.
What's the reason for this to happen in data.table? This part is on identifying the portion of code from dogroups.c responsible for the apparent memory increase.
Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.
The short explanation is that this seems to be an effect of the usage of SETLENGTH function (from R's C-interface) in data.table's dogroups.c .
In data.table, when you use by=..., for example,
set.seed(45)
DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6)))
DT[, list(y=mean(x)), by=id]
Corresponding to id=1, the values of "x" (=c(1,2,1,1,2,3)) has to be picked. This means, having to allocate memory for .SD (all columns not in by) per by value.
To overcome this allocation for each group in by, data.table accomplishes this cleverly by first allocating .SD with the length of the largest group in by (which here is corresponding to id=1, length 6). Then, we could, for each value of id, re-use the (overly) allocated data.table and by using the function SETLENGTH we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.
But what seems strange is that when the number of elements for each group in by all have the same number of items, nothing special seems to be happening with regard to gc() output. However, when they aren't the same, gc() seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.
To illustrate this point, I've written a C-code that mimics the SETLENGTH function usage in dogroups.c in `data.table.
// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>
int sizes[100];
#define SIZEOF(x) sizes[TYPEOF(x)]
// test function - no checks!
SEXP test(SEXP vec, SEXP SD, SEXP lengths)
{
R_len_t i, j;
char before_address[32], after_address[32];
SEXP tmp, ans;
PROTECT(tmp = allocVector(INTSXP, 1));
PROTECT(ans = allocVector(STRSXP, 2));
snprintf(before_address, 32, "%p", (void *)SD);
for (i=0; i<LENGTH(lengths); i++) {
memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp));
SETLENGTH(SD, INTEGER(lengths)[i]);
// do some computation here.. ex: mean(SD)
}
snprintf(after_address, 32, "%p", (void *)SD);
SET_STRING_ELT(ans, 0, mkChar(before_address));
SET_STRING_ELT(ans, 1, mkChar(after_address));
UNPROTECT(2);
return(ans);
}
Here vec is equivalent to any data.table dt and SD is equivalent to .SD and lengths is the length of each group. This is just a dummy program. Basically for each value of lengths, say n, the first n elements are copied from vec on to SD. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.
Save this file as test.c and then compile it as follows from terminal:
R CMD SHLIB -o test.so test.c
Now, open a new R-session, go to the path where test.so exists and then type:
dyn.load("test.so")
require(data.table)
set.seed(45)
max_len <- as.integer(1e6)
lengths <- as.integer(sample(4:(max_len)/10, max_len/10))
gc()
vec <- 1:max_len
for (i in 1:100) {
SD <- vec[1:max(lengths)]
bla <- .Call("test", vec, SD, lengths)
print(gc())
}
Note that for each i here, .SD will be allocated a different memory location and that's being replicated here by assigning SD for each i.
By running this code, you'll find that 1) the two values returned are identical for each i to that of address(SD) and 2) Vcells used Mb keeps increasing. Now, remove all variables from the workspace with rm(list=ls()) and then do gc(), you'll find that not all memory is being restored/freed.
Initial:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332708 17.8 597831 32.0 467875 25.0
Vcells 1033531 7.9 2327578 17.8 2313676 17.7
After 100 runs:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332912 17.8 597831 32.0 467875 25.0
Vcells 2631370 20.1 4202816 32.1 2765872 21.2
After rm(list=ls()) and gc():
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 341275 18.3 597831 32.0 467875 25.0
Vcells 2061531 15.8 4202816 32.1 3121469 23.9
If you remove the line SETLENGTH(SD, ...) from the C-code, and run it again, you'll find that there's no change in the Vcells.
Now as to why SETLENGTH on grouping with non-identical group lengths has this effect, I'm still trying to understand - check out the link in the edit above.

Resources