I want to write a package with internal data, and my method is discribe here
My DESCRIPTION file is:
Package: cancerProfile
Title: A collection of data sets of cancer
Version: 0.1
Authors#R: person("NavyCheng", email = "navycheng2020#gmail.com", role = c("aut", "cre"))
Description: This package contain some data sets of cancers, such as RNA-seq data, TF bind data and so on.
Depends: R (>= 3.4.0)
License: What license is it under?
Encoding: UTF-8
LazyData: true
and my project is like this:
cancerProfile.Rproj
NAMESPACE
LICENSE
DESCRIPTION
R/
data/
|-- prad.rna.count.rda
Then I install my package and load it:
> library(pryr)
> library(devtools)
> install_github('hcyvan/cancerProfile')
> library(cancerProfile)
> mem_used()
82.2 MB
> invisible(prad.rna.count)
> mem_used()
356 MB
> ls()
character(0)
> prad.rna.count[1:3,1:3]
TCGA.2A.A8VL.01A TCGA.2A.A8VO.01A TCGA.2A.A8VT.01A
ENSG00000000003.13 2867 1667 3140
ENSG00000000005.5 6 0 0
ENSG00000000419.11 1354 888 1767
> rm(prad.rna.count)
Warning message:
In rm(prad.rna.count) : object 'prad.rna.count' not found
My question is why I can't 'ls' and 'rm' prad.rna.count and how can I don this?
In your case you couldn't ls() or rm() the dataset because you never put it in your global environment. Consider the following:
# devtools::install_github("hcyvan/cancerProfile")
library(cancerProfile)
library(pryr)
mem_used()
#> 31.8 MB
data(prad.rna.count)
mem_used()
#> 32.2 MB
ls()
#> [1] "prad.rna.count"
prad.rna.count[1:3,1:3]
#> TCGA.2A.A8VL.01A TCGA.2A.A8VO.01A TCGA.2A.A8VT.01A
#> ENSG00000000003.13 2867 1667 3140
#> ENSG00000000005.5 6 0 0
#> ENSG00000000419.11 1354 888 1767
mem_used()
#> 305 MB
rm(prad.rna.count)
ls()
#> character(0)
mem_used()
#> 32.5 MB
Created on 2019-01-15 by the reprex package (v0.2.1)
Since I used data() rather than invisible(), I actually put the data into the global environment, allowing me to see it via ls() and remove it via rm(). The way I loaded the data (data()) didn't increase memory usage because it just returns a promise, but when I evaluated the promise via prad.rna.count[1:3,1:3], the memory usage shot up. Luckily, since I had a name pointing to the object by using data() rather than invisible(), when I used rm(prad.rna.count), R recognized there was no longer a name pointing to that object and released the memory. I'd check out http://adv-r.had.co.nz/memory.html#gc and http://r-pkgs.had.co.nz/data.html#data-data for more details.
Related
I’m exploring a textual corpus and I would like to be able to separate words following their grammatical type, for example consider only verbs and nouns.
I use spaCyr to do lemmatization with the spacy_parse() function and have seen in Quanteda reference (https://quanteda.io/reference/as.tokens.html) that there is a as.tokens() function that let me build a token object with the result of spacy_parse().
as.tokens(
x,
concatenator = "/",
include_pos = c("none", "pos", "tag"),
use_lemma = FALSE,
...
)
This way, I can get back something that looks like this (text is in French):
etu1_repres_1 :
[1] "OK/PROPN" ",/PUNCT" "déjà/ADV" ",/PUNCT" "je/PRON" "pense/VERB" "que/SCONJ"
[8] "je/PRON" "être/AUX" "influencer/VERB" "de/ADP" "par/ADP"
Let’s say I would like to separate the tokens and keep only tokens of type PRON and VERB.
Q1: How can I separate them from the other tokens to keep only:
etu1_repres_1 :
[1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
Q2: How can I do to remove the "/PRON" or "/VERB" part of each token to be able to build a data-feature matrix with only the lemmas.
Thanks a lot for helping,
Gabriel
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <-
as.tokens(list(etu1_repres_1 = c("OK/PROPN", ",/PUNCT", "déjà/ADV", ",/PUNCT",
"je/PRON", "pense/VERB", "que/SCONJ", "je/PRON",
"être/AUX", "influencer/VERB", "de/ADP", "par/ADP")))
# part 1
toks2 <- tokens_keep(toks, c("*/PRON", "*/VERB"))
toks2
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
# part 2
toks3 <- tokens_split(toks2, "/") |>
tokens_remove(c("PRON", "VERB"))
toks3
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je" "pense" "je" "influencer"
dfm(toks3)
#> Document-feature matrix of: 1 document, 3 features (0.00% sparse) and 0 docvars.
#> features
#> docs je pense influencer
#> etu1_repres_1 2 1 1
Created on 2022-08-19 by the reprex package (v2.0.1)
Here is a reprex:
> pryr::mem_change(x<- 1:1e7)
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
11.5 kB
> pryr::mem_change(rm(x))
592 B
>
My query is when I do mem_change(rm(x)) I should get a negative number since memory used should decrease. Why do I get a positive 592 B ?
# Trying to recreate Irina's code on my computer
> library(pryr)
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
> mem_used()
37.2 MB
> mem_change(x<-1:1e7)
12.8 kB
> mem_used()
37.4 MB
> mem_change(rm(x)) # This should be negative, but it's not
592 B
> mem_used()
37.4 MB
>
You should mem_used() to see how big your memory now.
Then when you mem_change(x<- 1:1e7), you you extending you memory for vector x.
The mem_change(rm(x)) will just remove this vector and take you back to the initial memory size.
Will be helpful to read Package ‘pryr’Author Hadley Wickham
> mem_used() # how much you use now
253 MB
> pryr::mem_change(x<- 1:1e7) # add 40MB
40 MB
> mem_used() # now you have 253 + 40 = 293 MB
293 MB
> mem_change(rm(x)) # deleting 40MB
-40 MB
> mem_used() # you should see the oroginal memory size.
253 MB
I am having the weirdest bug with map_int from the purrr package.
# Works as expected
purrr::map_int(1:10, function(x) x)
#> [1] 1 2 3 4 5 6 7 8 9 10
# Why on earth is that not working?
purrr::map_int(1:10, function(x) 2*x)
#> Error: Can't coerce element 1 from a double to a integer
# or that?
purrr::map_int(1:10, round)
#> Error: Can't coerce element 1 from a double to a integer
Created on 2019-03-28 by the reprex package (v0.2.1)
I run 3.5.2 in rocker container (Debian) with the latest github version of everything:
sessioninfo::package_info("purrr")
#> package * version date lib source
#> magrittr 1.5.0.9000 2019-03-28 [1] Github (tidyverse/magrittr#4104d6b)
#> purrr 0.3.2.9000 2019-03-28 [1] Github (tidyverse/purrr#25d84f7)
#> rlang 0.3.2.9000 2019-03-28 [1] Github (r-lib/rlang#9376215)
#>
#> [1] /usr/local/lib/R/site-library
#> [2] /usr/local/lib/R/library
2*x is not an integer because 2 is not. Do instead
purrr::map_int(1:10, function(x) 2L*x)
The documentation from help(map) says
The output of .f will be automatically typed upwards , e.g. logical ->
integer -> double -> character
It appears to be following the larger ordering given in help(c). For example, this produces an error map_dbl(1:10, ~complex(real = .x, imaginary = 1)).
NULL < raw < logical < integer < double < complex < character < list <
expression
As you can see in that ordering, double-to-integer is a downward conversion. So, the function is designed to not do this kind of conversion.
The solution is to either write a function .f which outputs integer (or lower) classed objects (as in #Stéphane Laurent's answer), or just use as.integer(map(.x, .f)).
This is a kind of type-checking, which can be a useful feature for preventing programming mistakes.
The rjson::fromJSON() reads a file incorrectly while jsonlite::fromJSON() reads it fine. Here's a sample example.
file test.json contents:
{"name": "Sanjay",
"unit_price": 130848,
"amount": 11,
"up_to_data_sales": 45725}
the jsonlite fromJSON outputs:
jsonlite::fromJSON("test.json")
$name
[1] "Sanjay"
$unit_price
[1] 130848
$amount
[1] 11
$up_to_data_sales
[1] 45725
But the same throws an error in rjson package.
rjson::fromJSON("test.json")
Error in rjson::fromJSON("test.json") : parseTrue: expected to see 'true' - likely an unquoted string starting with 't'.
Why is this error coming?
What is the reason rjson package was launched when jsonlite existed?
Well:
stringdist::stringdist("rjson", "jsonlite")
## [1] 5
That's a modest difference to begin with.
However, your assertion seems to be amiss:
library(magrittr)
rjson::fromJSON('{"name": "Sanjay",
"unit_price": 130848,
"amount": 11,
"up_to_data_sales": 45725}') %>% str()
## List of 4
## $ name : chr "Sanjay"
## $ unit_price : num 130848
## $ amount : num 11
## $ up_to_data_sales: num 45725
jsonlite::fromJSON('{"name": "Sanjay",
"unit_price": 130848,
"amount": 11,
"up_to_data_sales": 45725}') %>% str()
## List of 4
## $ name : chr "Sanjay"
## $ unit_price : int 130848
## $ amount : int 11
## $ up_to_data_sales: int 45725
Apart from jsonlite using a more diminutive data type for the numbers, they both parse the JSON fine.
So there's an issue with your file that you failed to disclose in the question.
A further incorrect assertion
-rw-rw-r-- 1 bob staff 2690 Jul 30 2007 rjson_0.1.0.tar.gz
-rw-rw-r-- 1 bob staff 400196 Dec 3 2013 jsonlite_0.9.0.tar.gz
not to mention:
-rw-rw-r-- 1 bob staff 873843 Oct 4 2010 RJSONIO_0.3-1.tar.gz
rjson came first. (dir listings came from the CRAN mirror sitting next to me).
You can actually read about the rationale and impetus behind jsonlite here: https://arxiv.org/abs/1403.2805 (which I got off the CRAN page for jsonlite.
1) Why is the error coming? - Error is due to the mistake in syntax
rjson does not read the file if 'file=' command is not given whereas when reading the file using Jsonlite it is not required
# For example:
y <- rjson::fromJSON(file = "Input.json")
x <- jsonlite::fromJSON("Input.json")
2) What is the reason rjson package was launched when jsonlite existed?
First, rjson was launched before jsonlite and second, there is a difference in the way they read files:
For example, consider the following input:
{
"id": 1,
"prod_info": [
{
"product": "xyz",
"brand": "pqr",
"price": 500
},
{
"product": "abc",
"brand": "klm",
"price": 5000
}
]
}
prod_info in the above input is a list with 2 vectors. But jsonlite reads it in the form of dataframe while rjson reads it as a list
Outputs:
x
$id
[1] 1
$prod_info
product brand price
1 xyz pqr 500
2 abc klm 5000
y
$id
[1] 1
$prod_info
$prod_info[[1]]
$prod_info[[1]]$product
[1] "xyz"
$prod_info[[1]]$brand
[1] "pqr"
$prod_info[[1]]$price
[1] 500
$prod_info[[2]]
$prod_info[[2]]$product
[1] "abc"
$prod_info[[2]]$brand
[1] "klm"
$prod_info[[2]]$price
[1] 5000
class(x$prod_info)
[1] "data.frame"
class(y$prod_info)
[1] "list"
The question has already been answered, but regarding differences between the two packages, I got bitten by one recently: how empty dictionaries are handled.
With rjson
> rjson::fromJSON("[]")
list()
> rjson::fromJSON("{}")
list()
Whereas, with jsonlite:
> jsonlite::fromJSON("[]")
list()
> jsonlite::fromJSON("{}")
named list()
That is, with rjson, you can't tell the difference between an empty list and an empty dictionary.
The translation to JSON works with both however, e.g. toJSON(structure(list(), names=character(0))) yields "{}".
How to disable / suppress a popup window "Updating loaded packages" which keeps showing up during R package installation? I am happy to have it set to "No", but I do not know how to make it work (investigated install.packages() args and did my googling, but did not find out).
Background: my goal is to compare the installing time of a large (2k) collection of packages. I want to make it overnight in a loop where in each iteration: (1) I remove all but base priority packages, (2) I measure the time of particular package installation. I must have no popup windows (which halt the process) to do this.
sessionInfo when I start RStudio:
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1
>
You should consider a benchmarking harness something akin to:
#!/bin/bash
# Create file of all installed packages
Rscript -e 'writeLines(unname(installed.packages()[,1]), "installed-pkgs.txt")'
# Iterate over the file, benchmarking package load 3x (consider bumping this up)
while read -r pkg; do
echo -n "Benchmarking package [${pkg}]"
for iter in {1..3}; do
echo -n "."
Rscript --vanilla \
-e 'args <- commandArgs(TRUE)' \
-e 'invisible(suppressPackageStartupMessages(xdf <- as.data.frame(as.list(system.time(library(args[1], character.only=TRUE), FALSE)))))' \
-e 'xdf$pkg <- args[1]' \
-e 'xdf$iter <- args[2]' \
-e 'xdf$loaded_namespaces <- I(list(loadedNamespaces()))' \
-e 'saveRDS(xdf, file.path("data", sprintf("%s-%s.rds", args[1], args[2])))' \
"${pkg}" \
"${iter}"
done
echo ""
done <installed-pkgs.txt
I made a ~/projects/pkgbench directory with a data subdir and put ^^ in ~/projects/pkgbench. With it you:
get a clean (vanilla) R session each run
3 iterations for each (make it higher if you want)
one RDS file per-iteration
the number of packages (including names) in the session namespace post-load in the RDS files
When it runs (from a non-RStudio terminal session on your macOS box) you get progress (one dot per iteration):
$ ./pkgbench.sh
Benchmarking package [abind]...
Benchmarking package [acepack]...
Benchmarking package [AER]...
Benchmarking package [akima]...
You can then do something like (I killed the benchmark after just a few pkgs):
library(hrbrthemes) # github/gitlab
library(tidyverse)
map_df(
list.files("~/projects/pkgbench/data", full.names = TRUE),
readRDS
) %>% tbl_df() %>% print() -> bench_df
## # A tibble: 141 x 8
## user.self sys.self elapsed user.child sys.child pkg iter loaded_namespaces
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <list>
## 1 0.00500 0.00100 0.00700 0. 0. abind 1 <chr [9]>
## 2 0.00600 0.00100 0.00700 0. 0. abind 2 <chr [9]>
## 3 0.00600 0.00100 0.00600 0. 0. abind 3 <chr [9]>
## 4 0.00500 0.00100 0.00600 0. 0. acepack 1 <chr [9]>
## 5 0.00600 0.001000 0.00800 0. 0. acepack 2 <chr [9]>
## 6 0.00500 0.00100 0.00600 0. 0. acepack 3 <chr [9]>
## 7 1.11 0.0770 1.19 0. 0. AER 1 <chr [36]>
## 8 1.04 0.0670 1.11 0. 0. AER 2 <chr [36]>
## 9 1.07 0.0720 1.15 0. 0. AER 3 <chr [36]>
## 10 0.136 0.0110 0.147 0. 0. akima 1 <chr [12]>
## # ... with 131 more rows
group_by(bench_df, pkg) %>%
summarise(
med_elapsed = median(elapsed),
ns_ct = length(loaded_namespaces[[1]])
) -> bench_sum
ggplot(bench_sum, aes("elapsed", med_elapsed)) +
geom_violin(fill = ft_cols$gray) +
ggbeeswarm::geom_quasirandom(color = ft_cols$yellow) +
geom_boxplot(color = "white", fill="#00000000", outlier.colour = NA) +
theme_ft_rc(grid="Y")
ggplot(bench_sum, aes(ns_ct, med_elapsed)) +
geom_point(color = ft_cols$yellow) +
geom_smooth(color = ft_cols$peach) + # shld prbly use something better than loess
theme_ft_rc(grid = "XY")
If you are going to run it overnight, make sure you disable all "sleepy/idle" time things macOS might do to you (like disable any heavyweight screensavers, prevent it from putting disks to sleep, etc).
Note that I suppressed package startup messages from printing. You may want to capture.output() instead or do a comparison with and without that.
library() also has all these parameter options:
library(
package,
help,
pos = 2,
lib.loc = NULL,
character.only = FALSE,
logical.return = FALSE,
warn.conflicts = TRUE,
quietly = FALSE,
verbose = getOption("verbose")
)
You may want to tweak those for various benchmarking runs as well.
I also only looked at the median of "what the package load felt like to the user" value. Consider examining all of the system.time values that are in the data frame.
If your Mac is sufficiently beefy CPU-core-wise and you have a fast solid state disk, you could consider using GNU parallel with this harness to speed up the timings. I'd definitely use more than 3 iterations per-pkg if you do this and be fairly conservative with the number of concurrent parallel runs.