Package environments are not working as expected in a pipeline - r

In a package I'm working on, I'm using environments to save and retrieve the labels of a dataframe.
In a magrittr pipeline, I want to save them in an environment variable which I would retrieve later.
However, I'm facing a problem: it seems as if the environment variables were not modified until the end of the pipeline.
Here is an example, with most of useful functions:
devtools::install_github("DanChaltiel/crosstable", build_vignettes=TRUE)
library(crosstable) #for functions set_label() and get_label() but you can test
#with other label-management packages (Hmisc, expss...)
labels_env = rlang::new_environment()
save_labels = function(.tbl){
labels_env$last_save = tibble(
name=names(.tbl),
label=get_label(.tbl)[.data$name]
)
invisible(.tbl)
}
get_last_save = function(){
labels_env$last_save
}
import_labels = function(.tbl){
data_label = get_last_save()
for(i in 1:nrow(data_label)){
name = as.character(data_label[i, name_from])
label = as.character(data_label[i, label_from])
.tbl[name] = set_label(.tbl[name], label)
}
.tbl
}
This works exactly as intended, as label for disp would be NULL otherwise:
library(dplyr)
library(crosstable)
save_labels(mtcars2)
mtcars2 %>%
transmute(disp=as.numeric(disp)+1) %>% #removes the label attribute of disp
import_labels() %>% #
crosstable(disp)
#> .id label variable value
#> 1 disp Displacement (cu.in.) Min / Max 72.1 / 473.0
#> 2 disp Displacement (cu.in.) Med [IQR] 197.3 [121.8;327.0]
#> 3 disp Displacement (cu.in.) Mean (std) 231.7 (123.9)
#> 4 disp Displacement (cu.in.) N (NA) 32 (0)
Created on 2021-01-26 by the reprex package (v0.3.0)
However, save_labels(mtcars2) returns mtcars2 invisibly so I'd like to be able to pipe the whole sequence. Unfortunately, this throws an error:
library(dplyr)
library(crosstable)
mtcars2 %>%
save_labels() %>%
transmute(disp=as.numeric(disp)+1) %>%
import_labels() %>% #
crosstable(disp)
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index
Created on 2021-01-26 by the reprex package (v0.3.0)
Indeed, when using pipes, the environment variable is not set yet when we get to import_labels(). If I re-run this code, it won't throw any error but that would be misleading as it would refer to the previous value of labels_env$last_save.
My understanding of pipes is not good enough to get this working. Moreover, it seems to be specific to the package environment, as I could not reproduce this behavior in a plain R script.
Is there a way I can use pipes with such an environment variable inside a package?

This was actually caused by a breaking change when the package magrittr (which provides pipes) went from v1.5 to v2.0.
This was explained on the blog and on NEWS.md.
A more specific reproducible example can be found on this GitHub issue.
In the new magrittr version, the evaluation sequenced has changed, so for side effects to happen in the correct order, you have to force the evaluation:
import_labels = function(.tbl){
force(.tbl) #force evaluation
data_label = get_last_save()
for(i in 1:nrow(data_label)){
name = as.character(data_label[i, name_from])
label = as.character(data_label[i, label_from])
.tbl[name] = set_label(.tbl[name], label)
}
.tbl
}

Related

What is generating the error 'Can't subset `.data` outside of a data mask context' with 'dplyr'?

I have a huge shiny app which uses a huge package. I'm not the author of any of them and I'm a bit lost. A function (fermentationPlot) throws the error: Can't subset .data outside of a data mask context:
Warning: Error in fermentationPlot: Can't subset `.data` outside of a data mask context.
185: <Anonymous>
173: dplyr::arrange
172: dplyr::mutate
171: as.data.frame
What could be the cause of this error? What does it mean? Below is the code block which generates it. I googled this error message and I found that it can be fixed by downgrading 'dplyr'. I tried 1.0.10, 1.0.5 and 1.0.0, and the error always occurs.
plotInfo <- dplyr::left_join(
x = dplyr::select(
plotDefaults, -c(.data$templateName, .data$minValue, .data$maxValue)
),
y = plotSettings,
by = .data$dataName
) %>%
dplyr::arrange(!is.na(.data$order), -.data$order) %>%
dplyr::mutate(
color = replace(.data$color, .data$color == "Blue", "Dark blue"),
minValue = as.numeric(.data$minValue),
maxValue = as.numeric(.data$maxValue)
) %>%
as.data.frame()
The by argument of left_join must be a character vector of column names. Probably the author wanted to do
by = "dataName"
and not
by = .data$dataName

how to return column which has violated the rule (R Validate)

I'm new to R and trying to use the validate package. My questions is how do I return the exact column that is violating the rules I've outlined?
Using violating, I can create a data frame with columns(the unique record ID and the location) but also want to return the name of the column where the violation occurs. So ideally,
[3 column data frame with unique_ID, location, violating_indicator_name][1]
Is this possible? and if so, how?
Thanks :)
edit
Editing because I realized that I could add rule descriptions, is there a way I can add the description of the rule which was validated as a third column to the 'violated' output dataframe
edit
library(validate)
#> Warning: package 'validate' was built under R version 4.1.3
remove(list = ls())
data(mtcars)
rules <- validator(cyl < 6, gear < 4 )
description(rules) [1] <- "cyl is more than 6 and gear is more than 4."
confronted <- confront(mtcars, rules)
violated <- violating (mtcars[c("cyl", "gear")], confronted)
library(validate)
#> Warning: package 'validate' was built under R version 4.1.3
data(cars)
rules <- validator(speed < 25, dist < 100)
confronted <- confront(cars, rules)
violated <- violating(cars, confronted)
violated
#> speed dist
#> 49 24 120
#> 50 25 85
names(violated)
#> [1] "speed" "dist"
Created on 2022-04-08 by the reprex package (v2.0.1)

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

How to find heavy objects that are not stored in .GlobalEnv?

I am trying to find which objects are taking a lot of memory in my R session, but the problem is that the object might have been invisibly created with an unknown name in an unknown environment.
If the object is stored in .GlobalEnv or a known environment, I can easily use a strategy like ls(enviro)+get()+object.size() (see lsos on this post for example) to list all objects and their size, allowing me to identify the heavy objects.
However, the object in question might not be stored in .GlobalEnv, but might be in some obscure environment implicitly created by an external package. How can in that case identify which object is using a lot of RAM?
The best case study is ggplot2 creating .last_plot in a dedicated environment. Looking under the hood one can find that it is stored in environment(ggplot2:::.store$get), so one can find it and eventually remove it. But if I didn't know that location or name a priori, would there be a way to find that there is a heavy object called .last_plot somewhere in memory?
pryr::mem_used()
#> 34.7 MB
## example: implicit creation of heavy and hidden object by ggplot
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 34.9 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
## Hidden object is not in .GlobalEnv
ls(.GlobalEnv, all.names = TRUE)
#> [1] "path"
## Here I know where to find it: environment(ggplot2:::.store$get)
ls(all.names = TRUE, envir = environment(ggplot2:::.store$get))
#> [1] ".last_plot"
pryr::object_size(get(".last_plot", environment(ggplot2:::.store$get))$data)
#> 80 MB
## But how could I have found this otherwise?
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't think there's any existing way to do this. If you combine #AllanCameron's answer with my comment, where you'd also run ls(y) for y environments calculated as
ns <- loadedNamespaces()
for (x in ns) {
y <- loadNamespace(x)
# look at the size of everything in y
}
you still won't find all the environments. I think you could do it if you also examined every object that might contain a reference to an environment (e.g. every function, formula, list, and various exotic objects) but it would be tricky not to miss something or count things more than once.
Edited to add: Actually, pryr::object_size is pretty smart at reporting on the environments attached to objects, so we'd get close by searching namespaces. For example, to find the top 20 objects:
pryr::mem_used()
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 35 MB
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 35.2 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
envs <- c(globalenv = globalenv(),
sapply(loadedNamespaces(), function(ns) loadNamespace(ns)))
sizes <- lapply(envs, function(e) {
objs <- ls(e, all = TRUE)
sapply(objs, function(obj) pryr::object_size(get(obj, envir = e)))
})
head(sort(unlist(sizes), decreasing = TRUE), 20)
#> base..__S3MethodsTable__. utils..__S3MethodsTable__.
#> 96216872 83443704
#> grid..__S3MethodsTable__. ggplot2..__S3MethodsTable__.
#> 80945520 80636768
#> ggplot2..store methods..classTable
#> 80418936 10101152
#> graphics..__S3MethodsTable__. tools..check_packages
#> 9325608 5185880
#> compiler.inlineHandlers methods..genericTable
#> 3444600 2808440
#> Rcpp..__T__show:methods colorspace..__T__show:methods
#> 2474672 2447880
#> Rcpp..RcppClass Rcpp..__C__C++OverloadedMethods
#> 2127584 1990504
#> Rcpp..__C__RcppClass Rcpp..__C__C++Field
#> 1982576 1980176
#> Rcpp..__C__C++Constructor Rcpp..__T__$:base
#> 1979992 1939616
#> tools..install_packages Rcpp..__C__Module
#> 1904032 1899872
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't know why those methods tables come out so large (I suspect it's because ggplot2 adds methods to those tables, so its environment gets captured); but somehow they are finding your object, because they aren't so big if I don't create it.
A hint about the issue is in the 5th object, listed as ggplot2..store (i.e. the object named .store in the ggplot2 namespace). Doesn't tell you to look in the environments of the functions in .store, but at least it gets you started.
Second edit:
Here are some tweaks to make the output a bit more readable.
# Unlist first, so we can clean up the names
sizes <- unlist(sizes)
# Replace the first dot with :::
names(sizes) <- sub(".", ":::", names(sizes), fixed = TRUE)
# Remove internal R objects
keep <- !grepl(".__", names(sizes), fixed = TRUE)
sizes <- sizes[keep]
With these changes, the output from sort(sizes[keep], decreasing = TRUE) starts out as
ggplot2:::.store
80418936
base:::.userHooksEnv
47855920
base:::.Options
45016888
utils:::Rprof
44958416
If you do
unlist(lapply(search(), function(y) sapply(ls(y), function(x) object.size(get(x)))))
You will get a complete list of all the objects in all the environments on your search path, including their sizes. You can then sort these and find the offending objects.

simmer: reading resources from inside trajectory functions

I want to be able to modify the resource capacity inside trajectory as a function of queue length.
The following (simplified) code below does not work. - When I try to call get_mon_resources(simStore) inside the function, the code crashes with the error:
Error in run_(private$sim_obj, until) :
Expecting a single value: [extent=0].
Thank you for your help.
simStore <- simmer()
fUpdateNumberOfCashiers <- function() {
dtLastRes <- simStore %>% get_mon_resources %>% tail(1)
nCapacityNow <- dtLastRes$capacity # same result with get_capacity(simStore),
nQueueNow <- dtLastRes$queue # same result with get_queue_count(simStore)
print(dtLastRes) # prints empty data-frame !
return (5) # crashes here ! (eventually 5 will be replaced with more meaningful formula
}
trajClient <- trajectory("Client's path") %>%
log_("Arrived to cashier") %>%
set_capacity("Cashier", value = fUpdateNumberOfCashiers ) %>%
seize("Cashier") %>%
timeout(function() {rexp(1, 30)}) %>% # One Cashier processes 30 clients / hour
release("Cashier") %>%
log_(function(attr) { sprintf("In total spent %.2f", now(simStore) - attr["start_time"])})
simStore <- simmer("Store") %>%
add_resource("Cashier", 1) %>%
add_generator("Store Clients", trajClient, function() {rexp(1, 120)}) %>% # 120 clients / hour
run(until=nHoursObserved <- 1) ; simStore
See the discussion related to troubleshooting this problem here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/simmer-devel/NgIikOpHpss
What causes the problem is that the other package (lubridate) masks objects from "simmer", as written seen below:
Attaching package: ‘lubridate’
The following objects are masked from ‘package:simmer’:
now, rollback
Once I replaced
library(simmer); library(lubridate);
with
library(lubridate); library(simmer);
The problem disappeared!

Resources