Send messages from within parallel function (parallel or future framework) - r

I would like to send messages from within a function to the R console during a parallel process using the parallel package or the future framework and pblapply::pblapply().
Here is a reprex that does not send any messages to the R console:
# library
library(pbapply)
library(stringi)
library(parallel)
# make fun
fun_func <- function(x){
cat(paste0("hello world ",x))
return(paste0("hello world ",x))}
# get data
set.seed(23)
d <- stri_rand_strings(100, 2, '[a-z]')
names(d) <- d
# make cluster
cl <- parallel::makeCluster(3)
# load func
clusterExport(cl, c("fun_func"))
# run function
pblapply(cl=cl,X=d,FUN=fun_func) -> res
# stop cluster
parallel::stopCluster(cl)
# show res
head(res)
#> $of
#> [1] "hello world of"
#>
#> $is
#> [1] "hello world is"
#>
#> $vl
#> [1] "hello world vl"
#>
#> $zz
#> [1] "hello world zz"
#>
#> $vz
#> [1] "hello world vz"
#>
#> $ws
#> [1] "hello world ws"
Created on 2022-12-07 with reprex v2.0.2
Update:
I just learned that it mitght be hard to get console message in Windows/RStudio. However, logging the messages with ParallelLogger might be an option.
Unfortunaltely I could not implement it.
So I would be happy to have a solution either for sending the messages to the console or to a file.

If you use a parallelization method on top of the future framework, e.g. future_lapply, furrr, foreach with doFuture, and soon also pbapply (https://github.com/psolymos/pbapply/issues/54), output from cat(), print(), message(), warning(), and so on in parallel workers are captured and truly re-outputted ("relayed") in your main R session as the parallel tasks completed. You can read about this in https://future.futureverse.org/articles/future-2-output.html.
If you want to see near-live output, that is, output that is produced while the parallel tasks are still running, then you use the progressr package. It is designed to send near-live progress updates when using Futureverse. Those updates can also include custom messages. For examples, see https://progressr.futureverse.org/#parallel-processing-and-progress-updates.

Related

Click button using R + httr

I'm trying to scrape randomly generated names from a website.
library(httr)
library(rvest)
url <- "https://letsmakeagame.net//tools/PlanetNameGenerator/"
mywebsite <- read_html(url) %>%
html_nodes(xpath="//div[contains(#id,'title')]")
However, that does not work. I'm assuming I have to «click» the «generate» button before extracting the content. Is there a simple way (without RSelenium) to achieve that?
Something similar to:
POST(url,
body = list("EntryPoint.generate()" = T),
encode = "form") -> res
res_t <- content(res, as="text")
Thanks!
rvest isn't much of a help here as planet names are not requested from a remote service, names are generated locally with javascript, that's what the EntryPoint.generate() call does. A relatively simple way is to use chromote, though its session/process closing seems kind of messy at the moment:
library(chromote)
b <- ChromoteSession$new()
{
b$Page$navigate("https://letsmakeagame.net/tools/PlanetNameGenerator")
b$Page$loadEventFired()
}
# call EntryPoint.generate(), read result from <p id="title></p> element,
# replicate 10x
replicate(10, b$Runtime$evaluate('EntryPoint.generate();document.getElementById("title").innerText')$result$value)
#> [1] "Torade" "Ukiri" "Giconerth" "Dunia" "Brihoria"
#> [6] "Tiulaliv" "Giahiri" "Zuthewei 4A" "Elov" "Brachomia"
b$close()
#> [1] TRUE
b$parent$close()
#> Error in self$send_command(msg, callback = callback_, error = error_, : Chromote object is closed.
b$parent$get_browser()$close()
#> [1] TRUE
Created on 2023-01-25 with reprex v2.0.2

Report extra information from a test_that block when failing

I want to cat() some information to the console in the case a test fails (I'm getting confident this won't happen but I can't prove it wont) so I can investigate the issue.
Now I have code that is approximately like this:
testthat::test_that('Maybe fails', {
seed <- as.integer(Sys.time())
set.seed(seed)
testthat::expect_true(maybe_fails(runif(100L)))
testthat::expect_equal(long_vector(runif(100L)), target, tol = 1e-8)
if (failed()) {
cat('seed: ', seed, '\n')
}
})
Unfortunately, failed() doesn't exist.
Return values of expect_*() don't seem useful, they just return the actual argument.
I'm considering to just check again using all.equal() but that is a pretty ugly duplication.
Instead of using cat, you could use the info argument managed by testthat and its reporters for all expect functions (argument kept for compatibility reasons):
library(testthat)
testthat::test_that("Some tests",{
testthat::expect_equal(1,2,info=paste('Test 1 failed at',Sys.time()))
testthat::expect_equal(1,1,info=paste('Test 2 failed at',sys.time()))
})
#> -- Failure (<text>:5:3): Some tests --------------------------------------------
#> 1 not equal to 2.
#> 1/1 mismatches
#> [1] 1 - 2 == -1
#> Test 1 failed at 2021-03-03 17:25:37

How to find heavy objects that are not stored in .GlobalEnv?

I am trying to find which objects are taking a lot of memory in my R session, but the problem is that the object might have been invisibly created with an unknown name in an unknown environment.
If the object is stored in .GlobalEnv or a known environment, I can easily use a strategy like ls(enviro)+get()+object.size() (see lsos on this post for example) to list all objects and their size, allowing me to identify the heavy objects.
However, the object in question might not be stored in .GlobalEnv, but might be in some obscure environment implicitly created by an external package. How can in that case identify which object is using a lot of RAM?
The best case study is ggplot2 creating .last_plot in a dedicated environment. Looking under the hood one can find that it is stored in environment(ggplot2:::.store$get), so one can find it and eventually remove it. But if I didn't know that location or name a priori, would there be a way to find that there is a heavy object called .last_plot somewhere in memory?
pryr::mem_used()
#> 34.7 MB
## example: implicit creation of heavy and hidden object by ggplot
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 34.9 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
## Hidden object is not in .GlobalEnv
ls(.GlobalEnv, all.names = TRUE)
#> [1] "path"
## Here I know where to find it: environment(ggplot2:::.store$get)
ls(all.names = TRUE, envir = environment(ggplot2:::.store$get))
#> [1] ".last_plot"
pryr::object_size(get(".last_plot", environment(ggplot2:::.store$get))$data)
#> 80 MB
## But how could I have found this otherwise?
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't think there's any existing way to do this. If you combine #AllanCameron's answer with my comment, where you'd also run ls(y) for y environments calculated as
ns <- loadedNamespaces()
for (x in ns) {
y <- loadNamespace(x)
# look at the size of everything in y
}
you still won't find all the environments. I think you could do it if you also examined every object that might contain a reference to an environment (e.g. every function, formula, list, and various exotic objects) but it would be tricky not to miss something or count things more than once.
Edited to add: Actually, pryr::object_size is pretty smart at reporting on the environments attached to objects, so we'd get close by searching namespaces. For example, to find the top 20 objects:
pryr::mem_used()
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 35 MB
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 35.2 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
envs <- c(globalenv = globalenv(),
sapply(loadedNamespaces(), function(ns) loadNamespace(ns)))
sizes <- lapply(envs, function(e) {
objs <- ls(e, all = TRUE)
sapply(objs, function(obj) pryr::object_size(get(obj, envir = e)))
})
head(sort(unlist(sizes), decreasing = TRUE), 20)
#> base..__S3MethodsTable__. utils..__S3MethodsTable__.
#> 96216872 83443704
#> grid..__S3MethodsTable__. ggplot2..__S3MethodsTable__.
#> 80945520 80636768
#> ggplot2..store methods..classTable
#> 80418936 10101152
#> graphics..__S3MethodsTable__. tools..check_packages
#> 9325608 5185880
#> compiler.inlineHandlers methods..genericTable
#> 3444600 2808440
#> Rcpp..__T__show:methods colorspace..__T__show:methods
#> 2474672 2447880
#> Rcpp..RcppClass Rcpp..__C__C++OverloadedMethods
#> 2127584 1990504
#> Rcpp..__C__RcppClass Rcpp..__C__C++Field
#> 1982576 1980176
#> Rcpp..__C__C++Constructor Rcpp..__T__$:base
#> 1979992 1939616
#> tools..install_packages Rcpp..__C__Module
#> 1904032 1899872
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't know why those methods tables come out so large (I suspect it's because ggplot2 adds methods to those tables, so its environment gets captured); but somehow they are finding your object, because they aren't so big if I don't create it.
A hint about the issue is in the 5th object, listed as ggplot2..store (i.e. the object named .store in the ggplot2 namespace). Doesn't tell you to look in the environments of the functions in .store, but at least it gets you started.
Second edit:
Here are some tweaks to make the output a bit more readable.
# Unlist first, so we can clean up the names
sizes <- unlist(sizes)
# Replace the first dot with :::
names(sizes) <- sub(".", ":::", names(sizes), fixed = TRUE)
# Remove internal R objects
keep <- !grepl(".__", names(sizes), fixed = TRUE)
sizes <- sizes[keep]
With these changes, the output from sort(sizes[keep], decreasing = TRUE) starts out as
ggplot2:::.store
80418936
base:::.userHooksEnv
47855920
base:::.Options
45016888
utils:::Rprof
44958416
If you do
unlist(lapply(search(), function(y) sapply(ls(y), function(x) object.size(get(x)))))
You will get a complete list of all the objects in all the environments on your search path, including their sizes. You can then sort these and find the offending objects.

Speedup split and merge dataframe rows in R

I have a data on which I want to separate the rows.
df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)
I want split the sentences in the text column and come up with the following:
df <- data.frame (text = c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..",
"I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment.",
"Lately, I haven't been able to view my Online Payment Card.",
"It's prompting me to have to upgrade my account whereas before it didn't.",
"I have used the Card at various online stores before and have successfully used it.",
"But now it's starting to get very frustrating that I have to said upgrade my account.",
"Do fix this|", "**I noticed some users have the same issue|",
"I've been using this app for almost 2 years without any problems.",
"Until, their system just blocked my virtual paying card without any notice.",
"So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs.",
"This app has been a big disappointment."), id = c(1, 2, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2), tag = c("DONE", "DONE", NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), stringsAsFactors = FALSE)
I have done it using this code however I think for-loop is so slow. I need to do this for 73,000 rows. So I need a faster approach.
Attempt 1:
library("qdap")
df$tag <- NA
for (review_num in 1:nrow(df)) {
x = sent_detect(df$text[review_num])
if (length(x) > 1) {
for (sentence_num in 1:length(x)) {
df <- rbind(df, df[review_num,])
df$text[nrow(df)] <- x[sentence_num]
}
df$tag[review_num] <- "DONE"
}
}
Attempt 2: Rows: 73000, Time Spent: 252 minutes or ~4 hours
reviews_df1 <- data.frame(id=character(0), text=character(0))
for (review_num in 1:nrow(df)) {
preprocess_sent <- sent_detect(df$text[review_num])
if (length(preprocess_sent) > 0) {
x <- data.frame(id=df$id[review_num],
text=preprocess_sent)
reviews_df <- rbind(reviews_df1, x)
}
colnames(reviews_df) <- c("id", "text")
}
Attempt 3: Rows: 29000, Time Spent: 170 minutes or ~2.8 hours
library(qdap)
library(dplyr)
library(tidyr)
df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)
df %>%
group_by(text) %>%
mutate(sentences = list(sent_detect(df$text))) %>%
unnest(cols=sentences) -> out.df
out.df
It seems puzzling to me that it would take that long. You could turn your input into a list and use mclapply (if you are not on Windows) to further speed things up.
Here is an example using data.table and parallel::mclapply on Womens Clothing E-Commerce Reviews.csv (23k lines). It takes about 21 seconds with lapply and 5.5 seconds with mclapply on 4 cores.
Granted, those are not very long reviews and sentences, but it demonstrates the usefulness of running things in parallel.
library(data.table)
library(parallel)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#> Loading required package: qdapTools
#>
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:data.table':
#>
#> shift
#> Loading required package: RColorBrewer
#> Registered S3 methods overwritten by 'qdap':
#> method from
#> t.DocumentTermMatrix tm
#> t.TermDocumentMatrix tm
#>
#> Attaching package: 'qdap'
#> The following object is masked from 'package:base':
#>
#> Filter
dt <- fread("https://raw.githubusercontent.com/NadimKawwa/WomeneCommerce/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")
system.time({
dfl <- setNames(as.list(dt$`Review Text`), dt$V1)
makeDT <- function(x) data.table(text = sent_detect(x))
out.dt <- rbindlist(mclapply(dfl, makeDT, mc.cores=4L), idcol = "id")
out.dt[, tag := NA_character_]
out.dt <- rbind(data.table(id=dt$V1, text=dt$`Review Text`, tag = "DONE"), out.dt)
})
#> user system elapsed
#> 21.078 0.482 5.467
out.dt
#> id
#> 1: 0
#> 2: 1
#> 3: 2
#> 4: 3
#> 5: 4
#> ---
#> 137388: 23484
#> 137389: 23484
#> 137390: 23484
#> 137391: 23485
#> 137392: 23485
#> text
#> 1: Absolutely wonderful - silky and sexy and comfortable
#> 2: Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.
#> 3: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
#> 4: I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
#> 5: This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!
#> ---
#> 137388: the medium fits my waist perfectly, but was way too long and too big in the bust and shoulders.
#> 137389: if i wanted to spend the money, i could get it tailored, but i just felt like it might not be worth it.
#> 137390: side note - this dress was delivered to me with a nordstrom tag on it and i found it much cheaper there after looking!
#> 137391: This dress in a lovely platinum is feminine and fits perfectly, easy to wear and comfy, too!
#> 137392: highly recommend!
#> tag
#> 1: DONE
#> 2: DONE
#> 3: DONE
#> 4: DONE
#> 5: DONE
#> ---
#> 137388: <NA>
#> 137389: <NA>
#> 137390: <NA>
#> 137391: <NA>
#> 137392: <NA>
On second thought, your code may be the problem - try changing
df %>%
group_by(text) %>%
mutate(sentences = list(sent_detect(df$text))) %>%
unnest(cols=sentences) -> out.df
to
df %>%
group_by(text) %>%
mutate(sentences = list(sent_detect(text))) %>%
unnest(cols=sentences) -> out.df
and see if that was the culprit (I think it was).

Error in running sfLapply in R

My piece of code looks like:
x<- c(1,2,3,4,5)
library(snowfall)
f1<- function(a,list){
f2<-function(b,num){ return(abs(num-b))}
l1<-sfLapply(list, f2, num=a)
l1<-sum(unlist(l1))
return(l1)
}
sfInit(parallel=TRUE,cpus=4)
l2<-(sfLapply(x, f1, list=x))
sfStop()
l2
when I run the last four lines, it gives an error:
l2<-(sfLapply(x, f1, list=x))
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: could not find function "sfLapply"
When I switch to sequential processing, using lapply, it runs perfectly.
> l2<-(lapply(x, f1, list=x))
> l2
[[1]]
[1] 10
[[2]]
[1] 7
[[3]]
[1] 6
[[4]]
[1] 7
[[5]]
[1] 10
Why is sfLapply throwing an error?
You need to load the snowfall package on the cluster nodes. So insert
sfLibrary(snowfall)
after sfInit().
EDIT: For clarification:
Your function f1 contains the function sfLapply, which is found in the snowfall package. When you initialize the cluster using sfInit as above, the snow package is loaded on each node of the cluster, but not the snowfall package. Without the latter, there is no object (function or otherwise) called sfLapply on the nodes, and you get the error.

Resources