Ubuntu R ForEach / DoMC not using multiple cores - r

I have built a function in R (running on Ubuntu 12.04 LTS 64bit, 4 core i7 server with multithreading and 6gb ram) where I've installed R using the standard packages:
sudo apt-get install r-base r-recommended r-base-dev
sudo apt-get install r-cran-multicore r-cran-iterators r-cran-foreach r-cran-domc
NB: I also installed foreach & doMC inside R (which didn't help either), like I installed the deldir package:
install.packages(c("deldir"), dependencies = TRUE)
My function runs fine, but it does not use parallel cores (just maxes out 1 of the 8):
library(deldir)
library(foreach)
library(doMC)
registerDoMC(cores=8)
#getDoParWorkers()
#getDoParName()
#getDoParVersion()
# loop through files
inputfiles <- dir(path="/home/geoadmin/data/objects/", pattern='.txt')
for( inputfilenr in 1:length(inputfiles))
{
# set file variables
curinputfile = paste("/home/geoadmin/data/objects/",inputfiles[[inputfilenr]], sep = "", collapse = NULL)
print (curinputfile)
curoutputfile = paste("/home/geoadmin/data/objects/",substr(inputfiles[[inputfilenr]], start=1, stop=10), '.out', sep = "", collapse = NULL)
# select the point x/y coordinates into a data frame...
points <- read.csv(curinputfile, header = TRUE, sep = ",", dec=".", fill = TRUE)
# set calculation variables, precision on 3 digits only because of the RDW coordinate system
voro = deldir(points$x, points$y, digits=3, list(ndx=2,ndy=2), rw=c(min(points$x)-abs(min(points$x)-max(points$x)), max(points$x)+abs(min(points$x)-max(points$x)), min(points$y)-abs(min(points$y)-max(points$y)), max(points$y)+abs(min(points$y)-max(points$y))))
tiles = tile.list(voro)
poly = array()
# start loop
poly <- foreach (i=1:length(tiles), .combine=cbind) %dopar%
{
# load tile info
tile = tiles[[i]]
# start with EWKB notation
curpoly = "POLYGON(("
# add list of coordinates by looping through the points in tile
for (j in 1:length(tiles[[i]]$x)) { curpoly = sprintf("%s %.6f %.6f,",curpoly,tile$x[[j]],tile$y[[j]]) }
# then again the first point to close the polygon and end the EWKB notation, adding that to the poly array
sprintf("%s %.6f %.6f))",curpoly,tile$x[[1]],tile$y[[1]])
}
write.csv(t(poly), file = curoutputfile, row.names = FALSE)
}
So the results are good, but no parallelism...
doMC did register correctly:
> getDoParWorkers()
[1] 8
> getDoParName()
[1] "doMC"
> getDoParVersion()
[1] "1.2.5"
If I look at the usage (with top):
top - 01:03:19 up 9 min, 3 users, load average: 1.02, 0.86, 0.45
Tasks: 131 total, 2 running, 127 sleeping, 0 stopped, 2 zombie
Cpu(s): 12.5%us, 0.0%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 6104932k total, 1240512k used, 4864420k free, 16656k buffers
Swap: 6283260k total, 0k used, 6283260k free, 141996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1553 zzzzzzzz 20 0 913m 850m 3716 R 100 14.3 8:22.03 R
So just maxing out one core. Does anyone have any idea what could cause foreach/doMC to not use multiple cores?
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doMC_1.2.5 multicore_0.1-7 iterators_1.0.6 foreach_1.4.0
[5] deldir_0.0-19
loaded via a namespace (and not attached):
[1] codetools_0.2-8

To add the likely answer for the question:
As foreach/mc does work on the computer itself (with the standard example), it's the specific code itself and likely that the voro=deldir part takes up the time, not the loop after it. This however means that the deldir package needs to be adjusted. Looking at the code in the DelDir source it seems I would need to adjust this snippet in the code:
# Call the master subroutine to do the work:
repeat {
tmp <- .Fortran(
'master',
x=as.double(x),
y=as.double(y),
sort=as.logical(sort),
rw=as.double(rw),
npd=as.integer(npd),
ntot=as.integer(ntot),
nadj=integer(tadj),
madj=as.integer(madj),
ind=integer(npd),
tx=double(npd),
ty=double(npd),
ilist=integer(npd),
eps=as.double(eps),
delsgs=double(tdel),
ndel=as.integer(ndel),
delsum=double(ntdel),
dirsgs=double(tdir),
ndir=as.integer(ndir),
dirsum=double(ntdir),
nerror=integer(1),
PACKAGE='deldir'
)
Not sure yet how i can format this into a thing which would work with foreach though...

Related

Basic operations in R giving different results on Windows and Linux

The bounty expires tomorrow. Answers to this question are eligible for a +150 reputation bounty.
Rai wants to draw more attention to this question.
I have been running some code in R and while testing realized the results were different on Windows and Linux. I have tried to understand why this happens, but couldn't find an answer. Let's illustrate it with an example:
These are some hard-coded values for reproducibility, always starting from a clean environment. I have checked that the bit representation of these values is exactly the same in both the Windows and the Linux machines:
data <- structure(list(x = c(0.1, 0.1, 0.1, 5, 5, 5, 10, 10, 10, 20, 20, 20),
y = c(0.013624804, 0.014023006, 0.013169554, 0.70540352,
0.68711807, 0.69233506, 1.4235181, 1.348244, 1.4141854, 2.779813,
2.7567347, 2.7436437)), class = c("data.frame"), row.names = c(NA, 12L))
val <- c(43.3065849160736, 0.00134925463859564, 1.03218302435548, 270.328323775978)
theta <- 1.60812569803848
init <- c(b0 = 2.76836653333333, b1 = 0.0134350095, b2 = 2.15105945932773,
b3 = 6.85922519794374)
Now I define a new variable W which is again exactly the same in bit representation in Windows and Linux:
f <- function(X, b0, b1, b2, b3) {
b0 + (b1 - b0) / (1 + exp(b2*(log(X) - log(b3))))
}
W <- 1 / f(data$x, val[1], val[2], val[3], val[4])^theta
And finally I apply an optim function:
SSw <- function(Y, X, b0, b1, b2, b3, w) {
sum(w * (Y - f(X, b0, b1, b2, b3))^2)
}
SSw.vec <- function(par) SSw(data$y, data$x, par[1], par[2], par[3], par[4], W)
mod <- optim(init, SSw.vec, method = "L-BFGS-B", lower = c(-Inf,-Inf,-Inf,0))
print(mod$par)
# In Windows it returns:
# b0 b1 b2 b3
# 3.097283e+01 1.831543e-03 1.047613e+00 1.842448e+02
# In Linux it returns:
# b0 b1 b2 b3
# 3.459241e+01 1.530134e-03 1.040363e+00 2.101996e+02
As you can see the differences are quite significative, but even if they weren't... just why are there any differences?
Any help will be really appreciated!
Edit
Here I add the sessionInfo() on both Windows and Linux.
On Windows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 plyr_1.8.6 cellranger_1.1.0 compiler_3.6.3 pillar_1.7.0 nloptr_1.2.2.2 tools_3.6.3
[8] bit_4.0.4 boot_1.3-24 lme4_1.1-29 lifecycle_1.0.0 tibble_3.1.7 nlme_3.1-144 gtable_0.3.0
[15] lattice_0.20-38 pkgconfig_2.0.3 rlang_1.0.2 Matrix_1.2-18 cli_3.4.1 rstudioapi_0.11 dplyr_1.0.6
[22] generics_0.1.0 vctrs_0.3.8 lmerTest_3.1-3 grid_3.6.3 tidyselect_1.1.1 glue_1.4.2 R6_2.4.1
[29] fansi_0.4.1 readxl_1.3.1 minqa_1.2.4 ggplot2_3.3.6 purrr_0.3.5 magrittr_1.5 scales_1.1.1
[36] ellipsis_0.3.2 MASS_7.3-51.5 splines_3.6.3 colorspace_1.4-1 numDeriv_2016.8-1.1 renv_0.13.2 utf8_1.1.4
[43] munsell_0.5.0 crayon_1.3.4
On Linux:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS: /opt/r/lib/R/lib/libRblas.so
LAPACK: /opt/r/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3 renv_0.13.2
I ran your code on my local machine with Fedora 37 and R 4.2.2. As the other commenters, I also got the result you got on Windows. Then I pulled the rocker image for R version 3.6.3:
docker run -ti rocker/r-ver:3.6.3 R
This image is also Debian-based. Here I was able to recreate the result you got on your system.
Then I moved to the rocker versioned release 4.0.0:
docker run -ti rocker/r-ver:4.0.0 R
Here the result was the same as you got on Windows and everyone else got on their machine. It must be noted that with R 4.0.0 the rocker project moved from Debian-based images to Ubuntu LTS.
Fedora comes with the ability to easily switch the BLAS/LAPACK backend via the flexiblas package. Thanks to that, I was able to test your code with the eight different backends available on my system. As you can see below, they do yield different results. In particular, the ATLAS backend comes somewhat close to the result you got. In contrast, OPENBLAS-OPENMP (the default on Fedora), other OPENBLAS variants, and NETLIB all produce the same result as you received on Windows. A third family BLIS produces yet another set of possible results.
Is one of the results better than the others? Yes! optim() looks for a result that minimizes the supplied function. In its returned list, it reports not just the minimizing parameters, but also the value for them. I've included that in the table below. So the ATLAS backend wins the prize here. It must be said that optim() does NOT minimize analytically. So it always gives approximate results. That is why initial values and the method matter for what results we get. And apparently, with your function the backend also matters. And if you look at the parameters you got on Buster the function goes to 0.002800321. So it is actually a better result than what we all get on our more modern systems, except for the result I got with ATLAS. That also happens to be much slower than the other backends. So it seems, the newer backends might have traded speed for accuracy.
If your aim is consistency across platforms, you can upgrade your system to Debian 11 Bullseye, since that appears to have a backend producing the same results as other modern platforms, as the answer by #jay.sf indicates. You could also investigate if you can find the same BLAS backend version used on Buster for Windows.
Furthermore, you can try to change to another blas library on your current system. Here is a guide how to do that. Though it is for Ubuntu, as both use apt, it should work for your system as well. (Edit: I tried that in a VM for Buster. None of the available BLAS backends produced the same result as on the more modern systems)
Finally, if you feel you must have a newer BLAS library on your older system, then you could try to backport it yourself. I have no experience with this. I don't know how advisable it is or how likely to succeed. I am just mentioning it for completeness.
library(flexiblas)
library(tidyverse)
test_fun <- function(i) {
flexiblas_switch(i)
data <- structure(list(x = c(0.1, 0.1, 0.1, 5, 5, 5, 10, 10, 10, 20, 20, 20),
y = c(0.013624804, 0.014023006, 0.013169554, 0.70540352,
0.68711807, 0.69233506, 1.4235181, 1.348244, 1.4141854, 2.779813,
2.7567347, 2.7436437)), class = c("data.frame"), row.names = c(NA, 12L))
val <- c(43.3065849160736, 0.00134925463859564, 1.03218302435548, 270.328323775978)
theta <- 1.60812569803848
init <- c(b0 = 2.76836653333333, b1 = 0.0134350095, b2 = 2.15105945932773,
b3 = 6.85922519794374)
f <- function(X, b0, b1, b2, b3) {
b0 + (b1 - b0) / (1 + exp(b2*(log(X) - log(b3))))
}
W <- 1 / f(data$x, val[1], val[2], val[3], val[4])^theta
SSw <- function(Y, X, b0, b1, b2, b3, w) {
sum(w * (Y - f(X, b0, b1, b2, b3))^2)
}
SSw.vec <- function(par) SSw(data$y, data$x, par[1], par[2], par[3], par[4], W)
mod <- optim(init, SSw.vec, method = "L-BFGS-B", lower = c(-Inf,-Inf,-Inf,0))
return(c(mod$par, value = mod$value))
}
flexiblas_list() |>
setdiff("__FALLBACK__") |>
tibble(backend = _) |>
mutate(
idx = flexiblas_load_backend(backend),
res = map(idx, test_fun)
) |>
unnest_wider(res)
#> # A tibble: 8 × 7
#> backend idx b0 b1 b2 b3 value
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NETLIB 2 31.0 0.00183 1.05 184. 0.00282
#> 2 OPENBLAS-OPENMP 1 31.0 0.00183 1.05 184. 0.00282
#> 3 ATLAS 3 34.7 0.00168 1.04 209. 0.00280
#> 4 BLIS-SERIAL 4 27.1 0.00225 1.06 158. 0.00285
#> 5 BLIS-OPENMP 5 27.1 0.00225 1.06 158. 0.00285
#> 6 BLIS-THREADS 6 27.1 0.00225 1.06 158. 0.00285
#> 7 OPENBLAS-SERIAL 7 31.0 0.00183 1.05 184. 0.00282
#> 8 OPENBLAS-THREADS 8 31.0 0.00183 1.05 184. 0.00282
I now can confirm your issue. I installed Debian Buster on a VM, did apt install r-base and got R3.5.2, ran your code, and it showed the same (probably) "flawed" Linux results from OP.
However, then I updated to R.4.2.2 but the "flawed" results didn't change! It used libblas3.8. On my real machine I'm running Ubuntu, R4.2.2, libblas3.10 and get the (probably) "right" Windows results.
Unfortunately I was not able to install libblas3.10 on Debian Buster and I am not sure if that's possible at all (see it's not listed under Buster in the link). Notice, that Debian Bullseye is now the actual version and your system is actually outdated.
The outdated BLAS/LAPACK may ― as noted in my comment — still be considered to produce the erroneous results, since these are the actual algebraic engines. You may be able to install an updated BLAS/LAPACK on Debian Buster, but I tend to recommend you upgrade to Debian Bullseye.
This is a little bit tangential, but probably the easiest thing you can do to alleviate the between-platform differences is to use
control = list(parscale = abs(init))
in your optim() call.
The reason for this is that unless a gradient function is specified, L-BFGS-B automatically uses finite differences with a fixed stepsize (ndeps, defaulting to 1e-3 for all parameters) to approximate the gradient. This is usually good enough but can cause problems for hard/unstable optimization problems, or when the parameters are on very different scales. parscale as specified above tells optim how to scale the parameters internally, generally improving the results.
It might be even better to pass an analytic (or auto-differentiated) gradient, but that's more work ...
This is also a little bit tangential but when I run your code on my system I get in the output of optim convergence=1, what indicates that the iteration limit ‘maxit’ had been reached. 0 indicates successful completion so maxit should be inceased.
mod <- optim(init, SSw.vec, method = "L-BFGS-B", lower = c(-Inf,-Inf,-Inf,0))
mod$convergence
#[1] 1
mod$message
#[1] "NEW_X"
mod <- optim(init, SSw.vec, method = "L-BFGS-B", lower = c(-Inf,-Inf,-Inf,0),
control=list(maxit=1e5))
mod$convergence
#[1] 0
mod$message
#[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
mod$par
# b0 b1 b2 b3
#4.336406e+01 1.154007e-03 1.031650e+00 2.710800e+02
mod$value
#[1] 0.002779518
Maybe this helps to get similar results on different system constellations.

Why is R making a copy-on-modification after using str?

I was wondering why R is making a copy-on-modification after using str.
I create a matrix. I can change its dim, one element or even all. No copy is made. But when a call str R is making a copy during the next modification operation on the Matrix. Why is this happening?
m <- matrix(1:12, 3)
tracemem(m)
#[1] "<0x559df861af28>"
dim(m) <- 4:3
m[1,1] <- 0L
m[] <- 12:1
str(m)
# int [1:4, 1:3] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here after str a copy is made
#tracemem[0x559df861af28 -> 0x559df838e4a8]:
dim(m) <- 3:4
str(m)
# int [1:3, 1:4] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here again after str a copy
#tracemem[0x559df838e4a8 -> 0x559df82c9d78]:
Also I was wondering why a copy is made when having a Task Callback.
TCB <- addTaskCallback(function(...) TRUE)
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x559dfa79def8>"
dim(m) <- 4:3 #Copy on modification
#tracemem[0x559dfa79def8 -> 0x559dfa8998e8]:
removeTaskCallback(TCB)
#[1] TRUE
dim(m) <- 4:3 #No copy
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_AT.UTF-8 LC_COLLATE=de_AT.UTF-8
[5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
[7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.3
This is a follow up question to Is there a way to prevent copy-on-modify when modifying attributes?.
I start R with R --vanilla to have a clean session.
I have asked this question on R-help as suggested by #sam-mason in the comments.
The answer from Luke Tierney solved the issue with str:
As of R 4.0.0 it is in some cases possible to reduce reference counts
internally and so avoid a copy in cases like this. It would be too
costly to try to detect all cases where a count can be dropped, but it
this case we can do better. It turns out that the internals of
pos.to.env were unnecessarily creating an extra reference to the call
environment (here in a call to exists()). This is fixed in r79528.
Thanks.
And related to Task Callback:
It turns out there were some issues with the way calls to the
callbacks were handled. This has been revised in R-devel in r79541.
This example will no longere need to duplicate in R-devel.
Thanks for the report.

Slow graph rendering with ggplot2 / Rstudio - GPU issue?

I am making a graph from table that contains circa 500k rows using ggplot2.
On my ubuntu 20.04 laptop (CPU i5 8265U) it takes around 15 sec and displays correctly in the plot tab of Rstudio. Now i just got a PC with win 10, a better CPU (i5 10400F) and a GPU (GTX 1660 Super). The same graph takes forever to build. If I wait enough i have the code run in more than 10 mins and still not displayed in the plot tab.
Unfortunately for me, i cannot share the data so cannot make a reprex, the code for the graph is:
> t1 <- Sys.time()
> gr_dbh_h <- tree16_temp %>%
+ filter(lu_en_simple2 %in% c("Evergreen Forest", "Deciduous Forest")) %>%
+ select(dbh, h, live_dead, lu_en_simple2_f) %>%
+ ggplot() +
+ geom_point(aes(x = dbh, y = h, color = live_dead), alpha = 0.5, shape = 3) +
+ labs(color = "", x = "Diameter at breast height (cm)", y = "Tree total height (m)") +
+ facet_wrap(~lu_en_simple2_f)
> t2 <- Sys.time()
> t2 - t1
Time difference of 0.1159708 secs
> gr_dbh_h
> t3 <- Sys.time()
> t3 - t2
Time difference of 9.901076 mins
Unfortunately again the closest reprex doesn't that same issue and just show a 2 time faster rendering on ubuntu laptop than win10 PC (instead of 60 times in my main code):
library(tidyverse)
tt <- tibble(
x = rnorm(500000),
y = rnorm(500000),
cat = rep(c("aa", "bb", "cc", "dd", "ee"), 100000),
group = c(rep("A", 200000), rep("B", 300000))
)
t1 <- Sys.time()
ggplot(tt) +
geom_point(aes(x, y, color = group), alpha = 0.5) +
facet_wrap(~cat)
t2 <- Sys.time()
t2 - t1
on Ubuntu:
> t2 - t1
Time difference of 10.26094 secs
on win10:
> t2 - t1
Time difference of 23.292 secs
So the main question is how can one system make the graph in 10 sec and the other more than 10 mins? Even with a basic graph first system is 2 times faster?
Is it just Ubuntu vs Windows? Can the GPU mess up with the rendering?
Rstudio and R are the same version, packages all up-to-date at the time of writing.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_Europe.1252 LC_CTYPE=English_Europe.1252 LC_MONETARY=English_Europe.1252
[4] LC_NUMERIC=C LC_TIME=English_Europe.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggthemr_1.1.0 devtools_2.3.1 usethis_1.6.1 webshot_0.5.2 tmap_3.1 bookdown_0.20 knitr_1.29
[8] sf_0.9-5 BIOMASS_2.1.3 scales_1.1.1 ggrepel_0.8.2 ggpubr_0.4.0 lubridate_1.7.9 forcats_0.5.0
[15] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
[22] tidyverse_1.3.0
Welcome to crappy graphics devices. On OS X it's sometimes faster to save to pdf and then open the pdf in preview than it is to render a graph to the RStudio window. Also, here is an example of text rendering taking ~300 times longer in one graphics device than in another.
I'd recommend installing an RStudio daily build, installing the ragg package, and then in the graphics settings setting the backend to agg:
That should give you decent performance rendering. It'll also fix some other issues that the default Windows graphics device has, so it's a good choice in any case.

parLapply and Part of Speech tagging

I am trying to use the parLapply along with the openNLP R package to do part of speech tagging of a corpus of ~600k documents. However, while I was able to successfully part of speech tag a different set of ~90k documents, I get a strange error after ~25 mins of running the same code over the ~600k documents:
Error in checkForRemoteErrors(val) : 10 nodes produced errors; first error: no word token annotations found
The documents are simply digital newspaper articles, where I run the tagger over the body field (after cleaning). This field is nothing but raw text which I save into a list of strings.
Here's my code:
# I set the Java heap size (memory) allocation - I experimented with different sizes
options(java.parameters = "- Xmx3GB")
# Convert the corpus into a list of strings
myCorpus <- lapply(contentCleaned, function(x){x <- as.String(x)})
# tag Corpus Function
tagCorpus <- function(x, ...){
s <- as.String(x) # This is a repeat and may not be required
WTA <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, WTA, a2)
a3 <- annotate(s, PTA, a2)
word_subset <- a3[a3$type == "word"]
POStags <- unlist(lapply(word_subset$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[word_subset], POStags), collapse = " ")
list(text = s, POStagged = POStagged, POStags = POStags, words = s[word_subset])
}
# I have 12 cores in my box
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()-2))
# I tried both exporting the word token annotator and not
clusterEvalQ(cl, {
library(openNLP);
library(NLP);
PTA <- Maxent_POS_Tag_Annotator();
WTA <- Maxent_Word_Token_Annotator()
})
# Each cluster node has the following description:
[[1]]
An annotator inheriting from classes
Simple_Word_Token_Annotator Annotator
with description
Computes word token annotations using the Apache OpenNLP Maxent tokenizer employing the default model for language 'en'.
clusterEvalQ(cl, sessionInfo())
# ClusterEvalQ outputs for each worker:
[[1]]
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] NLP_0.1-11 openNLP_0.2-6
loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-4 compiler_3.4.4 parallel_3.4.4 rJava_0.9-10
packageDescription('openNLP') # Version: 0.2-6
packageDescription('parallel') # Version: 3.4.4
startTime <- Sys.time()
print(startTime)
corpus.tagged <- parLapply(cl, myCorpus, tagCorpus)
endTime <- Sys.time()
print(endTime)
endTime - startTime
Kindly note that I have consulted many web forums & the one which stood out is:
parallel parLapply setup
However, this doesn't seem to address my issue. Furthermore, I am confused why the setup works with the ~90k articles but not the ~600k articles (I have a total of 12 cores and 64GB memory). Any advice is much appreciated.
I have managed to get this to work by directly using the qdap package (https://github.com/trinker/qdap) by Tyler Rinker. It took ~20 hours to run. Here's how the function pos from the qdap package does this in a one liner:
corpus.tagged <- qdap::pos(myCorpus, parallel =TRUE, cores =detectCores()-2)

How to check a data.table key works properly and why it would not?

Not sure a bug or my fault - a data.table key is not working for the table, I read from UTF-encoded file( link ).
names <- data.table(name = unique(read.table(file = "boys_ru.txt", header = FALSE, sep = "\n", quote = "", stringsAsFactors = F)$V1), sex = 1)
setkey(names, name)
data.table doesn't seem to recognize the key properly. names["сергей"] returns nothing while names[name == "сергей"]works fine
> names[name == "сергей"]
name sex
1: сергей 1
If I create the table myself, everything works fine too
dt1 <- data.table(name = rep("сергей", 5), sex = rep(1, 5))
setkey(dt1, name)
I don't know what to do, cause this doesn't allow me to join this table with another 10M rows table on name field. Interestingly, merge.data.frame works as expected with names table (but way too slow). sessionInfo -
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
I should have been more careful when loading the file, adding explicitly read.table(..., encoding = "UTF-8"). Otherwise the column gets wrong encoding, making data.table unable to match columns of different encodings.
Thanks to #Arun and participants of the RForge discussion above for pointing to the solution.

Resources