parLapply and Part of Speech tagging

parLapply and Part of Speech tagging - r

I am trying to use the parLapply along with the openNLP R package to do part of speech tagging of a corpus of ~600k documents. However, while I was able to successfully part of speech tag a different set of ~90k documents, I get a strange error after ~25 mins of running the same code over the ~600k documents:
Error in checkForRemoteErrors(val) : 10 nodes produced errors; first error: no word token annotations found
The documents are simply digital newspaper articles, where I run the tagger over the body field (after cleaning). This field is nothing but raw text which I save into a list of strings.
Here's my code:
# I set the Java heap size (memory) allocation - I experimented with different sizes
options(java.parameters = "- Xmx3GB")
# Convert the corpus into a list of strings
myCorpus <- lapply(contentCleaned, function(x){x <- as.String(x)})
# tag Corpus Function
tagCorpus <- function(x, ...){
s <- as.String(x) # This is a repeat and may not be required
WTA <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, WTA, a2)
a3 <- annotate(s, PTA, a2)
word_subset <- a3[a3$type == "word"]
POStags <- unlist(lapply(word_subset$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[word_subset], POStags), collapse = " ")
list(text = s, POStagged = POStagged, POStags = POStags, words = s[word_subset])
}
# I have 12 cores in my box
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()-2))
# I tried both exporting the word token annotator and not
clusterEvalQ(cl, {
library(openNLP);
library(NLP);
PTA <- Maxent_POS_Tag_Annotator();
WTA <- Maxent_Word_Token_Annotator()
})
# Each cluster node has the following description:
[[1]]
An annotator inheriting from classes
Simple_Word_Token_Annotator Annotator
with description
Computes word token annotations using the Apache OpenNLP Maxent tokenizer employing the default model for language 'en'.
clusterEvalQ(cl, sessionInfo())
# ClusterEvalQ outputs for each worker:
[[1]]
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] NLP_0.1-11 openNLP_0.2-6
loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-4 compiler_3.4.4 parallel_3.4.4 rJava_0.9-10
packageDescription('openNLP') # Version: 0.2-6
packageDescription('parallel') # Version: 3.4.4
startTime <- Sys.time()
print(startTime)
corpus.tagged <- parLapply(cl, myCorpus, tagCorpus)
endTime <- Sys.time()
print(endTime)
endTime - startTime
Kindly note that I have consulted many web forums & the one which stood out is:
parallel parLapply setup
However, this doesn't seem to address my issue. Furthermore, I am confused why the setup works with the ~90k articles but not the ~600k articles (I have a total of 12 cores and 64GB memory). Any advice is much appreciated.

I have managed to get this to work by directly using the qdap package (https://github.com/trinker/qdap) by Tyler Rinker. It took ~20 hours to run. Here's how the function pos from the qdap package does this in a one liner:
corpus.tagged <- qdap::pos(myCorpus, parallel =TRUE, cores =detectCores()-2)

Related

Why is R making a copy-on-modification after using str?

I was wondering why R is making a copy-on-modification after using str.
I create a matrix. I can change its dim, one element or even all. No copy is made. But when a call str R is making a copy during the next modification operation on the Matrix. Why is this happening?
m <- matrix(1:12, 3)
tracemem(m)
#[1] "<0x559df861af28>"
dim(m) <- 4:3
m[1,1] <- 0L
m[] <- 12:1
str(m)
# int [1:4, 1:3] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here after str a copy is made
#tracemem[0x559df861af28 -> 0x559df838e4a8]:
dim(m) <- 3:4
str(m)
# int [1:3, 1:4] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here again after str a copy
#tracemem[0x559df838e4a8 -> 0x559df82c9d78]:
Also I was wondering why a copy is made when having a Task Callback.
TCB <- addTaskCallback(function(...) TRUE)
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x559dfa79def8>"
dim(m) <- 4:3 #Copy on modification
#tracemem[0x559dfa79def8 -> 0x559dfa8998e8]:
removeTaskCallback(TCB)
#[1] TRUE
dim(m) <- 4:3 #No copy
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_AT.UTF-8 LC_COLLATE=de_AT.UTF-8
[5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
[7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.3
This is a follow up question to Is there a way to prevent copy-on-modify when modifying attributes?.
I start R with R --vanilla to have a clean session.

I have asked this question on R-help as suggested by #sam-mason in the comments.
The answer from Luke Tierney solved the issue with str:
As of R 4.0.0 it is in some cases possible to reduce reference counts
internally and so avoid a copy in cases like this. It would be too
costly to try to detect all cases where a count can be dropped, but it
this case we can do better. It turns out that the internals of
pos.to.env were unnecessarily creating an extra reference to the call
environment (here in a call to exists()). This is fixed in r79528.
Thanks.
And related to Task Callback:
It turns out there were some issues with the way calls to the
callbacks were handled. This has been revised in R-devel in r79541.
This example will no longere need to duplicate in R-devel.
Thanks for the report.

Problems using large custom stopword lists in tm package (R)

I'm sure that many of you have seen this before:
Warnmeldung:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
This time, I get the error, when I try to remove a custom stopword list from my corpus.
asdf <- tm_map(asdf, removeWords ,mystops)
It works with small stopword list (I tried until 100 or something), but my current stopword list has about 42000 words.
I have tried this:
asdf <- tm_map(asdf, removeWords ,mystops, lazy=T)
this won't give me back an error, however every tm_map command after this will give me the same error above and when I try to compute a DTM from the corpus:
Fehler in UseMethod("meta", x) :
nicht anwendbare Methode für 'meta' auf Objekt der Klasse "try-error" angewendet
Zusätzlich: Warnmeldung:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
I am thinking about a function, looping the removeWords command for little parts of my list, but I would too like to understand, why the length of the list is a problem..
Here my sessionInfo():
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SnowballC_0.5.1 wordcloud_2.5 RColorBrewer_1.1-2 RTextTools_1.4.2 SparseM_1.74 topicmodels_0.2-4 tm_0.6-2
[8] NLP_0.1-9
loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 splines_3.3.2 MASS_7.3-45 tau_0.0-18 prodlim_1.5.7 lattice_0.20-34 foreach_1.4.3
[8] tools_3.3.2 caTools_1.17.1 nnet_7.3-12 parallel_3.3.2 grid_3.3.2 ipred_0.9-5 glmnet_2.0-5
[15] e1071_1.6-7 iterators_1.0.8 modeltools_0.2-21 class_7.3-14 survival_2.39-5 randomForest_4.6-12 Matrix_1.2-7.1
[22] lava_1.4.5 bitops_1.0-6 codetools_0.2-15 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-38 stats4_3.3.2
[29] tree_1.0-37
EDIT:
20 newsgroups dataset
I use 20news-bydate.tar.gz and only the train folder.
I won't share all the preprocessing I am doing, as it includes a morphological analysis of the whole thing (not with R).
Here my R code:
library(tm)
library(topicmodels)
library(SnowballC)
asdf <- Corpus(DirSource("/path/to/20news-bydate/train",encoding="UTF-8"),readerControl=list(language="en"))
asdf <- tm_map(asdf, content_transformer(tolower))
asdf <- tm_map(asdf, removeWords, stopwords(kind="english"))
asdf <- tm_map(asdf, removePunctuation)
asdf <- tm_map(asdf, removeNumbers)
asdf <- tm_map(asdf, stripWhitespace)
# until here: preprocessing
# building DocumentTermMatrix with term frequency
dtm <- DocumentTermMatrix(asdf, control=list(weighting=weightTf))
# building a matrix from the DTM and wordvector (all words as titles,
# sorted by frequency in corpus) and wordlengths (length of actual
# wordstrings in the wordvector)
m <- as.matrix(dtm)
wordvector <- sort(colSums(m),decreasing=T)
wordlengths <- nchar(names(wordvector))
names(wordvector[wordlengths>22]) -> mystops1 # all words longer than 22 characters
names(wordvector)[wordvector<3] -> mystops2 # all words with occurence <3
mystops <- c(mystops1,mystops2) # the stopwordlist
# going back to the corpus to remove the chosen words
asdf <- tm_map(asdf, removeWords ,mystops)
This is where I get the error.

As I suspected in the comment: removeWords from the tm package uses perl regular expressions. All words are concatenated using the or | pipe. And in your case the resulting string has too many characters:
Error in gsub(regex, "", txt, perl = TRUE) : invalid regular
expression
'(*UCP)\b(zxmkrstudservzdvunituebingende|zxmkrstudservzdvunituebingende|...|unwantingly|
In addition: Warning message: In gsub(regex, "", txt, perl = TRUE) :
PCRE pattern compilation error 'regular expression is too large' at
''
One solution: define your own removeWords function, which splits a regular expression that is too large at the character limits, and then applies each split regular expression seperately, so that it does not hit the limit anymore:
f <- content_transformer({function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE)
return(txt)
}})
asdf <- tm_map(asdf, f, mystops)

Your custom stopwords is too big, so you have to break it down:
group <- 100
n <- length(myStopwords)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(myStopwords,r)
for (i in 1:length(d)) {
asdf <- removeWords(asdf, c(paste(d[[i]])))
}

Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport

Say I have a tall dataframe with many rows per group, like so:
df <- data.frame(group = factor(rep(c("a","b","c"), each = 5)),
v1 = sample(1:100, 15, replace = TRUE),
v2 = sample(1:100, 15, replace = TRUE),
v3 = sample(1:100, 15, replace = TRUE))
What I want to do is split df into length(levels(df$group)) separate dataframes, e.g.,
df_a <- df[df$group=="a",]; df_b <- df[df$group == "b",] ; ...
And then print each dataframe in a separate HTML/PDF/DOCX file (probably using Rmarkdown and knitr).
I want to do this because I have a large dataframe and want to create a personalized report for each group a, b, c, etc. Thanks.
Update (11/18/14)
Following #daroczig 's advice in this thread and another thread, I attempted to make my own template that would simply print a nicely formatted table of all columns and rows per group to substitute into the "correlations" template call in the original sapply() function. I want to make my own template rather than just printing the nice table (e.g., the answer #Thomas graciously provided) because I'd like to build additional customization into the template once the simple printing works. Anyway, I've certainly butchered it:
<!--head
meta:
title: Sample Report
author: Nicapyke
description: This is a demo
packages: ~
inputs:
- name: eachgroup
class: character
standalone: TRUE
required: TRUE
head-->
### Records received up to present for Group <%= eachgroup %>
<%=
pandoc.table(df[df$group == eachgroup, ])
%>
Then, after saving that as groupreport.rapport in my working directory, I wrote the following R code, modeled after #daroczig's response:
allgroups <- unique(df$group)
library(rapport)
for (eachstate in allstates) {
rapport.docx("FILEPATHHERE", eachgroup = eachgroup)
}
I received the error:
Error in openFileInOS(f.out) : File not found!
I'm not sure what happened. I see from the pander documentation that this means it's looking for a system file, but that doesn't mean much to me. Anyway, this error doesn't get at the root of the problem, which is 1) what should go in the input section of the custom template YAML header, and 2) which R code should go in the rapport template vs. in the R script.
I realize I may be making a number of errors that reveal my lack of experience with rapport and pander. Thanks for your patience!
N.B.:
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.8 dplyr_0.3.0.2 rapport_0.51 yaml_2.1.13 pander_0.5.1
plyr_1.8.1 lattice_0.20-29
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0 grid_3.1.2
[7] lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3 reshape_0.8.5 stringr_0.6.2
[13] tools_3.1.2

A slightly off-topic, but still R/markdown one-liner for separate reports with report templates:
> library(rapport)
> sapply(levels(df$group), function(g) rapport.html('correlations', data = df[df$group == g, ], vars = c('v1', 'v2', 'v3')))
Exported to */tmp/RtmpYyRLjf/rapport-correlations-1-0.[md|html]* under 0.683 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-2-0.[md|html]* under 0.888 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-3-0.[md|html]* under 1.063 seconds.
The rapport package can run (predefined or custom) report templates on any (sub)dataset in markdown, then export it to HTML/docx/PDF/other formats. For a quick demo, I've uploaded the resulting documents:
rapport-correlations-1-0.html
rapport-correlations-2-0.html
rapport-correlations-3-0.html

You can do this with by (or split) and xtable (from the xtable package). Here I create xtable objects of each subset, and then loop over them to print them to file:
library('xtable')
s <- by(df, df$group, xtable)
for(i in seq_along(s)) print(s[[i]], file = paste0('df',names(s)[i],'.tex'))
If you use the stargazer package, you can get a nice summary of the dataframe instead of the dataframe itself in just one line:
library('stargazer')
by(df, df$group, stargazer, out = paste0('df',unique(df$group),'.tex'))
You should be able to easily include each of these files in, e.g., a PDF report. You could also use HTML markup using either xtable or stargazer.

fetching large number of followers and followees in R

I am trying to get followers of XYZ and fetch them in descending order of their follower count (say we consider 400 of the top followers with respect of follower count). The code is--
library(twitteR)
user <- getUser("XYZ")
followers <- user$getFollowers()
b <- twListToDF(followers)
f_count <- as.data.frame(b$followersCount)
u_id <- as.data.frame(b$id)
u_sname <- as.data.frame(b$screenName)
u_name <- as.data.frame(b$name)
final_df <- cbind(u_id,u_name,u_sname,f_count)
sort_fc <- final_df[order(-f_count),]
colnames(sort_fc) <- c('id','name','s_name','fol_count')
The first problem I'm facing is that it doesn't give me the list of all followers. I surmise it's because of the 15 min API limit that twitter has. Anyway in which I can sort this out?
The second thing that I'm trying to do is get the followees(friends) of these followers and sort them in descending order of their follower count again. Let's say we take the top 100 followees with respect to the follower count.
The follow-up code which contains all these followees of followers in a data frame is (the column names of data frame are: odd numbered columns representing users and even numbered columns representing the count of followees' followers):
alpha <- as.factor(sort_fc[1:400,]$s_name)
user_followees <- rep(list(list()),10)
fof <- rep(list(list()),10)
gof <- rep(list(list()),10)
m <- data.frame(matrix(NA, nrow=100, ncol=800))
colnames(m) <- sprintf("%d",1:80)
for(i in 1:400)
{
user <- getUser(alpha[i])
Sys.sleep(61)
user_followees[[i]] <- user$getFriends(n=100)
fof[[i]] <- twListToDF(user_followees[[i]])$screenName
gof[[i]] <- twListToDF(user_followees[[i]])$followersCount
j <- 2*i-1
k <- 2*i
m[,j] <- fof[[i]]
m[,k] <- gof[[i]]
c <- as.vector(m[,j])
d <- as.vector(m[,k])
n <- cbind(c,d)
sort <- n[order(-d),]
m[,j] <- sort[,1]
m[,k] <- sort[,2]
}
The error I'm getting over here is:
[1] "Unauthorized"
Error in twInterfaceObj$doAPICall(cmd, params, method, ...) :
Error: Unauthorized
I'm using Sys.sleep(61) so that I don't make more than 1 search per minute (since twitter API limit is 15 for 15 mins, so I guess this works fine).
The session info:
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] twitteR_1.1.6 rjson_0.2.12 ROAuth_0.9.3 digest_0.6.3 RCurl_1.95-4.1
[6] bitops_1.0-5
I am a novice in R and require this manipulation to work on the interest-graph building. So, I would be glad if anyone can help me with this.
Thanks a lot for any help in advance.

Ubuntu R ForEach / DoMC not using multiple cores

I have built a function in R (running on Ubuntu 12.04 LTS 64bit, 4 core i7 server with multithreading and 6gb ram) where I've installed R using the standard packages:
sudo apt-get install r-base r-recommended r-base-dev
sudo apt-get install r-cran-multicore r-cran-iterators r-cran-foreach r-cran-domc
NB: I also installed foreach & doMC inside R (which didn't help either), like I installed the deldir package:
install.packages(c("deldir"), dependencies = TRUE)
My function runs fine, but it does not use parallel cores (just maxes out 1 of the 8):
library(deldir)
library(foreach)
library(doMC)
registerDoMC(cores=8)
#getDoParWorkers()
#getDoParName()
#getDoParVersion()
# loop through files
inputfiles <- dir(path="/home/geoadmin/data/objects/", pattern='.txt')
for( inputfilenr in 1:length(inputfiles))
{
# set file variables
curinputfile = paste("/home/geoadmin/data/objects/",inputfiles[[inputfilenr]], sep = "", collapse = NULL)
print (curinputfile)
curoutputfile = paste("/home/geoadmin/data/objects/",substr(inputfiles[[inputfilenr]], start=1, stop=10), '.out', sep = "", collapse = NULL)
# select the point x/y coordinates into a data frame...
points <- read.csv(curinputfile, header = TRUE, sep = ",", dec=".", fill = TRUE)
# set calculation variables, precision on 3 digits only because of the RDW coordinate system
voro = deldir(points$x, points$y, digits=3, list(ndx=2,ndy=2), rw=c(min(points$x)-abs(min(points$x)-max(points$x)), max(points$x)+abs(min(points$x)-max(points$x)), min(points$y)-abs(min(points$y)-max(points$y)), max(points$y)+abs(min(points$y)-max(points$y))))
tiles = tile.list(voro)
poly = array()
# start loop
poly <- foreach (i=1:length(tiles), .combine=cbind) %dopar%
{
# load tile info
tile = tiles[[i]]
# start with EWKB notation
curpoly = "POLYGON(("
# add list of coordinates by looping through the points in tile
for (j in 1:length(tiles[[i]]$x)) { curpoly = sprintf("%s %.6f %.6f,",curpoly,tile$x[[j]],tile$y[[j]]) }
# then again the first point to close the polygon and end the EWKB notation, adding that to the poly array
sprintf("%s %.6f %.6f))",curpoly,tile$x[[1]],tile$y[[1]])
}
write.csv(t(poly), file = curoutputfile, row.names = FALSE)
}
So the results are good, but no parallelism...
doMC did register correctly:
> getDoParWorkers()
[1] 8
> getDoParName()
[1] "doMC"
> getDoParVersion()
[1] "1.2.5"
If I look at the usage (with top):
top - 01:03:19 up 9 min, 3 users, load average: 1.02, 0.86, 0.45
Tasks: 131 total, 2 running, 127 sleeping, 0 stopped, 2 zombie
Cpu(s): 12.5%us, 0.0%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 6104932k total, 1240512k used, 4864420k free, 16656k buffers
Swap: 6283260k total, 0k used, 6283260k free, 141996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1553 zzzzzzzz 20 0 913m 850m 3716 R 100 14.3 8:22.03 R
So just maxing out one core. Does anyone have any idea what could cause foreach/doMC to not use multiple cores?
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doMC_1.2.5 multicore_0.1-7 iterators_1.0.6 foreach_1.4.0
[5] deldir_0.0-19
loaded via a namespace (and not attached):
[1] codetools_0.2-8

To add the likely answer for the question:
As foreach/mc does work on the computer itself (with the standard example), it's the specific code itself and likely that the voro=deldir part takes up the time, not the loop after it. This however means that the deldir package needs to be adjusted. Looking at the code in the DelDir source it seems I would need to adjust this snippet in the code:
# Call the master subroutine to do the work:
repeat {
tmp <- .Fortran(
'master',
x=as.double(x),
y=as.double(y),
sort=as.logical(sort),
rw=as.double(rw),
npd=as.integer(npd),
ntot=as.integer(ntot),
nadj=integer(tadj),
madj=as.integer(madj),
ind=integer(npd),
tx=double(npd),
ty=double(npd),
ilist=integer(npd),
eps=as.double(eps),
delsgs=double(tdel),
ndel=as.integer(ndel),
delsum=double(ntdel),
dirsgs=double(tdir),
ndir=as.integer(ndir),
dirsum=double(ntdir),
nerror=integer(1),
PACKAGE='deldir'
)
Not sure yet how i can format this into a thing which would work with foreach though...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

parLapply and Part of Speech tagging - r

I have managed to get this to work by directly using the qdap package (https://github.com/trinker/qdap) by Tyler Rinker. It took ~20 hours to run. Here's how the function pos from the qdap package does this in a one liner: corpus.tagged <- qdap::pos(myCorpus, parallel =TRUE, cores =detectCores()-2)

Related

Why is R making a copy-on-modification after using str?

Problems using large custom stopword lists in tm package (R)

Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport

fetching large number of followers and followees in R

Ubuntu R ForEach / DoMC not using multiple cores

Categories

Resources