I am creating a package that I hope to eventually put onto CRAN. I have coded much of the package in C++ with the help of Rcpp and now would like to enable parallelization of this C++ code. I am using the foreach package, however, I am open to switch to snow or a different library if this would work better.
I started by trying to parallelize a simple function:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::vec rNorm_c(int length) {
return arma::vec(length, arma::fill::randn);
}
/*** R
n_workers <- parallel::detectCores(logical = F)
cl <- parallel::makeCluster(n_workers)
doParallel::registerDoParallel(cl)
n <- 10
library(foreach)
foreach(j = rep(n, n),
.noexport = c("rNorm_c"),
packages = "Rcpp") %dopar% {rNorm_c(j)}
*/
I added the .noexport because without, I get the error Error in { : task 1 failed - "NULL value passed as symbol address". This led me to this SO post which suggested doing this.
However, I now receive the error Error in { : task 1 failed - "could not find function "rNorm_c"", presumably because I have not followed the top answers instructions to load the function separately at each node. I am unsure of how to do this.
This SO post demonstrates how to do this by writing the C++ code inline, however, since the C++ code for my package is multiple functions, this is likely not the best solution. This SO post advises to create a local package for the workers to load and make calls to, however, since I am hoping to make this code available in a CRAN package, it does not seem as though a local package would be possible unless I wanted to attempt to publish two CRAN packages.
Any suggestions for how to approach this or references to resources for parallelization of Rcpp code would be appreciated.
EDIT:
I used the above function to create a package called rnormParallelization. In this package, I also included a couple of R functions, one of which made use of the snow package to parallelize a for loop using the rNorm_c function:
rNorm_samples_for <- function(num_samples, length){
sample_mat <- matrix(NA, length, num_samples)
for (j in 1:num_samples){
sample_mat[ , j] <- rNorm_c(length)
}
return(sample_mat)
}
rNorm_samples_snow1 <- function(num_samples, length){
clus <- snow::makeCluster(3)
snow::clusterExport(clus, "rNorm_c")
out <- snow::parSapply(clus, rep(length, num_samples), rNorm_c)
snow::stopCluster(clus)
return(out)
}
Both functions work as expected:
> rNorm_samples_for(2, 3)
[,1] [,2]
[1,] -0.82040308 -0.3284849
[2,] -0.05169948 1.7402912
[3,] 0.32073516 0.5439799
> rNorm_samples_snow1(2, 3)
[,1] [,2]
[1,] -0.07483493 1.3028315
[2,] 1.28361663 -0.4360829
[3,] 1.09040771 -0.6469646
However, the parallelized version works considerably slower:
> microbenchmark::microbenchmark(
+ rnormParallelization::rNorm_samples_for(1e3, 1e4),
+ rnormParallelization::rNorm_samples_snow1(1e3, 1e4)
+ )
Unit: milliseconds
expr min lq
rnormParallelization::rNorm_samples_for(1000, 10000) 217.0871 249.3977
rnormParallelization::rNorm_samples_snow1(1000, 10000) 1242.8315 1397.7643
mean median uq max neval
320.5456 285.9787 325.3447 802.7488 100
1527.0406 1482.5867 1563.0916 3411.5774 100
Here is my session info:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rnormParallelization_1.0
loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_4.1.1 snow_0.4-4
[4] parallel_4.1.1 tools_4.1.1 Rcpp_1.0.7
GitHub repo with both of these scripts
Related
I'm not understanding how to do indirect subscripting in %dopar% or in llply( .parallel = TRUE). My actual use-case is a list of formulas, then generating a list of glmer results in a first foreach %dopar%, then calling PBmodcomp on specific pairs of results in a separate foreach %dopar%. My toy example, using numeric indices rather than names of objects in the lists, works fine for %do% but not %dopar%, and fine for alply without .parallel = TRUE but not with .parallel = TRUE. [My real example with glmer and indexing lists by names rather than by integers works with %do% but not %dopar%.]
library(doParallel)
library(foreach)
library(plyr)
cl <- makePSOCKcluster(2) # tiny for toy example
registerDoParallel(cl)
mB <- c(1,2,1,3,4,10)
MO <- c("Full", "noYS", "noYZ", "noYSZS", "noS", "noZ",
"noY", "justS", "justZ", "noSZ", "noYSZ")
# Works
testouts <- foreach(i = 1:length(mB)) %do% {
# mB[i]
MO[mB[i]]
}
testouts
# all NA
testouts2 <- foreach(i = 1:length(mB)) %dopar% {
# mB[i]
MO[mB[i]]
}
testouts2
# Works
testouts3 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]} )
testouts3
# fails "$ operator is invalid for atomic vectors"
testouts4 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]},
.parallel = TRUE,
.paropts = list(.export=ls(.GlobalEnv)))
testouts4
stopCluster(cl)
I've tried various combinations of double brackets like MO[mB[[i]]], to no avail. mB[i] instead of MO[mB[i]] works in all 4 and returns a list of the numbers. I've tried .export(c("MO", "mB")) but just get the message that those objects are already exported.
I assume that there's something I misunderstand about evaluation of expressions like MO[mB[i]] in different environments, but there may be other things I misunderstand, too.
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build
7601) Service Pack 1
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 [2]
LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United
States.1252 [4] LC_NUMERIC=C [5]
LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices
utils datasets methods [8] base
other attached packages: [1] plyr_1.8.4 doParallel_1.0.13
iterators_1.0.9 foreach_1.5.0
loaded via a namespace (and not attached): [1] compiler_3.5.1
tools_3.5.1 listenv_0.7.0 Rcpp_0.12.17 [5]
codetools_0.2-15 digest_0.6.15 globals_0.12.1 future_1.8.1
[9] fortunes_1.5-5
The problem appears to be with version 1.5.0 of foreach on r-forge. Version 1.4.4 from CRAN works fine for both foreach %do par% and llply( .parallel = TRUE). For anyone finding this post when searching for %dopar% with lists, here's the code where mList is a named list of formulas, and tList is a named list of pairs of model names to be compared.
tList <- list(Z1 = c("Full", "noYZ"),
Z2 = c("noYS", "noYSZS"),
S1 = c("Full", "noYS"),
S2 = c("noYZ", "noYSZS"),
A1 = c("noYSZS", "noY"),
A2 = c("noSZ", "noYSZ")
)
cl <- makePSOCKcluster(params$nCores) # value from YAML params:
registerDoParallel(cl)
# first run the models
modouts <- foreach(imod = 1:length(mList),
.packages = "lme4") %dopar% {
glmer(as.formula(mList[[imod]]),
data = dsn,
family = poisson,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 100000),
check.conv.singular = "warning")
)
}
names(modouts) <- names(mList)
####
# now run the parametric bootstrap tests
nSim <- 500
testouts <- foreach(i = seq_along(tList),
.packages = "pbkrtest") %dopar% {
PBmodcomp(modouts[[tList[[i]][1]]],
modouts[[tList[[i]][2]]],
nsim = nSim)
}
names(testouts) <- names(tList)
stopCluster(Cl)
I am trying to run a reproducible example with the mlr R package in parallel, for which I have found the solution of using parallelStartMulticore (link). The project runs with packrat as well.
The code runs properly on workstations and small servers, but running it in an HPC with the torque batch system runs into memory exhaustion. It seems that R threads are spawned ad infinitum, contrary to regular linux machines. I have tried to switch to parallelStartSocket, which works fine, but then I cannot reproduce the results with RNG seeds.
Here is a minimal example:
library(mlr)
library(parallelMap)
M <- data.frame(x = runif(1e2), y = as.factor(rnorm(1e2) > 0))
# Example with random forest
parallelStartMulticore(parallel::detectCores())
plyr::l_ply(
seq(100),
function(x) {
message("Iteration number: ", x)
set.seed(1, "L'Ecuyer")
tsk <- makeClassifTask(data = M, target = "y")
num_ps <- makeParamSet(
makeIntegerParam("ntree", lower = 10, upper = 50),
makeIntegerParam("nodesize", lower = 1, upper = 5)
)
ctrl <- makeTuneControlGrid(resolution = 2L, tune.threshold = TRUE)
# define learner
lrn <- makeLearner("classif.randomForest", predict.type = "prob")
rdesc <- makeResampleDesc("CV", iters = 2L, stratify = TRUE)
# Grid search in parallel
res <- tuneParams(
lrn, task = tsk, resampling = rdesc, par.set = num_ps,
measures = list(auc), control = ctrl)
# Fit optimal params
lrn.optim <- setHyperPars(lrn, par.vals = res$x)
m <- train(lrn.optim, tsk)
# Test set
pred_rf <- predict(m, newdata = M)
pred_rf
}
)
parallelStop()
The hardware of the HPC is an HP Apollo 6000 System ProLiant XL230a Gen9 Server blade 64-bit, with Intel Xeon E5-2683 processors. I ignore if the issue comes from the torque batch system, the hardware or any flaw in the above code. The sessionInfo() of the HPC:
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /cm/shared/apps/intel/parallel_studio_xe/2017/compilers_and_libraries_2017.0.098/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parallelMap_1.3 mlr_2.11 ParamHelpers_1.10 RLinuxModules_0.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 splines_3.4.0 munsell_0.4.3
[4] colorspace_1.3-2 lattice_0.20-35 rlang_0.1.1
[7] plyr_1.8.4 tools_3.4.0 parallel_3.4.0
[10] grid_3.4.0 packrat_0.4.8-1 checkmate_1.8.2
[13] data.table_1.10.4 gtable_0.2.0 randomForest_4.6-12
[16] survival_2.41-3 lazyeval_0.2.0 tibble_1.3.1
[19] Matrix_1.2-12 ggplot2_2.2.1 stringi_1.1.5
[22] compiler_3.4.0 BBmisc_1.11 scales_0.4.1
[25] backports_1.0.5
The "multicore" parallelMap backend uses parallel::mcmapply which should create a new fork()ed child process for every evaluation inside tuneParams and then quickly kill that process. Depending on what you use to count memory usage / active processes, it is possible that memory gets mis-reported and that child processes that are already dead (and were only alive for the fraction of a second) are shown, or that killing of finished processes for some reason does not happen.
Possible problems:
The batch system does not correctly track memory usage and counts the parent process's memory for every child separately. Does /usr/bin/free actually report that 30GB are gone while the script is running? As an easier test case, consider (running in an empty R session)
xxx <- 1:1e9
parallel::mclapply(1:4, function(x) {
Sys.sleep(60)
}, mc.cores = 4)
which should use about 4 GB of memory. If, during the 60 seconds in the child process, the reported memory usage is about 16 GB, it is this problem.
Memory reporting is accurate, but for some reason the memory space is changed a lot inside the child processes (triggering many COW writes), e.g. because of garbage collection. Does calling gc() before the tuneParams() call help?
Some setting on the machine prevents the "parallel" package from killing child processes. The following:
parallel::mclapply(1:4, function(x) {
xxx <<- 1:1e9 ; NULL
}, mc.cores = 4)
Sys.sleep(60)
should grab about 16 GB of memory, but release it right away. If the memory remains used during the Sys.sleep (and the remaining R session), it might be this problem.
With runjags, I am trying to monitor a very large number of values. The format for the monitor list is a string of values, In this case I am asking to moitor just 3, Y[14], Y[15], Y[3].
run.jags(model="model.MC.txt",data=list(Y=Y.NA.Rep,sizes=sizesB,cumul=cumul),
monitor=c("thetaj", "Y[14]", "Y[15]","Y[3]"))
Suppose I wanted to monitor hundreds of values. I can create this string, but it just returns to the prompt "+". and fails to run.
Is there some upper limit on the size of strings that can be created and passed in as arguments?
Is there a better way (non string) to pass this list into run.jags?
The only way I have been able to get it to run is to paste the string literal
into the function call, a variable containing the string does not work.
The longer run list looks something like this:
run.jags(model="model.MC.txt",data=list(Y=Y.NA.Rep,sizes=sizesB,cumul=cumul)
,monitor=c('Y[14]', 'Y[15]', 'Y[18]', 'Y[26]', 'Y[41]',
'Y[55]', 'Y[62]', 'Y[72]', 'Y[80]', 'Y[81]', 'Y[128]', 'Y[138]',
'Y[180]', 'Y[188]', 'Y[191]', 'Y[209]', 'Y[224]', 'Y[244]', '
'Y[255]', 'Y[263]', 'Y[282]', 'Y[292]', 'Y[303]', 'Y[324]',
'Y[349]', 'Y[358]', 'Y[359]', 'Y[365]', 'Y[384]',
... many lines deleted
'Y[1882]', 'Y[1895]', 'Y[1899]', 'Y[1903]', 'Y[1918]', 'Y[1922]',
'Y[1929]', 'Y[1942]', 'Y[1953]', 'Y[1990]'))
I'm not sure that this is a problem with runjags - the following code has 1002 monitors and runs just fine:
model <- "model {
for(i in 1 : N){ #data# N
Y[i] ~ dnorm(true.y[i], precision) #data# Y
true.y[i] <- (m * X[i]) + c #data# X
}
m ~ dnorm(0, 10^-3)
c ~ dnorm(0, 10^-3)
precision ~ dgamma(10^-3, 10^-3)
}"
X <- 1:1000
Y <- rnorm(length(X), 2*X + 10, 1)
N <- length(X)
monitors <- c('m','c',paste0('Y[',1:1000,']'))
results <- run.jags(model, n.chains=2, monitor=monitors, sample=100, method='rjags')
results <- run.jags(model, n.chains=2, monitor=monitors, sample=100, method='inter')
I have also tried writing the string directly into the function call by using:
cat('monitor = c("'); cat(monitors, sep='", "'); cat('")\n')
...and copy/pasting the resulting text as the monitor argument - that still works for me in R.app but when pasting into RStudio I get:
> results <- run.jags(model, n.chains=2, monitor = c("m", "c", "Y[1]", "Y[2]", "Y[3]", "Y[4]", "Y[5]", "Y[6]", "Y[7]", "Y[8]", "Y[9]", "Y[10]", "Y[11]", "Y[12]", "Y[13]", "Y[14]", "Y[15]", "Y[16]", "Y[17]", "Y[18]", "Y[19]", "Y[20]", "Y[21]", "Y[22]", "Y[23]", "Y[24]", "Y[25]", "Y[26]", "Y[27]", "Y[28]", "Y[29]", "Y[30]", "Y[31]", "Y[32]", "Y[33]", "Y[34]", "Y[35]", "Y[36]", "Y[37]", "Y[38]", "Y[39]", "Y[40]", "Y[41]", "Y[42]", "Y[43]", "Y[44]", "Y[45]", "Y[46]", "Y[47]", "Y[48]", "Y[49]", "Y[50]", "Y[51]", "Y[52]", "Y[53]", "Y[54]", "Y[55]", "Y[56]", "Y[57]", "Y[58]", "Y[59]", "Y[60]", "Y[61]", "Y[62]", "Y[63]", "Y[64]", "Y[65]", "Y[66]", "Y[67]", "Y[68]", "Y[69]", "Y[70]", "Y[71]", "Y[72]", "Y[73]", "Y[74]", "Y[75]", "Y[76]", "Y[77]", "Y[78]", "Y[79]", "Y[80]", "Y[81]", "Y[82]", "Y[83]", "Y[84]", "Y[85]", "Y[86]", "Y[87]", "Y[88]", "Y[89]", "Y[90]", "Y[91]", "Y[92]", "Y[93]", "Y[94]", "Y[95]", "Y[96]", "Y[97]", "Y[98]", "Y[99]", "Y[100]", "Y[101]", "Y[102]", "Y[103]", "Y[104]", "Y[105]... <truncated>
+
+
Which is somewhat similar to your description. So I'm guessing that you are using RStudio and that the problem is to do with the maximum length of a line of code that can be interpreted by RStudio.
If so, the fix is to simply hard wrap the command so it is broken over multiple lines - I tried this with 72 character width (100+ lines) and it works fine in RStudio. If my assumption is incorrect please modify your question to give more details of how you are running R, and your system using:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] runjags_2.0.4-2
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0 parallel_3.4.0 coda_0.19-1 grid_3.4.0 rjags_4-6 lattice_0.20-35
I've been getting WFA to run on the full set of intraday GBPUSD 30min data, and have come across a couple of things that need addressing. The first is I believe the save function needs changing to remove the time from the string (as shown here as a pull request on the R-Finance/quantstrat repo on github). The walk.forward function throws this error:
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'wfa.GBPUSD.2002-10-21 00:30:00.2002-10-23 23:30:00.RData', probable reason 'Invalid argument'
The second is a rare case scenario where its ends up calling runSum on a data set with less rows than the period you are testing (n). This is the traceback():
8: stop("Invalid 'n'")
7: runSum(x, n)
6: runMean(x, n)
5: (function (x, n = 10, ...)
{
ma <- runMean(x, n)
if (!is.null(dim(ma))) {
colnames(ma) <- "SMA"
}
return(ma)
})(x = Cl(mktdata)[, 1], n = 25)
4: do.call(indFun, .formals)
3: applyIndicators(strategy = strategy, mktdata = mktdata, parameters = parameters,
...)
2: applyStrategy(strategy, portfolios = portfolio.st, mktdata = symbol[testing.timespan]) at custom.walk.forward.R#122
1: walk.forward(strategy.st, paramset.label = "WFA", portfolio.st = portfolio.st,
account.st = account.st, period = "days", k.training = 3,
k.testing = 1, obj.func = my.obj.func, obj.args = list(x = quote(result$apply.paramset)),
audit.prefix = "wfa", anchored = FALSE, verbose = TRUE)
The extended GBPUSD data used in the creation of the Luxor Demo includes an erroneous date (2002/10/27) with only 1 observation which causes this problem. I can also foresee this being an issue when testing longer signal periods on instruments like Crude where they have only a few trading hours on Sunday evenings (UTC).
Given that I have purely been following the Luxor demo with the same (extended) intra-day data set, are these genuine issues or have they been caused by package updates etc?
What is the preferred way for these things to be reported to the authors of QS, and find out if/when fixes are likely to be made?
SessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantstrat_0.9.1739 foreach_1.4.3 blotter_0.9.1741 PerformanceAnalytics_1.4.4000 FinancialInstrument_1.2.0 quantmod_0.4-5 TTR_0.23-1
[8] xts_0.9.874 zoo_1.7-13
loaded via a namespace (and not attached):
[1] compiler_3.3.0 tools_3.3.0 codetools_0.2-14 grid_3.3.0 iterators_1.0.8 lattice_0.20-33
quantstrat is on github here:
https://github.com/braverock/quantstrat
Issues and patches should be reported via github issues.
I'm running some scripts from R that gets info from some webs. The problems is that even though I clean the session with gc(), the memory keep growing until my session crashes.
Here is the script:
library(XML)
library(RJDBC)
library(RCurl)
procesarPublicaciones <- function(tabla){
log_file <<- file(log_path, open="a")
drv <<- JDBC("oracle.jdbc.OracleDriver", classPath="C:/jdbc/jre6/ojdbc6.jar"," ")
con <<- dbConnect(drv, "server_path", "user", "password")
query <- paste("SELECT * FROM",tabla,sep=' ')
bool <- tryCatch(
{
## Get a list of URLs from a DB
listUrl <- dbGetQuery(con, query)
if( nrow(listUrl) != 0) TRUE else FALSE
dbDisconnect(con)
}, error = function(e) return(FALSE)
)
if( bool ) {
file.create(data_file)
apply(listUrl,c(1),procesarHtml)
}else{
cat("\n",getTime(),"\t[ERROR]\t\t", file=log_file)
}
cat( "\n",getTime(),"\t[INFO]\t\t FINISH", file=log_file)
close(log_file)
}
procesarHtml <- function(pUrl){
headerGatherer <- basicHeaderGatherer()
html <- getURI(theUrl, headerfunction = headerGatherer$update, curl = curlHandle)
heatherValue <- headerGatherer$value()
if ( heatherValue["status"] == "200" ){
doc <- htmlParse(html)
tryCatch
(
{
## Here I get all the info that I need from the web and write it on a file.
## here is a simplification
info1 <- xpathSApply(doc, xPath.info1, xmlValue)
info2 <- xpathSApply(doc, xPath.info2, xmlValue)
data <- data.frame(col1 = info1, col2=info2)
write.table(data, file=data_file , sep=";", row.names=FALSE, col.names=FALSE, append=TRUE)
}, error= function(e)
{
## LOG ERROR
}
)
rm(info1, info2, data, doc)
}else{
## LOG INFO
}
rm(headerGatherer,html,heatherValue)
cat("\n",getTime(),"\t[INFO]\t\t memory used: ", memory.size()," MB", file=log_file)
gc()
cat("\n",getTime(),"\t[INFO]\t\t memory used after gc(): ", memory.size()," MB", file=log_file)
}
Even though I remove all internal variables with rm() and use gc(), memory keeps growing. It seems that all the html that I get from the web is kept in memory.
Here is my Session Info:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 3
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.6 bitops_1.0-6 RJDBC_0.2-5 rJava_0.9-6 DBI_0.3.1
[6] XML_3.98-1.1
loaded via a namespace (and not attached):
[1] tools_3.2.0
--------------------EDIT 2015-06-08 --------------------
I'm still having the problem, but I found the same issue on other post, which is apparently resolved.
Serious Memory Leak When Iteratively Parsing XML Files
When using the XML package, you'll want to use free() to release the memory allocated by htmlParse() (or any of the other html parsing functions that allocate memory at the C level). I usually place a call to free(doc) as soon as I don't need the html doc any more.
So in your case, I would try placing free(doc) on its own line prior to rm(info1, info2, data, doc) in your function, like this:
free(doc)
rm(info1, info2, data, doc)
In fact the call to free() may be sufficient enough that you could remove the rm() call completely.
I had a related issue using htmlParse. Led to Windows crashing (out of memory) before my 10,000 iterataions completed.
Answer:
in addition to free/remove - do a garbage collect gc() (as suggested in Serious Memory Leak When Iteratively Parsing XML Files ) every n iterations