arrangements package - permutations of order 128 - r

I am trying to run a simple permutation of 'x' and 'y' across 128 spaces using the arrangements package on R.
I keep getting the following error message :
Error in permutations(test, k = 128, replace = TRUE) : too many results
The code that I ran was as follows:
library(arrangements)
test <- c('x','y')
permutations(test, k = 128, replace = TRUE)
sessionInfo() is as follows:
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1
Is there a work around I can use? I am also experimenting with the parallel package. Please advice.

That's way too many results as #Rui points out in the comments:
npermutations(2, 128, replace=TRUE, bigz = TRUE)
Big Integer ('bigz') :
[1] 340282366920938463463374607431768211456
How about using the skip and nitem parameters? This allows a user to retrieve a handful of results at a time.
## First 100,000 results
system.time(res <- permutations(c('x', 'y'), 128, replace = TRUE, nitem = 1e5))
# user system elapsed
# 0.146 0.000 0.146
## process res
## Next 100,000 results
res <- permutations(c('x', 'y'), 128, replace = TRUE, skip = 1e5, nitem = 1e5)
## process res
## 100,000 results starting at leixcographical index 1e19 + 1
## n.b. need to use strings or bigz type
res <- permutations(c('x', 'y'), 128, replace = TRUE,
skip = "10000000000000000000", nitem = 1e5)
## process res
## etc.
This can easily be generalized to parallel processing if needed.

Related

Why does quanteda's textmodel_wordfish run infinitely when I apply to quanteda.corpora's corpus of UK party manifestos?

I'm attempting to apply wordfish to quanteda.corpora's data_corpus_ukmanifestos, but it never seems to stop running. On the other hand, when I use the example code from quanteda's wordfish tutorial, wordfish is complete within seconds. Is this just a problem for me? Does this happen to others as well? How can I circumvent this problem?
This is the code I have right now. Like I said, wordfish works in seconds when run on the Irish budget speeches, but never stops running when applied to party manifestos.
## install/load packages
## install.packages(c("quanteda", "devtools"))
## devtools::install_github("quanteda/quanteda.corpora")
library(quanteda)
library(quanteda.corpora)
require(quanteda)
require(quanteda.corpora)
dfmat_irish <- dfm(data_corpus_irishbudget2010, remove_punct = TRUE)
tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6,5))
summary(tmod_wf)
dfmat_uk <- dfm(data_corpus_ukmanifestos, remove_punct = TRUE)
wf_uk <- textmodel_wordfish(dfmat_uk, dir = c(83, 74))
How do I get wordfish to work with this corpus?
Try trimming low-frequency words. The longer the time span of a time-series corpus, the more sparse your matrix. There are 101 manifestos in the UK corpus, going back to 1945. A lot of the terms are going to be very rare.
library("quanteda")
## Package version: 1.4.4
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
data(data_corpus_ukmanifestos, package = "quanteda.corpora")
system.time(
wf_uk2 <- dfm(data_corpus_ukmanifestos, remove_numbers = TRUE, remove_punct = TRUE) %>%
dfm_trim(min_termfreq = 10, min_docfreq = 20) %>%
textmodel_wordfish(dir = c(83, 74))
)
## user system elapsed
## 2.274 0.124 2.356
You could also use dfm_wordstem() to reduce the feature set further, but best to do this before the trim operation.

R package gamlss used inside foreach() fails to find an object

I'm trying to get leave-one-out predicted values. Please help me with this "can't find object" issue. I have searched for similar issues, but haven't managed to figure it out. This is on Windows 10.
Thanks in advance
library('gamlss')
library('foreach')
library('doParallel')
registerDoParallel(cores = 4)
# Generate data
set.seed(314)
sample.size <- 30
input.processed.cut <- data.frame(TP = round(runif(sample.size) * 100),
FP = round(runif(sample.size) * 100),
x = runif(sample.size))
# Fit Beta-binomial
model3 <- gamlss(formula = cbind(TP, FP) ~ x,
family = BB,
data = input.processed.cut)
# Get the leave-one-out values
loo_predict.mu <- function(model.obj, input.data) {
yhat <- foreach(i = 1 : nrow(input.data), .packages="gamlss", .combine = rbind) %dopar% {
updated.model.obj <- update(model.obj, data = input.data[-i, ])
predict(updated.model.obj, what = "mu", newdata = input.data[i,], type = "response")
}
return(data.frame(result = yhat[, 1], row.names = NULL))
}
par.run <- loo_predict.mu(model3, input.processed.cut)
# Error in { : task 1 failed - "object 'input.data' not found"
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
I got a response from gamlss team and verified that their solution works. The only thing to change was to provide "data" along with "newdata" to predict().
loo_predict.mu <- function(model.obj, input.data) {
yhat <- foreach(i = 1 : nrow(input.data), .packages="gamlss", .combine = rbind) %dopar% {
updated.model.obj <- update(model.obj, data = input.data[-i, ])
predict(updated.model.obj, what = "mu", data = input.data[-i, ],
newdata = input.data[i,], type = "response")
}
return(data.frame(result = yhat[, 1], row.names = NULL))
}

Only one processor being used while running NetLogo models using parApply

I am using the 'RNetLogo' package to run sensitivity analyses on my NetLogo model. My model has 24 parameters I need to vary - so parallelising this process would be ideal! I've been following along with the example in Thiele's "Parallel processing with the RNetLogo package" vignette, which uses the 'parallel' package in conjunction with 'RNetLogo'.
I've managed to get R to initialise the NetLogo model across all 12 of my processors, which I've verified using gui=TRUE. The problem comes when I try to run the simulation code across the 12 processors using 'parApply'. This line runs without error, but it only runs on one of the processors (using around 8% of my total CPU power). Here's a mock up of my R code file - I've included some commented-out code at the end, showing how I run the simulation without trying to parallelise:
### Load packages
library(parallel)
### Set up initialisation function
prepro <- function(dummy, gui, nl.path, model.path) {
library(RNetLogo)
NLStart(nl.path, gui=gui)
NLLoadModel(model.path)
}
### Set up finalisation function
postpro <- function(x) {
NLQuit()
}
### Set paths
# For NetLogo
nl.path <- "C:/Program Files/NetLogo 6.0/app"
nl.jarname <- "netlogo-6.0.0.jar"
# For the model
model.path <- "E:/Model.nlogo"
# For the function "sim" code
sim.path <- "E:/sim.R"
### Set base values for parameters
base.param <- c('prey-max-velocity' = 25,
'prey-agility' = 3.5,
'prey-acceleration' = 20,
'prey-deceleration' = 25,
'prey-vision-distance' = 10,
'prey-vision-angle' = 240,
'time-to-turn' = 5,
'time-to-return-to-foraging' = 300,
'time-spent-circling' = 2,
'predator-max-velocity' = 35,
'predator-agility' = 3.5,
'predator-acceleration' = 20,
'predator-deceleration' = 25,
'predator-vision-distance' = 20,
'predator-vision-angle' = 200,
'time-to-give-up' = 120,
'number-of-safe-zones' = 1,
'number-of-target-patches' = 5,
'proportion-obstacles' = 0.05,
'obstacle-radius' = 2.0,
'obstacle-radius-range' = 0.5,
'obstacle-sensitivity-for-prey' = 0.95,
'obstacle-sensitivity-for-predators' = 0.95,
'safe-zone-attractiveness' = 500
)
## Get names of parameters
param.names <- names(base.param)
### Load the code of the simulation function (name: sim)
source(file=sim.path)
### Convert "base.param" to a matrix, as required by parApply
base.param <- matrix(base.param, nrow=1, ncol=24)
### Get the number of simulations we want to run
design.combinations <- length(base.param[[1]])
already.processed <- 0
### Initialise NetLogo
processors <- detectCores()
cl <- makeCluster(processors)
clusterExport(cl, 'sim')
gui <- FALSE
invisible(parLapply(cl, 1:processors, prepro, gui=gui, nl.path=nl.path, model.path=model.path))
### Run the simulation across all processors, using parApply
sim.result.base <- parApply(cl, base.param, 1, sim,
param.names,
no.repeated.sim = 100,
trace.progress = FALSE,
iter.length = design.combinations,
function.name = "base parameters")
### Run the simulation on a single processor
#sim.result.base <- sim(base.param,
# param.names,
# no.repeated.sim = 100,
# my.nl1,
# trace.progress = TRUE,
# iter.length = design.combinations,
# function.name = "base parameters")
Here's a mock up for the 'sim' function (adapted from Thiele's paper "Facilitating parameter estimation and sensitivity analyses of agent-based models - a cookbook using NetLogo and R"):
sim <- function(param.set, parameter.names, no.repeated.sim, trace.progress, iter.length, function.name) {
# Some security checks
if (length(param.set) != length(parameter.names))
{ stop("Wrong length of param.set!") }
if (no.repeated.sim <= 0)
{ stop("Number of repetitions must be > 0!") }
if (length(parameter.names) <= 0)
{ stop("Length of parameter.names must be > 0!") }
# Create an empty list to save the simulation results
eval.values <- NULL
# Run the repeated simulations (to control stochasticity)
for (i in 1:no.repeated.sim)
{
# Create a random-seed for NetLogo from R, based on min/max of NetLogo's random seed
NLCommand("random-seed",runif(1,-2147483648,2147483647))
## This is the stuff for one simulation
cal.crit <- NULL
# Set NetLogo parameters to current parameter values
lapply(seq(1:length(parameter.names)), function(x) {NLCommand("set ",parameter.names[x], param.set[x])})
NLCommand("setup")
# This should run "go" until prey-win =/= 5, i.e. when the pursuit ends
NLDoCommandWhile("prey-win = 5", "go")
# Report a value
prey <- NLReport("prey-win")
# Report another value
pred <- NLReport("predator-win")
## Extract the values we are interested in
cal.crit <- rbind(cal.crit, c(prey, pred))
# append to former results
eval.values <- rbind(eval.values,cal.crit)
}
## Make sure eval.values has column names
names(eval.values) <- c("PreySuccess", "PredSuccess")
# Return the mean of the repeated simulation results
if (no.repeated.sim > 1) {
return(colMeans(eval.values))
}
else {
return(eval.values)
}
}
I think the problem might lie in the "nl.obj" string that RNetLogo uses to identify the NetLogo instance you want to run the code on - however, I've tried several different methods of fixing this, and I haven't been able to come up with a solution that works. When I initialise NetLogo across all the processors using the code provided in Thiele's example, I don't set an "nl.obj" value for each instance, so I'm guessing RNetLogo uses some kind of default list? However, in Thiele's original code, the "sim" function requires you to specify which NetLogo instance you want to run it on - so R will spit an error when I try to run the final line (Error in checkForRemoteErrors(val) : one node produced an error: argument "nl.obj" is missing, with no default). I have modified the "sim" function code so that it doesn't require this argument and just accepts the default setting for nl.obj - but then my simulation only runs on a single processor. So, I think that by default, "sim" must only be running the code on a single instance of NetLogo. I'm not certain how to fix it.
This is also the first time I've used the 'parallel' package, so I could be missing something obvious to do with 'parApply'. Any insight would be much appreciated!
Thanks in advance!
I am still in the process of applying a similar technique to perform a Morris Elementary Effects screening with my NetLogo model. For me the parallel execution works fine. I compared your script to mine and noticed that in my version the 'parApply' call of the simulation function (simfun) is embedded in a function statement (see below). Maybe including the function already solves your issue.
sim.results.morris <- parApply(cl, mo$X, 1, function(x) {simfun(param.set=x,
no.repeated.sim=no.repeated.sim,
parameter.names=input.names,
iter.length=iter.length,
fixed.values=fixed.values,
model.seed=new.model.seed,
function.name="Morris")})

mcapply: all scheduled cores encountered errors in user code

The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.
library(illuminaio)
library(parallel)
library(data.table)
# number of cores to use
ncores = 8
# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
pattern = "*.idat",
full.names = TRUE)
# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
idat <- readIDAT(x)
res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
fwrite(res, file.path = 'output.csv', append = TRUE)
}
# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)
After reading and writing info about 1200 files, I get the following message:
Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
all scheduled cores encountered errors in user code
How do I resolve it?
Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers.
R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.
Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':
set.seed( my.seed, kind = "L'Ecuyer-CMRG" );

Loading ffdf data take a lot of memory

I am facing a strange problem:
I save ffdf data using
save.ffdf()
from ffbase package and when i load them in a new R session, doing
load.ffdf("data.f")
it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R.
Having this issue, it does not make a lot of sense to use ffdf, isnĀ“t it?
I can't use ffsave because i am working in a server and do not have the zip app on it.
packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3
Any ideas about ?
[edit] some code example to help to clarify:
data <- read.csv.ffdf(file = fn, header = T, colClasses = classes)
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols and some integer cols.
data.1 <- data
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example.
closing the R session... opening again:
load.ffdf(file.name) # file.name is a string pointing to the file.
#that gives me object data, with class(data) = ffdf.
then i have a data object ffdf[5] , and its memory size is almost as big as:
data.R <- data[,] # which is a data.frame.
[end of edit]
*[ SECOND EDIT :: FULL REPRODUCIBLE CODE ::: ]
As my question is not answered yet, and i still find the problem, i give a reproducible example ::
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.limit(size=4000)
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = factor( sample(c(1:N/10) , N, replace=T)) )
df[1:10,]
dff <- as.ffdf(df)
head(dff)
#str(dff)
save.ffdf(dff, dir = "dframeffdf")
dim(dff)
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)
### resetting R :: fresh RStudio Session
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.size() # 15.63
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
gc()
memory.size() # 287
So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9
[END SECOND EDIT ]
In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.
Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = sample(c(1:N/10) , N, replace=T))
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
### resetting R :: fresh RStudio Session
library(ff)
library(ffbase)
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78
The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.

Resources