Related
I which to use the buffer.dist() function of the GSIF package developed by Tomislav Hengl et al. (2018). It has not been updated since 2019 and was taken down from CRAN.
I downloaded the latest version of GSIF (v0.5-5 - 2019-01-04) from the CRAN repository and loaded the functions manually into the R workspace. All functions can be found in the folder "R".
> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6
# Manually load GSIF environment (manually download from CRAN repository)
source("AAAA.R") # needs to be loaded first
# Manually load function buffer.dist()
source("buffer.dist.R")
# Load library
library(sp)
library(plotKML)
library(raster)
library(gstat)
## Follow the workflow in the tutorial: https://peerj.com/articles/5518/GeoMLA_README_thengl.pdf
# Load example data from gstat package
data(meuse, echo = FALSE)
data(meuse.grid)
# transform into SpatialPoints objects (input data requirement for buffer.dist() )
meuse.sp <- SpatialPointsDataFrame(meuse[1:2], meuse[3:14], proj4string = CRS('+init=epsg:4326'))
meuse.grid.spdf <- SpatialPixelsDataFrame(meuse.grid[1:2], meuse.grid[6], proj4string = CRS('+init=epsg:4326'))
# derive buffer distances for each individual point, using the buffer function in the raster package which derives a gridded map for each observation point ()
grid.dist0 <- buffer.dist(meuse.sp["zinc"],
meuse.grid.spdf[1],
as.factor(1:nrow(meuse.sp)))
This gives me the following error message:
Error in x#coords[i, , drop = FALSE] : subscript out of bounds
Here is the buffer.dist() function (Hengl et al., 2018):
setMethod("buffer.dist", signature(observations = "SpatialPointsDataFrame", predictionDomain = "SpatialPixelsDataFrame"), function(observations, predictionDomain, classes, width, ...){
if(missing(width)){ width <- sqrt(areaSpatialGrid(predictionDomain)) }
if(!length(classes)==length(observations)){ stop("Length of 'observations' and 'classes' does not match.") }
## remove classes without any points:
xg = summary(classes, maxsum=length(levels(classes)))
selg.levs = attr(xg, "names")[xg > 0]
if(length(selg.levs)<length(levels(classes))){
fclasses <- as.factor(classes)
fclasses[which(!fclasses %in% selg.levs)] <- NA
classes <- droplevels(fclasses)
}
## derive buffer distances
s <- list(NULL)
for(i in 1:length(levels(classes))){
s[[i]] <- raster::distance(rasterize(observations[which(classes==levels(classes)[i]),1]#coords, y=raster(predictionDomain)), width=width, ...)
}
s <- s[sapply(s, function(x){!is.null(x)})]
s <- brick(s)
s <- as(s, "SpatialPixelsDataFrame")
s <- s[predictionDomain#grid.index,]
return(s)
})
I went through all steps of the function manually. It is in the second last row where the bug seems to occur:
s <- s[predictionDomain#grid.index,]
Error in x#coords[i, , drop = FALSE] : subscript out of bounds
Do you have any suggestion how to fix the issue?
You do not describe what that method does, but it seems that it does something like this:
bufdist <- function(obs, r, classes, width) {
s <- list()
cls <- sort(unique(classes))
for (i in 1:length(cls)) {
obsi <- obs[classes==cls[i], ]
x <- rasterize(obsi, r)
s[[i]] <- buffer(x, width, background=NA)
}
names(s) <- cls
rast(s)
}
library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
set.seed(1)
v <- spatSample(r, 50, as.points=TRUE)
cls <- sample(LETTERS[1:4], 50, replace=TRUE)
b <- bufdist(v, r, cls, 7500)
plot(b, col="red")
I am writing an R package that uses rstan for Bayesian
sampling. (Here
is the specific commit if you want to reproduce the issue.) I only succeed in
running a function that calls rstan from the vignette if I use
library(rstan) in the vignette, and not with a few workarounds.
The setup
A function in the package calls rstan (edited for clarity):
#' #importFrom Rcpp cpp_object_initializer
#' #export
run_variational_bayes <- function(x, y, output_samples, beta_sd, stan_file) {
n_input <- length(y)
p <- ncol(x)
train_dat <- list(n = n_input, p = p, x = x, y = y, beta_sd = beta_sd)
stan_model <- rstan::stan_model(file = stan_file)
stan_vb <- rstan::vb(object = stan_model, data = train_dat,
output_samples = output_samples)
return(rstan::extract(stan_vb)$beta)
}
I test this function in the package:
context("RStan variational Bayes model")
test_that("Rstan variational Bayes model runs", {
german <- PosteriorBootstrap::get_german_credit_dataset()
n_bootstrap <- 10
prior_variance <- 100
stan_vb_sample <- PosteriorBootstrap::run_variational_bayes(x = german$x,
y = german$y,
output_samples = n_bootstrap,
beta_sd = sqrt(prior_variance),
iter = 10)
expect_true(nrow(stan_vb_sample) == n_bootstrap)
expect_true(ncol(stan_vb_sample) == ncol(german$x))
})
The tests pass locally and on Travis, so the function works from inside the
package.
The problem
The vignette code works if I include library(rstan):
library(rstan)
prior_sd <- 10
n_bootstrap <- 1000
german <- PosteriorBootstrap::get_german_credit_dataset()
stan_vb_sample <- PosteriorBootstrap::run_variational_bayes(x = german$x,
y = german$y,
output_samples = n_bootstrap,
beta_sd = prior_sd)
dim(stan_vb_sample)
#> [1] 1000 25
but I see it as bad practice that the user needs to attach another package to
use my package. If I use requireNamespace(), building the vignette works but
the Stan model does not run:
requireNamespace("PosteriorBootstrap", quietly = TRUE)
# ...
stan_vb_sample <- PosteriorBootstrap::run_variational_bayes(x = german$x,
y = german$y,
output_samples = n_bootstrap,
beta_sd = prior_sd)
#> Error in cpp_object_initializer(.self, .refClassDef, ...) :
#> could not find function "cpp_object_initializer"
#> failed to create the model; variational Bayes not done
#> Stan model 'bayes_logit' does not contain samples.
dim(stan_vb_sample)
#> NULL
Note that I used #' #importFrom Rcpp cpp_object_initializer in the Roxygen
comment, which should import the function that rstan says is missing.
Comparison with another package that uses rstan
This package
has similar values in DESCRIPTION, yet I tested that it does not require library(rstan) to
run rstan. It uses #import Rcpp in one function, which I tested with my
package by replacing #importFrom Rcpp cpp_object_initializer in front of the
function and got the same error.
Failed workarounds
The difference between requireNamespace() and library() is that the latter
imports the namespace of the package into the current environment. But
rstan does import(Rcpp) so that object should be available.
(1) I tried library("PosteriorBootstrap") in the vignette, since the
package imports that object into its namespace: I got the same error (with
#import Rcpp or with #importFrom Rcpp cpp_object_initializer).
(2) I copied that object to the environment of the function:
requireNamespace("Rcpp", quietly = TRUE)
#' #import Rcpp
#' #export
run_variational_bayes <- function(x, y, output_samples, beta_sd,
stan_file = get_stan_file(),
iter = 10000, seed = 123, verbose = FALSE) {
cpp_object_initializer <- Rcpp:cpp_object_initializer
# ...
}
and I was surprised to get a vignette error:
E creating vignettes (1.8s)
Quitting from lines 151-157 (anpl.Rmd)
Error: processing vignette 'anpl.Rmd' failed with diagnostics:
object 'Rcpp' not found
Execution halted
Temporary solution
As a temporary solution, I moved the code in the function to the vignette
entirely. The vignette fails with requireNamespace():
requireNamespace("rstan")
#> Loading required namespace: rstan
prior_sd <- 10
n_bootstrap <- 1000
german <- PosteriorBootstrap::get_german_credit_dataset()
train_dat <- list(n = length(german$y), p = ncol(german$x), x = german$x, y = german$y, beta_sd = prior_sd)
stan_file <- PosteriorBootstrap::get_stan_file()
stan_model <- rstan::stan_model(file = stan_file)
stan_vb <- rstan::vb(object = stan_model, data = train_dat, seed = seed,
output_samples = n_bootstrap)
#> Error in cpp_object_initializer(.self, .refClassDef, ...) :
#> could not find function "cpp_object_initializer"
#> failed to create the model; variational Bayes not done
stan_vb_sample <- rstan::extract(stan_vb)$beta
#> Stan model 'bayes_logit' does not contain samples.
dim(stan_vb_sample)
#> NULL
and succeeds with library(rstan):
library("rstan")
#> Loading required package: ggplot2
# ...
stan_model <- rstan::stan_model(file = stan_file)
stan_vb <- rstan::vb(object = stan_model, data = train_dat, seed = seed,
output_samples = n_bootstrap)
#> Chain 1: ------------------------------------------------------------
# ...
#> Chain 1: COMPLETED.
stan_vb_sample <- rstan::extract(stan_vb)$beta
dim(stan_vb_sample)
#> [1] 1000 25
In moving the code out of the package, I realised that a test that uses
library("rstan") and calls the rstan package directly, e.g.
context("Adaptive non-parametric learning function")
library("rstan")
# ...
test_that("Adaptive non-parametric learning with posterior samples works", {
german <- get_german_credit_dataset()
n_bootstrap <- 100
# Get posterior samples
seed <- 123
prior_sd <- 10
train_dat <- list(n = length(german$y), p = ncol(german$x), x = german$x,
y = german$y, beta_sd = prior_sd)
stan_model <- rstan::stan_model(file = get_stan_file())
stan_vb <- rstan::vb(object = stan_model, data = train_dat, seed = seed,
output_samples = n_bootstrap)
stan_vb_sample <- rstan::extract(stan_vb)$beta
# ...
}
passes the tests inside the package:
✔ | 24 | Adaptive non-parametric learning function [53.1 s]
══ Results ═════════════════════════════════════════════════════════════════════
Duration: 53.2 s
OK: 24
Failed: 0
Warnings: 0
Skipped: 0
but the same test with requireNamespace("rstan") fails them:
⠋ | 21 | Adaptive non-parametric learning functionError in cpp_object_initializer(.self, .refClassDef, ...) :
could not find function "cpp_object_initializer"
Stan model 'bayes_logit' does not contain samples.
...
══ Results ═════════════════════════════════════════════════════════════════════
Duration: 51.7 s
OK: 22
Failed: 1
Warnings: 0
Skipped: 0
Conclusion
I wonder if rstan code is calling cpp_object_initializer without a
qualifier, and if it's doing that in a new environment that does not inherit the
objects from the calling environment.
I acknowledge that I did not use rstantools to start the package
(my employer decided to stick with the MIT license and chose not to restart the
package structure from scratch) and that I am compiling
the model at call time. I suppose that users providing their own model would
face the same errors when using requireNamespace() instead of library().
How can I allow users to run package functions that call rstan without
library(rstan), short of restarting the package from scratch with rstantools?
Progress has been made on getting the parallel processing part working but saving the vector with the fetch distances is not working properly. The error I get is
df_Test_Fetch <- data.frame(x_lake_length)
Error in data.frame(x_lake_length) : object 'x_lake_length' not found
write.table(df_Test_Fetch,file="C:/tempTest_Fetch.csv",row.names=TRUE,col.names=TRUE, sep=",")
Error in is.data.frame(x) : object 'df_Test_Fetch' not found
I have tried altering the code below so that the foreach step is output to x_lake_length. But that did not output the vector as I hoped. How can I get the actually results to be saved to a csv file. I am running a windows 8 computer with R x64 3.3.0.
Thanks you in advance
Jen
Here is the full code.
# make sure there is no prexisting data
rm(x_lake_length)
# Libraries ---------------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(lakemorpho,rgdal,maptools,sp,doParallel,foreach,
doParallel)
# HPC ---------------------------------------------------------------------
cores_2_use <- detectCores() - 2
cl <- makeCluster(cores_2_use, useXDR = F)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl, cores_2_use)
# Data --------------------------------------------------------------------
ogrDrivers()
dsn <- system.file("vectors", package = "rgdal")[1]
# the line below is commented out but when I run the script on my data the line below is what I use instead of the one above
# then making the name changes as needed
# dsn<-setwd("J:\\Elodea\\ByHUC6\\")
ogrListLayers(dsn)
ogrInfo(dsn=dsn, layer="trin_inca_pl03")
owd <- getwd()
setwd(dsn)
ogrInfo(dsn="trin_inca_pl03.shp", layer="trin_inca_pl03")
setwd(owd)
x <- readOGR(dsn=dsn, layer="trin_inca_pl03")
summary(x)
# Analysis ----------------------------------------------------------------
myfun <- function(x,i){tmp<-lakeMorphoClass(x[i,],NULL,NULL,NULL)
x_lake_length<-vector("numeric",length = nrow(x))
x_lake_length[i]<-lakeMaxLength(tmp,200)
print(i)
Sys.sleep(0.1)}
foreach(i = 1:nrow(x),.combine=cbind,.packages=c("lakemorpho","rgdal")) %dopar% (
myfun(x,i)
)
options(digits=10)
df_Test_Fetch <- data.frame(x_lake_length)
write.table(df_Test_Fetch,file="C:/temp/Test_Fetch.csv",row.names=TRUE,col.names=TRUE, sep=",")
print(proc.time())
I think this is what you want, though without understanding the subject matter I can't be 100% sure.
What I did was add a return() to your parallelized function and assigned the value of that returned object to x_lake_length when you call the foreach. But I'm only guessing that that's what you were trying to do, so please correct me if I'm wrong.
# make sure there is no prexisting data
rm(x_lake_length)
# Libraries ---------------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(lakemorpho,rgdal,maptools,sp,doParallel,foreach,
doParallel)
# HPC ---------------------------------------------------------------------
cores_2_use <- detectCores() - 2
cl <- makeCluster(cores_2_use, useXDR = F)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl, cores_2_use)
# Data --------------------------------------------------------------------
ogrDrivers()
dsn <- system.file("vectors", package = "rgdal")[1]
# the line below is commented out but when I run the script on my data the line below is what I use instead of the one above
# then making the name changes as needed
# dsn<-setwd("J:\\Elodea\\ByHUC6\\")
ogrListLayers(dsn)
ogrInfo(dsn=dsn, layer="trin_inca_pl03")
owd <- getwd()
setwd(dsn)
ogrInfo(dsn="trin_inca_pl03.shp", layer="trin_inca_pl03")
setwd(owd)
x <- readOGR(dsn=dsn, layer="trin_inca_pl03")
summary(x)
# Analysis ----------------------------------------------------------------
myfun <- function(x,i){tmp<-lakeMorphoClass(x[i,],NULL,NULL,NULL)
x_lake_length<-vector("numeric",length = nrow(x))
x_lake_length[i]<-lakeMaxLength(tmp,200)
print(i)
Sys.sleep(0.1)
return(x_lake_length)
}
x_lake_length <- foreach(i = 1:nrow(x),.combine=cbind,.packages=c("lakemorpho","rgdal")) %dopar% (
myfun(x,i)
)
options(digits=10)
df_Test_Fetch <- data.frame(x_lake_length)
write.table(df_Test_Fetch,file="C:/temp/Test_Fetch.csv",row.names=TRUE,col.names=TRUE, sep=",")
print(proc.time())
I am trying the below R script to built logistic regression model using RHadoop (rmr2, rhdfs packages) on an HDFS data file located at "hdfs://:/somnath/merged_train/part-m-00000" and then testing the model using a test HDFS data file at "hdfs://:/somnath/merged_test/part-m-00000".
We are using CDH4 distribution with Yarn/MR2 running parallel to MR1 supported by Hadoop-0.20. And using the hadoop-0.20 mapreduce and hdfs versions to run the below RHadoop script as Sys.setenv commands shown below.
However, whenever I am running the script, I am facing the below error with very little luck to bypass it. I would appreciate if somebody point me to the possible cause of this error which seems to be due to wrong way of lapply call in R without handling NA arguments.
[root#kkws029 logreg_template]# Rscript logreg_test.R
Loading required package: methods
Loading required package: rJava
HADOOP_CMD=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin/hadoop
Be sure to run hdfs.init()
14/08/11 11:59:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
NULL
NULL
[1] "Starting to build logistic regression model..."
Error in FUN(X[[2L]], ...) :
Sorry, parameter type `NA' is ambiguous or not supported.
Calls: logistic.regression ... .jrcall -> ._java_valid_objects_list -> lapply -> FUN
Execution halted
Below is my R-script :
#!/usr/bin/env Rscript
Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce")
Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin/hadoop")
Sys.setenv(HADOOP_BIN="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin");
Sys.setenv(HADOOP_CONF_DIR="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/conf");
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar")
Sys.setenv(LD_LIBRARY_PATH="/usr/lib64/R/library/rJava/jri")
library(rmr2)
library(rhdfs)
.jinit()
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-auth-2.0.0-cdh4.3.0.jar")
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/hadoop-hdfs-2.0.0-cdh4.3.0.jar")
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar")
hdfs.init()
rmr.options( backend = "hadoop", hdfs.tempdir = "/tmp" )
logistic.regression =
function(hdfsFilePath, iterations, dims, alpha) {
r.file <- hdfs.file(hdfsFilePath,"r")
#hdfsFilePath <- to.dfs(hdfsFilePath)
lr.map =
function(.,M) {
Y = M[,1]
X = M[,-1]
keyval(
1,
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values(
from.dfs(
mapreduce(
input = as.matrix(hdfs.read.text.file(r.file)),
#input = from.dfs(hdfsFilePath),
map = function(.,M) {
Y = M[,1]
X = M[,-1]
keyval(
1,
Y * X *
g(-Y * as.numeric(X %*% t(plane))))},
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient
#trace(print(plane),quote(browser()))
}
return(plane) }
#validate logistic regression
logistic.regression.test =
function(hdfsFilePath, weight) {
r.file <- hdfs.file(hdfsFilePath,"r")
lr.test.map =
function(.,M) {
keyval(
1,
lapply(as.numeric(M[,-1] %*% t(weight)),function(z) 1/(1 + exp(-z))))}
probabilities =
values(
from.dfs(
mapreduce(
input = as.matrix(hdfs.read.text.file(r.file)),
map = function(.,M) {
keyval(
1,
lapply(as.numeric(M[,-1] %*% t(weight)), function(z) 1/(1 + exp(-z))))}
)))
return(probabilities) }
out = list()
prob = list()
rmr.options( backend = "hadoop", hdfs.tempdir = "/tmp" )
print("Starting to build logistic regression model...")
out[['hadoop']] =
## #knitr logistic.regression-run
logistic.regression(
"hdfs://XX.XX.XX.XX:NNNN/somnath/merged_train/part-m-00000", 5, 5, 0.05)
write.csv(as.vector(out[['hadoop']]), "/root/somnath/logreg_data/weights.csv")
print("Building logistic regression model completed.")
prob[['hadoop']] =
logistic.regression.test(
"hdfs://XX.XX.XX.XX:NNNN/somnath/merged_test/part-m-00000", out[['hadoop']])
write.csv(as.vector(prob[['hadoop']]), "/root/somnath/logreg_data/probabilities.csv")
stopifnot(
isTRUE(all.equal(out[['local']], out[['hadoop']], tolerance = 1E-7)))
NOTE: I have set following environment variables for HADOOP as follows in root ~/.bash_profile
# Hadoop-specific environment and commands
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce
export HADOOP2_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
#export HADOOP_CMD=${HADOOP_HOME}/bin/hadoop
#export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
#export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export LD_LIBRARY_PATH=${R_HOME}/library/rJava/jri #:${HADOOP_HOME}/../hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
# Add hadoop-common jar to classpath for PlatformName and FsShell classes; Add hadoop-auth and hadoop-hdfs jars
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP2_HOME}/client-0.20/* #:${HADOOP_HOME}/*.jar:${HADOOP_HOME}/lib/*.jar:${HADOOP2_HOME}/hadoop-common-2.0.0-cdh4.3.0.jar:${HADOOP_HOME}/../hadoop-hdfs/hadoop-hdfs-2.0.0-cdh4.3.0.jar:${HADOOP_HOME}/hadoop-auth-2.0.0-cdh4.3.0.jar:$HADOOP_STREAMING
PATH=$PATH:$R_HOME/bin:$JAVA_HOME/bin:$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH/lib/mahout:/opt/cloudera/parcels/CDH/lib/hadoop:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce:/var/lib/storm-0.9.0-rc2/lib #:$HADOOP_CMD:$HADOOP_STREAMING:$HADOOP_CONF_DIR
export PATH
SAMPLE TRAIN DATASET
0,-4.418,-2.0658,1.2193,-0.68097,0.90894
0,-2.7466,-2.9374,-0.87562,-0.65177,0.53182
0,-0.98846,0.66962,-0.20736,-0.2895,0.002313
0,-2.277,2.492,0.47936,0.4673,-1.5075
0,-5.4391,1.8447,-1.6843,1.465,-0.71099
0,-0.12843,0.066968,0.02678,-0.040851,0.0075902
0,-2.0796,2.4739,0.23472,0.86423,0.45094
0,-3.1796,-0.15429,1.4814,-0.94316,-0.52754
0,-1.9429,1.3111,0.31921,-1.202,0.8552
0,-2.3768,1.9301,0.096005,-0.51971,-0.17544
0,-2.0336,1.991,0.82029,0.018232,-0.33222
0,-3.6388,-3.2903,-2.1076,0.73341,0.75986
0,-2.9146,0.53163,0.49182,-0.38562,-0.76436
0,-3.3816,1.0954,0.25552,-0.11564,-0.01912
0,-1.7374,-0.63031,-0.6122,0.022664,0.23399
0,-1.312,-0.54935,-0.68508,-0.072985,0.036481
0,-3.991,0.55278,0.38666,-0.56128,-0.6748
....
SAMPLE TEST DATASET
0,-0.66666,0.21439,0.041861,-0.12996,-0.36305
0,-1.3412,-1.1629,-0.029398,-0.13513,0.49758
0,-2.6776,-0.40194,-0.97336,-1.3355,0.73202
0,-6.0203,-0.61477,1.5248,1.9967,2.697
0,-4.5663,-1.6632,-1.2893,-1.7972,1.4367
0,-7.2339,2.4589,0.61349,0.39094,2.19
0,-4.5683,-1.3066,1.1006,-2.8084,0.3172
0,-4.1223,-1.5059,1.3063,-0.18935,1.177
0,-3.7135,-0.26283,1.6961,-1.3499,-0.18553
0,-2.7993,1.2308,-0.42244,-0.50713,-0.3522
0,-3.0541,1.8173,0.96789,-0.25138,-0.36246
0,-1.1798,1.0478,-0.29168,-0.26261,-0.21527
0,-2.6459,2.9387,0.14833,0.24159,-2.4811
0,-3.1672,2.479,-1.2103,-0.48726,0.30974
1,-0.90706,1.0157,0.32953,-0.11648,-0.47386
...
I'm using JD Long's segue package (https://code.google.com/p/segue/) to do some parallel computing, and am running into an issue loading CRAN packages on the EC2 instances.
First, I created an EMR cluster like so:
myCluster <- createCluster(numInstances = 5,
cranPackages = c("RWeka", "tm"),
masterInstanceType="m1.large",
slaveInstanceType="m1.large",
location="us-east-1c",)
Per the documentation, I specified which packages I want to load (in this case, RWeka and tm).
The cluster seems to start properly, with no error messages. I am using RStudio on Linux Mint 17 with R version 3.0.2.
I wrote a function getTerms.jobAd which takes a character string and calls some functions from the packages above, and am using emrlapply() like so:
> jobAdTerms <- emrlapply(myCluster, X = as.list(jobAds[1:2, 3]), FUN = getTerms.jobAd)
RUNNING - 2014-06-24 17:05:19
RUNNING - 2014-06-24 17:05:50
WAITING - 2014-06-24 17:06:20
When I check the jobAdTerms list that is supposed to be returned, I get:
> jobAdTerms
[[1]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
[[2]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
Obviously, Corpus is one of the functions from the tm package.
What am I doing wrong? And how can I remedy this situation? Thanks!!
EDIT
Here's the function I am calling:
nGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4))
getTerms.jobAd <- function(txt) {
tmp <- tolower(txt)
tmp <- gsub('\\s*<.*?>|[:;,#$%^&*()?]|(?<=[a-zA-Z])\\.(?= |$)', '', tmp, perl = TRUE)
txt.Corpus <- Corpus(VectorSource(tmp))
txt.Corpus <- tm_map(txt.Corpus, stripWhitespace)
txt.TFV <- termFreq(txt.Corpus[[1]], control = list(dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
txt.TFV2 <- termFreq(txt.Corpus[[1]], control = list(tokenize = nGramTokenizer, dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
jobTerms <- rowSums(as.matrix(c(txt.TFV, txt.TFV2)))
return(jobTerms)
}
EDIT 2
Here's how you can reproduce the error:
data(crude)
jobAdTerms <- emrlapply(myCluster, X = as.list(crude), FUN = getTerms.jobAd)