mlr error in randomForestSRC after first run - r

I have installed Java 9.0.4 and all the relevant R libraries on macOS 10.13.4 to run the following script in R 3.5.0 (invoked in RStudio 1.1.423):
options("java.home"="/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/lib")
Sys.setenv(LD_LIBRARY_PATH='$JAVA_HOME/server')
dyn.load('/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/lib/server/libjvm.dylib')
library(mlr)
library(tidyverse) # for ggplot and data wrangly
library(ggvis) # ggplot visualisation in shiny app
library(rJava)
library(FSelector)
data <- read.csv('week07/PhishingWebsites.csv')
# All variables to nominal (PhishingWebsites)
data[c(1:31)] <- lapply(data[c(1:31)] , factor)
# Configure a classification task and specify Result as the target feature.
classif.task <- makeClassifTask(id = "web", data = data, target = "Result")
fv <- generateFilterValuesData(classif.task)
It works fine the first time I run it, but if I run it a second time I get the following error:
Error in randomForestSRC::rfsrc(getTaskFormula(task), data = getTaskData(task), :
An error has occurred in the grow algorithm. Please turn trace on for further analysis.
Any help much appreciated.

Related

NLP textEmbed function

I am trying to run the textEmbed function in R.
Set up needed:
require(quanteda)
require(quanteda.textstats)
require(udpipe)
require(reticulate)
#udpipe_download_model(language = "english")
ud_eng <- udpipe_load_model(here::here('english-ewt-ud-2.5-191206.udpipe'))
virtualenv_list()
reticulate::import('torch')
reticulate::import('numpy')
reticulate::import('transformers')
reticulate::import('nltk')
reticulate::import('tokenizers')
require(text)
It runs the following code
tmp1 <- textEmbed(x = 'sofa help',
model = 'roberta-base',
layers = 11)
tmp1$x
However, it does not run the following code
tmp1 <- textEmbed(x = 'sofa help',
model = 'roberta-base',
layers = 11)
tmp1$x
It gives me the following error
Error in x[[1]] : subscript out of bounds
In addition: Warning message:
Unknown or uninitialised column: `words`.
Any suggestions would be highly appreciated
I believe that this error has been fixed with a newer version of the text-package (version .9.50 and above).
(I cannot see any difference in the two code parts – but I think that this error is related to only submitting one token/word to textEmbed, which now works).
Also, see updated instructions for how to install the text-package http://r-text.org/articles/Extended_Installation_Guide.html
library(text)
library(reticulate)
# Install text required python packages in a conda environment (with defaults).
text::textrpp_install()
# Show available conda environments.
reticulate::conda_list()
# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R.
text::textrpp_initialize(save_profile = TRUE)
# Test so that the text package work.
textEmbed("hello")

response.plot3() crashes RStudio

0. Session information
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
1. Summary of my issue
I am having a crash while using a modified function of response.plot2(), named response.plot3(). The problem is not the function in itself, as the same problem occurs with response.plot2().
2. Code
library(biomod2)
library(raster)
library(reshape)
library(ggplot2)
setwd("xxx")
# I load the modified version of response.plot2()
source("/response.plot_modified.R", local = TRUE)
sp <- "NAME"
baseline_EU <- readRDS("./data/baseline_EU.rds")
initial.wd <- getwd()
setwd("models")
# Loading of formatted data and models calibrated by biomod
load(paste0(sp, "/run.data"))
load(paste0(sp, "/model.runs"))
# Variables used for calibration
cur.vars <- model.runs#expl.var.names
# Loading model names into R memory
models.to.plot <- BIOMOD_LoadModels(model.runs)
# Calculation of response curves with all models (stored in the object resp which is an array)
resp <- response.plot3(models = models.to.plot,
Data = baseline_EU[[cur.vars]],
fixed.var.metric = "sp.mean",
show.variables = cur.vars,
run.data = run.data)
I have got 60 models and the code plot the first curve before aborting the session, with no further explanation.
3. What I unsuccessfully tried
(1) check that it was not a ram issue
(2) uninstall-reinstall all the packages and their dependencies
(3) uptade to the last R version
(4) go back to response.plot2() to see if the issue could come from response.plot3()
4. I found some similar errors which lead me to think that it might be a package issue
https://github.com/rstudio/rstudio/issues/9373
Call to library(raster) or require(raster) causes Rstudio to abort session
Now I presume that there is a problem either with the biomod2 or the raster packages, or maybe the R version?
I would greatly appreciate your help if you have any ideas.

Using R Package NNMAPSlite to get City Environmental vs Mortality Dataset

I have several question for those who have worked with R studio. Currently I need to work with NMMAPSlite package. However, I found that there is an issue from the package itself when I wanted to initialise the database connection to remote DB that store the NMMAPS City dataset.
In short, I need help to either
resolve the problem with NMMAPSlite old R package or
where to find the NMMAPS dataset in csv format
BACKGROUND
As a background, I'm using NMMAPSLite packages with intend to reproduce paper of Antonio Gasparrini. Attached at the bottom is the code base I would like to run. It requires:
require(dlnm);
require(NMMAPSlite)
Now the package NMMAPSlite has been deprecated it seems, so I managed to install the dependencies and the package from archive. I will elaborate below on the links required to get the dependencies for NMMAPS and DLNM as well.
PROBLEM
The problems occur when calling initDB() where it says it failed to create remoteDB instance due to invalid object creation. But I suspect, rather, the error comes from the fact the url is not supported. Here is the NMMAPS docs that describes the initDB() function. The db initialisation is necessary to read the city dataset.
The following is the error from R Console when running initDB()
creating directory 'NMMAPS' for local storage
Error in validObject(.Object) :
invalid class “remoteDB” object: object needs a 'url' of type 'http://'
In addition: Warning message:
In grep("^http://", URL, fixed = TRUE, perl = TRUE) :
argument 'perl = TRUE' will be ignored
QUESTIONS
I know this packages NMMAPS are deprecated and too old perhaps, but I really want to reproduce/replicate Antonio Gasparrini's paper: Distributed lag non-linear models for the purpose of my undergraduate thesis project.
Hence,
I wonder if there is anyway to get NMMAPS Dataset for cities environment data vs mortality rate. I visited the official NMMAPS Database but the link for downloading the data is either broken or the server is already down
Or you can also help me to find out if there is equivalent to NMMAPSlite package in R. I just need to download the cities dataset that contains humidity trend, temperatures trend, dewpoint, CO2 trends, Ozone O3 trend, and deaths/mortality rate with respect to time at any particular city for over 2 years. The most important that I need is the mortality rate and Ozone O3 trend.
Or last effort, perhaps do you mind suggesting me similar dataset that is used by his paper? Something where I can derive/analyze time relationship to estimate mortality rate given environmental and air polution information?
APPENDIX
Definition of initDB
baseurl = "http://www.ihapss.jhsph.edu/NMMAPS/v0.1"
function (basedir = "NMMAPS")
{
if (!file.exists(basedir))
message(gettextf("creating directory '%s' for local storage",
basedir))
outcome <- new("remoteDB", url = paste(baseurl, "outcome",
sep = "/"), dir = file.path(basedir, "outcome"), name = "outcome")
exposure <- new("remoteDB", url = paste(baseurl, "exposure",
sep = "/"), dir = file.path(basedir, "exposure"), name = "exposure")
Meta <- new("remoteDB", url = paste(baseurl, "Meta", sep = "/"),
dir = file.path(basedir, "Meta"), name = "Meta")
assign("exposure", exposure, .dbEnv)
assign("outcome", outcome, .dbEnv)
assign("Meta", Meta, .dbEnv)
}
Code to run:
The error comes from line 3
require(dlnm);require(NMMAPSlite)
##############################
# LOAD AND PREPARE THE DATASET
##############################
initDB()
data <- readCity("ny", collapseAge = TRUE)
data <- data[,c("city", "date", "dow", "death", "tmpd", "dptp", "rhum", "o3tmean", "o3mtrend", "cotmean", "comtrend")]
# TEMPERATURE: CONVERSION TO CELSIUS
data$temp <- (data$tmpd-32)*5/9
# POLLUTION: O3 AND CO AT LAG-01
data$o3 <- data$o3tmean + data$o3mtrend
data$co <- data$cotmean + data$comtrend
data$o301 <- filter(data$o3,c(1,1)/2,side=1)
data$co01 <- filter(data$co,c(1,1)/2, side=1)
# DEW POINT TEMPERATURE AT LAG 0-1
data$dp01 <- filter(data$dptp,c(1,1)/2,side=1)
##############################
# CROSSBASIS SPECIFICATION
##############################
# FIXING THE KNOTS AT EQUALLY SPACED VALUES
range <- range(data$temp,na.rm=T)
ktemp <- range [1] + (range [2]-range [1])/5*1:4
# CROSSBASIS MATRIX
ns.basis <- crossbasis(data$temp,varknots=ktemp,cenvalue=21, lagdf=5,maxlag=30)
##############################
# MODEL FIT AND PREDICTION
##############################
ns <- glm(death ~ ns.basis + ns (dp01, df=3 ) + dow + o301 + co01 +
ns(date,df=14*7),family=quasipoisson(), data)
ns.pred <- crosspred(ns.basis,ns,at=-16:33)
##############################
# RESULTS AND PLOTS
##############################
# 3-D PLOT (FIGURE 1)
crossplot(ns.pred,label="Temperature")
# SLICES (FIGURE 2, TOP)
percentiles <- round(quantile(data$temp,c(0.001,0.05,0.95,0.999)), 1)
ns.pred <- crosspred(ns.basis,ns,at=c(percentiles,-16:33))
crossplot(ns.pred,"slices",var=percentiles,lag=c(0,5,15,28), label="Temperature")
# OVERALL EFFECT (FIGURE 2, BELOW)
crossplot(ns.pred,"overall",label="Temperature", title="Overall effect of temperature on mortality
New York 1987–2000" )
# RR AT CHOSEN PERCENTILES VERSUS 21C (AND 95%CI)
ns.pred$allRRfit[as.character(percentiles)]
cbind(ns.pred$allRRlow,ns.pred$allRRhigh)[as.character(percentiles),]
##############################
# THE MOVING AVERAGE MODELS UP TO LAG x (DESCRIBED IN SECTION 5.2)
# CAN BE CREATED BY THE CROSSBASIS FUNCTION INCLUDING THE
# ARGUMENTS lagtype="strata", lagdf=1, maxlag=x
Resources for your context
Distributed lag non-linear models link
Rstudio's NMMAPSlite Package docs pdf download
Rstudio's DNLM Package docs pdf
Duplicate questions from another forum: forum
How to install package from tar/archive: link
Meanwhile, I will contact the author of this package and see if I can get the dataset. Preferable in csv format.
It seems that your code is based on R ver. < 3.0.0. You might find it difficult to reproduce the paper as the current R is typical > 4.0.0. You could try to install the windows version of NMMAPS database from the link given by 'Lil'. But, you will need to install an older version of R (2.9.2).
Or, you could hang on with the latest version of R and make a simple search on GitHub. In case you haven't found the NMMAPS database, you will find how to deal with the database here.
you could try this link http://www.biostat.jhsph.edu/IHAPSS/data/NMMAPS/R/ to download the package. There you have the city-data compressed where you can choose New York manually if initDB does not work.

Error in ts(data) : R scoping issue

I am running my script
#!/usr/bin/env Rscript
require(gdata)
library('ggplot2')
library('forecast')
library('tseries')
df1 <- read.xls('us-gasprices.xls',header = TRUE)
new = df1$V2
data = tail(new,-1)
data = ts(data)
fit = auto.arima(data, seasonal=FALSE)
png(filename="residuals.png")
tsdisplay(residuals(fit), lag.max=40, main='(4,1,1) Model Residuals')
plot(m)
dev.off()
Output
Error in ts(data) : 'ts' object must have one or more observations
I have read this thread
I understand what Rob has explained that the problem is in scoping.
When I run my code from R terminal line by line then it works fine.
But when I run Rscript in Ubuntu terminal the above-mentioned problem occurs.
How to solve this?

An error occurs when calling rpart for a large data set

I have a large data set which has 100k data fields. When I try str() or view the full data no glitched occurs, but when I run rpart on the training set it takes sometime and after about 3-4 minutes it shows up the following error,
Error: Unable to establish connection with R session
My script looks like below:
# Decision tree
library(rpart)
library(rattle)
library(party)
train_set <- read.table('my_sample_trainset.csv', header=TRUE, sep=',', stringsAsFactors=FALSE)
test_set <- read.table('my_sample_testset.csv', header=TRUE, sep=',', stringsAsFactors=FALSE)
my_trained_tree <- rpart(Route ~ Bus_Id + week_days + time_slot, data=train_set, method="class")
# Error occurs on/after this line
my_prediction <- predict(my_trained_tree, test_set, type = "class")
my_solution <- data.frame(Route = my_prediction)
write.csv(my_solution, file = "solution.csv", row.names = FALSE)
Am I missing a library? or does this happen because of the big data set (6.5MB)
Further, I am using rStudio version 0.99.447 on a Mac OS X Yosemite
That message means that R is still calculating the results. If you open Activity Monitor and sort by CPU usage on the CPU tab, you should see that rsession is using 100% of a CPU. So you can just click "ok" on that message and allow R to keep computing.
I wish there were a workaround though, this issue is plaguing me as we speak!

Resources