I have several question for those who have worked with R studio. Currently I need to work with NMMAPSlite package. However, I found that there is an issue from the package itself when I wanted to initialise the database connection to remote DB that store the NMMAPS City dataset.
In short, I need help to either
resolve the problem with NMMAPSlite old R package or
where to find the NMMAPS dataset in csv format
BACKGROUND
As a background, I'm using NMMAPSLite packages with intend to reproduce paper of Antonio Gasparrini. Attached at the bottom is the code base I would like to run. It requires:
require(dlnm);
require(NMMAPSlite)
Now the package NMMAPSlite has been deprecated it seems, so I managed to install the dependencies and the package from archive. I will elaborate below on the links required to get the dependencies for NMMAPS and DLNM as well.
PROBLEM
The problems occur when calling initDB() where it says it failed to create remoteDB instance due to invalid object creation. But I suspect, rather, the error comes from the fact the url is not supported. Here is the NMMAPS docs that describes the initDB() function. The db initialisation is necessary to read the city dataset.
The following is the error from R Console when running initDB()
creating directory 'NMMAPS' for local storage
Error in validObject(.Object) :
invalid class “remoteDB” object: object needs a 'url' of type 'http://'
In addition: Warning message:
In grep("^http://", URL, fixed = TRUE, perl = TRUE) :
argument 'perl = TRUE' will be ignored
QUESTIONS
I know this packages NMMAPS are deprecated and too old perhaps, but I really want to reproduce/replicate Antonio Gasparrini's paper: Distributed lag non-linear models for the purpose of my undergraduate thesis project.
Hence,
I wonder if there is anyway to get NMMAPS Dataset for cities environment data vs mortality rate. I visited the official NMMAPS Database but the link for downloading the data is either broken or the server is already down
Or you can also help me to find out if there is equivalent to NMMAPSlite package in R. I just need to download the cities dataset that contains humidity trend, temperatures trend, dewpoint, CO2 trends, Ozone O3 trend, and deaths/mortality rate with respect to time at any particular city for over 2 years. The most important that I need is the mortality rate and Ozone O3 trend.
Or last effort, perhaps do you mind suggesting me similar dataset that is used by his paper? Something where I can derive/analyze time relationship to estimate mortality rate given environmental and air polution information?
APPENDIX
Definition of initDB
baseurl = "http://www.ihapss.jhsph.edu/NMMAPS/v0.1"
function (basedir = "NMMAPS")
{
if (!file.exists(basedir))
message(gettextf("creating directory '%s' for local storage",
basedir))
outcome <- new("remoteDB", url = paste(baseurl, "outcome",
sep = "/"), dir = file.path(basedir, "outcome"), name = "outcome")
exposure <- new("remoteDB", url = paste(baseurl, "exposure",
sep = "/"), dir = file.path(basedir, "exposure"), name = "exposure")
Meta <- new("remoteDB", url = paste(baseurl, "Meta", sep = "/"),
dir = file.path(basedir, "Meta"), name = "Meta")
assign("exposure", exposure, .dbEnv)
assign("outcome", outcome, .dbEnv)
assign("Meta", Meta, .dbEnv)
}
Code to run:
The error comes from line 3
require(dlnm);require(NMMAPSlite)
##############################
# LOAD AND PREPARE THE DATASET
##############################
initDB()
data <- readCity("ny", collapseAge = TRUE)
data <- data[,c("city", "date", "dow", "death", "tmpd", "dptp", "rhum", "o3tmean", "o3mtrend", "cotmean", "comtrend")]
# TEMPERATURE: CONVERSION TO CELSIUS
data$temp <- (data$tmpd-32)*5/9
# POLLUTION: O3 AND CO AT LAG-01
data$o3 <- data$o3tmean + data$o3mtrend
data$co <- data$cotmean + data$comtrend
data$o301 <- filter(data$o3,c(1,1)/2,side=1)
data$co01 <- filter(data$co,c(1,1)/2, side=1)
# DEW POINT TEMPERATURE AT LAG 0-1
data$dp01 <- filter(data$dptp,c(1,1)/2,side=1)
##############################
# CROSSBASIS SPECIFICATION
##############################
# FIXING THE KNOTS AT EQUALLY SPACED VALUES
range <- range(data$temp,na.rm=T)
ktemp <- range [1] + (range [2]-range [1])/5*1:4
# CROSSBASIS MATRIX
ns.basis <- crossbasis(data$temp,varknots=ktemp,cenvalue=21, lagdf=5,maxlag=30)
##############################
# MODEL FIT AND PREDICTION
##############################
ns <- glm(death ~ ns.basis + ns (dp01, df=3 ) + dow + o301 + co01 +
ns(date,df=14*7),family=quasipoisson(), data)
ns.pred <- crosspred(ns.basis,ns,at=-16:33)
##############################
# RESULTS AND PLOTS
##############################
# 3-D PLOT (FIGURE 1)
crossplot(ns.pred,label="Temperature")
# SLICES (FIGURE 2, TOP)
percentiles <- round(quantile(data$temp,c(0.001,0.05,0.95,0.999)), 1)
ns.pred <- crosspred(ns.basis,ns,at=c(percentiles,-16:33))
crossplot(ns.pred,"slices",var=percentiles,lag=c(0,5,15,28), label="Temperature")
# OVERALL EFFECT (FIGURE 2, BELOW)
crossplot(ns.pred,"overall",label="Temperature", title="Overall effect of temperature on mortality
New York 1987–2000" )
# RR AT CHOSEN PERCENTILES VERSUS 21C (AND 95%CI)
ns.pred$allRRfit[as.character(percentiles)]
cbind(ns.pred$allRRlow,ns.pred$allRRhigh)[as.character(percentiles),]
##############################
# THE MOVING AVERAGE MODELS UP TO LAG x (DESCRIBED IN SECTION 5.2)
# CAN BE CREATED BY THE CROSSBASIS FUNCTION INCLUDING THE
# ARGUMENTS lagtype="strata", lagdf=1, maxlag=x
Resources for your context
Distributed lag non-linear models link
Rstudio's NMMAPSlite Package docs pdf download
Rstudio's DNLM Package docs pdf
Duplicate questions from another forum: forum
How to install package from tar/archive: link
Meanwhile, I will contact the author of this package and see if I can get the dataset. Preferable in csv format.
It seems that your code is based on R ver. < 3.0.0. You might find it difficult to reproduce the paper as the current R is typical > 4.0.0. You could try to install the windows version of NMMAPS database from the link given by 'Lil'. But, you will need to install an older version of R (2.9.2).
Or, you could hang on with the latest version of R and make a simple search on GitHub. In case you haven't found the NMMAPS database, you will find how to deal with the database here.
you could try this link http://www.biostat.jhsph.edu/IHAPSS/data/NMMAPS/R/ to download the package. There you have the city-data compressed where you can choose New York manually if initDB does not work.
I want to export an R model in pmml format and use it elsewhere. The other software requires some variables as integers but all numeric variables are exported as double instead, even when they are explicitly integer in my dataset.
I tried to bypass this problem by changing them manually (or with regex) and I deleted every decimal but while the software accepts the new format, the prediction is not what I expect (because I just deleted decimals), so I want to solve this directly inside R.
How can I force my variables to be a certain dataType (particularly "integer")?
This is a code example that exports a .pmml:
# Required packages -------------------------------------------------------
library(tidyverse)
library(r2pmml)
library(randomForest)
library(nnet)
# Dataset creation --------------------------------------------------------
seed = 1
data = data.frame(
var1 = round(runif(10) * 100),
var2 = round(runif(10) * 100),
y = round(runif(10) * 100)
)
data =
data %>%
mutate(var1 = as.integer(var1),
var2 = as.integer(var2))
# Structure check ---------------------------------------------------------
str(data)
# Neural Network and Random Forest models ---------------------------------
nn =
nnet(
y ~ .,
data = data,
method = "nnet",
size = c(2),
linout = 1
)
rf =
randomForest(y ~ .,
data = data)
# pmml export -------------------------------------------------------------
r2pmml(rf,
file = "rf.pmml",
dataset = data,
verbose = TRUE)
r2pmml(nn,
file = "nn.pmml",
dataset = data,
verbose = TRUE)
I expect my pmml to have variables var1 and var2 as an integer, but they end up being double in this section of the output
<DataDictionary>
<DataField name="y" optype="continuous" dataType="double"/>
<DataField name="var1" optype="continuous" dataType="double"/>
<DataField name="var2" optype="continuous" dataType="double"/>
and I got decimal numbers in
<NeuralLayer activationFunction="logistic">
<Neuron id="hidden/1" bias="-0.4112317232771385">
<Con from="input/1" weight="-6.591508925328581"/>
<Con from="input/2" weight="-31.805468580606753"/>
</Neuron>
but I'm not sure if that should be integer or double.
With the R2PMML package, and its underlying JPMML-R library being open source, you can always take a look into the source code (of the version that you're using) to see how things are implemented. In case of the nnet model type, you could take a look into the org.jpmml.rexp.NNetConverter class.
Essentially, there are two options. First, the R model object (nnet objects saved into RDS file) may not contain any feature type information at all. Second, this information might be there, but the converter is not using it yet - it is defaulting to the default data type of the nnet algorithm (all numeric computation works is done using the double data type, so it seems like a good choice for storing in the PMML document).
Where exactly is it recorded in your R model object(s) that features var1 and var2 are integers (instead of doubles)? If you think
you've found the answer, consider opening a feature request with the JPMML-R project.
I have run various models (glm, rpart, earth etc) and exported the model object from each respective one into a folder on my computer. So I now have a folder with ~60 different models stored as seperate .rda files.
This was done by creating a model function and then applying it to a list of model types through the purrr map package (to avoid errors and termination).
I now want to load them back into r and compare them. Unfortunatley when I wrote my intial model script each model is stored as the same ie "Model.Object" (I didnt know how to do otherwise) so when I try to load each one individually into r it just overides each other. Each file is saved as glm.rda, rpart.rda, earth.rda etc but the model within is labelled Model.Object (for clarification).
So I guess I have a few questions;
1. It is possible to load in multiple .rda files into r into a list that can then be indexed
2. How to alter the model function that has been applied so that the 'model.object' name reads as the model type (e.g. glm, rpart etc)
Code:
Model.Function = function(Model.Type){
set.seed(0)
Model.Output = train(x = Pred.Vars.RVC.Data, y = RVC, trControl = Tcontrolparam,
preProcess = Preprocessing.Options, tuneLength = 1, metric = "RMSE",
method = Model.Type)
save(Model.Object, file = paste("./RVC Models/",Model.Type,".rda", sep = ""))
return(Model.Object)
}
Possibly.Model.Function = possibly(Model.Function, otherwise = "something wrong here")
result.possible = map(c("glm","rpart","earth"), Possibly.Model.Function)
For now, a rescue operation of your existing files might look something like this (following #nicola's comment about using the envir argument to load()):
rda2list <- function(file) {
e <- new.env()
load(file, envir = e)
as.list(e)
}
folder <- "./RVC Models"
files <- list.files(folder, pattern = ".rda$")
models <- Map(rda2list, file.path(folder, files))
names(models) <- tools::file_path_sans_ext(files)
Going forward, it would be easier to save your models as .Rds files with saveRDS() rather than using save(). Then reassignment is easy upon loading the file. See e.g. this question and answer for more details on the matter.
I have created an evaluation scheme using the recommenderlab package with binaryRatingMatrix. How can I see which all users from the actual data are there in unknown test set?
scheme <- evaluationScheme(data = data1, method = "split", train = 0.9, given = 3)
where data1 is binaryRatingMatrix. I would like to extract the list of users who are in the unknown set getData(scheme, "unknown")?
This will print out the first column which is all the userIds.
getRatingMatrix(getData(scheme, "unknown")[,1])
First of all thanks for implementing XGBoost in h2o!
Unfortunately I am unable to predict from an h2o xgboost model that's loaded from disk (which I'm sure you can appreciate is really frustrating).
I am using the latest stable release of h2o i.e. 3.10.5.2 & I am using an R client.
I have included an example below that should enable you to reproduce the issue,
Thanks in advance
### Start h2o
require(h2o)
local_h2o = h2o.init()
### Source the base data set
data(mtcars)
h2o_mtcars = as.h2o(x = mtcars,destination_frame = 'h2o_mtcars')
### Fit a model to be saved
mdl_to_save = h2o.xgboost(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class doesnt work
#mdl_to_save = h2o.glm(model_id = 'mdl_to_save',y = 1,x = 2:11,training_frame = h2o_mtcars) ##This class works
### Take some reference predictions
ref_preds = h2o.predict(object = mdl_to_save,newdata = h2o_mtcars)
### Save the model to disk
silent = h2o.saveModel(object = mdl_to_save,path = 'INSERT_PATH',force = TRUE)
### Delete the model to make sure there cant be any strange locking issues
h2o.rm(ids = 'mdl_to_save')
### Load it back up
loaded_mdl = h2o.loadModel(path = 'INSERT_PATH/mdl_to_save')
### Score the model
### The h2o.predict statement below is what causes the error: java.lang.NullPointerException
lod_preds = h2o.predict(object = loaded_mdl,newdata = h2o_mtcars)
all.equal(ref_preds,lod_preds)
At the time I write this (January 2018), this is still a bug for xgboost. See this ticket for more information.
In the meantime, you can download the model as a pojo or mojo file
h2o.download_pojo(model, path = "/media/somewhere/tmp")
Loading the model back isn't that easy, unfortunately, but you can pass the new data via json to the saved pojo model with the function:
h2o.predict_json()
However, the new data must be provided in json format.
See this question for more details