Saving categorical variable treatment plan in R - r

The R package vtreat provides a handy way of creating "one-hot encoders" for the categorical variables (see a relevant post at the Win-Vector blog). Is there any way to save the treatment plan tplan object for further use (e.g., equivalent mechanism of pickle in Python).
tplan <- vtreat::designTreatmentsZ(dTrain, vars)
oneHotEncoded <- as.matrix(vtreat::prepare(tplan, dTrain, varRestriction = vars))
I would like to transform whatever data I will get with this particular treatment plan (which was computed on the dTrain), in a situation where the dTrain is no longer available. That is, I cannot re-use dTrain the next time I will call the script.
P.s. the solution should not necessary be confined to using vtreat

Base R provides the general functions save() and load() for such purposes.
Here is a reproducible example using code snippets from the post you have linked to:
library(titanic)
library(vtreat)
data(titanic_train)
outcome <- 'Survived'
target <- 1
shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')
for(v in shouldBeCategorical) {
titanic_train[[v]] <- as.factor(titanic_train[[v]])
}
tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")
vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))
dTrain <- titanic_train
set.seed(4623762)
tplan <- vtreat::designTreatmentsZ(dTrain, vars,
minFraction= 0,
verbose=FALSE)
save(tplan, file='tplan.RData')
The file tplan.RData will be saved in your current working directory; afterwards, in a new R session, when you ask for
load('tplan.RData')
you will get your tplan variable back.
Alternatively, base R functions saveRDS and loadRDS will also do the job; their usage is exactly similar, and they seem to be preferable.

Related

Is there a way to pack/unpack binary data from memory in R without going through a `rawConnection`?

I'm currently using this handmade thing (which should of course be adjusted for different data types):
read_blob <- function(blob) {
blob_con <- rawConnection(blob)
data <- readBin(blob_con, "numeric", n = length(blob) / 4)
close(blob_con)
return(data)
}
write_blob <- function(data) {
stopifnot(is.numeric(data))
blob_con <- rawConnection(raw(0), "r+")
writeBin(data, blob_con)
blob <- rawConnectionValue(blob_con)
close(blob_con)
return(blob)
}
input_data <- c(123.45,234.56,345.67)
data_to_raw <- write_blob(input_data)
print(data_to_raw)
raw_to_data <- read_blob(data_to_raw)
print(raw_to_data)
stopifnot(all(input_data == raw_to_data))
I consider this somewhat of a workaround, having to open a rawConnection for reading and writing. Is there really no better way of doing this?
I know of
rawToChar but no other data types,
the package pack which should work with structure descriptors in the style of unpack('d*', data). I guess it was never finished because in my case only returns the first value regardless of what I put in the pattern, and last updated in 2008.
the package blob which doesn't seem to be doing this on its own.

Is there a way to run commands over multiple R datasets?

I am very new to R and would like to use some code to run various batch code on all of the data that I have available. It should be clear what I'm trying to do:
# library(PerformanceAnalytics)
# mydata <- mtcars[, c('mpg', 'cyl', 'disp', 'hp', 'carb')];
# chart.Correlation(mydata, histogram=TRUE, pch=19)
library(MASS)
M_names = data(package = "MASS")$result[, "Item"]
for (i in 1:length(M_names)) {
eval(paste("MASS::", M_names[i], sep=""));
}
The commented part is some code I found that I haven't been able to integrate yet. The Correlation is a very cool correlation matrix, which I'm attempting to funnel every single dataset I have access to into so I can quickly review them instead of doing it all manually. I guess I will need to save them all to PNGs to have practical workflow around that, as it's clear there's no way to coax the X windows to appear or stay put when running R code as a script.
The behavior I observe as I execute this on my Mac is:
> library(MASS)
> M_names = data(package = "MASS")$result[, "Item"]
> for (i in 1:length(M_names)) {
+ eval(paste("MASS::", M_names[i], sep=""));
+ }
>
>
I don't know for sure what the silent + indicator means, but I'm pretty sure it just means that code line is inside the for loop scope. But the eval is swallowing the command I assembled. I'm just trying to get it to print out the content of the data at each iteration of the loop for now.
I also noticed this:
> eval("MASS::ships")
[1] "MASS::ships"
It just prints it when I try to eval it.
I also hope there is a way to programmatically print individual datasets. I'm already hacking really hard at this, and there is no way that what I am doing here is a good idea.
If you have the package dataset names in a vector the key to accessing them
by their character names is the get function:
library(MASS)
M_names = data(package = "MASS")$result[, "Item"]
head(get(M_names[1]), 1)
# state sex diag death status T.categ age
# 1 NSW M 10905 11081 D hs 35
You can then loop through the vector of names
for (DATA in M_names) print(summary(get(DATA)))
Another options is to use the envir argument of the data function to load the datasets into a specific environment. It may be worth adding the data to a new environment instead of polluting your workspace.You can do that with
data(list=M_names, package="MASS", envir = list_of_datafames<- new.env())
You can then look through the list_of_datafames as you would with an other list object:
lapply(list_of_datafames, summary)

reading sparse data in h2o using svmlight

I am trying to read a dataset in SVMLight format into h2o. Writing it to a file on disk and reading it back is working OK but reading it directly from R's memory is not. I would like to know if there is a different function or a different way of calling the function I have used below.
Here's an example R 3.3.3, h2o 3.10.3.6:
require(data.table)
require(h2o)
set.seed(1000)
tot_obs <- 100
tot_var <- 500
vars_per_obs <- round(.0*tot_var,0):round(.1*tot_var,0)
#randomly generated data
mat.dt <- do.call('rbind', lapply(1:tot_obs, function(n) {
nvar <- sample(vars_per_obs,1)
if(nvar>0) data.table(obs=n, var=sample(1:tot_var,nvar))[, value:=sample(10:50,.N,replace=TRUE)]
}))
event.dt <- data.table(obs=1:tot_obs)[, is_event:=sample(0:1,.N,prob=c(.9,.1),replace=TRUE)]
#SVMLight format
setorder(mat.dt, obs, var)
mat.agg.dt <- mat.dt[, .(feature=paste(paste0(var,":",value), collapse=" ")), obs]
mat.agg.dt <- merge(event.dt, mat.agg.dt, by="obs", sort=FALSE, all.x=TRUE)
mat.agg.dt[is.na(feature), feature:=""]
mat.agg.dt[, svmlight:=paste(is_event,feature)][, c("obs","is_event","feature"):=NULL]
fwrite(mat.agg.dt, file="svmlight.txt", col.names=FALSE)
#h2o
localH2o <- h2o.init(nthreads=-1, max_mem_size="4g")
h2o.no_progress()
#works
h2o.orig <- h2o.importFile("svmlight.txt", parse=TRUE)
#does NOT work
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
The easy answer is that you probably don't have enough R memory to perform this action, so one solution is to increase the amount of memory in R (if that's an option for you). It could also mean that you don't have enough memory in your H2O cluster, so you could increase that as well.
The only way to go directly from R memory to the H2O cluster is the as.h2o() function, so you are definitely using the right command. Under the hood, the as.h2o() function writes the frame from R memory to disk (stored in a temp file) and then reads it directly into the H2O cluster using H2O's native parallel read functionality.
We recently added the ability to use data.table's read/write functionality any place that we use base R, so since you have data.table installed, you should probably be able to get around this bottleneck by adding this to the top of your script: options("h2o.use.data.table"=TRUE). This will force the use of data.table instead of base R to write to disk for the first half of the as.h2o() conversion process. This should work for you since it's doing the exact same thing that your code is doing already where you use fwrite to write to disk and h2o.importFile() to read it back in.
Also you don't need the last line with h2o.parseRaw():
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
You can just do:
h2o.orig.1 <- as.h2o(mat.agg.dt)
There is a related post that shows how to use data.table to solve the reverse problem (using as.data.frame() instead of as.h2o()) here.

read, manipulate and export multiple .dta Files using a for Loop in R

I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.

Including R scripts in R packages

I want to share some software as a package but some of my scripts do not seem to go very naturally as functions. For example consider the following chunk of code where 'raw.df' is a data frame containing variables of both discrete and continuous kinds. The functions 'count.unique' and 'squash' will be defined in the package. The script splits the data frame into two frames, 'cat.df' to be treated as categorical data and 'cts.df' to be treated as continuous data.
My idea of how this would be used is that the user would read in the data frame 'raw.df', source the script, then interactively edit 'cat.df' and 'cts.df', perhaps combining some categories and transforming some variables.
dcutoff <- 9
tail(raw.df)
(nvals <- apply(raw.df, 2, count.unique))
p <- dim(raw.df)[2]
(catvar <- (1:p)[nvals <= dcutoff])
p.cat <- length(catvar)
(ctsvar <- (1:p)[nvals > dcutoff])
p.cts <- length(ctsvar)
cat.df <- raw.df[ ,catvar]
for (i in 1:p.cat) cat.df[ ,i] <- squash(cat.df[ ,i])
head(cat.df)
for(i in 1:p.cat) {
cat(as.vector(table(cat.df[ ,i])), "\n")
}
cts.df <- raw.df[ ,ctsvar]
for(i in 1:p.cts) {
cat( quantile(cts.df[ ,i], probs = seq(0, 1, 0.1)), "\n")
}
Now this could, of course, be made into a function returning a list containing nvals, p, p.cat, cat.df, etc; however this seems rather ugly to me. However the only provision for including scripts in a package seems to be the 'demo' folder which does not seem to be the right way to go. Advice on how to proceed would be gratefully received.
(But the gratitude would not be formally expressed as it seems that using a comment to express thanks is deprecated.)
It is better to encapsulate your code in a function. It is not ugly to return a list, S3 objects for example are just a list with an attribute class.
object <- list(attribute.name = something, ..)
class(object) <- "cname"
return (object)
You can also use inst folder (as mentioned in Dirk comment) since the contents of the inst subdirectory will be copied recursively to the installation directory.
you create an inst folder:
inst
----scripts
some_scripts.R
You can call it from a function in your package and use system.file mechanism to load it.
load_myscript <- function(){
source(system.file(package='your_pkg_name','scripts/some_scripts.R'))
}
You call it as any other function in your package:
load_myscript()

Resources