I am using the referenceIntervals package in R, to do some data analytics.
In particular I am using the refLimit function which calculates reference and confidence intervals. I want to edit it to remove certain functionality (for instance it runs a shapiro normalitiy test, which stops the entire code if the data larger than 5000, it wont allow you to parametrically test samples less than 120). To do this I have been typing refLimit into the terminal - copying the function definition, then saving it as a separate file (below is the full original definition of the function).
singleRefLimit =
function (data, dname = "default", out.method = "horn", out.rm = FALSE,
RI = "p", CI = "p", refConf = 0.95, limitConf = 0.9)
{
if (out.method == "dixon") {
output = dixon.outliers(data)
}
else if (out.method == "cook") {
output = cook.outliers(data)
}
else if (out.method == "vanderLoo") {
output = vanderLoo.outliers(data)
}
else {
output = horn.outliers(data)
}
if (out.rm == TRUE) {
data = output$subset
}
outliers = output$outliers
n = length(data)
mean = mean(data, na.rm = TRUE)
sd = sd(data, na.rm = TRUE)
norm = NULL
if (RI == "n") {
methodRI = "Reference Interval calculated nonparametrically"
data = sort(data)
holder = nonparRI(data, indices = 1:length(data), refConf)
lowerRefLimit = holder[1]
upperRefLimit = holder[2]
if (CI == "p") {
CI = "n"
}
}
if (RI == "r") {
methodRI = "Reference Interval calculated using Robust algorithm"
holder = robust(data, 1:length(data), refConf)
lowerRefLimit = holder[1]
upperRefLimit = holder[2]
CI = "boot"
}
if (RI == "p") {
methodRI = "Reference Interval calculated parametrically"
methodCI = "Confidence Intervals calculated parametrically"
refZ = qnorm(1 - ((1 - refConf)/2))
limitZ = qnorm(1 - ((1 - limitConf)/2))
lowerRefLimit = mean - refZ * sd
upperRefLimit = mean + refZ * sd
se = sqrt(((sd^2)/n) + (((refZ^2) * (sd^2))/(2 * n)))
lowerRefLowLimit = lowerRefLimit - limitZ * se
lowerRefUpperLimit = lowerRefLimit + limitZ * se
upperRefLowLimit = upperRefLimit - limitZ * se
upperRefUpperLimit = upperRefLimit + limitZ * se
shap_normalcy = shapiro.test(data)
shap_output = paste(c("Shapiro-Wilk: W = ", format(shap_normalcy$statistic,
digits = 6), ", p-value = ", format(shap_normalcy$p.value,
digits = 6)), collapse = "")
ks_normalcy = suppressWarnings(ks.test(data, "pnorm",
m = mean, sd = sd))
ks_output = paste(c("Kolmorgorov-Smirnov: D = ", format(ks_normalcy$statistic,
digits = 6), ", p-value = ", format(ks_normalcy$p.value,
digits = 6)), collapse = "")
if (shap_normalcy$p.value < 0.05 | ks_normalcy$p.value <
0.05) {
norm = list(shap_output, ks_output)
}
else {
norm = list(shap_output, ks_output)
}
}
if (CI == "n") {
if (n < 120) {
cat("\nSample size too small for non-parametric confidence intervals, \n \t\tbootstrapping instead\n")
CI = "boot"
}
else {
methodCI = "Confidence Intervals calculated nonparametrically"
ranks = nonparRanks[which(nonparRanks$SampleSize ==
n), ]
lowerRefLowLimit = data[ranks$Lower]
lowerRefUpperLimit = data[ranks$Upper]
upperRefLowLimit = data[(n + 1) - ranks$Upper]
upperRefUpperLimit = data[(n + 1) - ranks$Lower]
}
}
if (CI == "boot" & (RI == "n" | RI == "r")) {
methodCI = "Confidence Intervals calculated by bootstrapping, R = 5000"
if (RI == "n") {
bootresult = boot::boot(data = data, statistic = nonparRI,
refConf = refConf, R = 5000)
}
if (RI == "r") {
bootresult = boot::boot(data = data, statistic = robust,
refConf = refConf, R = 5000)
}
bootresultlower = boot::boot.ci(bootresult, conf = limitConf,
type = "basic", index = 1)
bootresultupper = boot::boot.ci(bootresult, conf = limitConf,
type = "basic", index = 2)
lowerRefLowLimit = bootresultlower$basic[4]
lowerRefUpperLimit = bootresultlower$basic[5]
upperRefLowLimit = bootresultupper$basic[4]
upperRefUpperLimit = bootresultupper$basic[5]
}
RVAL = list(size = n, dname = dname, out.method = out.method,
out.rm = out.rm, outliers = outliers, methodRI = methodRI,
methodCI = methodCI, norm = norm, refConf = refConf,
limitConf = limitConf, Ref_Int = c(lowerRefLimit = lowerRefLimit,
upperRefLimit = upperRefLimit), Conf_Int = c(lowerRefLowLimit = lowerRefLowLimit,
lowerRefUpperLimit = lowerRefUpperLimit, upperRefLowLimit = upperRefLowLimit,
upperRefUpperLimit = upperRefUpperLimit))
class(RVAL) = "interval"
return(RVAL)
}
However when I then execute this file a large number of terms end up being undefined, for instance when I use the function I get 'object 'nonparRanks' not found'.
How do I edit the function in the package? I have looked at trying to important the package namespace and environment but this has not helped. I have also tried to find the actual function in the package files in my directory, but not been able to.
I am reasonably experienced in R, but I have never had to edit a package before. I am clearly missing something about how functions are defined in packages, but I am not sure what.
In the beginning of the package there is a line
data(sysdata, envir=environment())
See here: https://github.com/cran/referenceIntervals/tree/master/data/sysdata.rda
I suspect that "nonparRanks" is defined there as I don't see it defined anywhere else. So perhaps you could download that file, write your own function, then run that same line before running your function and it may work.
EDIT:
Download the file then run:
load("C:/sysdata.rda")
With your path to the file and then your function will work.
nonparRanks is a function in the referenceIntervals package:
Table that dictate the ranks for the confidence intervals
around thecalculated reference interval
Your method of saving and editing the function is fine, but make sure you load all the necessary underlying functions to run it too.
The easiest thing to do might be to:
save your copied and pasted R function as a different name, e.g. singleRefLimit2, then
call library("referenceIntervals"), which will load all the underlying functions you need and then
load your function source("singelRefLimit2.R"), with whatever edits you choose to make.
Related
I am trying to remove columns where proportion of NA value are greater than na_cutoff threshold using mlr3pipelines.
Here is my try:
library(mlr3)
library(mlr3pipelines)
task = tsk("iris")
dt = task$data()
dt[1:50, Sepal.Width := NA]
task_ = as_task_classif(dt, target = "Species")
graph = po("removeconstants", id = "removeconstants", ratio = 0.01) %>>%
po("select", id = "drop_na_cols")
ps = ParamSet$new(list(ParamDbl$new("na_cutoff", lower = 0, upper = 1, default = 0.2)))
graph$param_set$add(ps)
graph$param_set
graph$param_set$trafo = function(x, param_set) {
na_cutoff = x$na_cutoff
print(na_cutoff)
x$drop_na_cols.selector = function(task) {
fn = task$feature_names
data = task$data(cols = fn)
drop <- which(colMeans(is.na(data)) > na_cutoff)
fn[-drop]
}
x$na_cutoff = NULL
x
}
train_res = graph$train(task_)
train_res$drop_na_cols.output$data()
The problem is that last column is not removed even it should be.
In general, trafos are not meant for parameter sets.
I.e. internally, when the Graph accesses the parameters, the parameter transformation is not applied.
They are intended to create search spaces for black-box optimization, including hyperparameter optimization of ML models.
Also, you modifying the parameter set of an existing Graph is a bad idea.
The way to go I believe is to use the PipeOpSelect with a custom selector: https://mlr3pipelines.mlr-org.com/reference/Selector.html
Following this issue https://github.com/mlr-org/mlr3pipelines/issues/313
I thought the recommended way to do this is through trafo on select pipe.
Nevertheless, I have just created new pipeop that removes columns with many NA values:
library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
PipeOpDropNACol = R6::R6Class(
"PipeOpDropNACol",
inherit = mlr3pipelines::PipeOpTaskPreprocSimple,
public = list(
initialize = function(id = "drop.nacol", param_vals = list()) {
ps = ParamSet$new(list(
ParamDbl$new("cutoff", lower = 0, upper = 1, default = 0.05, tags = c("dropnacol_tag"))
))
ps$values = list(cutoff = 0.2)
super$initialize(id, param_set = ps, param_vals = param_vals)
}
),
private = list(
.get_state = function(task) {
pv = self$param_set$get_values(tags = "dropnacol_tag")
print(pv$cutoff)
features_names = task$feature_names
data = task$data(cols = features_names)
print(data)
many_na = sapply(data, function(column) (sum(is.na(column))) / length(column) > pv$cutoff)
print(many_na)
list(cnames = colnames(data)[-many_na])
},
.transform = function(task) {
task$select(self$state$cnames)
}
)
)
# no group variable
task = tsk("iris")
dt = task$data()
dt[1:50, Sepal.Width := NA]
task = as_task_classif(dt, target = "Species")
gr = Graph$new()
gr$add_pipeop(PipeOpDropNACol$new())
result = gr$train(task)
result[[1]]$data()
gr$predict(task)
I am trying to modify an existing function by copy and pasting it to an R script, and assigning it to a new function object in my local environment. However the new function cannot find functions that are called to within the original function. How can I fix this without looking up and finding each function individually? I am guessing that the original function is somehow linked to the package or its dependencies and 'knows where to look' for the missing function, but I cannot figure out how to do this with my new copy-and-pasted function.
library("camtrapR")
Print the function name
activityDensity
The output here is the code for this function. I have omitted it here because it is long (and I have pasted it below), but I copy and paste the output of the function code exactly (see below where I assign this exact code to a new function), except for the last two lines of output, which I think are important:
<bytecode: 0x000000002a2d1e20>
<environment: namespace:camtrapR>
So now I assign the copy and pasted code from the output above to a new function with New <-
New <- function (recordTable, species, allSpecies = FALSE, speciesCol = "Species",
recordDateTimeCol = "DateTimeOriginal", recordDateTimeFormat = "%Y-%m-%d %H:%M:%S",
plotR = TRUE, writePNG = FALSE, plotDirectory, createDir = FALSE,
pngMaxPix = 1000, add.rug = TRUE, ...)
{
wd0 <- getwd()
mar0 <- par()$mar
on.exit(setwd(wd0))
on.exit(par(mar = mar0), add = TRUE)
recordTable <- dataFrameTibbleCheck(df = recordTable)
timeZone <- "UTC"
checkForSpacesInColumnNames(speciesCol = speciesCol, recordDateTimeCol = recordDateTimeCol)
if (!is.data.frame(recordTable))
stop("recordTable must be a data frame", call. = FALSE)
if (!speciesCol %in% colnames(recordTable))
stop(paste("speciesCol = \"", speciesCol, "\" is not a column name in recordTable",
sep = ""), call. = FALSE)
if (!recordDateTimeCol %in% colnames(recordTable))
stop(paste("recordDateTimeCol = \"", recordDateTimeCol,
"\" is not a column name in recordTable", sep = ""),
call. = FALSE)
stopifnot(is.logical(c(allSpecies, writePNG, plotR, createDir)))
if (allSpecies == FALSE) {
stopifnot(species %in% recordTable[, speciesCol])
stopifnot(hasArg(species))
}
recordTable$DateTime2 <- parseDateTimeObject(inputColumn = recordTable[,
recordDateTimeCol], dateTimeFormat = recordDateTimeFormat,
timeZone = timeZone)
recordTable$Time2 <- format(recordTable$DateTime2, format = "%H:%M:%S",
usetz = FALSE)
recordTable$Time.rad <- (as.numeric(as.POSIXct(strptime(recordTable$Time2,
format = "%H:%M:%S", tz = timeZone))) - as.numeric(as.POSIXct(strptime("0",
format = "%S", tz = timeZone))))/3600 * (pi/12)
if (isTRUE(writePNG)) {
if (hasArg(plotDirectory)) {
if (isTRUE(createDir)) {
dir.create(plotDirectory, recursive = TRUE, showWarnings = FALSE)
setwd(plotDirectory)
}
else {
stopifnot(file.exists(plotDirectory))
setwd(plotDirectory)
}
}
else {
stop("writePNG is TRUE. Please set plotDirectory",
call. = FALSE)
}
}
pngWidth <- pngMaxPix
pngHeight <- round(pngMaxPix * 0.8)
if (allSpecies == FALSE) {
subset_species <- subset(recordTable, recordTable[, speciesCol] ==
species)
if (nrow(subset_species) == 1)
stop(paste(species, "had only 1 record. Cannot estimate density."),
call. = FALSE)
try_error_tmp <- try({
if (isTRUE(writePNG))
png(filename = paste("activity_density_",
species, "_", Sys.Date(), ".png",
sep = ""), width = pngWidth, height = pngHeight,
units = "px", res = 96, type = "cairo")
if (isTRUE(writePNG) | isTRUE(plotR)) {
densityPlot(subset_species$Time.rad, main = paste("Activity of",
species), rug = add.rug, ...)
mtext(paste("number of records:", nrow(subset_species)),
side = 3, line = 0)
}
if (isTRUE(writePNG))
dev.off()
}, silent = TRUE)
if (class(try_error_tmp) == "try-error")
warning(paste(toupper(species), ": ", try_error_tmp[1],
" - SKIPPED", sep = ""), call. = FALSE)
}
else {
subset_species_list <- list()
for (i in 1:length(unique(recordTable[, speciesCol]))) {
spec.tmp <- unique(recordTable[, speciesCol])[i]
subset_species <- subset(recordTable, recordTable[,
speciesCol] == spec.tmp)
plot_main_title <- paste("Activity of", spec.tmp)
if (nrow(subset_species) == 1) {
warning(paste(toupper(spec.tmp), ": It had only 1 record. Cannot estimate density. - SKIPPED",
sep = ""), call. = FALSE)
next
}
else {
try_error_tmp <- try({
if (isTRUE(writePNG))
png(filename = paste("activity_density_",
spec.tmp, "_", Sys.Date(), ".png",
sep = ""), width = pngWidth, height = pngHeight,
units = "px", res = 96, type = "cairo")
if (isTRUE(writePNG) | isTRUE(plotR)) {
densityPlot(subset_species$Time.rad, main = plot_main_title,
rug = add.rug, ...)
mtext(paste("number of records:", nrow(subset_species)),
side = 3, line = 0)
}
if (isTRUE(writePNG))
dev.off()
}, silent = TRUE)
if (class(try_error_tmp) == "try-error")
warning(paste(toupper(spec.tmp), ": ",
try_error_tmp[1], " - SKIPPED",
sep = ""), call. = FALSE)
}
subset_species_list[[i]] <- subset_species$Time.rad
names(subset_species_list)[i] <- spec.tmp
}
}
if (allSpecies == FALSE) {
return(invisible(subset_species$Time.rad))
}
else {
return(invisible(subset_species_list))
}
}
Yet, when I try to run this new function (arguments omitted here for clarity), it can't find a function embedded within.
How can I somehow assign this function to look within the original package camtrapR for any dependencies, etc.? and why does the code output from the function not already do this?
New()
Error in dataFrameTibbleCheck(df = recordTable) :
could not find function "dataFrameTibbleCheck"
This answer here: https://stackoverflow.com/a/49277036/9096420 allows one to manually edit and save a function's code for each R session, but it is non-reproducible (not code) that can be shared or re-used.
If New is the new function copied from camtrapR then use
environment(New) <- asNamespace("camtrapR")
to ensure that the function calls in its body are looked up in the correct places.
I would like to build a CPO for the mlr::makeClassificationViaRegression wrapper. The wrapper builds regression models that predict for the positive class whether a particular example belongs to it (1) or not (-1). It also calculates predicted probabilities using a softmax.
After reading the documentation and vignettes for makeCPOTargetOp, my attempt is as follows:
cpoClassifViaRegr = makeCPOTargetOp(
cpo.name = 'ClassifViaRegr',
dataformat = 'task', #Not sure - will this work if input is df with unknown target values?
# properties.data = c('numerics', 'factors', 'ordered', 'missings'), #Is this needed?
properties.adding = 'twoclass', #See https://mlrcpo.mlr-org.com/articles/a_4_custom_CPOs.html#task-type-and-conversion
properties.needed = character(0),
properties.target = c('classif', 'twoclass'),
task.type.out = 'regr',
predict.type.map = c(response = 'response', prob = 'response'),
constant.invert = TRUE,
cpo.train = function(data, target) {
getTaskDesc(data)
},
cpo.retrafo = function(data, target, control) {
cat(class(target))
td = getTaskData(target, target.extra = T)
target.name = paste0(control$positive, ".prob")
data = td$data
data[[target.name]] = ifelse(td$target == pos, 1, -1)
makeRegrTask(id = paste0(getTaskId(target), control$positive, '.'),
data = data,
target = target.name,
weights = target$weights,
blocking = target$blocking)
},
cpo.train.invert = NULL, #Since constant.invert = T
cpo.invert = function(target, control.invert, predict.type) {
if(predict.type == 'response') {
factor(ifelse(target > 0, control.invert$positive, control.invert$positive))
} else {
levs = c(control.invert$positive, control.invert$negative)
propVectorToMatrix(vnapply(target, function(x) exp(x) / sum(exp(x))), levs)
}
})
It seems to work as expected, the demo below shows that the inverted prediction is identical to the prediction obtained using the makeClassificationViaRegr wrapper:
lrn = makeLearner("regr.lm")
# Wrapper -----------------------------------------------------------------
lrn2 = makeClassificationViaRegressionWrapper(lrn)
model = train(lrn2, sonar.task, subset = 1:140)
predictions = predict(model, newdata = getTaskData(sonar.task)[141:208, 1:60])
# CPO ---------------------------------------------------------------------
sonar.train = subsetTask(sonar.task, 1:140)
sonar.test = subsetTask(sonar.task, 141:208)
trafd = sonar.train %>>% cpoClassifViaRegr()
mod = train(lrn, trafd)
retr = sonar.test %>>% retrafo(trafd)
pred = predict(mod, retr)
invpred = invert(inverter(retr), pred)
identical(predictions$data$response, invpred$data$response)
The problem is that the after the CPO has converted the task from twoclass to regr, there is no way for me to specify predict.type = 'prob'. In the case of the wrapper, the properties of the base regr learner are modified to accept predict.type = prob (see here). But the CPO is unable to modify the learner in this way, so how can I tell my model to return predicted probabilities instead of the predicted response?
I was thinking I could specify a include.prob parameter, i.e. cpoClassifViaRegr(include.prob = T). If set to TRUE, the cpo.invert returns the predicted probabilities in addition to the predicted response. Would something like this work?
I'm having a problem with implementing the function bt.matching.find from the SIT toolbox which is hosted on Github. After downloading the toolbox following the steps described here, I tried to replicate the code described in this blog
library(SIT.dates)
library(SIT)
objt <- bt.matching.find(Cl(data), normalize.fn = normalize.mean, dist.fn = 'dist.euclidean', plot=T)
R did not find the function, so I tried using spacing to access the function
objt <- SIT:::bt.matching.find(Cl(data), normalize.fn = normalize.mean, dist.fn = 'dist.euclidean', plot=T)
But this time I got a weird error which has nothing to do with any argument in the function
Error in last(data, n.reference) : could not find function "last"
I did research on the function bt.matching.find using the function getAnywhere and here's what I got
getAnywhere("bt.matching.find")
A single object matching ‘bt.matching.find’ was found
It was found in the following places
namespace:SIT
with value
function (data, n.query = 90, n.reference = 252 * 10, n.match = 10,
normalize.fn = normalize.mean.sd, dist.fn = dist.euclidean,
plot = FALSE, plot.dist = FALSE, layout = NULL, main = NULL)
{
data = last(data, n.reference)
reference = coredata(data)
n = len(reference)
query = reference[(n - n.query + 1):n]
reference = reference[1:(n - n.query)]
main = paste(main, join(format(range(index(data)[(n - n.query +
1):n]), "%d%b%Y"), " - "))
n.query = len(query)
n.reference = len(reference)
dist.fn.name = ""
if (is.character(dist.fn)) {
dist.fn.name = paste("with", dist.fn)
dist.fn = get(dist.fn)
}
dist = rep(NA, n.reference)
query.normalized = match.fun(normalize.fn)(query)
for (i in n.query:n.reference) {
window = reference[(i - n.query + 1):i]
window.normalized = match.fun(normalize.fn)(window)
dist[i] = match.fun(dist.fn)(rbind(query.normalized,
window.normalized))
if (i%%100 == 0)
cat(i, "\n")
}
min.index = c()
temp = dist
temp[temp > mean(dist, na.rm = T)] = NA
for (i in 1:n.match) {
if (any(!is.na(temp))) {
index = which.min(temp)
min.index[i] = index
temp[max(0, index - 2 * n.query):min(n.reference,
(index + n.query))] = NA
}
}
n.match = len(min.index)
if (plot) {
dates = index(data)[1:len(dist)]
if (is.null(layout)) {
if (plot.dist)
layout(1:2)
else layout(1)
}
par(mar = c(2, 4, 2, 2))
if (plot.dist) {
plot(dates, dist, type = "l", col = "gray", main = paste("Top
Historical Matches for",
main, dist.fn.name), ylab = "Distance", xlab = "")
abline(h = mean(dist, na.rm = T), col = "darkgray",
lwd = 2)
points(dates[min.index], dist[min.index], pch = 22,
col = "red", bg = "red")
text(dates[min.index], dist[min.index], 1:n.match,
adj = c(1, 1), col = "black", xpd = TRUE)
}
plota(data, type = "l", col = "gray", LeftMargin = 1,
main = iif(!plot.dist, paste("Top Historical Matches for",
main), NULL))
plota.lines(last(data, 90), col = "blue")
for (i in 1:n.match) {
plota.lines(data[(min.index[i] - n.query + 1):min.index[i]],
col = "red")
}
text(index4xts(data)[min.index - n.query/2], reference[min.index -
n.query/2], 1:n.match, adj = c(1, -1), col = "black",
xpd = TRUE)
plota.legend(paste("Pattern: ", main, ",Match Number"),
"blue,red")
}
return(list(min.index = min.index, dist = dist[min.index],
query = query, reference = reference, dates = index(data),
main = main))
}
<bytecode: 0x000000e7e11c8a00>
<environment: namespace:SIT>
I tried calling the function using backports package
library(backports)
.onLoad <- function(libname, pkgname) {
backports::import(SIT, "bt.matching.find", force = TRUE)
}
But this also didn't work
Why is R not able to access the function? could this be because this package was built under an older version?
Additional information
Environment
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)
The problem was solved with help from the developer of the package, for anyone who is interested in using the code, here are the adjustments that should be done
library(SIT)
library(quantmod)
tickers = 'SPY'
data = getSymbols(tickers, src = 'yahoo', from = '1950-01-01', auto.assign = F)
obj = SIT:::bt.matching.find(Cl(data), normalize.fn = SIT:::normalize.mean, dist.fn = 'dist.euclidean', plot=T)
matches = SIT:::bt.matching.overlay(obj, plot.index=1:90, plot=T)
layout(1:2)
matches = SIT:::bt.matching.overlay(obj, plot=T, layout=T)
SIT:::bt.matching.overlay.table(obj, matches, plot=T, layout=T)
I am trying to bound the following variables in a function
lower = c(Hyp.b = 0.01, Hyp.Di = .0000001),
upper = c(Hyp.b = 1.01, Hyp.Di = .1)
The script is as follows:
Hyp.q.forward.fun = function( time, Hyp.qi, Hyp.b, Hyp.Di ){ # time in days
Hyp.q.theo = Hyp.qi*(1 + Hyp.b*Hyp.Di*time)^(-1/Hyp.b)
return(Hyp.q.theo)
}
residfun = function(x,x.days,y.prod){
Hyp.qi = x[1]
Hyp.b = x[2]
Hyp.Di = x[3]
q.theo = (365.25/12)*Hyp.q.forward.fun(
time=x.days,
Hyp.qi=Hyp.qi,
Hyp.b=Hyp.b,
Hyp.Di=Hyp.Di)
#plot(x.days,y.prod); lines(x.days,q.theo)
residual = sqrt(sum((q.theo-y.prod)^2))
return(residual)
}
I'm not sure if I'm even using the correct method to bound the two variable. Any help would be much appreciated.