Box Cox transformation in R, apply to a column - r

I have skewed data that I need to normalize in order to make a t-test and I struggle to find an implementation of the Box-Cox transformation taking a specified lambda. I tried to use a log but for a few data points, it does not work quite well.
I come from Python where there is this function:
from scipy.special import boxcox
>>> boxcox([1, 4, 10], 2.5)
array([0.,12.4, 126.09110641])
where 2.5 would be the lambda specified. This function can then be applied to a whole column.
I would like to find its implementation in R but so far I have only found the boxcox function that gives me the best lambda parameter in the MASS package but I cannot seem to find a way to apply any lambda I want.

You can try the boxcox function from the EnvStat pacakge (see here).
There you can specify lambda:
library(EnvStat)
boxcox(1:10, lambda = 2.5, optimize = F)

Related

Custom Weight Regularization in Keras

I am attempting to implement a custom regularization method in Keras for R which will discourage negative weightings during training. I have found supporting documentation for this in Python, just not for R.
In this method, I would like to identify negative weightings, and then apply regularization to those weights specifically. I have my current attempt defined as
l1l2_reg <- function(weight_matrix) {
neg <- which(weight_matrix < 0, arr.ind = T)
return(0.0001 * sum(sum(weight_matrix[neg]^2)) + sum(sum(abs(weight_matrix[neg]^2))))
}
I am defining the usage of this within my model as
reconstruct = bottleneck %>%
layer_dense(units = input_size, activation = "linear",
kernel_regularizer = l1l2_reg,
name = "reconstruct")
When the model is run, I am met with the error message
Error: Discrete value supplied to continuous scale
I believe that this is occurring because the function is not correctly locating the weights, but am unsure how to go about it. Based on the code above, it should be identifying the indices of the negative weightings and then returning the regularization based off of that, but clearly my implementation is flawed. I primarily use MATLAB, so my implementation may also be skewed towards that as well.
What is the correct method of implementing this within R?
For most custom functions passed to Keras (in both Python and R), you generally have to stick to TensorFlow operations. In this case, which() and subsetting with an integer array via [neg] need to be updated to their TensorFlow equivalents: tf$where() and tf$gather_nd(). Or you can take a different approach altogether and use tf$maximum(), like in the example below.
(The [ method for tensors today doesn't yet accept a list of arbitrary integer indices, but rather, slice specs, in R see ?`[.tensorflow.tensor` for details)
(sum(), abs(), ^, * are R generics which automatically dispatch to the TensorFlow methods tf$reduce_sum(), tf$abs(), tf$pow() and tf$multiply() when called with a Tensor)
You can update your l1l2_reg like this (note, the actual calculation is slightly different from what you wrote, to match the common meaning of "l1" and "l2"):
library(tensorflow)
library(keras)
neg_l1l2_reg <- function(weight_matrix) {
x <- tf$maximum(tf$zeros_like(weight_matrix), weight_matrix)
l1 <- sum(abs(x)) * 0.0001
l2 <- sum(x ^ 2) * 0.0001
l1 + l2
}

Transform data using inverse Gaussian cumulative density function in R (from given matlab code)

I have been given matlab code and need to figure out how to do the same in R.
These are my instructions:
then for each column (variable) all the rank values in Rtr are transformed back to 'actual' values by the built-in norminv() function which is just the inverse of the Gaussian cumulative density function
Basically the step I cannot figure out is what in matlab is the norminv() command, more specifically it looks like this:
output(:,i) = norminv(data(:,i)/(N+1),0,1)
I have tried this solution given in another thread:
library(actuar)
library(fitdistrplus)
fig <- fitdist(claims, "invgauss", start = list(mean = 5, shape = 1))
But as far as I can tell from the output it doesn't actually give you transformed data but just a test of how well the data fits the inverse gaussian.
Does anyone have a good solution to my problem? Or am I missing something in the output I get from that other solution?

Weibull distribution with weighted data

I have some time to event data that I need to generate around 200 shape/scale parameters for subgroups for a simulation model. I have analysed the data, and it best follows a weibull distribution.
Normally, I would use the fitdistrplus package and fitdist(x, "weibull") to do so, however this data has been matched using kernel matching and I have a variable of weighting values called km and so needs to incorporate a weight, which isn't something fitdist can do as far as I can tell.
With my gamma distributed data instead of using fitdist I did the calculation manually using the wtd.mean and wtd.var functions from the hsmisc package, which worked well. However, finding a similar formula for the weibull is eluding me.
I've been testing a few options and comparing them against the fitdist results:
test_data <- rweibull(100, 0.676, 946)
fitweibull <- fitdist(test_data, "weibull", method = "mle", lower = c(0,0))
fitweibull$estimate
shape scale
0.6981165 935.0907482
I first tested this: The Weibull distribution in R (ExtDist)
library(bbmle)
m1 <- mle2(y~dweibull(shape=exp(lshape),scale=exp(lscale)),
data=data.frame(y=test_data),
start=list(lshape=0,lscale=0))
which gave me lshape = -0.3919991 and lscale = 6.852033
The other thing I've tried is eweibull from the EnvStats package.
eweibull <- eweibull(test_data)
eweibull$parameters
shape scale
0.698091 935.239277
However, while these are giving results, I still don't think I can fit my data with the weights into any of these.
Edit: I have also tried the similarly named eWeibull from the ExtDist package (which I'm not 100% sure still works, but does have a weibull function that takes a weight!). I get a lot of error messages about the inputs being non-computable (NA or infinite). If I do it with map, so map(test_data, test_km, eWeibull) I get [[NULL] for all 100 values. If I try it just with test_data, I get a long string of errors associated with optimx.
I have also tried fitDistr from propagate which gives errors that weights should be a specific length. For example, if both are set to be 100, I get an error that weights should be length 94. If I set it to 94, it tells me it has to be length of 132.
I need to be able to pass either a set of pre-weighted mean/var/sd etc data into the calculation, or have a function that can take data and weights and use them both in the calculation.
After much trial and error, I edited the eweibull function from the EnvStats package to instead of using mean(x) and sd(x), to instead use wtd.mean(x,w) and sqrt(wtd.var(x, w)). This now runs and outputs weighted values.

analytical derivative of splinefun()

I'm trying to fit a natural cubit spline to probabilistic data (probabilities that a random variable is smaller than certain values) to obtain a cumulative distribution function, which works well enough using splinefun():
cutoffs <- c(-90,-60,-30,0,30,60,90,120)
probs <- c(0,0,0.05,0.25,0.5,0.75,0.9,1)
CDF.spline <- splinefun(cutoffs,probs, method="natural")
plot(cutoffs,probs)
curve(CDF.spline(x), add=TRUE, col=2, n=1001)
I would then, however, like to use the density function, i.e. the derivative of the spline, to perform various calculations (e.g. to obtain the expected value of the random variable).
Is there any way of obtaining this derivative as a function rather than just evaluated at a discrete number of points via splinefun(x, deriv=1)?
This is pretty close to what I'm looking for, but alas the example doesn't seem to work in R version 2.15.0.
Barring an analytical solution, what's the cleanest numerical way of going about this?
If you change the environment assignment line for g in the code the Berwin Turlach provided on R-help to this:
environment(g) <- environment(f)
... you succeed in R 2.15.1.

Using outer() with predict()

I am trying to use the outer function with predict in some classification code in R. For ease, we will assume in this post that we have two vectors named alpha and beta each containing ONLY 0 and 1. I am looking for a simple yet efficient way to pass all combinations of alpha and beta to predict.
I have constructed the code below to mimic the lda function from the MASS library, so rather than "lda", I am using "classifier". It is important to note that the prediction method within predict depends on an (alpha, beta) pair.
Of course, I could use a nested for loop to do this, but I am trying to avoid this method.
Here is what I would like to do ideally:
alpha <- seq(0, 1)
beta <- seq(0, 1)
classifier.out <- classifier(training.data, labels)
outer(X=alpha, Y=beta, FUN="predict", classifier.out, validation.data)
This is a problem because alpha and beta are not the first two parameters in predict.
So, in order to get around this, I changed the last line to
outer(X=alpha, Y=beta, FUN="predict", object=classifier.out, data=validation.data)
Note that my validation data has 40 observations, and also that there are 4 possible pairs of alpha and beta. I get an error though saying
dims [product 4] do not match the length of object [40]
I have tried a few other things, some of which work but are far from simple. Any suggestions?
The problem is that outer expects its function to be vectorized (i.e., it will call predict ONCE with a vector of all the arguments it wants executed). Therefore, when predict is called once, returning its result (which happens to be of length 4), outer complains because it doesn't equal the expected 40.
One way to fix this is to use Vectorize. Untested code:
outer(X=alpha, Y=beta, FUN=Vectorize(predict, vectorize.args=c("alpha", "beta")), object=classifier.out, data=validation.data)
I figured out one decent way to do this. Here it is:
pairs <- expand.grid(alpha, beta)
names(pairs) <- c("alpha", "beta")
mapply(predict, pairs$alpha, pairs$beta,
MoreArgs=list(object=classifier.out, data=validation.data))
Anyone have something simpler and more efficient? I am very eager to know because I spent a little too long on this problem. :(

Resources