Which methods of normalization exist in R? - r

I'm working with data that is not normal distributed. I have applied the common methods: logs and square roots in order to transform the data and then treat it with an ARIMA model so I can make a forecast.
What I have tried is:
set.seed(123)
y<-rexp(200)
yl<-log(y+1)
shapiro.test(yl)
trans<-(y-mean(y))/sd(y)
shapiro.test(trans)
This methods are failing the test of normality, I would like to ask if there are another options to transform data into normal data in R.

You can try the forecast package, with has the BoxCox.lambda function to handle
BoxCox transformations. The scale/re-scale is done automatically. Example:
require(forecast)
y <- ts(rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12), frequency=12) + runif(120)
lambda <- BoxCox.lambda(y) # should check if the transformation is necessary
model <- auto.arima(y, lambda = lambda)
plot(forecast(model))

Related

Back transformation of emmeans in lmer

I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.

How to use Amelia package to get a best time series model in R

I'm trying to handle the missing data from a data frame use multiple imputations, professor advice me to use Amelia package. And I can build the time series model, but when I try to use lapply function to repeatedly run the time series model in each dataset, I got an error on the function in lapply.
My data frame have three variables, date, pm25, pm10. I can built an AR model for pm25.
And the imputation code is:
imp <- amelia(Exetertibble, m=50, ts = "date")
So I can get 50 imputations, and the time series model would like this:
model1 <- arima(imp$imputations$imp1$pm25, order = c(1,0,0))
Then I try to use lapply function:
extractcoefs <- lapply(imp$imputations, coef(model1))
There is an error, it said that the coef(model)is not a function or character or symbol.
My aim is to combine the 50 imputations and get the best result of coefficient of the time series model, I don't know how to write a correct function in there.
I also tried:
extractcoefs <- lapply(imp$imputations, coef(arima(order=c(1,0,0))))
and:
extractcoefs <- lapply(imp$imputations, arima(order=c(1,0,0)$coef))
No idea, what you are trying to do.
Look at this example for lapply:
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
So you give lapply a list und apply a function on each of the list elements. In this case the function is mean().
So for this example you will get the mean for a, beta and logic.
You are using lapply on imp$imputations.
You got imp$imputations from your call to the amelia() function. Which gives you an instance of S3 class "amelia". This instances includes several objects, one of these is a list imp, which has as list elements all the imputed datasets (in your case 50).
So using lapply(imp$imputations, coef(model1)) will apply the function in the second part on all imputed datasets. The only problem is, your second part isn't really a function. Also you can't apply coef on the imputed datasets. You must apply coef() on a model object, because it returns the model coefficients form the model.
I guess you want to do the following:
Generate your m=50 imputed datasets
Build a arima model for each dataset
Get the coefficients for each of this model
You could just use a for loop through the m=50 datasets for this.
Take this as an example:
data(africa)
imp <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", m = 5)
for (i in 1:length(imp$imputations))
{
model <- arima(imp$imputations[[i]]$gdp_pc)
coe <- coef(model)
print(coe)
}
This would give you 50 results of coef. (for the different arima models build on the different m=50 imputed datasets)

Estimating Lambda for Yeo and Johnson transform

I have a time series of rainfall values in a csv file.I plotted the histogram of the data. The histogram is skewed to the left. I wanted to transform the values so that it will have a normal distribution. I used the Yeo-Johnson transform available in R. The transformed values are here.
My question is:
In the above transformation, I used a test value of 0.5 for lambda, which works fine. Is there away to determine the optimal value of lambda based on the time series? I'll appreciate any suggestions.
So far, here's the code:
library(car)
dat <- scan("Zamboanga.csv")
hist(dat)
trans <- yjPower(dat,0.5,jacobian.adjusted=TRUE)
hist(trans)
Here is the csv file.
First find the optimal lambda by using the function boxCox from the car package to estimate λ by maximum likelihood.
You can plot it like this:
boxCox(your_model, family="yjPower", plotit = TRUE)
As Ben Bolker said in a comment, the model here could be something like
your_model <- lm(dat~1)
Then use the optimized lambda in your existing code.

Reusing the model from R's forecast package

I have been told that, while using R's forecast package, one can reuse a model. That is, after the code x <- c(1,2,3,4); mod <- ets(x); f <- forecast(mod,h=1) one could have append(x, 5) and predict the next value without recalculating the model. How does one do that? (As I understand, using simple exponential smoothing one would only need to know alpha, right?)
Is it like forecast(x, model=mod)? If that is the case I have to say that I am using Java and calling the forecast code programmatically (for many time series), so I dont think I could keep the model object in the R environment all the time. Would there be an easy way to keep the model object in Java and load it in R environment when needed?
You have two questions here:
A) Can the forecast package "grow" its datasets? I can't speak in great detail to this package and you will have to look at its document. However, R models in general obey a structure of
fit <- someModel(formula, data)
estfit <- predict(fit, newdata=someDataFrame)
eg you supply updated data given a fit object.
B) Can I serialize a model back and forth to Java? Yes, you can. Rserve is one object, you can also try basic serialize() to (raw) character. Or even just `save(fit, file="someFile.RData").
Regarding your first question:
x <- 1:4
mod <- ets(x)
f1 <- forecast(mod, h=1)
x <- append(x, 5)
mod <- ets(x, model=mod) # Reuses old mod without re-estimating parameters.
f2 <- forecast(mod, h=1)

Error returned predicting new data using GAM with periodic smoother

Apologies if this is better suited in CrossValidated.
I am fitting GAM models to binomial data using the mgcv package in R. One of the covariates is periodic, so I am specifying the bs = "cc" cyclic cubic spline. I am doing this in a cross validation framework, but when I go to fit my holdout data using the predict function I get the following error:
Error in pred.mat(x, object$xp, object$BD) :
can't predict outside range of knots with periodic smoother
Here is some code that should replicate the error:
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
# fit gam with periodic smoother:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Any suggestions on where I'm going wrong would be greatly appreciated. Maybe manually specifying knots to fall on -pi and pi?
I did not get an error on the first run but I did replicate the error on the second try. Perhaps you need to use set.seed(123) #{no error} and set.seed(223) #{produces error}. to see if that creates partial success. I think you are just seeing some variation with a relatively small number of points in your derivation and validation datasets. 100 points for GAM fit is not particularly "generous".
Looking at the gamFit object it appears that the range of the knots is encoded in gamFit$smooth[[1]]['xp'], so this should restrict your inputs to the proper range:
x.2 <- runif(100,min=-pi,max=pi);
x.2 <- x.2[findInterval(x.2, range(gamFit$smooth[[1]]['xp']) )== 1]
# Removes the errors in all the situations I tested
# There were three points outside the range in the set.seed(223) case
The problem is that your test set contains values that were not in the range of your training set. Since you used a spline, knots were created at the minimum and maximum value of x, and your fitted function is not defined outside of that range. So, when you test the model, you should exclude those points that are outside the range. Here is how you would exclude the points in the test set:
set.seed(2)
... <Your code>
predict(gamFit,newdata=df.2[df.2$x>=min(df$x) & df.2$x<=max(df$x),,drop=F])
Or, you could specify the "outer" knot points in the model to the min and max of your whole data. I don't know how to do that offhand.

Resources