Denormalize values after prediction in R - r

I am not sure whether I am denormalizing data correctly. I have one output variable and several input variables. I am normalizing them using the RSNNS package. Suppose x is an input matrix (NxM) where each of the N rows is an object with M features. And y is a vector (N) with corresponding answers.
nx <- normalizeData(x, type='0_1')
After that, some of the data are used to make the model and some for prediction. Suppose pred.ny are predicted values. These values are normalized.
pred.y <- denormalizeData(pred.ny, getNormParameters(nx))
Is this correct? How does it work? It's clear for one input it can use the min and max values previously used for normalization. But how does it work if each input was normalized separately, using its own min and max value?
update
Here is a toy-example, where '0_1' looks better than 'norm'. 'norm' make huge training error and almost constant prediction.
x <- runif(1020, 1, 5000)
y <- sqrt(x)
nx <- normalizeData(x, type='0_1')
ny <- normalizeData(y, type='0_1')
model <- mlp(nx[1:1000], ny[1:1000], size = 1)
plotIterativeError(model)
npy <- predict(model, matrix(nx[1001:1020], ncol=1))
py <- denormalizeData(npy, getNormParameters(ny))
print(cbind(y[1001:1020], py))

There are two things going on here:
Training your model, i.e. setting the internal coefficients in the neural network. You use both the inputs and output for this.
Using the model, i.e. getting predictions with fixed internal coefficients.
For part 1, you've decided to normalize your data. So the neural net works on normalized data. So you have trained the neural net
on the inputs fX(X) rather than X, where fX is the transform you used to the matrix of original inputs to produce the normalized inputs.
on the outputs fy(y) rather than y, where fy is the transform you applied to the vector of outputs to get the normalized outputs.
In terms of your original inputs and outputs your trained machine now looks like this:
Apply the normalization function fX to inputs to get normalized inputs fX(X).
Run neural net with normalized inputs to produce normalized outputs fy(y).
Apply the denormalization function fy-1 to the normalized outputs fy(y) to get y.
Note that fX(X) and fy, and hence fy-1, are defined on the training set.
So in R you might write something like this to get training data and normalize it, the first 100 rows
tx <- x[1:100,]
ntx <- normalizeData(tx, type='0_1')
ty <- y[1:100]
nty <- normalizeData(ty, type='0_1')
and something like this to denormalize the predicted results
pred.y <- denormalizeData(pred.ny, getNormParameters(nty))
# nty (or ny) not nx here
The slight concern I have is that I'd prefer to normalize the features used in prediction using the same transform fX that I used for training, but looking at the RSNNS documentation this facility doesn't appear to be present (it would be easy to write it yourself, however). It would probably be OK to normalize the prediction features using the whole X matrix, i.e. including the training data. (I can also see it might be preferable to use the default, normalization to z-score that RSNNS provides rather than the "0_1" choice you have used.)

Related

How to estimate a parameter in R- sample with replace

I have a txt file with numbers that looks like this(but with 100 numbers) -
[1] 7.1652348 5.6665965 4.4757553 4.8497086 15.2276296 -0.5730937
[7] 4.9798067 2.7396933 5.1468304 10.1221489 9.0165661 65.7118194
[13] 5.5205704 6.3067488 8.6777177 5.2528503 3.5039562 4.2477401
[19] 11.4137624 -48.1722034 -0.3764006 5.7647536 -27.3533138 4.0968204
I need to estimate MLE theta parameter from this distrubution -
[![this is my distrubution ][1]][1]
and I need to estimate theta from a sample of 1000 observations with replace, and save the sample, and do a hist.
How can I estimate theta from my sample? I have no information about normal distrubation.
I wrote something like this -
data<-read.table(file.choose(), header = TRUE, sep= "")
B <- 1000
sample.means <- numeric(data)
sample.sd <- numeric(data)
for (i in 1:B) {
MySample <- sample(data, length(data), replace = TRUE)
sample.means <- c(sample.means,mean(MySample))
sample.sd <- c(sample.sd,sd(MySample))
}
sd(sample.sd)
but it doesn't work..
This question incorporates multiple different ones, so let's tackle each step by step.
First, you will need to draw a random sample from your population (with replacement). Assuming your 100 population-observations sit in a vector named pop.
rs <- sample(pop, 1000, replace = True)
gives you your vector of random samples. If you wanna save it, you can write it to your disk in multiple formats, so I'll just suggest a few related questions (How to Export/Import Vectors in R?).
In a second step, you can use the mle()-function of the stats4-package (https://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html) and specify the objective function explicitly.
However, the second part of your question is more of a statistical/conceptual question than R related, IMO.
Try to understand what MLE actually does. You do not need normally distributed variables. The idea behind MLE is to choose theta in such a way, that under the resulting distribution the random sample is the most probable. Check https://en.wikipedia.org/wiki/Maximum_likelihood_estimation for more details or some youtube videos, if you'd like a more intuitive approach.
I assume, in the description of your task, it is stated that f(x|theta) is the conditional joint density function and that the observations x are iir?
What you wanna do in this case, is to select theta such that the squared difference between the observation x and the parameter theta is minimized.
For your statistical understanding, in such cases, it makes sense to perform log-linearization on the equation, instead of dealing with a non-linear function.
Minimizing the squared difference is equivalent to maximizing the log-transformed function since the sum is negative (<=> the product was in the denominator) and the log, as well as the +1 are solely linear transformations.
This leaves you with the maximization problem:
And the first-order condition:
Obviously, you would also have to check that you are actually dealing with a maximum via the second-order condition but I'll omit that at this stage for simplicity.
The algorithm in R does nothing else than solving this maximization problem.
Hope this helps for your understanding. Maybe some smarter people can give some additional input.

Xgboost - how to make a custom loss function which depends the value of another column, as well the error

I am having issue implementing recency-weighting for xgboost training in R (i.e. passing a weight vector to xgb.dmatrix) - although the weighting affects the learning curve readout for the training set, it does not appear to have any impact at all on the actual model produced - performance in the test set is identical.
I can't seem to get to the bottom of this issue or generate a reproducible example. So instead I would like to pass the Date column of the features to a custom loss function, something like:
custom_loss <- function(preds,dat) {
labels <- getinfo(dat,"label")
dates <- [a vector corresponding to the dates associated with each prediction]
grad = f(dates)*-2*(labels - preds)
hess = f(dates)*2
[where f is an increasing function of the value in dates, so later samples matter more when training]
return(list(grad=grad,hess=hess))
}
But I can't seem to figure out how to do this, any suggestions?

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

convert a list -class numeric- into a distance structure in R

I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.

PLS in R: Extracting PRESS statistic values

I'm relatively new to R and am currently in the process of constructing a PLS model using the pls package. I have two independent datasets of equal size, the first is used here for calibrating the model. The dataset comprises of multiple response variables (y) and 101 explanatory variables (x), for 28 observations. The response variables, however, will each be included seperately in a PLS model. The code current looks as follows:
# load data
data <- read.table("....txt", header=TRUE)
data <- as.data.frame(data)
# define response variables (y)
HEIGHT <- as.numeric(unlist(data[2]))
FBM <- as.numeric(unlist(data[3]))
N <- as.numeric(unlist(data[4]))
C <- as.numeric(unlist(data[5]))
CHL <- as.numeric(unlist(data[6]))
# generate matrix containing the explanatory (x) variables only
spectra <-(data[8:ncol(data)])
# calibrate PLS model using LOO and 20 components
library(pls)
refl.pls <- plsr(N ~ as.matrix(spectra), ncomp=20, validation = "LOO", jackknife = TRUE)
# visualize RMSEP -vs- number of components
plot(RMSEP(refl.pls), legendpos = "topright")
# calculate explained variance for x & y variables
summary(refl.pls)
I have currently arrived at the point at which I need to decide, for each response variable, the optimal number of components to include in my PLS model. The RMSEP values already provide a decent indication. However, I would also like to base my decision on the PRESS (Predicted Residual Sum of Squares) statistic, in accordance various studies comparable to the one I am conducting. So in short, I would like to extract the PRESS statistic for each PLS model with n components.
I have browsed through the pls package documentation and across the web, but unfortunately have been unable to find an answer. If there is anyone out here that could help me get in the right direction that would be greatly appreciated!
You can find the PRESS values in the mvr object.
refl.pls$validation$PRESS
You can see this either by exploring the object directly with str or by perusing the documentation more thoroughly. You will notice if you look at ?mvr you will see the following:
validation if validation was requested, the results of the
cross-validation. See mvrCv for details.
Validation was indeed requested so we follow this to ?mvrCv where you will find:
PRESS a matrix of PRESS values for models with 1, ...,
ncomp components. Each row corresponds to one response variable.

Resources