h2o deep learning weights and normalization - r

I'm exploring h2o via the R interface and I'm getting a weird weight matrix. My task is as simple as they get: given x,y compute x+y.
I have 214 rows with 3 columns. The first column(x) was drawn uniformly from (-1000, 1000) and the second one(y) from (-100,100). I just want to combine them so I have a single hidden layer with a single neuron.
This is my code:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
train <- h2o.importFile(path = "/home/martin/projects/R NN Addition/addition.csv")
model <- h2o.deeplearning(1:2,3,train, hidden = c(1), epochs=200, export_weights_and_biases=T, nfolds=5)
print(h2o.weights(model,1))
print(h2o.weights(model,2))
and the result is
> print(h2o.weights(model,1))
x y
1 0.5586579 0.05518193
[1 row x 2 columns]
> print(h2o.weights(model,2))
C1
1 1.802469
For some reason the weight value for y is 0.055 - 10 times lower than for x. So, in the end the neural net would compute x+y/10. However, h2o.predict actually returns the correct values (even on a test set).
I'm guessing there's a preprocessing step that's somehow scaling my data. Is there any way I can reproduce the actual weights produced by the model? I would like to be able to visualize some pretty simple neural networks.

Neural networks perform best if all the input features have mean 0 and standard deviation 1. If the features have very different standard deviations, neural networks perform very poorly. Because of that h20 does this normalization for you. In other words, before even training your net it computes mean and standard deviation of all the features you have, and replaces the original values with (x - mean) / stddev. In your case the stddev for the second feature is 10x smaller than for the first, so after the normalization the values end up being 10x more important in terms of how much they contribute to the sum, and the weights heading to the hidden neuron need to cancel it out. That's why the weight for the second feature is 10x smaller.

Related

R: How to check which model of an ensemble algorithm has been selected to perform regression?

I am using the R package machisplin (it's not on CRAN) to downscale a satellite image. According to the description of the package:
The machisplin.mltps function simultaneously evaluates different combinations of the six algorithms to predict the input data. During model tuning, each algorithm is systematically weighted from 0-1 and the fit of the ensembled model is evaluated. The best performing model is determined through k-fold cross validation (k=10) and the model that has the lowest residual sum of squares of test data is chosen. After determining the best model algorithms and weights, a final model is created using the full training dataset.
My question is how can I check which model out of the 6 has been selected for the downscaling? To put it differently, when I export the downscaled image, I would like to know which algorithm (out of the 6) has been used to perform the downscaling.
Here is the code:
library(MACHISPLIN)
library(raster)
library(gbm)
evi = raster("path/evi.tif") # covariate
ntl = raster("path/ntl_1600.tif") # raster to be downscaled
##convert one of the rasters to a point dataframe to sample. Use any raster input.
ntl.points<-rasterToPoints(ntl,
fun = NULL,
spatial = FALSE)
##subset only the x and y data
ntl.points<- ntl.points[,1:2]
##Extract values to points from rasters
RAST_VAL<-data.frame(extract(ntl, ntl.points))
##merge sampled data to input
InInterp<-cbind(ntl.points, RAST_VAL)
#run an ensemble machine learning thin plate spline
interp.rast<-machisplin.mltps(int.values = InInterp,
covar.ras = evi,
smooth.outputs.only = T,
tps = T,
n.cores = 4)
#set negative values to 0
interp.rast[[1]]$final[interp.rast[[1]]$final <= 0] <- 0
writeRaster(interp.rast[[1]]$final,
filename = "path/ntl_splines.tif")
I vied all the output parameters (please refer to Example 2 in the package description) but I couldn't find anything relevant to my question.
I have posted a question on GitHub as well. From here you can download my images.
I think this is a misunderstanding; mahcisplin, isnt testing 6 and gives one. it's trying many ensembles of 6 and its giving one ensemble... or in other words
that its the best 'combination of 6 algorithms' that I will get, and not one of 6 algo's chosen.
It will get something like "a model which is 20% algo1 , 10% algo2 etc. "and not "algo1 is the best and chosen"

Cross-Validation in R using vtreat Package

Currently learning about cross validation through a course on DataCamp. They start the process by creating an n-fold cross validation plan. This is done with the kWayCrossValidation() function from the vtreat package. They call it as follows:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
Then, they suggest running a for loop as follows:
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
This results in a new column in the datafram with the predictions, per the last line of the above chunk of code.
My doubt is thus whether the predicted values on the data frame will be in fact averages of the 3 folds or if they will just be those of the 3rd run of the for loop?
Am I missing a detail here, or is this exactly what this code is doing, which would then defeat the purpose of the 3-fold cross validation or any-fold cross validation for that matter, as it will simply output the results of the last iteration? Shouldn't we be looking to output the average of all the folds, as laid out in the splitPlan?
Thank you.
I see there is confusion about the scope of K-fold cross-validation. The idea is not to average predictions over different folds, rather to average some measure of the prediction error, so to estimate test errors.
First of all, as you are new on SO, notice that you should always provide some data to work with. As in this case your question is not data-contingent, I just simulated some. Still, it is a good practice helping us helping you.
Check the following code, which slightly modifies what you have provided in the post:
library(vtreat)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y, pred.cv = NA)
# Folds.
nRows = dim(dta)[1]
nSplits = 3
splitPlan = kWayCrossValidation(nRows, nSplits)
# Fitting model on all folds but i-th.
for(i in 1:nSplits)
{
# Get the i-th split.
split = splitPlan[[i]]
# Build a model on the training data from this split.
model = lm(y ~ ., data = dta[split$train, -4])
# Make predictions on the application data from this split.
dta$pred.cv[split$app] = predict(model, newdata = dta[split$app, -4])
}
# Now compute an estimate of the test error using pred.cv.
mean((dta$y - dta$pred.cv)^2)
What the for loop does, is to fit a linear model on all folds but the i-th (i.e., on dta[split$train, -4]), and then it uses the fitted function to make predictions on the i-th fold (i.e., dta[split$app, -4]). At least, I am assuming that split$train and split$app serve such roles, as the documentation is really lacking (which usually is a bad sign). Notice I am revoming the 4-th column (dta$pred.cv) as it just pre-allocates memory in order to store all the predictions (it is not a feature!).
At each iteration, we are not filling the whole dta$pred.cv, but only a subset of that (corresponding to the rows of the i-th fold, stored each time in split$app). Thus, at the end that column just stores predictions from the K iteration.
The real rationale for cross-validation jumps in here. Let me introduce the concepts of training, validation, and test set. In data analysis, the ideal is to have such a huge data set so that we can divide it in three subsamples. The first one could then be used to train the algorithms (fitting models), the second to validate the models (tuning the models), the third to choose the best model in terms on some perfomance measure (usually mean-squared-error for regression, or MSE).
However, we often do not have all these data points (especially if you are an economist). Thus, we seek an estimator for the test MSE, so that the need for splitting data disappears. This is what K-fold cross-validation does: at once, each fold is treated as the test set, and the union of all the others as the training set. Then, we make predictions as in your code (in the loop), and save them. What you miss is the last line in the code I provided: the average of the MSE across folds. That provides us with as estimate of the test MSE, where we choose the model yielding the lowest value.
That being said, I never heard before of the vtreat package. If you are into data analysis, I suggest to have a look at the tidiyverse and the caret packages. As far as I know (and I see here on SO), they are widely used and super-well documented. May be worth learning them.

Find the nearest neighbor using caret

I'm fitting a k-nearest neighbor model using R's caret package.
library(caret)
set.seed(0)
y = rnorm(20, 100, 15)
predictors = matrix(rnorm(80, 10, 5), ncol=4)
data = data.frame(cbind(y, predictors))
colnames(data)=c('Price', 'Distance', 'Cost', 'Tax', 'Transport')
I left one observation as the test data and fit the model using the training data.
id = sample(nrow(data)-1)
train = data[id, ]
test = data[-id,]
knn.model = train(Price~., method='knn', train)
predict(knn.model, test)
When I display knn.model, it tells me it uses k=9. I would love to know which 9 observations are actually the "nearest" to the test observation. Besides manually calculating the distances, is there an easier way to display the nearest neighbors?
Thanks!
When you are using knn you are creating clusters with points that are near based on independent variables. Normally, this is done using train(Price~., method='knn', train), such that the model chooses the best prediction based on some criteria (taking into account also the dependent variable as well). Given the fact I have not checked whether the R object stores the predicted price for each of the trained values, I just used the model trained to predicte the expected price given the model (where the expected price is located in the space).
At the end, the dependent variable is just a representation of all the other variables in a common space, where the price associated is assumed to be similar since you cluster based on proximity.
As a summary of steps, you need to calculate the following:
Get the distance for each of the training data points. This is done through predicting over them.
Calculate the distance between the trained data and your observation of interest (in absolut value, since you do not care about the sign but just about the absolut distances).
Take the indexes of the N smaller ones(e.g.N= 9). you can get the observations and related to this lower distances.
TestPred<-predict(knn.model, newdata = test)
TrainPred<-predict(knn.model, train)
Nearest9neighbors<-order(abs(TestPred-TrainPred))[1:9]
train[Nearest9neighbors,]
Price Distance Cost Tax Transport
15 95.51177 13.633754 9.725613 13.320678 12.981295
7 86.07149 15.428847 2.181090 2.874508 14.984934
19 106.53525 16.191521 -1.119501 5.439658 11.145098
2 95.10650 11.886978 12.803730 9.944773 16.270416
4 119.08644 14.020948 5.839784 9.420873 8.902422
9 99.91349 3.577003 14.160236 11.242063 16.280094
18 86.62118 7.852434 9.136882 9.411232 17.279942
11 111.45390 8.821467 11.330687 10.095782 16.496562
17 103.78335 14.960802 13.091216 10.718857 8.589131

Calculating p-value from pseudo-F in R

I'm working with a very large dataset with 132,019 observations of 18 variables. I've used the clusterSim package to calculate the pseudo-F statistic on clusters created using Kohonen SOMs. I'm trying to assess the various cluster sizes (e.g., 4, 6, 9 clusters) with p-values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on.
I use the following code to get the pseudo-F.
library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4
Then I use the following code to get the p-value. When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0.
k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0
I guess I was expecting not a round number, so I'm confused about how to interpret the results. I get the exact same results regardless of which cluster size I evaluate. I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters.
I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give a p value of 1 in one tailed test. If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary.

Extra weight values in neural network (nnet for R package)

I'm attempting to reverse engineer in Excel how the nnet package works using some simple input data. Here's the steps I've taken
Import dummy data:test <- read.csv('dataScaled.csv',header=TRUE,sep = ",")
Train the network:
anntrain <- nnet(Price ~ Sqft + Bedrooms + Bathrooms,test[1:650,],size=2, maxit=5000,linout=TRUE)
Grab the weights of the ANN:
anntrain$wts
This outputs:
[1] -2.12443010 6.68900321 0.85338018 -0.73329823 -3.95336239 7.91917321
[7] -5.38893137 4.05941771 -0.02062346 0.26584364 0.32881035
Grab fitted values of trained network:
anntrain$fitted.values
This outputs what I believe to be the scaled Price predictions of the trained network for each of the 650 transactions I trained it on above.
Prove out the fitted values by recalculating using the above weights using the sigmoid function.
My confusion is that it outputs 11 weight values. If I only have 3 inputs, 2 hidden nodes and 1 output, shouldn't that equate to only 8 weights? What are the 3 extra weights for?
There is a bias in every layer (Why use a bias/threshold?). A bias is like a node that always gives you the input 1. Thus you have (3+1)*2+(2+1)*1 = 11 weights.

Resources