High error in neuralnet - r

I am trying to train a neural network using neuralnet in R, but am getting very high error terms (in the region of 1850). The input variables are responses to a set of 6 likert scales (all on a 1-7) and the output is whether there was a topbox response to another variable (coded 0,1). The input variables have been scaled to a 0,1 range (I've also tried normalizing to a mean of 0). I've tried a range of hidden nodes (from 1-10) and the network is converging to a threshold of 0.1 fairly reliably in 200000-400000 iterations, but with a consistent error term around 1800-1900. There are 30,000 cases in total, about 22000 in the training set. I appreciate that this type of problem doesn't need a neural network necessarily - this is proof of concept (going excellently...) on a familiar dataset before application to other questions. Any suggestions on how to reduce the error/improve the training net appreciated.
As I said, I have tried both normalising and scaling, and now also using the pca preprocessing provided in caret. Must be something in the data, but at a bit of a loss...
Code:
maxs <-apply(final, 2, max)
mins <-apply(final, 2, min)
scaled <-as.data.frame(scale(final, center=mins, scale=maxs-mins))
index<-sample(1:nrow(scaled), round(0.75*nrow(scaled)))
train_ <-scaled[index,]
test_ <-scaled[-index,]
nn<-neuralnet(Q11~A1+B1+C1+D1+E1+Q12, data=train_, hidden=5, rep=1,threshold=0.01, stepmax=6e+05, linear.output=F, lifesign='full')

Related

VAR Model in R, Error in Solve.default(sigma)

I'm currently trying to fit a VAR model with 6 variables from an XTS time series set. I have over 800 observations as well. The code I'm trying to run is
estim <- VAR(MinuteSeries, p = AIC , type = "both")
summary(estim)
The value AIC is the AIC value retrieved from the lag-select function. When I pass the summary statement I am given the error:
Error in solve.default(Sigma) :
system is computationally singular: reciprocal condition number = 5.61898e-17
I have read online that this can be due to have a larger amount of coefficients in the model than observations in the data, however I have over 800 observations in the data and still getting this issue with just 6 variables. Is the size the issue still for my model or am I missing something more important?
I had the very same issue with seemingly non-problematic data (60 observations with 4 variable TS). So, I read online that one guy advised the following:
"It isn't just the high correlation of your variables, but also their scaling with respect to the response and/or the spatial coefficient. Using a different method= (say "LU") and using a power trace vector trs= may get you there too, but re-scaling the variable will also re-scale its square. The same problem affects the STSLS - re-scale the variable. If these are say in Euro, use thousand, million or whatever Euro instead, for example."
It helped when I transformed GDP from $ to billions $.

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Spark ML Logistic Regression with Categorical Features Returns Incorrect Model

I've been doing a head-to-head comparison of Spark 1.6.2 ML's LogisticRegression with R's glmnet package (the closest analog I could find based on other forum posts).
I'm specifically looking at these two fitting packages when using categorical features. When using continuous features, results for the two packages are comparable.
For my first attempt with Spark, I used the ML Pipeline API to transform my single 21-level categorical variable (called FAC for faculty) with StringIndexer followed by OneHotEncoder to get a binary vector representation.
When I fit my models in Spark and R, I get the following sets of results (that aren't even close):
SPARK 1.6.2 ML
lrModel.intercept
-3.1453838659926427
lrModel.weights
[0.37664264958084287,0.697784342445422,0.4269429071484017,0.3521764371898419,0.19233585960734872,0.6708049751689226,0.49342372792676115,0.5471489576300356,0.37650628365008465,1.0447861554914701,0.5371820187662734,0.4556833133252492,0.2873530144304645,0.09916227313130375,0.1378469333986134,0.20412095883272838,0.4494641670133712,0.4499625784826652,0.489912016708041,0.5433020878341336]
R (glmnet)
(Intercept) -2.79255253
facG -0.35292166
facU -0.16058275
facN 0.69187146
facY -0.06555711
facA 0.09655696
facI 0.02374558
facK -0.25373146
facX 0.31791765
facM 0.14054251
facC 0.02362977
facT 0.07407357
facP 0.09709607
facE 0.10282076
facH -0.21501281
facQ 0.19044412
facW 0.18432837
facF 0.34494177
facO 0.13707197
facV -0.14871580
facS 0.19431703
I've manually checked the glmnet results and they're correct (calculating the proportion of training samples with a particular level of the categorical feature and comparing that to the softmax prob. under the estimated model). These results do not change even when the max. no. of iterations in the optimization is set to 1000000 and the convergence tolerance is set to 1E-15. These results also do not change when the Spark LogisticRegression weights are initialized to the glmnet-estimated weights (Spark's optimizing a different cost function?).
I should say that the optimization problem is not different between these two approaches. You should be minimizing logistic loss (a convex surface) and thereby arriving at nearly the exact same answer).
Now, when I manually recode the FAC feature as a binary vector in the data file and read those binary columns as "DoubleType" (using Spark's DataFrame schema), I get very comparable results. (The order of the coefficients for the following results is different from the above results. Also the reference levels are different--"B" in the first case, "A" in the second--so the coefficients for this test should not match those from the above test.)
SPARK 1.6.2 ML
lrModel.intercept
-2.9530485080391378
lrModel.weights
[-0.19233467682265934,0.8524505857034615,0.09501714540028124,0.25712829253044844,0.18430675058702053,0.09317325898819705,0.4784688407322236,0.3010877381053835,0.18417033887042242,0.2346069926274015,0.2576267066227656,0.2633474197307803,0.05448893119304087,0.35096612444193326,0.3448460751810199,0.505448794876487,0.29757609104571175,0.011785058030487976,0.3548130904832268,0.15984047288368383]
R (glmnet)
s0
(Intercept) -2.9419468179
FAC_B -0.2045928975
FAC_C 0.8402716731
FAC_E 0.0828962518
FAC_F 0.2450427806
FAC_G 0.1723424956
FAC_H -0.1051037449
FAC_I 0.4666239456
FAC_K 0.2893153021
FAC_M 0.1724536240
FAC_N 0.2229762780
FAC_O 0.2460295934
FAC_P 0.2517981380
FAC_Q -0.0660069035
FAC_S 0.3394729194
FAC_T 0.3334048723
FAC_U 0.4941379563
FAC_V 0.2863010635
FAC_W 0.0005482422
FAC_X 0.3436361348
FAC_Y 0.1487405173
Standardization is set to FALSE for both and no regularization is performed (you shouldn't perform it here since you're really just learning the incidence rate of each level of the feature and the binary feature columns are completely uncorrelated from one another). Also, I should mention that the 21 levels of the categorical feature range in incidence counts from ~800 to ~3500 (so this is not due to lack of data; large error in estimates).
Anyone experience this? I'm one step away from reporting this to the Spark guys.
Thanks as always for the help.

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Resources