Interpreting Independent Components using FastICA in R - r

I've recently conducted an Independent Component Analysis using fastICA in R. I have obtained a matrix of independent components. How do I figure out what variables these components are made of? For example, component #4 (v4) has a value for each of the n = 400 observations. What is this? Which variables were used to create this v4?
This is the code that was used: ica_new<-fastICA (final,n.comp = 40, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE)
I found this code for PCA: ## varimax with normalize = TRUE is the default
fa <- factanal( ~., 2, data = swiss)
varimax(loadings(fa), normalize = FALSE)
promax(loadings(fa))
EDIT: So thanks to #Hack-R I think the code I will need to use would look something like this ica_new<-fastICA (final,n.comp = 40, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE, firstEig = 1, lastEig = nrow(final))
`
Is this accurate? EDIT: Doesn't Run

Related

Error when running multiscale GWR: Error in gw_weight_vec: Not compatible > with requested type: [type=NULL; target=double]

I am trying to run multiscale geographically weighted regression (MGWR) using the GWmodel package in R. When running the function gwr.multiscale this error is shown:
Error in gw_weight_vec(vdist, bw, kernel, adaptive): Not compatible
with requested type: [type=NULL; target=double].
An example:
library(GWmodel)
data(LondonHP)
dist <- gw.dist(coordinates(londonhp))
ab_gwr <- gwr.multiscale(PURCHASE ~ FLOORSZ + PROF,
data = londonhp,
criterion = "dCVR",
kernel = "gaussian",
adaptive = FALSE,
var.dMat.indx = 2,
bws0 = c(100,
100,
100),
bw.seled = rep(T, 3),
dMats = list(dist,
dist,
dist),
parallel.method = "omp",
parallel.arg = "omp")
I have tried other parameters as well, like adaptive bandwidth, to include fewer covariates, to change the bws0 parameter etc etc. Other kinds of errors occur depending on what I have tried.
I am following the example from the package's PDF.
The parameter var.dMat.indx is defined for the usage of distance matrix for each variable, and was used wrongly in my code. The solution:
library(GWmodel)
data(LondonHP)
dist <- gw.dist(coordinates(londonhp))
ab_gwr <- gwr.multiscale(PURCHASE ~ FLOORSZ + PROF,
data = londonhp,
criterion = "dCVR",
kernel = "gaussian",
adaptive = FALSE,
var.dMat.indx = 1:3,
bws0 = c(100,
100,
100),
bw.seled = rep(TRUE, 3),
dMats = list(dist,
dist,
dist),
parallel.method = "omp",
parallel.arg = "omp")

Can't pass xgb.DMatrix to caret

I am trying tune Hyperparametes of xgboost for a classification problem, using caret library, As there were a lot of factors in my data set and xgboost likes data as numerical, I created a dummy rows using Feature Hashing, but when I get to run caret train , I get an error
#Using Feature hashing to convert all the factor variables to dummies
objTrain_hashed = hashed.model.matrix(~., data=train1[,-27], hash.size=2^15, transpose=FALSE)
#created a dense matrix which is normally accepted by xgboost method in R
#Hoping I could pass it caret as well
dmodel <- xgb.DMatrix(objTrain_hashed[, ], label = train1$Walc)
xgb_grid_1 = expand.grid(
nrounds = 500,
max_depth = c(5, 10, 15),
eta = c(0.01, 0.001, 0.0001),
gamma = c(1, 2, 3),
colsample_bytree = c(0.4, 0.7, 1.0),
min_child_weight = c(0.5, 1, 1.5)
)
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 3,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
xgb_train1 <- train(Walc ~.,dmodel,method = 'xgbTree',trControl = xgb_trcontrol_1,
metric = 'accuracy',tunegrid = xgb_grid_1)
I am getting the following error
Error in as.data.frame.default(data) :
cannot coerce class ""xgb.DMatrix"" to a data.frame
Any suggestions, on how I can proceed ?
This is because you are inputting dmodel into the last part of your code. Try inputting objTrain_hashed, which is a matrix, and not an xgb.DMatrix
How about sparse.model.matrix() instead of hashed.model.matrix...
It works on my PC...
and don't transform to xgb.DMatrix()
put it in train() function just mere sparse.model.matrix() form.
like...
model_data <- sparse.model.matrix(Y~., raw_data)
and
xgb_train1 <- train(Y ~.,model_data, <bla bla> ...)
Wish it works... thank you.

xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

I've been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to do cross validation, how does the optimal parameters get passed to xgb.train? Or should I calculate the ideal parameters (such as nround, max.depth) based on the output of xgb.cv?
param <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = 12)
cv.nround <- 11
cv.nfold <- 5
mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T)
md <-xgb.train(data=dtrain,params = param,nround = 80,watchlist = list(train=dtrain,test=dtest),nthread=6)
Looks like you misunderstood xgb.cv, it is not a parameter searching function. It does k-folds cross validation, nothing more.
In your code, it does not change the value of param.
To find best parameters in R's XGBoost, there are some methods. These are 2 methods,
(1) Use mlr package, http://mlr-org.github.io/mlr-tutorial/release/html/
There is a XGBoost + mlr example code in the Kaggle's Prudential challenge,
But that code is for regression, not classification. As far as I know, there is no mlogloss metric yet in mlr package, so you must code the mlogloss measurement from scratch by yourself. CMIIW.
(2) Second method, by manually setting the parameters then repeat, example,
param <- list(objective = "multi:softprob",
eval_metric = "mlogloss",
num_class = 12,
max_depth = 8,
eta = 0.05,
gamma = 0.01,
subsample = 0.9,
colsample_bytree = 0.8,
min_child_weight = 4,
max_delta_step = 1
)
cv.nround = 1000
cv.nfold = 5
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T)
Then, you find the best (minimum) mlogloss,
min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])
min_logloss is the minimum value of mlogloss, while min_logloss_index is the index (round).
You must repeat the process above several times, each time change the parameters manually (mlr does the repeat for you). Until finally you get best global minimum min_logloss.
Note: You can do it in a loop of 100 or 200 iterations, in which for each iteration you set the parameters value randomly. This way, you must save the best [parameters_list, min_logloss, min_logloss_index] in variables or in a file.
Note: better to set random seed by set.seed() for reproducible result. Different random seed yields different result. So, you must save [parameters_list, min_logloss, min_logloss_index, seednumber] in the variables or file.
Say that finally you get 3 results in 3 iterations/repeats:
min_logloss = 2.1457, min_logloss_index = 840
min_logloss = 2.2293, min_logloss_index = 920
min_logloss = 1.9745, min_logloss_index = 780
Then you must use the third parameters (it has global minimum min_logloss of 1.9745). Your best index (nrounds) is 780.
Once you get best parameters, use it in the training,
# best_param is global best param with minimum min_logloss
# best_min_logloss_index is the global minimum logloss index
nround = 780
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
I don't think you need watchlist in the training, because you have done the cross validation. But if you still want to use watchlist, it is just okay.
Even better you can use early stopping in xgb.cv.
mdcv <- xgb.cv(data=dtrain, params=param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T, early.stop.round=8, maximize=FALSE)
With this code, when mlogloss value is not decreasing in 8 steps, the xgb.cv will stop. You can save time. You must set maximize to FALSE, because you expect minimum mlogloss.
Here is an example code, with 100 iterations loop, and random chosen parameters.
best_param = list()
best_seednumber = 1234
best_logloss = Inf
best_logloss_index = 0
for (iter in 1:100) {
param <- list(objective = "multi:softprob",
eval_metric = "mlogloss",
num_class = 12,
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3),
gamma = runif(1, 0.0, 0.2),
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround = 1000
cv.nfold = 5
seed.number = sample.int(10000, 1)[[1]]
set.seed(seed.number)
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,
verbose = T, early.stop.round=8, maximize=FALSE)
min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])
if (min_logloss < best_logloss) {
best_logloss = min_logloss
best_logloss_index = min_logloss_index
best_seednumber = seed.number
best_param = param
}
}
nround = best_logloss_index
set.seed(best_seednumber)
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
With this code, you run cross validation 100 times, each time with random parameters. Then you get best parameter set, that is in the iteration with minimum min_logloss.
Increase the value of early.stop.round in case you find out that it's too small (too early stopping). You need also to change the random parameter values' limit based on your data characteristics.
And, for 100 or 200 iterations, I think you want to change verbose to FALSE.
Side note: That is example of random method, you can adjust it e.g. by Bayesian optimization for better method. If you have Python version of XGBoost, there is a good hyperparameter script for XGBoost, https://github.com/mpearmain/BayesBoost to search for best parameters set using Bayesian optimization.
Edit: I want to add 3rd manual method, posted by "Davut Polat" a Kaggle master, in the Kaggle forum.
Edit: If you know Python and sklearn, you can also use GridSearchCV along with xgboost.XGBClassifier or xgboost.XGBRegressor
This is a good question and great reply from silo with lots of details! I found it very helpful for someone new to xgboost like me. Thank you. The method to randomize and compared to boundary is very inspiring. Good to use and good to know. Now in 2018 some slight revise are needed, for example, early.stop.round should be early_stopping_rounds. The output mdcv is organized slightly differently:
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
And depends on the application (linear, logistic,etc...), the objective, eval_metric and parameters shall be adjusted accordingly.
For the convenience of anyone who is running a regression, here is the slightly adjusted version of code (most are the same as above).
library(xgboost)
# Matrix for xgb: dtrain and dtest, "label" is the dependent variable
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
set.seed(123)
for (iter in 1:100) {
param <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# Check error in testing data
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
I found silo's answer is very helpful.
In addition to his approach of random research, you may want to use Bayesian optimization to facilitate the process of hyperparameter search, e.g. rBayesianOptimization library.
The following is my code with rbayesianoptimization library.
cv_folds <- KFold(dataFTR$isPreIctalTrain, nfolds = 5, stratified = FALSE, seed = seedNum)
xgb_cv_bayes <- function(nround,max.depth, min_child_weight, subsample,eta,gamma,colsample_bytree,max_delta_step) {
param<-list(booster = "gbtree",
max_depth = max.depth,
min_child_weight = min_child_weight,
eta=eta,gamma=gamma,
subsample = subsample, colsample_bytree = colsample_bytree,
max_delta_step=max_delta_step,
lambda = 1, alpha = 0,
objective = "binary:logistic",
eval_metric = "auc")
cv <- xgb.cv(params = param, data = dtrain, folds = cv_folds,nrounds = 1000,early_stopping_rounds = 10, maximize = TRUE, verbose = verbose)
list(Score = cv$evaluation_log$test_auc_mean[cv$best_iteration],
Pred=cv$best_iteration)
# we don't need cross-validation prediction and we need the number of rounds.
# a workaround is to pass the number of rounds(best_iteration) to the Pred, which is a default parameter in the rbayesianoptimization library.
}
OPT_Res <- BayesianOptimization(xgb_cv_bayes,
bounds = list(max.depth =c(3L, 10L),min_child_weight = c(1L, 40L),
subsample = c(0.6, 0.9),
eta=c(0.01,0.3),gamma = c(0.0, 0.2),
colsample_bytree=c(0.5,0.8),max_delta_step=c(1L,10L)),
init_grid_dt = NULL, init_points = 10, n_iter = 10,
acq = "ucb", kappa = 2.576, eps = 0.0,
verbose = verbose)
best_param <- list(
booster = "gbtree",
eval.metric = "auc",
objective = "binary:logistic",
max_depth = OPT_Res$Best_Par["max.depth"],
eta = OPT_Res$Best_Par["eta"],
gamma = OPT_Res$Best_Par["gamma"],
subsample = OPT_Res$Best_Par["subsample"],
colsample_bytree = OPT_Res$Best_Par["colsample_bytree"],
min_child_weight = OPT_Res$Best_Par["min_child_weight"],
max_delta_step = OPT_Res$Best_Par["max_delta_step"])
# number of rounds should be tuned using CV
#https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
# However, nrounds can not be directly derivied from the bayesianoptimization function
# Here, OPT_Res$Pred, which was supposed to be used for cross-validation, is used to record the number of rounds
nrounds=OPT_Res$Pred[[which.max(OPT_Res$History$Value)]]
xgb_model <- xgb.train (params = best_param, data = dtrain, nrounds = nrounds)

How to make predictions after every 50 cycles in RSNNS

I am RSNNS to make a model. I am using QuickProp algorithm. here's my neural network:
mydata1 <- read.csv("-1-5_rand1.csv");
mydata <- mydata1[1:151, ]
test_set <- mydata1[152:168, ]
test_set1 <- test_set[c(-7)]
a <- SnnsRObjectFactory()
input <- mydata[c(-7)]
output <- mydata[c(7)]
b <- splitForTrainingAndTest(input, output, ratio = 0.22)
a <- mlp(b$inputsTrain, b$targetsTrain, size = 9, maxit = 650, learnFunc = "Quickprop", learnFuncParams = c(0.01, 2.5, 0.0001, 0, 0), updateFunc = "Topological_Order",
updateFuncParams = c(0.0), hiddenActFunc = "Act_TanH", computeError=TRUE, initFunc = "Randomize_Weights", initFuncParams = c(-1,1),
shufflePatterns = TRUE, linOut = FALSE, inputsTest = b$inputsTest, targetsTest = b$targetsTest)
I am predicting using test set as:
predictions <- predict(a, test_set1)
Is it possible to in RSNNS to predict after every 50 cycles using test set instead of predicting after 650 cycles?
the answer is you can't do it with the high-level interface, but with the low-level interface, you can have a look, e.g., at the mlp_irisSnnsR.R demo that is included in RSNNS

Object 'w' not found error in factor analysis with package 'psych'

A lot of questions about factor analysis on these pages. I have browsed through them but nothing seems similar, so hopefully someone can help.
I am running a factor analysis on some survey questions where I expect some latent constructs to emerge. I am running either principal axes or minres and get the same problem, as detailed below.
My dataset contains many discrete variables and a reasonable amount of missing variables coded as NA, but even after removing all NA the problem persists:
minres.out <- factor.minres(r = res, nfactors = 5, residuals=F, rotate = "varimax", n.obs=NA, scores=F, SMC=T, missing=F, min.err=0.001, ,max.iter=50, symmetric=T,warnings=T,fm="minres")
minres.out
minres.out2 <- fa(r = res, nfactors = 5, residuals=F, rotate = "oblimin", n.obs=NA, scores=F, SMC=T, missing=F, impute="median",min.err=0.001, ,max.iter=50, symmetric=T,warnings=T,fm="minres", alpha=0.1, p=0.05,oblique.scores=F, use="pairwise")
minres.out2
The first one uses the deprecated version and gives me a warning, but it works. The second one gives me the following error:
Error in factor.scores(x.matrix, f = Structure, method = scores) :
object 'w' not found
I have no object w in my data, but I do not really understand what this object is meant to be in the first place.
Running traceback() gives me:
3: factor.scores(x.matrix, f = Structure, method = scores)
2: fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate,
scores = scores, residuals = residuals, SMC = SMC, covar = covar,
missing = FALSE, impute = impute, min.err = min.err, max.iter = max.iter,
symmetric = symmetric, warnings = warnings, fm = fm, alpha = alpha,
oblique.scores = oblique.scores, np.obs = np.obs, use = use,
...)
1: fa(r = res, nfactors = 5, residuals = F, rotate = "oblimin",
n.obs = NA, scores = F, SMC = T, missing = F, impute = "median",
min.err = 0.001, , max.iter = 50, symmetric = T, warnings = T,
fm = "minres", alpha = 0.1, p = 0.05, oblique.scores = F,
use = "pairwise")
Not very enlightening to me.
Any suggestions regarding this w?
I went through the code line-by-line. It seems that scores cannot be passed as an argument to the factor.scores function. It goes through a switch statement and none of the branches activates, so you end up with no value for w which causes it to fail. You could try copying and pasting the following silly fix into your R session and then running your code again:
fa <- function(r, nfactors = 1, n.obs = NA, n.iter = 1, rotate = "oblimin",
scores = "regression", residuals = FALSE, SMC = TRUE, covar = FALSE,
missing = FALSE, impute = "median", min.err = 0.001, max.iter = 50,
symmetric = TRUE, warnings = TRUE, fm = "minres", alpha = 0.1,
p = 0.05, oblique.scores = FALSE, np.obs = NULL, use = "pairwise",
...){
scores <- c("a","b")
psych::fa(r, nfactors = 1, n.obs = NA, n.iter = 1, rotate = "oblimin",
scores = "regression", residuals = FALSE, SMC = TRUE, covar = FALSE,
missing = FALSE, impute = "median", min.err = 0.001, max.iter = 50,
symmetric = TRUE, warnings = TRUE, fm = "minres", alpha = 0.1,
p = 0.05, oblique.scores = FALSE, np.obs = NULL, use = "pairwise",
...)
}
I had this same error. Mine was caused because I tried to pass "Regression" to scores instead of "regression". So make sure that what you're passing to scores is an acceptable parameter option.

Resources