Bisecting K-Means on a Data Set in R - r

I am doing my first assignment in Data Science (Masters level) and do not come from a programming background. I have completed a K-Means model on my data (which is a a simple test data set). But now I want to implement bisecting k-means in order to show how this can improve the clustering result. I am coding in R, does anyone have any knowledge on how to code bisecting k-means in R, for someone who is fairly new to the field?
The code I am trying to use is:
bkmeansset <- ml_bisecting_kmeans(x, formula = NULL, k = 3, max_iter = 20,
seed = NULL, min_divisible_cluster_size = 1, features_col = "features",
prediction_col = "prediction", uid =
random_string("bisecting_bisecting_kmeans_"))
I am inputting a test set called "testset" and I am not sure where to but this in the argument of the function. The error message that I am getting is:
Error in UseMethod("ml_bisecting_kmeans") :
no applicable method for 'ml_bisecting_kmeans' applied to an object of class
"character"

Related

How to obtain a heatmap or network diagram of variable interaction through a machine learning model

I try to use the "vivid" package in R to draw the correlation of variable interactions, because this package can directly analyze the trained machine learning model, and its visualization is very beautiful, but when I execute the vivi() command , I get the wrong information, I don't know what's wrong, because I can execute another instruction in the package: pdpPairs(), and get the PDPS diagram, although I know it's difficult without more information , but I'm hoping to get an answer.Or you can also tell me if there is any other way to directly analyze the trained machine learning model to get a heatmap or network diagram of variable interaction, which will be very useful to me
viFit <- vivi(
fit = gbm_fit,
data = variable_data,
response = "Y1_learning",
gridSize = 10,
importanceType = NULL,
nmax = 100,
reorder = TRUE,
class = 1,
predictFun = NULL)
Agnostic variable importance method used.
Calculating interactions...
Error in mat[vars2] <- res[["value"]] : invalid subscript type 'list'

Writing a prediction equation from plsr model

Greeting to everyone.
I sucessfully computed pls-r model in R using the code below
pls_modB_Kexch_2 <- plsr(Av.K_exc~., data = trainKexch.sar.veg, scale=TRUE,method= "s",validation='CV')
The regression coeffiecents for ncomps =11 were
(
Intercept)= -4.692966e+05,
Easting = 6.068582e+03, Northings= 7.929767e+02,
sigma_vv = 8.024741e+05, sigma_vh = -6.375260e+05,
gamma_vv = -7.120684e+05, gamma_vh = 4.330279e+05,
beta_vv = -8.949598e+04, beta_vh = 2.045924e+05,
c11_db = 2.305016e+01, c22_db = -4.706773e+01,
c12_real = -1.877267e+00.)
It predicts well new data sets when applied with in R enviroment.
My challenge is presenting this model in form of y=sum(AX)+Bo equation where A are coeffiecents of respective variablesX
Or any other mathmetical form, that can be presented academically.
I tried a direct way by multiplying the coeff.to each variable and suming them up, aquick manual trial for predictions gave me strange results. Am missing something here, please help.

Error in optim: L-BFGS-B needs finite value of fn

I am trying to run impute_errors() function of the imputeTestBench package for a series of values. I am using six user defined methods for selection of best imputation method. Below is my code:
correctedSalesHistoryMatrix[, 1:2],
matrix(unlist(apply(X = as.matrix(correctedSalesHistoryMatrix[, -c(1, 2)]),
MARGIN = 1,
FUN = impute_errors,
smps = "mcar",
methods = c(
"imputationMethod1"
, "imputationMethod2"
, "imputationMethod3"
, "imputationMethod4"
, "imputationMethod5"
, "imputationMethod6"
),
methodPath = "C:\\Documents\\Imputations.R",
errorParameter = "mape",
missPercentFrom = 10,
missPercentTo = 10
)
), nrow = nrow(correctedSalesHistoryMatrix), byrow = T
)
)
When I am using a small dataset, the function executes successfully. When I am using a large dataset I am using the following error:
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
Called from: optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,
np + 1L), upper = rep(Inf, np + 1L), control = optim.control)
I don't think this is an easy fix.
Error is probably not caused by imputeTestBench itself, but rather by one of your user defined imputation methods.
Run impute_errors like before and only add na_mean as method instead of your user defined methods (impute_errors(..., methods = 'na_mean') ) to see if this suggestion is true.
The error itself occurs quite often and has to do with stats::optim receiving inputs it can't deal with. Quite likely you are not using stats::optim in your user defined imputation methods (so you can't easily fix the input). More likely is that a package your are using is doing some calculations and then using stats::optim. Or even worse a package you are using is using another package, that is using stats::optim.
In the answers to this question you can see an explanation underlying problem. Overall seems to occur especially for large datasets, when the fn input parameter to stats::optim becomes Inf.
Here a some examples of the problem also occurring for different R packages and functions (which all use stats::optim somewhere internally): 1, 2, 3
Not too much you can do overall, if you don't want to go extremely deep into the underlying packages.
If you are using the imputeTS package for one of your user supplied imputation methods, in this Github Issue a workaround is proposed, which might help, if the error occurs within the na_kalman or na_seadec method.

How to use covariates in rddtools rdd_reg_lm function?

I am trying to run a parametric RD regression using the rddtools R package. However, the package documentation is not very clear to me.
First: the function to define an RD object is:
rdd_data(y, x, covar, cutpoint, z, labels, data)
where covar, in the help file, means only "Exogeneous variables" . But what type? A data frame? A list?
Second: The function rdd_reg_lm again demands informing covariates in this way:
rdd_reg_lm(rdd_object, covariates = NULL, order = 1, bw = NULL,
slope = c("separate", "same"), covar.opt = list(strategy = c("include",
"residual"), slope = c("same", "separate"), bw = NULL),
covar.strat = c("include", "residual"), weights)
Where, according to the help file, the covariates argument means simply "Formula to include covariates". Again, it is not clear to me what is exactly the correct way of applying these covariates.
Moreover, is it possible to include multiple covariates in this function rdd_data() and rdd_reg_lm()?
I appreciate some help here. I have already read the help and vignette files again and again, searched in many blogs and still nothing.
I have already checked this topic below
How to include a linear trend in a regression discontinuity design using rddtools
which showed me the following example:
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = ageyrs, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates = 'ageyrs', slope =
("same"), covar.opt = list("include"))
Even so, the syntax is still not clear to me, as I am trying to add multiple covariates without success
Thanks!
You can create a data frame with your covariates and then include it in rdd_data.
covariates<-data.frame(z1=ageyrs, z2=ageyrs2)
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = covariates, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates =TRUE, slope =("same"))

Grid tuning xgboost with missing data

It seems like the expected method of grid tuning the xgboost model is using the caret package, as clearly displayed here: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees
However, I struggle to make sense of the case with missing data. When creating the model without using caret, I set the missing to NA.
dtrain = xgb.DMatrix(data = data.matrix(train$data),label = train$label,missing = NA)
That allows me to create the model like so:
bst = xgboost(data = dtrain,depth = 4,eta =.3,nthread = 2,
nround = 43, print.every.n = 5,
objective = "binary:logistic",eval_metric = "auc",verbose = TRUE
)
This works very nicely, however, caret does not take this kind of object.
This is what I'm trying:
xgbtrain = train(x = train$data,y = as.factor(make.names(train$label)),
trControl = trControl, tuneGrid = my_grid,method = "xgbTree")
But for every iteration it is telling me this:
Error in xgb.DMatrix(as.matrix(x), label = y) : can not open file "NA"
That's the same error message I was getting before in regular xgb.boost when I didn't set my missing to NA. The xgb.DMatrix is not a subsettable object I could take the data from, and it is also not possible to convert it to a data frame. How do I get around this?
EDIT
Figured it out. In the end, it had nothing to do with missing data, but with having factors in the dataset. Instead of using xgboost's function to convert to a sparse matrix, I used regular model.matrix() and was able to successfully plug in the new matrix into caret's train function.

Resources