XGBoost Error in R Studio ("'data' has class 'character' and length...") - r

I am having difficulties fitting my data to an xgboost classifier model. When I run this:
classifier = xgboost(data = as.matrix(training_set[c(4:15, 17:18,20:28)]),
label = training_set$posted_ind, nrounds = 10)
R Studio tells me:
Error in xgb.DMatrix(data, label = label, missing = missing) :
'data' has class 'character' and length 1472000.
'data' accepts either a numeric matrix or a single filename.
The training set data has both continuous and categorical data, but all categorical data has been encoded as such (and the same data fit to random forest and naive bayes models). Is there some additional step I need to complete so that I can use these data in an xgboost model?

Make sure that your "training_set" does not have any columns that are factors. If you encoded your categorical variables as numeric but casted them as factors, you will get this error.

I came across the same problem and found a complete solution. You have to use:
sparse_matrix <- sparse.model.matrix(label_y ~ ., data = df)[,-1]
X_train_dmat = xgb.DMatrix(sparse_matrix, label = df$label)
This transforms the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. The above is dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).
The purpose is to transform each value of each categorical feature into a binary feature {0, 1}.
For example, a column Treatment will be replaced by two columns, TreatmentPlacebo, and TreatmentTreated. Each of them will be binary. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value 1 in the new column TreatmentPlacebo and the value 0 in the new column TreatmentTreated. The column TreatmentPlacebo will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.
Source: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#conversion-from-categorical-to-numeric-variables

What is working for me while using tidymodels is adding a recipe step for dummy encoding:
step_dummy(all_nominal_predictors(), one_hot = TRUE)

Related

What does a proportional matrix look like for glmnet response variable in R?

I'm trying to use glmnet to fit a GLM that has a proportional response variable (using the family="binomial").
The help file for glmnet says that the response variable:
"For family="binomial" should be either a factor with
two levels, or a two-column matrix of counts or proportions (the second column
is treated as the target class"
But I don't really understand how I would have a two column matrix. My variable is currently just a single column with values between 0 and 1. Can someone help me figure out how this needs to be formatted so that glmnet will run it properly? Also, can you explain what the target class means?
It is a matrix of positive label and negative label counts, for example in the example below we fit a model for proportion of Claims among Holders :
data = MASS::Insurance
y_counts = cbind(data$Holders - data$Claims,data$Claims)
x = model.matrix(~District+Age+Group,data=data)
fit1 = glmnet(x=x,y=y_counts,family="binomial",lambda=0.001)
If possible, so you should go back to before your calculation of the response variable and retrieve these counts. If that is not possible, you can provide a matrix of proportion, 2nd column for success but this assumes the weight or n is same for all observations:
y_prop = y_counts / rowSums(y_counts)
fit2 = glmnet(x=x,y=y_prop,family="binomial",lambda=0.001)

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Error in bn.fit predict function in bnlear R

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!
The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

Classification column is removed after using dummyVars in caret package - R

I am playing around with the caret package and came upon this question.
I am using dummyVars to split my categorical columns into separate dummy variables. It seems that dummyVars code removes the classification column in the input data set. For example:
library(earth)
data(etitanic)
dummies <- dummyVars(survived ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "sex.female" "sex.male" "age"
[7] "sibsp" "parch"
So when I try to split the data, I get an error.
train = createDataPartition(et$survived, p=.75, list=FALSE)
Error in createDataPartition(et$survived, p = 0.75, list = FALSE) :
y must have at least 2 data points
Could anyone let me know if this is the expected behavior of caret's dummyVars. I can easily add in the survived column into the data set using say,
et$survived<-etitanic$survived
and then train a model. But I presume that there must be a better way or else the caret package would not remove the classification column. Am I missing something here? Could someone throw more light on this please?
Thanks
As far as I know there is no way to keep the classification column in (or at least not as a factor; and that is because the output is a matrix and therefore it is always numeric). This is because the reason of the dummyVars function is to create dummy variables for the factor predictor variables. It is also designed to provide an alternative to the base R function model.matrix which offers more choices (model.matrix also does not keep the classification column).
Also, and maybe more importantly, functions that require the classification column to be of factor class and only of factor class offer either a way to provide the factor as a separate argument (like the svm function from the e1071 package) or specifically require it as a separate argument (like the knn function from the FNN package). In both cases you do not need to have the factor in your data.frame. You just need to provide it as a separate vector in the function you want to use.
However, there is an alternative for the cases where you do not need the classification column to be of factor type in which case you can simply do:
library(earth)
data(etitanic)
etitanic2 <- etitanic
#convert the classification colunn to numeric
etitanic2$survived <- as.numeric(etitanic2$survived)
#use formula without specifying the response variable
dummies <- dummyVars( ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
> names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "survived" "sex.female" "sex.male" "age"
[8] "sibsp" "parch"
By converting the classification column into numeric and by not specifying a response variable in the formula, the survived column is kept in the output data.frame but as of numeric class.

what to do with non-numeric predictor in plsreg1?

I am using the function ""plsreg1" in the "plsdepot" package to run a PLS regression.
As the package introduction states, in the argument "plsreg1(predictors, response, comps = 2, crosval = TRUE)", the predictors represents "A numeric matrix or data frame with the predictor variables (which may contain missing data)."
And the response represents "A numeric vector for the reponse variable. No missing data allowed."
My question is, what if some of my predictor are categorical variable ?
can I still use this package? How should I revise my script?
Example script from the package is as below
library(plsdepot)
data(cornell)
matrix of correlations
round(as.dist(cor(cornell)),3)
# partial least squares regression
mypls1 <- plsreg1(cornell[,1:7], cornell[,8,drop=F],comp=3)
mypls1
Thank you very much.

Resources