R - XGBoost: Error building DMatrix - r

I am having trouble using the XGBoost in R.
I am reading a CSV file with my data:
get_data = function()
{
#Loading Data
path = "dados_eye.csv"
data = read.csv(path)
#Dividing into two groups
train_porcentage = 0.05
train_lines = nrow(data)*train_porcentage
train = data[1:train_lines,]
test = data[train_lines:nrow(data),]
rownames(train) = c(1:nrow(train))
rownames(test) = c(1:nrow(test))
return (list("test" = test, "train" = train))
}
This function is Called my the main.R
lista_dados = get_data()
#machine = train_svm(lista_dados$train)
#machine = train_rf(lista_dados$train)
machine = train_xgt(lista_dados$train)
The problem is here in the train_xgt
train_xgt = function(train_data)
{
data_train = data.frame(train_data[,1:14])
label_train = data.frame(factor(train_data[,15]))
print(is.data.frame(data_train))
print(is.data.frame(label_train))
dtrain = xgb.DMatrix(data_train, label=label_train)
machine = xgboost(dtrain, num_class = 4 ,max.depth = 2,
eta = 1, nround = 2,nthread = 2,
objective = "binary:logistic")
return (machine)
}
This is the Error:
becchi#ubuntu:~/Documents/EEG_DATA/Dados_Eye$ Rscript main.R
[1] TRUE
[1] TRUE
Error in xgb.DMatrix(data_train, label = label_train) :
xgb.DMatrix: does not support to construct from list Calls: train_xgt
-> xgb.DMatrix Execution halted becchi#ubuntu:~/Documents/EEG_DATA/Dados_Eye$
As you can see, they are both DataFrames.
I dont know what I am doing wrong, please help!

Just convert data frame to matrix first using as.matrix() and then pass to xgb.Dmatrix().

Check if all columns have numeric data in them- I think this could be because you have some column that has data stored as factors/ characters which it won't be able to convert to a matrix. if you have factor variables, you can use one-hot encoding to convert them into dummy variables.

Try:
dtrain = xgb.DMatrix(as.matrix(sapply(data_train, as.numeric)), label=label_train)
instead of just:
dtrain = xgb.DMatrix(data_train, label=label_train)

Related

Create a multivariate matrix in tidymodels recipes::recipe()

I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]

How to make a Loop in R referencing a data set

I'm confused on how to run a complicated loop. I want R to run a function (rpt) on each of the 14 turtles in the data set (starting with R3L12). Here is what the code looks like for just running the function for one turtle.
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= R3L12rep,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
The problem is is that the dataset will be changing each time. For the next turtle, turtle R3L1, the data = would be R3L1rep.
It could just be easier to copy and paste the above code and change it for the 13 turtles, but I wanted to see if anyone could help me with a loop.
Thank you!
You could just make a vector containing the names of each dataset.
data_names=c("R3L12rep","R3L1rep")
Then loop over each name:
for(i in seq_along(data_names)){
foo = rpt(odba ~ (1|date.1),
grname = "date.1",
data= data_names[i],
datatype = "Gaussian",
nboot = 500,
npermut = 0))
print(foo)
}
put your datasets into a list, then iterate over that list:
datasets = list(R3L12rep,R3L1rep, <insert-rest-of-turtles>)
for (data in datasets) {
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= data,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
}

Exporting Seurat Object Data by Cluster

I'm using Seurat to perform a single cell analysis and am interested in exporting the data for all cells within each of my clusters. I tried to use the below code but have had no success.
My Seurat object is called Patients. I also attached a screenshot of my Seurat object. I am looking to extract all the clusters (i.e. Ductal1, Macrophage1, Macrophage2, etc...)
meta.data.cluster <- unique(x = Patients#meta.data$active.ident)
for(group in meta.data.cluster) {
group.cells <- WhichCells(object = Patients, subset.name = "active.ident" , accept.value = group)
data_to_write_out <- as.data.frame(x = as.matrix(x = Patients#raw.data[, group.cells]))
write.csv(x = data_to_write_out, row.names = TRUE, file = paste0(save_dir,"/",group, "_cluster_outfile.csv"))
}
I am new to R and coding so any help is greatly appreciated! :)
It doesn't work because there is no active.ident column under your metadata. For example if we use an example dataset like yours and set the ident:
library(Seurat)
M = matrix(rnbinom(5000,mu=20,size=1),ncol=50)
colnames(M) = paste0("P",1:50)
rownames(M) = paste0("gene",1:100)
Patients = CreateSeuratObject(M)
Patients$grp = sample(c("Ductal1","Macrophage1","Macrophage2"),50,replace=TRUE)
Idents(Patients) = Patients$grp
You can see this line of code gives you no value:
meta.data.cluster <- unique(x = Patients#meta.data$active.ident)
meta.data.cluster
NULL
You can do:
meta.data.cluster <- unique(Idents(Patients))
for(group in meta.data.cluster) {
group.cells <- WhichCells(object = Patients, idents = group)
data_to_write_out <- as.data.frame(GetAssayData(Patients,slot = 'counts')[,group.cells])
write.csv(data_to_write_out, row.names = TRUE, file = paste0(save_dir,"/",group, "_cluster_outfile.csv"))
}
Note also you can get the counts out using GetAssayData . You can subset one group and write out like this:
wh <- which(Idents(Patients) =="Macrophage1" )
da = as.data.frame(GetAssayData(Patients,slot = 'counts')[,wh])
write.csv(da,...)

mcmc modeling in r by metropolis hastings

I am trying to modeling mcmc by using mhadaptive package in R. But one error appear. What should I do?
#importing data from excel
q<-as.matrix(dataset1) #input data from spread price
F1<-as.matrix(F_1_) #input data from F
li_reg<-function(pars,data) #defining function
{
a01<-pars[1] #defining parameters
a11<-pars[2]
epsilon<-pars[3]
b11<-pars[4]
a02<-pars[5]
a12<-pars[6]
b12<-pars[7]
v<-pars[8]
pred<-((a01+a11*epsilon^2+b11)+F1[,2]*(a02+a12*epsilon^2+b12)) #parametes which exist here should be optimize by cinsidering this formula
log_likelihood<-sum(dnorm(data[,2],pred,log = TRUE))
prior<-prior_reg(pars)
return(log_likelihood+prior)
}
prior_reg<-function(pars) #here there is prior values
{
epsilon<-pars[3]
v<-pars[8]
prior_epsilon<-pt(0.85,5,lower.tail = TRUE,log.p = FALSE)
}
mcmc_r<-Metro_Hastings(li_func = li_reg,pars =NULL,prop_sigma = NULL,par_names = c('a01','a11','epsilon','b11','a02','a12','b12'),data=q,iterations = 2000,burn_in = 1000,adapt_par = c(100,20,0.5,0.75),quiet = FALSE)
mcmc_r<-mcmc_thin(mcmc_r)
I used mhadaptive package for calculating optimized parameters.
But this error eppear
Error in optim(pars, li_func, control = list(fnscale = -1), hessian = TRUE, :
function cannot be evaluated at initial parameters

Unused arguments in R error

I am new to R , I am trying to run example which is given in "rebmix-help pdf". It use galaxy dataset and here is the code
library(rebmix)
devAskNewPage(ask = TRUE)
data("galaxy")
write.table(galaxy, file = "galaxy.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)
REBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, InformationCriterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table))
Table <- REBMIX[[i, j, k]]$summary
else Table <- merge(Table, REBMIX[[i, j,k]]$summary, all = TRUE, sort = FALSE)
}
}
}
It is giving me error ERROR:
unused argument (InformationCriterion = InformationCriterion[j])
Plz help
I'm running R 3.0.2 (Windows) and the library rebmix defines a function REBMIX where InformationCriterion is not listed as a named argument, but Criterion.
Brief invoke REBMIX as :
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
It looks as though there have been substantial changes to the rebmix package since the example mentioned in the OP was created. Among the most noticable changes is the use of S4 classes.
There's also an updated demo in the rebmix package using the galaxy data (see demo("rebmix.galaxy"))
To get the above example to produce results (Note: I am not familiar with this package or the rebmix algorithm!!!):
Change the argument to Criterion as mentioned by #Giupo
Use the S4 slot access operator # instead of $
Don't name the results object REDMIX because that's already the function name
library(rebmix)
data("galaxy")
## Don't re-name the REBMIX object!
myREBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
myREBMIX[[i, j, k]] <- REBMIX(Dataset = list(galaxy),
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table)) {
Table <- myREBMIX[[i, j, k]]#summary
} else {
Table <- merge(Table, myREBMIX[[i, j,k]]#summary, all = TRUE, sort = FALSE)
}
}
}
}
I guess this is late. But I encountered a similar problem just a few minutes ago. And I realized the real scenario that you may face when you got this kind of error msg... It's just the version conflict.
You may use a different version of the R package from the tutorial, thus the argument names could be different between what you are running and what the real code use.
So please check the version first before you try to manually edit the file. Also, it happens that your old version package is still in the path and it overrides the new one. This was exactly what I had... since I manually installed the old and new version separately...

Resources