R package mlr Multilabel Text Classification: how to classify new data - r

I found this code in a tutorial about multilabel classification with package mlr.
library("mlr")
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))
pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])
I understand the structure of the dataset yeast, but I do not understand how to use the code when I have new data which I want to classify because then I wouldn´t have any TRUE or FALSE values for the labels. Actually I would have some training data with the same structure as yeast but for my new data the columns 1:14 would be missing.
Am I missunderstanding something? If not: How can I use the code correctly?
Edit:
Here´s a sample code how I would use the code:
library("tm")
train.data = data.frame("id" = c(1,1,2,3,4,4), "text" = c("Monday is nice weather.", "Monday is nice weather.", "Dogs are cute.", "It is very rainy.", "My teacher is angry.", "My teacher is angry."), "label" = c("label1", "label2", "label3", "label1", "label4", "label5"))
test.data = data.frame("id" = c(5,6), "text" = c("Next Monday I will meet my teacher.", "Dogs do not like rain."))
train.data$text = as.character(train.data$text)
train.data$id = as.character(train.data$id)
train.data$label = as.character(train.data$label)
test.data$text = as.character(test.data$text)
test.data$id = as.character(test.data$id)
### Bring training data into structure
train.data$label = make.names(train.data$label)
labels = unique(train.data$label)
# DocumentTermMatrix for all texts
texts = unique(c(train.data$text, test.data$text))
docs <- Corpus(VectorSource(unique(texts)))
terms <-DocumentTermMatrix(docs)
m <- as.data.frame(as.matrix(terms))
# Logical columns for labels
test = data.frame("id" = train.data$id, "topic"=train.data$label)
test2 = as.data.frame(unclass(table(test)))
test2[,c(1:ncol(test2))] = as.logical(unlist(test2[,c(1:ncol(test2))]))
rownames(test2) = unique(test$id)
# Bind columns from dtm
termsDf = cbind(test2, m[1:nrow(test2),])
names(termsDf) = make.names(names(termsDf))
### Create Multilabel Task
classify.task = makeMultilabelTask(id = "multi", data = termsDf, target = labels)
### Now the model
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, classify.task)
### How can I predict for test.data?
So, the problem is that I do not have any labels for test.data because that is what I would actually like to compute?
Edit2:
When I simply use
names(m) = make.names(names(m))
pred = predict(mod, newdata = m[(nrow(termsDf)+1):(nrow(termsDf)+nrow(test.data)),])
the result is for both texts the same and really not that I would expect.

Related

Is it possible to create a stratified table (tbl_strata) using tbl_svysummary()?

I am pretty new to survey data and gtsumarry package.
I'm trying to create a stratified table from the survey data using the following code, and I get the error "Error: Problem with mutate() input tbl".
# Reading the subset of the data
fileUrl <- "https://raw.github.com/Shadi-Sadie/Paper-1-Cancer-Screening-and-Immigrants/master/Cleaned%20Data/subset.csv"
SData<-read.csv( fileUrl , header = TRUE, sep ="," )
# Setting the weight
options( "survey.replicates.mse" = TRUE)
svy <- svrepdesign(repweights = "PWGTP[0-9]+",
weights = ~PWGTP,
combined.weights = TRUE,
type = "JK1",
scale = 4/80, rscales = rep(1, 80),
data = SData)
# creating the table
SData %>%
select(CITG, HICOV, ESRG , EXPANSION) %>%
tbl_strata(
strata = CITG,
.tbl_fun =
~ .x %>% tbl_svysummary(
by = EXPANSION,
include = c(CITG, HICOV, ESRG , EXPANSION),
label = list(CITG ~ "Nativity",
HICOV~ "Any health insurance",
ESRG~ "Employment",
EXPANSION ~ "Expansion" )
)
)
If it is possible to use tbl_svysummary() with the tbl_strata() could anyone tell me where I'm doing wrong?
Thanks for updating with a reproducible post. I made the following changes:
You were passing the data frame to tbl_strata() and it needed to be updated to the survey design object.
The stratifying variable should no be listed in the tbl_summary(include=) argument.
Happy Porgramming!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.0'
fileUrl <- "https://raw.github.com/Shadi-Sadie/Paper-1-Cancer-Screening-and-Immigrants/master/Cleaned%20Data/subset.csv"
SData <- read.csv(fileUrl, header = TRUE, sep = ",")
# Setting the weight
options("survey.replicates.mse" = TRUE)
svy <- survey::svrepdesign(
repweights = "PWGTP[0-9]+",
weights = ~PWGTP,
combined.weights = TRUE,
type = "JK1",
scale = 4 / 80, rscales = rep(1, 80),
data = SData
)
# creating the table
tbl <-
svy %>%
tbl_strata(
strata = CITG,
.tbl_fun =
~ .x %>% tbl_svysummary(
by = EXPANSION,
include = c(HICOV, ESRG, EXPANSION),
label = list(
HICOV = "Any health insurance",
ESRG = "Employment",
EXPANSION = "Expansion"
)
)
)
Created on 2022-05-01 by the reprex package (v2.0.1)

importance ranking: error must be an object of class xgb.Booster

I ran a xgboost regression forecast (also tried to complete it with the xgb.Booster.complete). When trying to get the xgb.importance, I get the error massage
Error in xgboost::xgb.importance(case_xgbm) : model: must be an
object of class xgb.Booster
However, when verifying, R says it is an "xgb.Booster" class.
Any idea what is going on?
library(xgboost)
library(caret)
somedata <- MASS::Boston
indexes = createDataPartition(somedata$medv, p = .85, list = F) #medv is the y
train = somedata[indexes, ]
test = somedata[-indexes, ]
train_x = data.matrix(train[, -13])
train_y = train[,13]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
class(xgbc)
xgboost::xgb.importance(xgbc)
xgbc2 = xgb.Booster.complete(xgbc, saveraw = TRUE)
class(xgbc2)
xgboost::xgb.importance(xgbc2)
try
xgboost::xgb.importance(model=xgbc)
this worked for me

mlr3 - Apply pre-processing to new data

Using lmr3verse package here. Let's say I applied the following pre-processing to the training set used to train Learner:
preprocess <- po("scale", param_vals = list(center = TRUE, scale = TRUE)) %>>%
po("encode",param_vals = list(method = "one-hot"))
And I would like to predict the label of new observations contained in a dataframe (with the original variables) pred with the command predict(Learner, newdata = pred, predict_type="prob"). This won't work since Learner was trained with centered, scaled, and one-hot encoding variables.
How to apply the same pre-processing used on the training set to new data (only features, not response) in order to make predictions?
I am not 100% sure but it seems you can feed newdata to a new task and feed it to predict. This page shows an example of combining mlr_pipeops and learner objects.
library(dplyr)
library(mlr3verse)
df_iris <- iris
df_iris$Petal.Width = df_iris$Petal.Width %>% cut( breaks = c(0,0.5,1,1.5,2,Inf))
task = TaskClassif$new(id = "my_iris",
backend = df_iris,
target = "Species")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
task_train = TaskClassif$new(id = "my_iris",
backend = df_iris[train_set,], # use train_set
target = "Species")
graph = po("scale", param_vals = list(center = TRUE, scale = TRUE)) %>>%
po("encode", param_vals = list(method = "one-hot")) %>>%
mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
graph$train(task_train)
graph$pipeops$encode$state$outtasklayout # inspect model input types
graph$pipeops$classif.rpart$predict_type = "prob"
task_test = TaskClassif$new(id = "my_iris_test",
backend = df_iris[test_set,], # use test_set
target = "Species")
pred = graph$predict(task_test)
pred$classif.rpart.output$prob
# when you don't have a target variable, just make up one
df_test2 <- df_iris[test_set,]
df_test2$Species = sample(df_iris$Species, length(test_set)) # made-up target
task_test2 = TaskClassif$new(id = "my_iris_test",
backend = df_test2, # use test_set
target = "Species")
pred2= graph$predict(task_test2)
pred2$classif.rpart.output$prob
As suggested by #missuse, by using graph <- preprocess %>>% Learner and then graph_learner <- GraphLearner$new(graph) commands, I could predict --- predict(TunedLearner, newdata = pred, predict_type="prob") --- using a raw data.frame.

mlrCPO - Task conversion TOCPO

I would like to build a CPO for the mlr::makeClassificationViaRegression wrapper. The wrapper builds regression models that predict for the positive class whether a particular example belongs to it (1) or not (-1). It also calculates predicted probabilities using a softmax.
After reading the documentation and vignettes for makeCPOTargetOp, my attempt is as follows:
cpoClassifViaRegr = makeCPOTargetOp(
cpo.name = 'ClassifViaRegr',
dataformat = 'task', #Not sure - will this work if input is df with unknown target values?
# properties.data = c('numerics', 'factors', 'ordered', 'missings'), #Is this needed?
properties.adding = 'twoclass', #See https://mlrcpo.mlr-org.com/articles/a_4_custom_CPOs.html#task-type-and-conversion
properties.needed = character(0),
properties.target = c('classif', 'twoclass'),
task.type.out = 'regr',
predict.type.map = c(response = 'response', prob = 'response'),
constant.invert = TRUE,
cpo.train = function(data, target) {
getTaskDesc(data)
},
cpo.retrafo = function(data, target, control) {
cat(class(target))
td = getTaskData(target, target.extra = T)
target.name = paste0(control$positive, ".prob")
data = td$data
data[[target.name]] = ifelse(td$target == pos, 1, -1)
makeRegrTask(id = paste0(getTaskId(target), control$positive, '.'),
data = data,
target = target.name,
weights = target$weights,
blocking = target$blocking)
},
cpo.train.invert = NULL, #Since constant.invert = T
cpo.invert = function(target, control.invert, predict.type) {
if(predict.type == 'response') {
factor(ifelse(target > 0, control.invert$positive, control.invert$positive))
} else {
levs = c(control.invert$positive, control.invert$negative)
propVectorToMatrix(vnapply(target, function(x) exp(x) / sum(exp(x))), levs)
}
})
It seems to work as expected, the demo below shows that the inverted prediction is identical to the prediction obtained using the makeClassificationViaRegr wrapper:
lrn = makeLearner("regr.lm")
# Wrapper -----------------------------------------------------------------
lrn2 = makeClassificationViaRegressionWrapper(lrn)
model = train(lrn2, sonar.task, subset = 1:140)
predictions = predict(model, newdata = getTaskData(sonar.task)[141:208, 1:60])
# CPO ---------------------------------------------------------------------
sonar.train = subsetTask(sonar.task, 1:140)
sonar.test = subsetTask(sonar.task, 141:208)
trafd = sonar.train %>>% cpoClassifViaRegr()
mod = train(lrn, trafd)
retr = sonar.test %>>% retrafo(trafd)
pred = predict(mod, retr)
invpred = invert(inverter(retr), pred)
identical(predictions$data$response, invpred$data$response)
The problem is that the after the CPO has converted the task from twoclass to regr, there is no way for me to specify predict.type = 'prob'. In the case of the wrapper, the properties of the base regr learner are modified to accept predict.type = prob (see here). But the CPO is unable to modify the learner in this way, so how can I tell my model to return predicted probabilities instead of the predicted response?
I was thinking I could specify a include.prob parameter, i.e. cpoClassifViaRegr(include.prob = T). If set to TRUE, the cpo.invert returns the predicted probabilities in addition to the predicted response. Would something like this work?

Topic label of each document in LDA model using textmineR

I'm using textmineR to fit a LDA model to documents similar to https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html. Is it possible to get the topic label for each document in the data set?
>library(textmineR)
>data(nih_sample)
> # create a document term matrix
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names =
nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec =
c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower
= TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE,
cpus = 2)
>dtm <- dtm[,colSums(dtm) > 2]
>set.seed(123)
> model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin =
180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood =
TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)
then adding the labels to the model:
> model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm,
M = 1)
now I want the topic labels for each of 100 document in nih_sample$ABSTRACT_TEXT
Are you looking to label each document by the label of its most prevalent topic? IF so, this is how you could do it:
# convert labels to a data frame so we can merge
label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)
# get the top topic for each document
top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])
# convert the top topics for each document so we can merge
top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)
# merge together. Now each document has a label from its top topic
top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)
This kind of throws away some information that you'd get from LDA though. One advantage of LDA is that each document can have more than one topic. Another is that we can see how much of each topic is in that document. You can do that here by
# set the plot margins to see the labels on the bottom
par(mar = c(8.1,4.1,4.1,2.1))
# barplot the first document's topic distribution with labels
barplot(model$theta[1,], names.arg = model$labels, las = 2)

Resources