classify data using RTextTool package - r

I can't seem to figure out how to classify new data. So, after a
model has been built, I'd like to use it to classify new data as it
comes in.
let's say that I have a dataset called tweets which tweets[,1] contains a text tweets[,2] is the polarity of each tweet (positive or negative )
matrix= create_matrix(tweets[,1], language="english",minDocFreq = 2,stripWhitespace = TRUE, removeStopwords=TRUE, removeNumbers=TRUE )
container = create_container(matrix, as.numeric(as.factor(tweets[,2])),
trainSize=1:190000, testSize=190001:210000,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT" , "SVM", "RF", "BAGGING", "TREE"))
results = classify_models(container, models)
i want now to apply the differents models that i have created on a test data called newdf where newdf[,1] represents a vector of comments
predMatrix <- create_matrix(newdf[,1], originalMatrix=matrix,language="english",minDocFreq = 2,stripWhitespace = TRUE,
removeStopwords=TRUE, removeNumbers=TRUE ,weighting=tm::weightTf,
stemWords=TRUE )
predSize = length(tweets2[,1]);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize),
testSize=1:predSize, virgin=TRUE)
results = classify_models(predictioncontainer, models)
is this the right way to do it ?

Related

Create a multivariate matrix in tidymodels recipes::recipe()

I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]

Parallel Processing packages in R with user function and multiple outcomes

I'm working on trying to make my model fitting procedure in R more efficient. Currently, I have all of my data generated with 1500 sims for 15 variables. This data is stored in an array, with each level being one sim, each row being one "person" and each column being one of the 15 variables (eg., 300 x 15 x 1500). I then pass one layer of the array through mplusObject numerous times, fitting different LPA models (one class, two class, etc). For each of these models, there are numerous outcomes that get reported and saved. I've been working for a while now trying to figure out how to speed this up using parallel processing given that the data is pre-generated and one layer of the array doesn't depend on the other. I'll show what I currently have below, but it isn't working, so I'm wondering if I need a different package. Thanks!
inp <- array(1:(300*15*1500), dim=(300,15,1500)) #Really there's actual data here, not random values, but the data generation process is a whole other thing.
results <- results = matrix(NA,1500,129) #A results table for values to be written to, filled with NAs, 1500 simulations, 129 results.
num_sims=1500
foreach(i=1:num_sims, .packages=c('mclust','MplusAutomation')) %dopar% {
working <- inp[,,i]
sim_num=i
results[sim_num,1] = working[1,17] #number of groups
results[sim_num,2] = working[1,18] #sample size 1
results[sim_num,3] = working[1,19] #sample size 2
results[sim_num,4] = working[1,20] #sample size 3
results[sim_num,5] = working[1,21] #dist2
results[sim_num,6] = working[1,22] #dist3
df <- as.data.frame(working[,1:15])
lpa1_15 <- mplusObject(
TITLE = "1-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(1);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa1_15_fit = mplusModeler(lpa1_15, "df.dat", modelout = "lpa1_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa1_15_fit$results$summaries$LL)){
results[sim_num,7] = -2 * lpa1_15_fit$results$summaries$LL
results[sim_num,8] = lpa1_15_fit$results$summaries$BIC
results[sim_num,9] = lpa1_15_fit$results$summaries$aBIC
results[sim_num,10] = lpa1_15_fit$results$summaries$AIC
results[sim_num,11] = lpa1_15_fit$results$summaries$AICC}
lpa2_15 <- mplusObject(
TITLE = "2-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(2);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];
%c#2%
x01-x15;
[x01-x15];",
OUTPUT = "TECH11;",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa2_15_fit = mplusModeler(lpa2_15, "df.dat", modelout = "lpa2_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa2_15_fit$results$summaries$LL)){
results[sim_num,12] = -2 * lpa2_15_fit$results$summaries$LL
results[sim_num,13] = lpa2_15_fit$results$summaries$BIC
results[sim_num,14] = lpa2_15_fit$results$summaries$aBIC
results[sim_num,15] = lpa2_15_fit$results$summaries$AIC
results[sim_num,16] = lpa2_15_fit$results$summaries$AICC
results[sim_num,17] = lpa2_15_fit$results$summaries$Entropy
if (!is.null(lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff)){
results[sim_num,18] = lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff
results[sim_num,19] = lpa2_15_fit$results$summaries$T11_VLMR_PValue
results[sim_num,20] = lpa2_15_fit$results$summaries$T11_LMR_Value
results[sim_num,21] = lpa2_15_fit$results$summaries$T11_LMR_PValue}
... and so on...
}
The results I got from running this were:
[[1]]
[1] 0.491
[[2]]
[1] 0.7037
I've tried using parallel, foreach and dopar, and parLapply, but just can't get them to work. The closest I got was using the foreach function, but that returned a single value for each and none of the results were saved to the results table. I can provide the code for how I attempted these, but none of them worked really, so at this point I'm questioning if it can be done (and if so, which method/approach is best for this setup).
I should also point out that the levels of data can be run in any order (eg., [,,1], [,,5], [,,3]) is okay, but once that level is called the full function (or however it should be set up) should be run, as several tests compare the current model to the previous model (3 classes vs 2 classes) for that dataset, so in that sense the data does have to be run in order.
Thanks for any help or suggestions you might have!

Problems formating data (biomod2)

I keep running into an error while trying to run the BIOMOD_FormatingData()-function.
I have checked through all arguments and removed any NA-values, the explanatory variables are the same for both the testing and training datasets (independent datasets), and I've generated pseudo-absence data for the evaluation dataset (included in eval.resp.var).
Has anyone run into this error before? and if so, what was the issue related to? This is my first time using Biomod2 for ensemble modelling and I've run out of ideas as to what could be causing this error!
Here is my script and the subsequent error:
library(biomod2)
geranium_data <-
BIOMOD_FormatingData(
resp.var = SG.occ.train['Geranium.lucidum'],
resp.xy = SG.occ.train[, c('Longitude', 'Latitude')],
expl.var = SG.variables,
resp.name = "geranium_data",
eval.resp.var = SG.test.data['Geranium.lucidum'],
eval.expl.var = SG.variables,
eval.resp.xy = SG.test.data[, c('Longitude', 'Latitude')],
PA.nb.rep = 10,
PA.nb.absences = 4650,
PA.strategy = 'random',
na.rm = TRUE
)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= geranium_data Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Response variable name was converted into geranium.data
> Pseudo Absences Selection checkings...
> random pseudo absences selection
> Pseudo absences are selected in explanatory variablesError in `names<-`(`*tmp*`, value = c("calibration", "validation")) : incorrect number of layer names

R: how to map test data into lsa space created by training data

I am trying to do text analysis using LSA. I've read many other posts regarding LSA on StackOverflow, but I have not found one similar to mine yet. IF you know there's one similar to mine, please kindly redirect me to it! Much appreciated!
here's my reproducible code with sample data created:
creating sample data train & test sets
sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time',
'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 1 im happy
2: 1 this is good
3: 0 what a bummer X(
4: 1 today is kinda okay day for me
5: 0 i somehow messed up big time
6: 1 guess not being promoted is not too bad :]
7: 0 stayhing home is boring :(
8: 0 kids wont stop crying QQ
9: 1 warriors are legendary!
10: 0 stop reading my tweets!!!
sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 0 running out of things to say...
2: 1 if you are still reading, good for you!
3: 0 nothing ended on a good note today
4: 0 seriously sleep deprived!! >__<
preprocessing for training data set
corpus.train = Corpus(VectorSource(train_data$text))
create a term document matrix for training set
tdm.train = TermDocumentMatrix(
corpus.train,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
train_matrix = as.matrix(tdm.train)
create an lsa space using train data
lsa.train = lsa(tdm.train, dimcalc_share())
set dimension # (i randomly picked one here b/c the data size is too small to create an elbow shape)
k = 6
project train matrix into the new LSA space
projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]
convert above projected data into a matrix
projected.train.matrix = matrix(projected.train,
nrow = dim(projected.train)[1],
ncol = dim(projected.train)[2])
train the random forest model (somehow this step does not work anymore with this small sample data... but it's okay, won't be a big problem in this question; however, if you can help me with this error too, that'd be fantastic! i tried googling for this error but it's just not fixed...)
trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)
preprocessing for test data set
basically im repeating whatever i did to the training data set, except i did not use the test set to create its own LSA space
corpus.test = Corpus(VectorSource(test_data$text))
create a term document matrix for test set
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
test_matrix = as.matrix(tdm.test)
project test matrix into the trained LSA space (here's where the question is)
projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)
but i'd get an error:
Error in crossprod(docvecs, LSAspace$tk) : non-conformable arguments
i am not finding any useful google search results regarding this error... (there's only one search results page from google QQ)
any help is much appreciated! Thank you!
When you build the LSA model you are using the vocabulary of the training data. But when you build the TermDocumentMatrix for the test data, you are using the vocabulary of the test data. The LSA model only know how to handle documents tabulated against the vocabulary of the training data.
One way to remedy this is to create your test TDM with dictionary set to the vocabulary of the training data:
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removeNumbers = TRUE,
tolower = TRUE,
stopwords = stopwords("en"),
stemming = TRUE,
removePunctuation = TRUE,
weighting = weightTfIdf,
dictionary=rownames(tdm.train)
)
)

How to incorporate logLoss in caret

I'm attempting to incorporate logLoss as the performance measure used when tuning randomForest (other classifiers) by way of caret (instead of the default options of Accuracy or Kappa).
The first R script executes without error using defaults. However, I get:
Error in { : task 1 failed - "unused argument (model = method)"
when using the second script.
The function logLoss(predict(rfModel,test[,-c(1,95)],type="prob"),test[,95]) works by way of leveraging a separately trained randomForest model.
The dataframe has 100+ columns and 10,000+ rows. All elements are numeric outside of the 9-level categorical "target" at col=95. A row id is located in col=1.
Unfortunately, I'm not correctly grasping the guidance provided by http://topepo.github.io/caret/training.html, nor having much luck via google searches.
Your help are greatly appreciated.
Working R script:
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = train[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
Not working R script:
logLoss = function(data,lev=NULL,method=NULL) {
lLoss = 0
epp = 10^-15
for (i in 1:nrow(data)) {
index = as.numeric(lev[i])
p = max(min(data[i,index],1-epp),epp)
lLoss = lLoss - log(p)
}
lLoss = lLoss/nrow(data)
names(lLoss) = c("logLoss")
lLoss
}
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10,summaryFunction = logLoss)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = trainBal[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
I think you should set summaryFunction=mnLogLoss in trainControl and metric="logLoss" in train (I found it here). Like this:
# load libraries
library(caret)
# load the dataset
data(iris)
# prepare resampling method
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(7)
fit <- train(Species~., data=iris, method="rf", metric="logLoss", trControl=control)
# display results
print(fit)
Your argument name is not correct (i.e. "unused argument (model = method)"). The webpage says that the last function argument should be called model and not method.

Resources