I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.
About the dataset:
The dataset has two columns: "Test_name" and "Description"
There are six labels in the Test_Name column and their corresponding description in the "Description" column.
My approach towards the problem
DATA PREPARATION
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
Model Building
Using H2o package, build a gbm model.
Results obtained
Four of the class labels are classified well but the rest two are poorly classified.
below is the output:
Extract training frame with `h2o.getFrame("train")`
MSE: (Extract with `h2o.mse`) 0.1197392
RMSE: (Extract with `h2o.rmse`) 0.3460335
Logloss: (Extract with `h2o.logloss`) 0.3245868
Mean Per-Class Error: 0.3791268
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Body Fluid Analysis = 401 / 2,759
Cytology Test = 182 / 1,087
Diagnostic Imaging = 117 / 3,907
Doctors Advice = 32 / 752
Organ Function Test = 461 / 463
Patient Related = 101 / 113
Totals = 1,294 / 9,081
The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?
Just some quick things you can do to improve this:
Look at performance metrics on a validation set, including the confusion matrix
Perhaps try hyperparameter tuning to improve performance for your task (using h2o.grid:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html)
Consider using h2o.word2vec for feature generation (Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.word2vec.craigslistjobtitles.R and Example: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.word2vec.craigslistjobtitles.R)
If you provide more details and a working example there is more that can be done to help you.
Related
I am using the R package machisplin (it's not on CRAN) to downscale a satellite image. According to the description of the package:
The machisplin.mltps function simultaneously evaluates different combinations of the six algorithms to predict the input data. During model tuning, each algorithm is systematically weighted from 0-1 and the fit of the ensembled model is evaluated. The best performing model is determined through k-fold cross validation (k=10) and the model that has the lowest residual sum of squares of test data is chosen. After determining the best model algorithms and weights, a final model is created using the full training dataset.
My question is how can I check which model out of the 6 has been selected for the downscaling? To put it differently, when I export the downscaled image, I would like to know which algorithm (out of the 6) has been used to perform the downscaling.
Here is the code:
library(MACHISPLIN)
library(raster)
library(gbm)
evi = raster("path/evi.tif") # covariate
ntl = raster("path/ntl_1600.tif") # raster to be downscaled
##convert one of the rasters to a point dataframe to sample. Use any raster input.
ntl.points<-rasterToPoints(ntl,
fun = NULL,
spatial = FALSE)
##subset only the x and y data
ntl.points<- ntl.points[,1:2]
##Extract values to points from rasters
RAST_VAL<-data.frame(extract(ntl, ntl.points))
##merge sampled data to input
InInterp<-cbind(ntl.points, RAST_VAL)
#run an ensemble machine learning thin plate spline
interp.rast<-machisplin.mltps(int.values = InInterp,
covar.ras = evi,
smooth.outputs.only = T,
tps = T,
n.cores = 4)
#set negative values to 0
interp.rast[[1]]$final[interp.rast[[1]]$final <= 0] <- 0
writeRaster(interp.rast[[1]]$final,
filename = "path/ntl_splines.tif")
I vied all the output parameters (please refer to Example 2 in the package description) but I couldn't find anything relevant to my question.
I have posted a question on GitHub as well. From here you can download my images.
I think this is a misunderstanding; mahcisplin, isnt testing 6 and gives one. it's trying many ensembles of 6 and its giving one ensemble... or in other words
that its the best 'combination of 6 algorithms' that I will get, and not one of 6 algo's chosen.
It will get something like "a model which is 20% algo1 , 10% algo2 etc. "and not "algo1 is the best and chosen"
We have a .model file that has an xgboost model. Here's a snippet of our code loading the model:
> xg_model <- xgb.load("../model_outputs/our_saved_model.model")
> xg_model
##### xgb.Booster
raw: 1.6 Mb
xgb.attributes:
niter
niter: 149
I didn't create this model, but I am tasked with passing new data to the model in order to make predictions. Unfortunately, I am hitting this error:
Error in predict.xgb.Booster(xg_model, xgb.DMatrix(as.matrix(our_dataframe_of_data))) :
[01:34:01] amalgamation/../src/learner.cc:1183: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (38 vs. 40) : Number of columns does not match number of features in booster.
... so it's clear that our dataframe has 40 columns, but this model is trained to expect a dataframe with 38 columns. What's unclear is exactly which 38 columns our xg_model is expecting. Is there a function to call / plot to graph / etc. that might show what 38 columns the model was trained on? We only have the trained model currently, but not the R code that trained the model...
What's your XGBoost version? It's important to know, because XGBoost "schema specification" has been evolving quite significantly.
Right now, you should explore what attributes are available on your xgb.Booster object. See if it has nfeatures and feature_names attributes defined:
print(xg_model$nfeatures)
print(xg_model$feature_names)
I believe your xgb.Booster object has these attributes available, because how else would it know to demand 38 features?
I had the same issue, I was able to solve it after extracting the model features like this.
ModelVars<- xgb.importance(feature_names = colnames(our_dataframe_of_data),model=xg_model)
After this it was just a matter of subsetting my dataframe to the ones in ModelVars. I was able to use predict function and get the scores even though the number of features was less than the number of features in the training dataset as expected.
Context
thanks in advance for your help. Right now, I have run a dataset through the LDA function in Jonathan Chang's 'lda' package (N.B. this is different from the 'topicmodels' package). Below is a replicable example, which uses the cora dataset that comes automatically when you install and load the 'lda' package.
library(lda)
data(cora.documents) #list of words contained in each of the 2,410 documents
data(cora.vocab) #vocabulary list of words that occur at least once across all documents
Thereafter, I conduct the actual LDA by setting the different parameters and running the code.
#parameters for LDA algorithm
K <- 20 #number of topics to be modelled from the corpus, "K"
G <- 1000 #number of iterations to cover - the higher this number, the more likely the data converges, "G"
alpha <- 0.1 #document-topic distributions, "alpha"
beta <- 0.1 #topic-term distributions, "beta/eta"
#creates an LDA fit based on above parameters
lda_fit <- lda.collapsed.gibbs.sampler(cora.documents, cora.vocab, K = 20,
num.iterations = G, alpha, beta)
Following which, we examine one component of the output of the LDA model, which is called document_sums. This component displays the number of words that each individual document contains that is allocated to each of the 20 topics (based on the K-value I chose). For instance, one document may have 4 words allocated to Topic 3, and 12 words allocated to Topic 19, in which case the document is assigned to Topic 19.
#gives raw figures for no. of times each document (column) had words allocated to each of 20 topics (rows)
document_sums <- as.data.frame(lda_fit$document_sums)
document_sums[1:20, 1:20]
Question
However, what I want to do is essentially use the principle of fuzzy membership. Instead of allocating each document to the topic it contains the most words in, I want to extract the probabilities that each document gets allocated to each topic. document_sums quite close to this, but I still have to do some processing on the raw data.
Jonathan Chang, the creator of the 'lda' package, himself says this in this thread:
n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep. ()
Separately, another reply on another forum reaffirms this:
The resulting document_sums will give you the (unnormalized) distribution over topics for the test documents. Normalize them, and compute the inner product, weighted by the RTM coefficients to get the predicted link probability (or use predictive.link.probability)
And thus, my question is, how do I normalise my document_sums and 'add the smoothing constant'? These I am unsure of.
As asked: You need to add the prior to the matrix of counts and then divide each row by its total. For example
theta <- document_sums + alpha
theta <- theta / rowSums(theta)
You'll need to do something similar for the matrix of counts relating words to topics.
However if you're using LDA, may I suggest you check out textmineR? It does this normalization (and other useful things) for you. I originally wrote it as a wrapper for the 'lda' package, but have since implemented my own Gibbs sampler to enable other features. Details on using it for topic modeling are in the third vignette
I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.
You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.
I have some dummy data that consists of 99 rows of data, one column is
free text data and one column is the cateogry. It has been categorised into either Customer Service or Not Customer Service related.
I passed the 99 rows of data into my R script, created a Corpus, cleaned and parsed my data and converted it to a DocumentTermMatrix. I then converted my DTM to a dataframe to make it easier to view. I bound the category to my new dataframe. I then split it 50/50 so 50 rows into my training set, 49 into my testing set. I also pulled out the category.
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .5))
test <- (1:nrow(mat.df))[- train]
cl <- mat.df[, "category"]
I then created a model with the stripped out category column and passed this new model to my KNN
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
conf.mat
I can then work out the accuracy, generate a cross table or export the predictions to test the accuracy of the model.
The bit i am struggling to get my head around at the moment, is how do i use the model going forward for new data.
So if i then have 10 new rows of free text data that havent been manually classified, How do i then run my knn model i have just created to classify this additional data?
Maybe i am just misunderstanding the next process.
Thanks,
The same way you just found the hold-out test performance:
knn.pred.newdata <- knn(modeldata[train, ], completely_new_data, cl[train])
In a KNN model, your training data is intrinsically part of your model. Since it's just finding the nearest training points, how do you know which those are if you don't have their coordinates?
That said, why do you want to use a KNN model instead of something more modern (SVM, Random forest, Boosted trees, neural networks)? KNN models scale extremely poorly with the number of data points.