I am building an machine learning text classification model in R. I want to classify the sentence into more than one label if it falls into multiple categories.
e.g.: "The phone screen resolution is awesome and the battery life as well" - currently I am able to classify the sentence into either Battery or Phone feature category but I want it to be classified into both.
The output can be like:
It will be great if anyone can help me with ideas or methods to get the above result.
I would suggest training a binary classifier for each label.
With some algorithms - like logistic regression - all you can do is train every binary classifier independently.
There are also so-called multilabel algorithms - they train all binary classifiers at the same time, and extract the same features from data for every classifier. An example is a neural network with a sigmoid last layer. See "support multilabel" section in http://scikit-learn.org/stable/modules/multiclass.html for a list of multilabel algorithms.
Of course, a multilabel algorithm will not necessarily outperform logistic regression, you have to try and see what works best for your problem.
Related
As said above I'm trying to create a model for detecting spam emails based on word occurrences. my information from my dataset is as follows:
about 2800 variables representing each word and the frequency of their occurrences
binary spam variable 1 for spam 0 for legit
I've been using online resources but can only find logistic regression and NN tutorials for much smaller datasets, which seem much simpler in comparison. So far I've totaled up the total words for spam and non spam to analyze, but I'm having trouble creating the model itself
Does anyone have any sources or insight on how to manage this with a much larger dataset?
Apologies for the simple question (if it is so) I appreciate any advice.
A classical approach uses a generalised linear model (GLM) with a penalty for the number of variables. The GLM will be the logistic regression model in this case. The classic approach for the penalty is the LASSO, ridge regression and elastic net techniques. The shrinkage in your parameter values may be such that no parameters are selected to be predictive if your ratio of the number of variables (p) to the number of samples (N) is too high. Some parameters can control the shrinkage for that. It's overall a well studied topic. Your questions haven't asked about the programming language you will use, but you may find helpful packages in Python, R, Julia and other widespread data science programming languages. There will also be a lot of information in the CV community.
I would start analysing each variable individually. I would implement a logistic regression for each one, and remain only with those whose p-value is really significative.
After this first step, then you can run a more complex logistic regression model, where you include the remaining variables in the first step.
In Azure ML, I have a predictive regression model using boosted decision tree regression and it is reasonably accurate.
The input dataset has over 450 columns and the model has done a good job of predicting against test data sets, without over-fitting.
To report on the result i need to know what features/columns the model mainly used to make predictions but i cant find this information easily when looking at the trained model data.
How do i identify this information? Im happy to import the result dataset into R to help find this but I just need pointers on what direction to start working in.
Mostly, in using Microsoft Azure Machine Learning, when looking at the features that is mainly used to make predictions, it is found on the output of the Train Model module.
But on using Decision Trees as your algorithm, the output of your Train Model module would be the constructed 'trees' of the algorithm, and it looks like this:
To know the features that made impact on predictions while using Decision Trees algorithms, you can use the Permutation Feature Importance module. Look at the sample experiment below:
The parameters of Permutation Feature Importance are Random Seed and Metric for Measuring Performance (in this case, Regression - Coefficient of Determination)
The left input of Permutation Feature Importance is your trained model, and the right input is your test data.
The output of Permutation Feature Importance looks like this:
You can add Execute R Script to extract the Features and Scores from Permutation Feature Importance module.
I have a data mining problem and would like to have your suggestion/opinion on the approach part.
It is a multi class problem, I need to build classifier and for a new data point, algorithm should be able to recognize whether the data point belongs to existing classes or it belongs to new class(C+1).
Current approach I am following is, if probability for a particular class is >60% then the record gets classified in to that class and if none of the classes have >60% probability then the record will be classified in to New class(C+1).
But the accuracy for the New class recognition is low(~30 to 40%). I have used C5.0 boosted decision tree algorithm.
95% of the features have binary data.
Could any one please suggest any other alternative approach / algorithm for this.
Sri
There many supervised classification alternatives , for the case of R one excelent option is support vector machine classification using the e1071 package. I would also suggest to try and evaluate softmax neural networks.
I would like to fit an LSTM model using MXNET in R for the purpose of predicting a continuous response (i.e., regression) given several continuous predictors. However, the mx.lstm() function seems to be geared toward NLP as it requires arguments which don't seem applicable to a regression problem (such as those related to embedding).
Is MXNET capable of this sort of modeling and, if not, what is an example of an appropriate tool (preferably in R)? Are there any tutorials relevant to the problem I've described?
LSTM is used for working with temporal data: text, speech, time series. If you want to predict a continuous response, then I assume you want to do something similar to time series analysis.
If my assumption is correct, then, please, take a look here. It gives quite a good example on how to use MxNet with R for time series on CPU. The GPU version is also available here.
I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/