Plotting a dataset and an SVM model for text classification - plot

for educational purpose I have to create a model for text classification using Rapidminer.
I downloaded a dataset made of the attributes "text" and "sentiment", each of them containing respectively the text of some tweets and their sentiment (positive or negative).
First of all, I would know if there is a way to plot this dataset for seeing data distribution using python on any other tools and if I have to calculate other attributes to see this data distribution (for example polarity and subjectivity, or anything else). My teacher wants me to plot the data first in order to undastand whether it is a linear distribution or not.
Secondly, I would know how to plot the results of an SVM model on this dataset.
I've already build the model using the LibSVM on Rapidminer, but this software doesn't provide any way to plot the model results seeing also the decision boundaries.
I tried to calculate polarity and subjectivity of the texts of the dataset using both Textblob and Vader on python and then I plotted the dataset on rapidminer using this new attributes, but the data distribution looks non-linear (and I didn't exppect this result) since both TextBlob and Vader calculate a positive polarity for some texts tagged as negative and viceversa.
I show you a screen of the data plotting I did to know if and how I have to fix it. How would you define this actual data distribution, linear or non-linear?
data plotting using subjectivity and polarity extracted with Vader and Textblob

Related

Is there an R function for creating an interaction plot of a panelAR model?

In order to strengthen the interpretation of an interaction term I would like to create an interaction plot.
Starting Point: I am analyzing a panel data frame with which I fitted a feasible generalized least squares model by using the panelAR function. It includes an interaction term of two continuous variables.
What I want to do: To create an interaction plot, e.g. following the style of “plot_model” from the package sjPlot (see Three-Way-Interactions: link).
Problem: I could neither find any package which supports the type of my model nor a different way to get a plot.
Question: Is there any workaround which can be used for obtaining an interaction plot or even a package which supports a panelAR model?
Since I am quite new to R I would appreciate every kind of help. Thank you very much

h20 driverless ai ROC curve How to identify threshold for Multiclass Confusion Matrix

I have created a training model and ROC Curves shows 0.9748 on Multiclass Confusion Matrix. I ran this model on test data using "score on another dataset" and got the predictions. I would like to understand how to get the Threshold for these predictions so that we can publish the future values to the users.
DAI returns prediction values not labels. This means you have to set the threshold yourself. For example you could download the predictions file, then import it into your favorite language (let's use H2O-3's python api for example) and then run an a boolean check to see if a given column has a value that is above the threshold for it to be a specific label.
Details on the multiclass experiment graphs and how DAI decides to display different threshold metrics can be found in the documentation here

How many labels are acceptable before using regression over classification

I have a problem where I'm trying to use supervised learning in python. I have a series of x,y coordinates which i know belong to a label in one data set. In the other i have only the x,y coordinates. I am going to use one set to train the other, my approach is that of supervised learning and to use a classification algorithm (linear discriminant analysis) as the number of labels is discrete. Although they are discrete, they are large in number (n=~80,000). My question, at which number of labels should i consider regression over classification where regression is better suited to continuous labels. I'm using SciKit as my machine learning package and using astronml.orgs excellent tutorial as a guide.
It is not about numbers. It is about being continuous or not. It does not matter if you have 80,000 classes or even more; as long as there is no correlation between neighbour classes (for eg. class i and i+1), you should use classification (not regression).
Regression only makes sense when the labels are continuous (real numbers for eg.) or at least when there is a correlation between adjacent classes (for eg. when labels show the count of something, you can do regression and then round up the results).

What is the interpretation of the plot boxes of Logistic Model Tree (LMT) outcome in the RWeka package in r?

Im working on a user classification with 5 known groups (observations approximatly equally divided over groups). I have information about these users (like age, living area ...) and try to find the characteristics that identify the users in each group.
For this purpose I use the Rweka package in R (collection of machine learning algorithms: http://cran.r-project.org/web/packages/RWeka/RWeka.pdf). To find the characteristics that distinguish between my groups I use Logistic Model Trees (LMT). There is just little information about this function:
I will try to sketch an example of a plotted tree.
The splits are straight forward for interpretation, but in each terminal node there is a box filled with:
LM_24: 48/96
(20742)
What does this mean? How can I see in which of the five groups the node ends?
With what function can I retrieve the coefficients used in the model? Such that the influence of the variables can be studied.
(I did look into other methods for building trees on these data, but both the regression and classification tree packages (like rpart, party) only find one terminal note in my data, whilest the LMT function finds 6 split nodes)
I hope you can provide me the answer/some help with this function. Thanks a lot!

Alternatives to LDA for big datasets

I'm analyzing a big dataset of gene expression in R, with 100 samples and 50.000 genes.
I already made some very informative PCA projections of inter-sample patterns. Now I want to make some projections of the data maximizing the differences between the labels I have for the samples.
Normally I would do this with the lda() function from the MASS package. However, this is way too slow and memory intensive.
If the goal is to produce a projection of the samples maximizing the difference between known labels, what are some good alternatives to lda()?
Thanks!
Summary of our discussion in the comments to the question
Linear discriminant analysis does not work on data sets with more features than observations, so you need some form of regularization. If you want to do classification but are mainly interested in the predictive patterns, rather than the predictions themselves, you can use partial least squares discriminant analysis (PLSDA).
However, in your case the components of PLSDA might be hard to interpret since they will contain one coefficient per gene, but it seems unrealistic to believe that all 50000 genes are relevant to the phenotype you are studying. An alternative approach I prefer is to use nearest shrunken centroids or elastic net that produces sparse models (i.e. they only keep the best genes and discard those of little importance).
You could run your LDA model on a sample of the data set.

Resources