Number is recognized as a noun in spacy portuguese model - spacy-3

Just out of curiosity I would like to ask why the number "4950" has the PoS (part of speech) of "NOUN" in spaCy v3.1.3, using the large model in Portuguese. It is not in the GitHub token exception file (https://github.com/explosion/spaCy/blob/master/spacy/lang/pt/tokenizer_exceptions.py).
nlp = spacy.load('pt_core_news_lg')
doc = nlp('4950')
print(doc[0].text, doc[0].pos_)
#4950 NOUN
Is there any way to know what the other particular cases are?

To be clear, this should normally be a NUM.
This looks like it's just an error, and it doesn't affect most numbers, including similar ones like 4951. It's possible that somewhere in the Portuguese training data 4950 is labelled NOUN for some reason.
It's hard to explain individual predictions by the statistical models, and they make errors sometimes. This one is particularly egregious and may indicate an issue with data preparation, but in general errors like this are always possible. See this thread.
Also note this doesn't seem to be an issue in the small model. I'll look into this internally to see if there's a bug somewhere.
Quick update: If you use this in a sentence, like 4950 maçãs, it's properly labelled as NUM. One-word sentences are not something the models are trained on a lot and might cause more weird results.

Related

Is repeated anova what i am looking for?

I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))

Compare vectors of a doc and just a word

So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?
Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)
In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".
However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.
Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.
For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)
But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.
You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.
But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!
I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.
Sorry that I can't help you much, it's completely task and data dependent and you should test different things.
Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).

How to make Decision Tree rules more understandable?

I'd like to extract useful rules from Decision Trees/Random Forest in order to develop a more applicable way to handle the rules and predictions. So I need an application which makes the rules more understandable.
Any suggestions (e.g. visualizations, validation methods etc) for my purpose?
As far as WHY a particular split was chosen, the answer is always going to be: "Because that split created the best splitting of the target variable."
You referenced scikit-learn... Go ahead and briefly scan scikit-learn's documentation on Decision Trees... It has an example, which is exactly what you are asking for in the middle of the page. It looks like this:
The code to generate this plot is there also:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
from sklearn.externals.six import StringIO
with open("iris.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f)
There are several other graphical representations there also with accompanying code:
The SKL documentation is generally awesome and is very useful.
Hope this helps!
While this is certainly possible for Decision Trees and AN6U5 did a great job describing how, Random Forests use bundles of little trees that were trained using random subsets of the data and random subsets of the features. Thus each tree is optimal only in that limited setting of features and data. Since there are typically 100s or even 1000s of them, figuring out the context by examining the randomized data is going to be a thankless task. I don't think anyone does it.
However there are importance ranking for the features generated for Random Forests and pretty much all implementations will output them if requested. They turn out to be extremely useful.
Two of the most important ones are MDI (Mean Decrease Impurity) and MDA (Mean Decrease Accuracy). They are described in some detail in chapter 6 of this excellent work: http://arxiv.org/pdf/1407.7502v3.pdf

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

How does r calculate the p-values in logistic regression

What type of p-values do R calculate in a binomial logistic regression, and where is this documented?
When i read the documentation for ?glm() I find no reference to the calculation of the p-values.
The p-values are calculated by the function summary.glm. See ?summary.glm for a (very brief) bit about how those are calculated.
For more information, look at the source code by typing
summary.glm
at the R command prompt. There you will find the lines of code where an object pvalue is created. Follow the code back to see how the components of the p-value calculation are (conditionally) calculated.
The authors of R wrote the help system with several principles in mind: compactness (don't write more than is needed, it's not a textbook), accuracy, and a curious and well-educated audience. It really was written for other statisticians. The "curious" part of that opening sentence was included to raise the question why you did not also follow the various links in the ?glm page: to summary.glm where you would have found one answer to your ambiguous question or to anova.glm where you would have found another possible answer. The help-authors do expect that you will follow those links and read the whole page and execute the examples. You will notice that even after you get to summary.glm that there is no mention of "binary logistic regression" since they pretty much assume that you are well-grounded in statistics and have copy of McCullagh and Nelder handy, or if not that you will go read the references.
The other principle: sometimes it is the code itself (given the open-source nature of R) that performs the documentation. Technically glm doesn't print anything and print.glm doesn't print p-values. It would be print.summary.glm or print.anova.glm that would be doing any printing. Part of learning R is learning that the results printed to the console will have gone through a eval-print loop and that output can be tailored with object-class-specific functions.
These assumptions are just part of what many people see as a "steep learning curve for R" (although I would have called it a shallow curve if plotted with time/effort on x-axis.)

Resources