Training SRL using BERT on german language with AllenNLP - bert-language-model

I am trying to train a SRL model for German text by translating Ontonotes dataset and propagating the labels from English sentences to German sentences. When i train the model with this dataset, as well a manually annotated dataset i seem to be stuck at maximum F1 score of 0.62. I am using deepset/gbert-large bert model for training with learning rate 5e-5. I have updated the Ontonotes.py file to read the conll formatted files and i checked the srl frames to ensure the labels are being picked up correctly. Is there something else i am missing out which i need to take care while trying to train a model in different language or is it just the low quality of data which might be causing the issue.
I have also tried to manually annotate german sentences for the SRL task and even for such high quality data the model does not seem to perform as it equivalent BERT model for English performs. Although the quality of the dataset created by translating and transferring labels might be low, does it still explain a difference of 0.24 F1 score ?

Related

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

Binary variables (features) used in a classification model - data visualisation

I suppose this is less of a coding question but more of a plea for advice from experienced data scientists:
Suppose I have a classification problem at hand with 15 features (variables), all those variables are binary type - 1|0. Naturally, the output is also 1|0
I'm looking for a best / most widely used method of presenting and visualizing the output from the test data.
In the case the context might be helpful, the model is supposed to predict whether loan applicants will default on their payments after some period of time, if they're employed|unemployed had more than x days delinquency etc.
The model is randomForest

how to use LSA for dimension reduction in text analytics with R

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets.
what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.
I have been following the steps in this article:
http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/
their explanation is kind of too brief for a beginner like me.
I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:
lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)
when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):
and my lsa.train.sk looks like following:
my question is, how do i interpret such information?
How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!
EDIT:
What I've done so far:
making everything into term document matrix
pass in the matrix into the NaiveBayes learner
predict using the learned algorithm
my problems are:
accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).
current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

Differences in scoring from PMML model on different platforms

I've built a toy Random Forest model in R (using the German Credit dataset from the caret package), exported it in PMML 4.0 and deployed onto Hadoop, using the Cascading Pattern library.
I've run into an issue where Cascading Pattern scores the same data differently (in a binary classification problem) than the same model in R. Out of 200 observations, 2 are scored differently.
Why is this? Could it be due to a difference in the implementation of Random Forests?
The German Credit dataset represents a classification-type problem. The winning score of a classification-type RF model is simply the class label that was the most frequent among member decision trees.
Suppose you have RF model with 100 decision trees, and 50 decision trees predict "good credit" and another 50 decision trees predict "bad credit". It is possible that R and Cascading Pattern resolve such tie situations differently - one picks the score that is seen first and the other picks the score that is seen last. You could try re-training your RF model with odd number of member decision trees (ie. use some value that is not divisible by two, such as 99 or 101).
The PMML specification tells to return the score that was seen first. I'm not sure if Cascading Pattern pays any attention to such details. You may want to try out an alternative solution called JPMML-Cascading.
Score matching is a big deal. When a model is moved from the scientist's desktop to the production IT deployment environment, the scores need to match. For a classification task, that also includes the probabilities of all target categories. There is sometimes a problem of precision between different implementations/platforms which can result in minimal differences (really minimal). In any case, they also need to be checked.
Obviously, it could also be the case that the model was not represented correctly in PMML ... unlikely with the R PMML Package. The other option is that the model is not deployed correctly. That is, the scoring engine cascading is using is not interpreting the PMML file properly.
PMML itself has a model element called ModelVerification that allows for a PMML file to contain scored data which can then be used for score matching. This is useful but not necessary since you should be able to score an already scored dataset and compared computed with expected results which you are already doing.
For more on model verification and score matching as well as error handling in PMML, check:
https://support.zementis.com/entries/21207918-Verifying-your-model-in-ADAPA-did-it-upload-correctly-

Resources