Looking for a LayoutLM model with max_embeddings of 1024 instead of 512 - bert-language-model

I'm currently trying to use a LayoutLM model in a project. However, the max embeddings is 512. I'm trying to see if I can find a pretrained model with 1024 max embeddings. Does anyone know of such a model or if there is a way around it?
Thanks

Training such a model will be time-consuming but we will consider that.

Related

Implement random forest without bootstrap

I want to implement the random forest algorithm of Breiman (2001) using all my training set to grow the trees. In other words, I want to keep the random selection of inputs at each node and remove the bootstrap stage. This is motivated by the fact that I'm working with few observations that exhibit auto-correlation.
I've gone through the documentation of the packages randomForest, ranger and Rborist, but I didn't find an answer. I've also tried to take a look at the source code of the function randomForest using getAnywhere(randomForest.default); but I have to admit that my R-level is too low to be able to get anything out of it.
Thank you in advance.
Edit. Note to future readers: if you want to modify the bootstrap step, make sure to set keep.inbag=T when using randomForest.
The sampsize argument in randomForest controls the number of samples used for each tree and the replace argument controls whether or not you are bootstrapping. So in your case, set sampsize=N (number of samples) and replace=FALSE.

R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

Minbucket and weights in rpart

A couple questions for the rpart and party experts.
1) I am trying to understand the difference of the control parameter "minbucket" in rpart and party. Is it correct that minbucket in rpart is unweighted (even if weights are provided to fit the tree)?
2) Can anyone briefly describe how the weights are used in the rpart algorithm? I tried to download and review the source code, but I couldn't make much sense of it being a newbie. rpart calls a C function (C_rpart), which seems to be the main part of rpart, but I couldn't find more information about it.
Thanks so much in advance.
The weights parameter in rpart (and in most other machine learning algorithms) can be considered to be exactly equivalent to duplicating those training items that many times. A weight of 5 is the same as having that line repeated 5 times. You can explicitly create this using some simple code, provided that your data set is small enough:
data[rep(1:nrow(data),times=data$weights),]

How can I speed up a topic model in R?

Background
I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that the computation only goes on and on without any “convergence” being produced.
I am using the default options in LDA() function in topicmodels:
Run model
dtm2.sparse_TM <- LDA(dtm2.sparse, 15)
The model has been running for about 72 hours – and still is as I am writing.
Question
So, my questions are (a) if this is normal behaviour; (b) if not to the first question, do you have any suggestion on what do; (c) if yes to the first question, how can I substantially improve the speed of the computation?
Additional information: The original data contains not 3000 words but about 3.7 million. When I ran that (on the same machine) it did not converge, not even after a couple of weeks. So I ran it with 300 words and only 500 documents (randomly selected) and not all worked fine. I used the same nr of topics and default values as before for all models.
So for my current model (see my question) I removed sparse terms with the help of the tm package.
Remove sparse terms
dtm2.sparse <- removeSparseTerms(dtm2, 0.9)
Thanks for the input in advance
Adel
You need to use online variational Bayes which can easily handle the training such number of documents. In online variational Bayes you train the model using mini-batches of your training samples which increases the convergence speed amazingly (refer to SGD link below).
For R, you can use this package. Here you can read more about it and how to use it too. Also look at this paper since that R package implements the method used in that paper. If possible, import their Python code uploaded here in R. I highly recommend the Python code since I had such a great experience with it for a project I recently worked on. When the model is learned, you can save the topic distributions for future use and use the input it to onlineldavb.py along with your test samples to integrate over the topic distributions given those unseen documents. With online variational Bayesian methods I trained an LDA with 500000 documents and 5400 words in the vocabulary data set in less than 15 hours.
Sources
Variational Bayesian Methods
Stochastic Gradient Descent (SGD)

how to know feature contributions in R?

in R I trained a model with good performance. It takes time to train, since it has over 200 predictors(features). Is there a way to get to know what features contributes most?
Thanks.
This question is too vague as it stands since there can be any number of ways to determine variable importance based on the problem at hand. In any case, please have a look at what is available in the caret package here:
http://caret.r-forge.r-project.org/varimp.html

Resources