Keras predict repeated columns - r

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!

One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

Specification of a mixed model using glmmLasso package

I have a dataset containing repeated measures and quite a lot of variables per observation. Therefore, I need to find a way to select explanatory variables in a smart way. Regularized Regression methods sound good to me to address this problem.
Upon looking for a solution, I found out about the glmmLasso package quite recently. However, I have difficulties defining a model. I found a demo file online, but since I'm a beginner with R, I had a hard time understanding it.
(demo: https://rdrr.io/cran/glmmLasso/src/demo/glmmLasso-soccer.r)
Since I cannot share the original data, I would suggest you use the soccer dataset (the same dataset used in glmmLasso demo file). The variable team is repeated in observations and should be taken as a random effect.
# sample data
library(glmmLasso)
data("soccer")
I would appreciate if you can explain the parameters lambda and family, and how to tune them.

mlr classification training with rpart does not complete

I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.
Is there any different setup to be done for the different methods?
My training data here if needed to replicate the problem. I tried to train it simply with
pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.
It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.
Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Resources