How to train and fit a LSTM model on an encrypted dataset - encryption

I am beginner in deep learning, and I am trying to do a project on detection of encrypted malicious traffic using Long Short Term Memory (LSTM). I have the two benign and malicious data sets. What dependent do I need to import and how to train and fit the model on the dataset? Do I need to combine two dataset and train the model over it, since LSTM gets the features automatically? Any help deeply appreciated.

Related

BERT classification on imbalanced or small dataset

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer.
Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is:
Does BertModel would perform well on imbalanced dataset?
Does BertModel would perform well on small dataset? (as small as less than 500 data points, I bet it's not..)
The objective of transferred learning using pre-trained models partially answers your questions. BertModel pre-trained on large corpus, which when adapted to task specific corpus, usually performs better than non pre-trained models (for example, training a simple LSTM for classification task).
BERT has shown that it performs well when fine-tuned on small task-specific corpus. (This answers your question 2.). However, the level of improvements also depend on the domain and task that you want to perform, and how related was the data used for pre-training is with respect to your target dataset.
From my experience, pre-trained BERT when fine-tuned on target task performs much better than other DNNs such as LSTM and CNNs when the datasets are highly imbalanced. However, this again depends on the task and data. 1:99 is really a huge imbalance, which might require data balancing techniques.

Function to use a trained neural network to predict tomorrows temperature

I am in the process of creating a neural network with the aim of being able to predict the temperature of tomorrow in my area. I have loaded the data, normalized it, divided it into a train and a test set, created a NN using the neural net library, and predicted the temperatures in the test set with a 78% accuracy.
My system parts are named as follows:
nural network <- nn
nural network output <- nn.results
Data <- data frame with all the inputs (Humidity, air pressure, averages of previous temperatures, etc.)
Predicted data <- results$predicted (for the test set)
What function/code would I use to, instead of predicting data in the test set, actually predict the temperature of tomorrow?
I hope the question is not all too stupid, any help would be much appreciated though. Sorry that there is not all too much code, but it is difficult with neural networks to just give snippets.
Thank you!
Edit: I think the predict function is the way to go, however it requests “new data”, but I would like to predict solely with old data, would special formatting of the test set do the trick?

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

Training model with batches of training data-R

I am new to R and data analysis.
I am hitting the wall as my hardware is not able to process the whole training set for computing a model.
I was thinking by using the caret package, am I able to train the model by breaking the training data in batches i.e. training the model with the first 1000 rows, followed by the next 1000 rows and so on and so forth? I then will be able to trim the model at every stage to save memory.
Will the model be “updated” with every feed of the batch of training data?
I know this method is known as sequential training but wasn’t able to find a practical example/case study.
Hope to get some guidance on this. Thanks.

When to use test and training sets in Weka?

I've been working with Weka for awhile now, and in my research on it, I find that a lot of code examples use test and training sets. For instance, with Discretization and Bayesian Networks,their examples are almost always shown using test and training sets. I may be missing some fundamental understanding of data processing here, but I don't understand why this seems to always be the case. I am using Discretization and Bayesian Networks in a project and for both of them, I have not used test or training sets, and do not see why I would need to either. I am performing cross validation on the BayesNet, so I am testing its accuracy. Am I misunderstanding what test and training sets are used for??? Oh and please use the simplest of terminology; I'm still not very experienced with the world of data processing.
The idea behind training and test sets is to test the generalization error. That is, if you used just one data set, you could achieve perfect accuracy by simply learning this set (this is what nearest neighbour classifiers do, IBk in Weka). In general, this is not what you want however -- the machine learning algorithm should learn the general concept behind the example data that you give it. A way of testing whether this happens is to use separate data for training and testing.
If you're using cross-validation, you're using separate training and test sets. This is simply a way of coming up with the partition of your entire data set into training and test. If you do 10 fold cross-validation for example, your entire data is partitioned into 10 sets of equal size. Nine of these are combined and used for training, the remaining one for testing. Then the process is repeated with nine different sets combined for training and so on until all the ten individual partitions will have been used for testing.
So training/test sets and cross-validation are conceptually doing the same thing, cross-validation simply takes a more rigorous approach by averaging over the entire data set.
Training data refers to the data used to "build the model".
For example, it you are using the algorithm J48 (a tree classifier) to classify instances, the training data will be used to generate the tree that will represent the "learned concept" that should be a generalization of the concept. It means that the learned rules, generated trees, the adjusted neural network, or whatever; will be able to get new (unseen) instances and classify them correctly (the "learned concept" does not depends on the training data).
The test sets are a percentage of the data that will be used to test whether the model has learned the concept properly (it is independent of the training data).
In WEKA you can run an execution splitting your data set into trainig data (to build the tree in the case of J48) and test data (to test the model in order to determine that the concept has been learned). For example, you can use 60% of the data for training and 40% for testing (determine how much data is needed for training and testing is one of the key problems of data mining).
But I would recommend you to have a quick look to cross-validation, that is a robust testing method that is implemented in WEKA. It has been explained quite well here:
https://stackoverflow.com/a/10539247/1565171
If you have more questions just leave a comment.

Resources