A large data set for KNN - bigdata

I want to apply a modified KNN that it is implemented for large data set. I am tryign to find a large data set (more than 20000 rows) that works perfect for KNN in order to can compare classic KNN and my own version. Any example?

There must be many if searched properly over the internet. The MNIST handwritten digit dataset can be a good place to start, and it has 70000 labelled examples. A carefully tuned KNN works pretty well on this data.
It can be downloaded from sklearn library.
>>> from sklearn.datasets import fetch_mldata
>>> mnist = fetch_mldata('MNIST original', data_home=custom_data_home)
For more details, please refer https://scikit-learn.org/0.19/datasets/mldata.html.

Related

How to run supervised ML models on a large dataset (15GB) in R?

I have a dataset (15 GB): 72 million records and 26 features. I would like to compare 7 supervised ML models (classification problem): SVM, random forest, decision tree, naive bayes, ANN, KNN and XGBoosting. I created a sample set of 7.2 million records (10% of the entire set). Running models on the sample set (even feature selection) is already an issue. It has a very long processing time. I use only RStudio at this moment.
I've been looking for an answer to my questions for days. I tried the following things:
- data.table - still not sufficient to reduce the processing time
- sparklyr - can't copy my dataset, because it's too large
I am looking for a costless solution to my problem. Can someone please help me?
If you have access to Spark, you can use sparklyr to read the CSV file directly.
install.packages('sparklyr')
library(sparklyr)
## You'll have to connect to your Spark cluster, this is just a placeholder example
sc <- spark_connect(master = "spark://HOST:PORT")
## Read large CSV into Spark
sdf <- spark_read_csv(sc,
name = "my_spark_table",
path = "/path/to/my_large_file.csv")
## Take a look
head(sdf)
You can use dplyr functions to manipulate data (docs). To do machine learning, you'll need to use the sparklyr functions for SparkML (docs). You should be able to find almost all of what you want in sparklyr.
Try Google Colab. This can help you in running your dataset
easily.
You should look into the disk.frame package.

how to use LSA for dimension reduction in text analytics with R

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets.
what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.
I have been following the steps in this article:
http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/
their explanation is kind of too brief for a beginner like me.
I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:
lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)
when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):
and my lsa.train.sk looks like following:
my question is, how do i interpret such information?
How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!
EDIT:
What I've done so far:
making everything into term document matrix
pass in the matrix into the NaiveBayes learner
predict using the learned algorithm
my problems are:
accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).
current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

Train SVM on a very large dataset stored on hard drive

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

How can I speed up a topic model in R?

Background
I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that the computation only goes on and on without any “convergence” being produced.
I am using the default options in LDA() function in topicmodels:
Run model
dtm2.sparse_TM <- LDA(dtm2.sparse, 15)
The model has been running for about 72 hours – and still is as I am writing.
Question
So, my questions are (a) if this is normal behaviour; (b) if not to the first question, do you have any suggestion on what do; (c) if yes to the first question, how can I substantially improve the speed of the computation?
Additional information: The original data contains not 3000 words but about 3.7 million. When I ran that (on the same machine) it did not converge, not even after a couple of weeks. So I ran it with 300 words and only 500 documents (randomly selected) and not all worked fine. I used the same nr of topics and default values as before for all models.
So for my current model (see my question) I removed sparse terms with the help of the tm package.
Remove sparse terms
dtm2.sparse <- removeSparseTerms(dtm2, 0.9)
Thanks for the input in advance
Adel
You need to use online variational Bayes which can easily handle the training such number of documents. In online variational Bayes you train the model using mini-batches of your training samples which increases the convergence speed amazingly (refer to SGD link below).
For R, you can use this package. Here you can read more about it and how to use it too. Also look at this paper since that R package implements the method used in that paper. If possible, import their Python code uploaded here in R. I highly recommend the Python code since I had such a great experience with it for a project I recently worked on. When the model is learned, you can save the topic distributions for future use and use the input it to onlineldavb.py along with your test samples to integrate over the topic distributions given those unseen documents. With online variational Bayesian methods I trained an LDA with 500000 documents and 5400 words in the vocabulary data set in less than 15 hours.
Sources
Variational Bayesian Methods
Stochastic Gradient Descent (SGD)

R knn large dataset

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.

Resources