Cleaning unnecessary variable in big data by using R - r

I have a data set which contains 163 columns(variable) and 199566 rows(data).
So How can i eleminate redundant data ?
Can i do this by using normal distribution?

Maybe try dimensionality reduction methods such as PCA. It will help you reduce the amount of columns as if I understand correctly is what you want to achieve.
If you haven't used them before, you will probably have to read more about what these techniques exactly do but the above will get you started.

Related

Keras predict repeated columns

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

Which technique is best used to find optimal split on numeric data to reduced error on group?

I have a dataset that contains a numeric variable and a binary categorical variable. I want to find the optimal split on the numeric variable that can be used to quickly classify the categories and limit the amount of error.
I have used a decision tree to do this but am wondering if there are better optimization methods out there to do this?
I would like to be able to do this in R but am having trouble how to write the function for this.
Please help me understand the simple optimisation problem. Thanks!

How can I efficiently best fit large data with large numbers of variables

I have a data set with, 10 million rows and 1,000 variables, and I want to best fit those variables, so I can estimate a new rows value. I am using Jama's QR decomposition to do it (better suggestions welcome, but I think this question applies to any implementation). Unfortunately that takes too long.
It appears I have two choices. Either I can split the data into, say, 1000 size 10,000 chunks and then average the results. Or I can add up every , say, 100 rows, and stick those combined rows into the QR decomposition.
One or both ways may be mathematical disasters, and I'm hoping someone can point me in the right direction.
For such big datasets I'd have to say you need to use HDF5. HDF5 is Hierarchical Data Format v5. They have C/C++ implementation APIs, and other bindings for different languages. HDF uses B-trees to keep index of datasets.
HDF5 is supported by Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, and Julia.
Unfortunately I don't know more than this about it. However I'd suggest you'd begin your research with a simple exploratory internet search!

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

Resources