Fastest way to reduce dimensionality for multi-classification in R - r

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.

I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Related

Analysing vocal similarity of little owls using warbleR in R

I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.

Is repeated anova what i am looking for?

I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))

VAR model with variable combination and variation

I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!

r rpart only working for integers and not factors? getting a tree with no depth

I'm having a few issues running a simple decision tree within R using rpart.
I can't post my actual data for an example because of confidentiality, but here's the structure. I've blanked out a load of bits just because I've got my tin foil hat on today.
I've run the most basic model to predict MIX based on MIX_BEFORE and LIFESTAGE and I don't get a tree out of the end of it. I've tried using rpart.control and specifying the minsplit, it makes no difference.
Even when I add in a few more variables I still don't get a tree:
Yet... the second I remove the factor variables and attempt to build a tree using an integer, it works fine:
Any ideas at all?
Your data has a fairly strong class imbalance: 99% one class, 1% the other. So rpart can get 99% accuracy just by saying that everything is the majority class (which is what it is doing). Most variables will not be able to discriminate better than that, so you get trees with no branches like you did with the factor variables. Your _NBR variable happens to be more predictive for the small number of points with _NBR >= 7. But even your model that uses _NBR predicts almost all points are majority class. You may be able to get some help from This Cross Validated Post on how to deal with class imbalance.

Text Categorization using R

I am relatively new at using R. I have a dataset of around 5000 datapoints.
My goal is to predict a category using the comments entered.
I have a training dataset of 4500 records and a testing data set of 500 records.
I am looking for 2-3 packages which might help me in doing this.I have to evaluate these packages and prepare a report on that. Can anyone suggest me some good packages which might easier to use and also more efficient.
Again, I have 2 columns
1st one is comments and based on this I have to predict the category.
Right now I have defined around 10 independent categories.
Most of the comments have specific keywords which I have defined as categories
One such example
Comment 1
The website is pretty good --->> category would be WebsiteContent
comment 2 might be like
Excellent article ,very detailed--->> same category as above(WebsiteContent)
But the keywords such as article, website are very limited and can be linked to the category
all of comments are different but the underlying keywords are mostly the same
Thanks,
Ankan
Although all you need is a very long and well written set of if-else statements, try using a Decision tree from the package from the rpart and prp package. I'm saying this only cause you're trying to learn and I'm guessing this is for some assignment which you're supposed to be doing on your own.
tree<-rpart(train$decision~train$comment, method"class")
prp(tree)
The first line builds the model and the second one plots it. This might be a bit overboard actually but since you're learning R this is a fun thing to work with and can be used for a wide variety of things. Although, Decision trees work better with more predictor variables.
Use predict(test,tree) to test out the model on your test dataset.

Resources