Text Categorization using R - r

I am relatively new at using R. I have a dataset of around 5000 datapoints.
My goal is to predict a category using the comments entered.
I have a training dataset of 4500 records and a testing data set of 500 records.
I am looking for 2-3 packages which might help me in doing this.I have to evaluate these packages and prepare a report on that. Can anyone suggest me some good packages which might easier to use and also more efficient.
Again, I have 2 columns
1st one is comments and based on this I have to predict the category.
Right now I have defined around 10 independent categories.
Most of the comments have specific keywords which I have defined as categories
One such example
Comment 1
The website is pretty good --->> category would be WebsiteContent
comment 2 might be like
Excellent article ,very detailed--->> same category as above(WebsiteContent)
But the keywords such as article, website are very limited and can be linked to the category
all of comments are different but the underlying keywords are mostly the same
Thanks,
Ankan

Although all you need is a very long and well written set of if-else statements, try using a Decision tree from the package from the rpart and prp package. I'm saying this only cause you're trying to learn and I'm guessing this is for some assignment which you're supposed to be doing on your own.
tree<-rpart(train$decision~train$comment, method"class")
prp(tree)
The first line builds the model and the second one plots it. This might be a bit overboard actually but since you're learning R this is a fun thing to work with and can be used for a wide variety of things. Although, Decision trees work better with more predictor variables.
Use predict(test,tree) to test out the model on your test dataset.

Related

Analysing vocal similarity of little owls using warbleR in R

I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.

Is repeated anova what i am looking for?

I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))

VAR model with variable combination and variation

I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Example R source code for multiple linear regression with looping through geographies & products?

pardon the newbie question, as I just started learning R a couple weeks ago (but intend to use it actively from now on). However, I could use some help if you already have a working example.
In order to determine own price elasticity coefficients for our each of our products (~100) in each of our states, I want to be able to write a multiple regression that regresses Units on a variety of independent variables. That's straightforward. However, I would like R to be able to cycle through EACH product within a particular state, THEN move onto the next state in the data file, and start the regression on the first product, repeating the cycle.
I have attached an example of what I'm trying to accomplish. I would also like R at the end to export the regression coefficients (and summaries, p-value, t-stat) into a separate worksheet.
Does anyone have an example similar to this? I'm comfortable enough to read the source code and make modifications to fit my needs, but certainly not yet comfortable at this point to write one from scratch. And, alas, I am tired of copying/pasting into Minitab/Excel (which is what i've been using up to this point) to run regressions 1,000 times.
Appreciate any help you could offer!

Resources