Problems with Naive Bayes - r

I'm trying to run Naive Bayes in R for making predictions from textual data (by building a Document Term Matrix).
I read several posts warning about terms that could be missing in both the training and the testing set, so I decided to work with only one data frame and split it afterwards. The code I'm using is this:
data <- read.csv(file="path",header=TRUE)
########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)
# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])
# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)
# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)
# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)
# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
completecorpus <- tm_map(completecorpus,PlainTextDocument)
completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
completecorpus <- tm_map(completecorpus,removePunctuation)
completecorpus <- tm_map(completecorpus,removeNumbers)
completecorpus <- tm_map(completecorpus,stripWhitespace)
# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]
# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)
# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))
conf.matrix
The problem is that I'm getting weird results like this:
actual
predicted 1 2 3
1 60 833 107
2 0 0 0
3 0 0 0
Any idea of why is this happening?
The raw data looks like this:
head(complete)
Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well
InfoType
13000 2
13001 2
13002 2
13003 3
13004 2
13005 2

Seemingly the problem is that the TDM needs to get rid of so much sparsity. So I added:
completematrix<-removeSparseTerms(completematrix, 0.95)
And it started working!!
actual
predicted 1 2 3
1 60 511 6
2 0 86 2
3 0 236 99
Thank you all for your ideas (thank you Chelsey Hill!!)

Related

Find frequency of terms from Function

I need to find frequency of terms from the function that I have created that find terms with punctuation in them.
library("tm")
my.text.location <- "C:/Users/*/"
newpapers <- VCorpus(DirSource(my.text.location))
I read it then make the function:
library("stringr")
punctterms <- function(x){str_extract_all(x, "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}")}
terms <- lapply(newpapers, punctterms)
Now I'm lost as to how will I find the frequency for each term in each file. Do I turn it into a DTM or is there a better way without it?
Thank you!
This task is better suited for quanteda, not tm. Your function creates a list and removes everything out of the corpus. Using quanteda you can just use the quanteda commands to get everything you want.
Since you didn't provide any reproducible data, I will use a data set that comes with quanteda. Comments above the code explain what is going on. Most important function in this code is dfm_select. Here you can use a diverse set of selection patterns to find terms in the text.
library(quanteda)
# load corpus
my_corpus <- corpus(data_corpus_inaugural)
# create document features (like document term matrix)
my_dfm <- dfm(my_corpus)
# dfm_select can use regex selections to select terms
my_dfm_punct <- dfm_select(my_dfm,
pattern = "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}",
selection = "keep",
valuetype = "regex")
# show frequency of selected terms.
head(textstat_frequency(my_dfm_punct))
feature frequency rank docfreq group
1 fellow-citizens 39 1 19 all
2 america's 35 2 11 all
3 self-government 30 3 16 all
4 world's 24 4 15 all
5 nation's 22 5 13 all
6 god's 15 6 14 all
So I got it to work without using quanteda:
m <- as.data.frame(table(unlist(terms)))
names(m) <- c("Terms", "Frequency")

Extract total frequency of words from vector in R

This is the vector I have:
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)"
I want a data frame as a result, that would contain words and the frequecy of times they occur.
So result should look something like:
word count
a 300
and 260
be 200
... ...
... ...
What I tried to do, was use tm
corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)
Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?
Is there a easier way to do this, maybe by not using tm at all?
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1
Plain R solution, assuming all words are separated by space:
words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)
The names(counts) holds words, and values are the counts.
You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:
posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)
You've got two options. Depends if you want word count per document, or for all documents.
All Documents
library(dplyr)
count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <- rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]
### RESULT of head(count)
# word count
# 140 the 14
# 144 they 10
# 4 and 9
# 25 csm 7
# 43 for 5
# 55 had 4
This should capture occurrences across all documents (by use of rowSums).
Per Document
I would suggesting using the tidytext package, if you want word frequency per document.
library(tidytext)
m_td <- tidy(m)
The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do
library(dplyr) # or if you want it all, `library(tidyverse)`
library(tidytext)
data_frame(posts) %>%
unnest_tokens(word, posts) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## # A tibble: 101 × 2
## word n
## <chr> <int>
## 1 csm 7
## 2 0.0 3
## 3 nda 3
## 4 bit 2
## 5 ccp 2
## 6 dominion 2
## 7 forum 2
## 8 forums 2
## 9 hard 2
## 10 internal 2
## # ... with 91 more rows

R predict function not using entire data in the test data set, only using partial data and predicting

I have a train data set which has 700 records. I prepared the model using c5.0 function with this data.
library(C50)
abc_model <- C5.0(abc_train[-5], abc_train$resultval)
I have test data, which has 5000 records.
I am using predict function to do the prediction on these 5000 recs.
abc_Test <- read.csv("FullData.csv", quote="")
abc_pred <- predict(abc_model, abc_test)
This is giving me the prediction for ONLY 700 recs, not all 5000.
How to make this predict for all 5000?
When I have the train data size larger than test data size, then the result is fine, I get all data, I am able to combine test data with results and get the output into ".CSV". But when train data size is smaller than test data, all records are not getting predicted.
x <- data.frame(abc_test, abc_pred)
Any inputs how to overcome this problem? I am not an expert in R. Any suggestions will help me a lot.
Thanks Richard.
Below is my train data, few recs.
Id Value1 Value2 Country Result
20835 63 1 United States yes
3911156 60 12 Romania no
39321 10 3 United States no
29425 80 9 Australia no
Below is my test data, few recs again.
Id Value1 Value2 Country
3942587 114 12 United States
3968314 25 13 Sweden
3973205 83 10 Russian Federation
17318 159 9 Russian Federation
I am trying to find the Result value and append this to my test data. But, like i described, I am getting the Result only for 700 records, not all 5000
You should try this:
str(abc_train)
str(abc_test)
lapply(abc_train[ names(abc_train) != "Result"] , table)
lapply(abc_train[] , table)
Then you will probably find that some of the levels for some of the variables in abc_test were not in abc_train, so estimates could not be produced. I'm guessing you thought that the numeric values would be handled as though a regression had been done, but that won't happen if those columns are factors in any prediction function and perhaps never depending on the function's behavior. Looking at C50::C5.0.default, it appears there may be no regression option for variables.

Poisson Table in R

I am trying to generate a Poisson Table in R for two events, one with mean 1.5 (lambda1) and the other with mean 1.25 (lambda2). I would like to generate the probabilities in both cases for x=0 to x=7+ (7 or more). This is probably quite simple but I can't seem to figure out how to do it! I've managed to create a data frame for the table but I don't really know how to input the parameters as I've never written a function before:
name <- c("0","1","2","3","4","5","6","7+")
zero <- mat.or.vec(8,1)
C <- data.frame(row.names=name,
"0"=zero,
"1"=zero,
"2"=zero,
"3"=zero,
"4"=zero,
"5"=zero,
"6"=zero,
"7+"=zero)
I am guessing I will need some "For" loops and will involve dpois(x,lambda1) at some point. Can somebody help please?
I'm assuming these events are independent. Here's one way to generate a table of the joint PMF.
First, here are the names you've defined, along with the lambdas:
name <- c("0","1","2","3","4","5","6","7+")
lambda1 <- 1.5
lambda2 <- 1.25
We can get the marginal probabilities for 0-6 by using dpois, and the marginal probability for 7+ using ppois and lower.tail=FALSE:
p.x <- c(dpois(0:6, lambda1), ppois(7, lambda1, lower.tail=FALSE))
p.y <- c(dpois(0:6, lambda2), ppois(7, lambda2, lower.tail=FALSE))
An even better way might be to create a function that does this given any lambda.
Then you just take the outer product (really, the same thing you would do by hand, outside of R) and set the names:
p.xy <- outer(p.x, p.y)
rownames(p.xy) <- colnames(p.xy) <- name
Now you're done:
0 1 2 3 4 5
0 6.392786e-02 7.990983e-02 4.994364e-02 2.080985e-02 6.503078e-03 1.625770e-03
1 9.589179e-02 1.198647e-01 7.491546e-02 3.121478e-02 9.754617e-03 2.438654e-03
2 7.191884e-02 8.989855e-02 5.618660e-02 2.341108e-02 7.315963e-03 1.828991e-03
3 3.595942e-02 4.494928e-02 2.809330e-02 1.170554e-02 3.657982e-03 9.144954e-04
4 1.348478e-02 1.685598e-02 1.053499e-02 4.389578e-03 1.371743e-03 3.429358e-04
5 4.045435e-03 5.056794e-03 3.160496e-03 1.316873e-03 4.115229e-04 1.028807e-04
6 1.011359e-03 1.264198e-03 7.901240e-04 3.292183e-04 1.028807e-04 2.572018e-05
7+ 4.858139e-05 6.072674e-05 3.795421e-05 1.581426e-05 4.941955e-06 1.235489e-06
6 7+
0 3.387020e-04 1.094781e-05
1 5.080530e-04 1.642171e-05
2 3.810397e-04 1.231628e-05
3 1.905199e-04 6.158140e-06
4 7.144495e-05 2.309303e-06
5 2.143349e-05 6.927908e-07
6 5.358371e-06 1.731977e-07
7+ 2.573935e-07 8.319685e-09
You could have also used a loop, as you originally suspected, but that's a more roundabout way to the same solution.

How to get frequency of word in a sentence in R?

I have one input file which has one paragraph. I need to find the frequency of particular word in that paragraph.
cat file:
Text Index
train is good 1
let the train come 5
train is best 3
i m great 3
what is best 2
Code:
input<-read.table("file",sep="\t",header=TRUE)
paragraph1<-input[1][1]
word<-"train"
I need to find frequency of word "train" in paragraph1. How can i get it using R?
If you gave a little more info I could probably provide more info in return. Using qdap you could:
library(qdap)
dat <- readLines(n=5)
train is good 1
let the train come 5
train is best 3
i m great 3
what is best 2
dat <- do.call(rbind.data.frame, strsplit(dat, " +"))
colnames(dat) <- c("Text", "Index")
termco(dat$Text, , " train ")
## > termco(dat$Text, , " train ")
## all word.count train
## 1 all 16 3(18.75%)
You could probably do all the paragraphs at once with termco. For more on termco see this link.
Alot of this depends on what's separating paragraphs, how you're reading it in, how things are indented etc.
The poster found the following useful:
length(gregexpr("the", "the dog ate the word the", fixed = TRUE)[[1]])

Resources