error message while stemming for sentiment analysis

error message while stemming for sentiment analysis - r

I do stemming on my dataset for sentiment analysis and I got this error message
"Error in structure(if (length(n)) n else NA, names = x) :
'names' attribute [2] must be the same length as the vector [1]"
Please help!
myCorpus<-Corpus(VectorSource(Datasetlow_cost_airline$text))
# Convert to lower case
myCorpus<-tm_map(myCorpus,tolower)
# Remove puntuation
myCorpus<-tm_map(myCorpus,removePunctuation)
# Remove numbers
myCorpus<-tm_map(myCorpus,removeNumbers)
# Remove URLs ?regex = regular expression ?gsub = pattern matching
removeURL<-function(x)gsub("http[[:alnum:]]*","",x)
myCorpus<-tm_map(myCorpus,removeURL)
stopwords("english")
# Add two extra stop words: 'available' and 'via'
myStopwords<-c(stopwords("english"),"available","via","can")
# Remove stopwords from corpus
myCorpus<-tm_map(myCorpus,removeWords,myStopwords)
# Keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy<-myCorpus
# Stem word (change all the words to its root word)
myCorpus<-tm_map(myCorpus,stemDocument)
# Inspect documents (tweets) numbered 11 to 15
for(i in 11:15){
cat(paste("[[",i,"]]",sep=""))
writeLines(strwrap(myCorpus[[i]],width=73))
}
# Stem completion
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)

There seems to be something odd about the stemCompletion function in tm version 0.6. There is a nice workaround here that I've used for this answer. In brief, replace your
# Stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # use spaces!
with
# Stem completion
stemCompletion_mod <- function(x,dict) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), dictionary = dict, type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function
myCorpus <- lapply(corpus, stemCompletion_mod, myCorpusCopy)
If that doesn't help then you'll need to give more details and a sample of your actual data.

Related

Network Analysis of column from Dataset and use of stopwords

I've been working of a dataset but when I insert the code in I get all words such as 'in' 'and'. I was trying to remove these common words. I know I need to use the stopwords function but I am not sure where to input and it what command to use after it? I want to find the most words use to describe a listing other than 'in' 'for' 'what'
nycab$name <- as.character((nycab$name))
nycab$name <- tolower(nycab$name)
corpus <- Corpus(VectorSource(nycab$name))
nycwords_dfm <- dfm(nycab$name)
head(nycwords_dfm)
wordcountnyc_dfm <- dfm_select(nycwords_dfm, pattern = topwordcount)
topwordcount <- names(topfeatures(wordcountnyc_dfm,50))
head(topwordcount)
nycword_fcm <-fcm(wordcountnyc_dfm)
head(nycword_fcm)
nycwordcount2_fcm <- fcm_select(nycword_fcm, pattern = topwordcount)
textplot_network(nycwordcount2_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Datasets in case anyone needs it -https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Looks like you are using quanteda, so get rid of the tm part in your code, the corpus line.
You can use dfm_remove to get rid of the stopwords.
nycwords_dfm <- dfm(nycab$name)
# remove stopwords
nycwords_dfm <- dfm_remove(nycwords_dfm, stopwords("english"))
# rest of your code
...
If you need to remove more things first use tokens:
# remove punctuation and stopwords via tokens
nycwords_toks <- tokens(nycab$name, remove_punct = TRUE)
nycwords_toks <- tokens_remove(nycwords_toks, stopwords("english"))
nycwords_dfm <- dfm(nycwords_toks)
# rest of your code
....

Adding RegEx to specify character ngrams for a corpus in R

I'm having trouble using a RegEx on a corpus.
I read in a couple of text documents that I converted to a corpus.
I want to display it in a TermDocumentMatrix after some pre-processing.
First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"
Then I want to use character n-grams with n = 1:3, so for the previous example ->
t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.
My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.
I was wondering if there's a way to include in: typedPrefix <- tokens()...
Here's the code:
# read documents
FILEDIR <- (path)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
my_corpus <- corpus(txts)
#start processing
typedPrefix <- my_corpus
typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
dfm2 <- dfm(typedPrefix)
tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
as.matrix(tdm2)
#write output file
write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")

Remove stopwords based on a long list

I have a dataframe with 60000 rows/phrase which I would like to use as stopwords and remove from text.
I use tm package and I use this line, after I read the csv file with the list of stopwords:
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords)
but I receive this error:
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'regular expression is too large'
at ''
Is there any problem because the list is to big? Is there anything I could make to fix it?

You could probably resolve your issue by splitting the stopword list into multiple parts, something like the following:
chunk <- 1000
i <- 0
n <- length(df$mylistofstopwords)
while (i != n) {
i2 <- min(i + chunk, n)
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords[(i+1):i2])
i <- i2
}
Or, you could just use a package that can handle long stopword lists. corpus is one such package. quanteda is another. Here's how to get a document-by-term matrix in corpus
library(corpus)
x <- term_matrix(corpus, drop = df$mylistofstopwords)
Here, the input argument corpus can be a tm corpus.

Text mining with R: use of sub

I am on a project with R and I am starting to get my hands dirty with it.
In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear.
I would like to remove words with less than 4 letters and remove punctuation
gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg)
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)

You haven't stored your first two lines of code as variables to use later. So, in your third line, where you create your corpus variable, you are using the unmodified msg data. Give this a try:
msg_clean <- gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
msg_clean <- gsub("[[:punct:]]", "", msg_clean)
corpus <- Corpus(VectorSource(msg_clean))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq = 120, highfreq = Inf)

Getting count of keywords using tm package in R

I'm trying to get a count of the keywords in my corpus using the R "tm" package. This is my code so far:
# get the data strings
f<-as.vector(forum[[1]])
# replace +
f<-gsub("+", " ", f ,fixed=TRUE)
# lower case
f<-tolower(f)
# show all strings that contain mobile
mobile<- f[grep("mobile", f, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)]
text.corp.mobile <- Corpus(VectorSource(mobile))
text.corp.mobile <- tm_map(text.corp.mobile , removePunctuation)
text.corp.mobile <- tm_map(text.corp.mobile , removeWords, c(stopwords("english"),"mobile"))
dtm.mobile <- DocumentTermMatrix(text.corp.mobile)
dtm.mobile
dtm.mat.mobile <- as.matrix(dtm.mobile)
dtm.mat.mobile
This returns a table with binary results of weather a keyword appeared in one of the corpus texts or not.
Instead of getting the final result in a binary form I would like to get a count for each keyword. For example:
'car' appeared 5 times
'button' appeared 9 times

without seeing your actual data, its a bit hard to tell but since you just called DocumentTermMatrix I would try something like this:
dtm.mat.mobile <- as.matrix(dtm.mobile)
word.freqs <- sort(rowSums(dtm.mat.mobile), decreasing=TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

error message while stemming for sentiment analysis - r

Related

Network Analysis of column from Dataset and use of stopwords

Adding RegEx to specify character ngrams for a corpus in R

Remove stopwords based on a long list

Text mining with R: use of sub

Getting count of keywords using tm package in R

Categories

Resources