I do stemming on my dataset for sentiment analysis and I got this error message
"Error in structure(if (length(n)) n else NA, names = x) :
'names' attribute [2] must be the same length as the vector [1]"
Please help!
myCorpus<-Corpus(VectorSource(Datasetlow_cost_airline$text))
# Convert to lower case
myCorpus<-tm_map(myCorpus,tolower)
# Remove puntuation
myCorpus<-tm_map(myCorpus,removePunctuation)
# Remove numbers
myCorpus<-tm_map(myCorpus,removeNumbers)
# Remove URLs ?regex = regular expression ?gsub = pattern matching
removeURL<-function(x)gsub("http[[:alnum:]]*","",x)
myCorpus<-tm_map(myCorpus,removeURL)
stopwords("english")
# Add two extra stop words: 'available' and 'via'
myStopwords<-c(stopwords("english"),"available","via","can")
# Remove stopwords from corpus
myCorpus<-tm_map(myCorpus,removeWords,myStopwords)
# Keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy<-myCorpus
# Stem word (change all the words to its root word)
myCorpus<-tm_map(myCorpus,stemDocument)
# Inspect documents (tweets) numbered 11 to 15
for(i in 11:15){
cat(paste("[[",i,"]]",sep=""))
writeLines(strwrap(myCorpus[[i]],width=73))
}
# Stem completion
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
There seems to be something odd about the stemCompletion function in tm version 0.6. There is a nice workaround here that I've used for this answer. In brief, replace your
# Stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # use spaces!
with
# Stem completion
stemCompletion_mod <- function(x,dict) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), dictionary = dict, type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function
myCorpus <- lapply(corpus, stemCompletion_mod, myCorpusCopy)
If that doesn't help then you'll need to give more details and a sample of your actual data.
Related
I've been working of a dataset but when I insert the code in I get all words such as 'in' 'and'. I was trying to remove these common words. I know I need to use the stopwords function but I am not sure where to input and it what command to use after it? I want to find the most words use to describe a listing other than 'in' 'for' 'what'
nycab$name <- as.character((nycab$name))
nycab$name <- tolower(nycab$name)
corpus <- Corpus(VectorSource(nycab$name))
nycwords_dfm <- dfm(nycab$name)
head(nycwords_dfm)
wordcountnyc_dfm <- dfm_select(nycwords_dfm, pattern = topwordcount)
topwordcount <- names(topfeatures(wordcountnyc_dfm,50))
head(topwordcount)
nycword_fcm <-fcm(wordcountnyc_dfm)
head(nycword_fcm)
nycwordcount2_fcm <- fcm_select(nycword_fcm, pattern = topwordcount)
textplot_network(nycwordcount2_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Datasets in case anyone needs it -https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Looks like you are using quanteda, so get rid of the tm part in your code, the corpus line.
You can use dfm_remove to get rid of the stopwords.
nycwords_dfm <- dfm(nycab$name)
# remove stopwords
nycwords_dfm <- dfm_remove(nycwords_dfm, stopwords("english"))
# rest of your code
...
If you need to remove more things first use tokens:
# remove punctuation and stopwords via tokens
nycwords_toks <- tokens(nycab$name, remove_punct = TRUE)
nycwords_toks <- tokens_remove(nycwords_toks, stopwords("english"))
nycwords_dfm <- dfm(nycwords_toks)
# rest of your code
....
I'm having trouble using a RegEx on a corpus.
I read in a couple of text documents that I converted to a corpus.
I want to display it in a TermDocumentMatrix after some pre-processing.
First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"
Then I want to use character n-grams with n = 1:3, so for the previous example ->
t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.
My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.
I was wondering if there's a way to include in: typedPrefix <- tokens()...
Here's the code:
# read documents
FILEDIR <- (path)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
my_corpus <- corpus(txts)
#start processing
typedPrefix <- my_corpus
typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
dfm2 <- dfm(typedPrefix)
tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
as.matrix(tdm2)
#write output file
write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")
I have a dataframe with 60000 rows/phrase which I would like to use as stopwords and remove from text.
I use tm package and I use this line, after I read the csv file with the list of stopwords:
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords)
but I receive this error:
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'regular expression is too large'
at ''
Is there any problem because the list is to big? Is there anything I could make to fix it?
You could probably resolve your issue by splitting the stopword list into multiple parts, something like the following:
chunk <- 1000
i <- 0
n <- length(df$mylistofstopwords)
while (i != n) {
i2 <- min(i + chunk, n)
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords[(i+1):i2])
i <- i2
}
Or, you could just use a package that can handle long stopword lists. corpus is one such package. quanteda is another. Here's how to get a document-by-term matrix in corpus
library(corpus)
x <- term_matrix(corpus, drop = df$mylistofstopwords)
Here, the input argument corpus can be a tm corpus.
I am on a project with R and I am starting to get my hands dirty with it.
In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear.
I would like to remove words with less than 4 letters and remove punctuation
gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg)
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)
You haven't stored your first two lines of code as variables to use later. So, in your third line, where you create your corpus variable, you are using the unmodified msg data. Give this a try:
msg_clean <- gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
msg_clean <- gsub("[[:punct:]]", "", msg_clean)
corpus <- Corpus(VectorSource(msg_clean))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq = 120, highfreq = Inf)
I'm trying to get a count of the keywords in my corpus using the R "tm" package. This is my code so far:
# get the data strings
f<-as.vector(forum[[1]])
# replace +
f<-gsub("+", " ", f ,fixed=TRUE)
# lower case
f<-tolower(f)
# show all strings that contain mobile
mobile<- f[grep("mobile", f, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)]
text.corp.mobile <- Corpus(VectorSource(mobile))
text.corp.mobile <- tm_map(text.corp.mobile , removePunctuation)
text.corp.mobile <- tm_map(text.corp.mobile , removeWords, c(stopwords("english"),"mobile"))
dtm.mobile <- DocumentTermMatrix(text.corp.mobile)
dtm.mobile
dtm.mat.mobile <- as.matrix(dtm.mobile)
dtm.mat.mobile
This returns a table with binary results of weather a keyword appeared in one of the corpus texts or not.
Instead of getting the final result in a binary form I would like to get a count for each keyword. For example:
'car' appeared 5 times
'button' appeared 9 times
without seeing your actual data, its a bit hard to tell but since you just called DocumentTermMatrix I would try something like this:
dtm.mat.mobile <- as.matrix(dtm.mobile)
word.freqs <- sort(rowSums(dtm.mat.mobile), decreasing=TRUE)