I am running the following code and receiving this error:
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException
setwd("C:\\Users\\jbarr\\Desktop\\test)
library (tm); library (wordcloud);library (RWeka); library (tau);library(xlsx);
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
review_source <- VectorSource(Comment)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords,stopwords(kind = "english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c("member", "advise", "inform", "informed", "caller", "call","provided", "advised"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
wordfreq <- colSums(dtm2)
wordfreq <- sort(wordfreq, decreasing=TRUE)
head(wordfreq, n=100)
wfreq <- head(wordfreq, 500)
set.seed(142)
words <- names(wfreq)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(words[1:100], wordfreq[1:100], rot.per=0.35, scale=c(2.7, .4), colors=dark2, random.order=FALSE)
write.xlsx(wfreq, "C:\\Users\\jbarr\\Desktop\\test")
The interesting problem is, I have ran this code on multiple files, and only specific ones have the error.
Sanmeet is right - it's a problem with NAs in your data frame.
just prior to your line: review_source <- VectorSource(Comment)
insert the line below:
Comment[which(is.na(Comment))] <- "NULLVALUEENTERED"
This will change all of your NA values to the phrase NULLVALUEENTERED (feel free to change that). No more NAs, and the code should run fine.
You are getting the error in tokenizer due to NA in your string vector Comment
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
length(Comment)
Comment = Comment[complete.cases(Comment)]
length(Comment)
Or you can also use is.na as below
Comment = Comment[!is.na(Comment)]
Now apply the preprocessing steps, create the corpus etc
Hope this helps.
A Suggestion: I get this error when reading an excel (.xlsx) file using:
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1", startRow = 1, endRow = 0).
Notice it appears that the value for endRow should be NULL or a valid number. But
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1")
works fine. So you might want to check your argument values and argument to parameter alignment.
Seems like there are NAs in your data Frame. Run is.na() and remove those rows. Try running the code again. It should work.
Related
I have multiple *.txt files that contain the title and texts that I want to process in R. A program below reads all the *.txt and displays the final file while skipping the first read texts.
My program is as here below. It uses for loop and I want to see all the texts
library(here)
library(glue)
library(tm)
library(SnowballC)
library(tidyverse)
library(tidytext)
all_texts <- list.files(setwd('.KCI/'), (startsWith = 'abstract'))
for(i in seq(1:length(all_texts)))
{
data <- read_tsv(all_texts[i], , show_col_types = FALSE)
corpus <- Corpus(VectorSource(data[i]))
corpus[i] <- tm_map(corpus[i], tolower)
corpus[i] <- tm_map(corpus[i], removePunctuation)
corpus[i] <- tm_map(corpus[i], removeNumbers)
corpus[i] <- tm_map(corpus[i], stripWhitespace)
corpus[i] <- tm_map(corpus[i], removeWords, c(stopwords("english"), mystopwords))
corpus[i] <- tm_map(corpus[i], stemDocument)
dtm <- DocumentTermMatrix(corpus[i])
}
This program just reads the final document but skips the previous ones. Therefore I want even other documents to be displayed before the last one.
<Title> <Year> <Text>
How is it? 1998 I am wondering if it could end like that. Therefore the deal is too good to be true
This would be a lot easier if you had provided some data.
library(tm)
library(SnowballC)
##
# two documents based on your example (t1 & t2 are identical here).
#
t1 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
t2 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
data <- list(t1,t2) # listof documents
dtm.list <- lapply(data, function(x) {
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
DocumentTermMatrix(corpus)
})
lapply(dtm.list, inspect)
Note I left out mystopwords because you did not provide any.
In your case you could put the read_tsv(...) back into the function and use lapply(...) in the list of file names. Something like:
dtm.list <- lapply(all.texts, function(x) {
data <- read_tsv(x)
corpus <- Corpus(VectorSource(data))
...
})
Where ... are the lines of code in my example above.
If your ultimate goal is to analyze word frequency, you might be better off using ?termFreq.
I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.
I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?
I am trying to create a dendrogram in r based off an excel sheet for use in text mining. I have one large column, each cell with a string of text. I want the smallest branch of the dendrogram to represent an individual cell, yet when I run my script I instead get a dendrogram of every word within the entire excel file. How do I fix this?
library(tm)
library(stringi)
library(proxy)
Data <- read.csv(file.choose(),header=TRUE)
docs <- Corpus(VectorSource(Data))
docs[[1]]
docs1 <- tm_map(docs, PlainTextDocument)
docs2 <- tm_map(docs1, stripWhitespace)
docs3 <- tm_map(docs2, removeWords, stopwords("english"))
docs4 <- tm_map(docs3, removePunctuation)
docs5 <- tm_map(docs4, content_transformer(tolower))
docs5[[1]]
TermMatrix <- TermDocumentMatrix(docs5)
docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean")
docsdissim2 <- as.matrix(docsdissim)
docsdissim2
h <- hclust(docsdissim, method = "ward.D2")
I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)