I'm trying to get a count of the keywords in my corpus using the R "tm" package. This is my code so far:
# get the data strings
f<-as.vector(forum[[1]])
# replace +
f<-gsub("+", " ", f ,fixed=TRUE)
# lower case
f<-tolower(f)
# show all strings that contain mobile
mobile<- f[grep("mobile", f, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)]
text.corp.mobile <- Corpus(VectorSource(mobile))
text.corp.mobile <- tm_map(text.corp.mobile , removePunctuation)
text.corp.mobile <- tm_map(text.corp.mobile , removeWords, c(stopwords("english"),"mobile"))
dtm.mobile <- DocumentTermMatrix(text.corp.mobile)
dtm.mobile
dtm.mat.mobile <- as.matrix(dtm.mobile)
dtm.mat.mobile
This returns a table with binary results of weather a keyword appeared in one of the corpus texts or not.
Instead of getting the final result in a binary form I would like to get a count for each keyword. For example:
'car' appeared 5 times
'button' appeared 9 times
without seeing your actual data, its a bit hard to tell but since you just called DocumentTermMatrix I would try something like this:
dtm.mat.mobile <- as.matrix(dtm.mobile)
word.freqs <- sort(rowSums(dtm.mat.mobile), decreasing=TRUE)
Related
I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.
Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.
Your last line should be:
mat <- as.matrix(TermDocumentMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf)))))
Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.
I've been working of a dataset but when I insert the code in I get all words such as 'in' 'and'. I was trying to remove these common words. I know I need to use the stopwords function but I am not sure where to input and it what command to use after it? I want to find the most words use to describe a listing other than 'in' 'for' 'what'
nycab$name <- as.character((nycab$name))
nycab$name <- tolower(nycab$name)
corpus <- Corpus(VectorSource(nycab$name))
nycwords_dfm <- dfm(nycab$name)
head(nycwords_dfm)
wordcountnyc_dfm <- dfm_select(nycwords_dfm, pattern = topwordcount)
topwordcount <- names(topfeatures(wordcountnyc_dfm,50))
head(topwordcount)
nycword_fcm <-fcm(wordcountnyc_dfm)
head(nycword_fcm)
nycwordcount2_fcm <- fcm_select(nycword_fcm, pattern = topwordcount)
textplot_network(nycwordcount2_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Datasets in case anyone needs it -https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Looks like you are using quanteda, so get rid of the tm part in your code, the corpus line.
You can use dfm_remove to get rid of the stopwords.
nycwords_dfm <- dfm(nycab$name)
# remove stopwords
nycwords_dfm <- dfm_remove(nycwords_dfm, stopwords("english"))
# rest of your code
...
If you need to remove more things first use tokens:
# remove punctuation and stopwords via tokens
nycwords_toks <- tokens(nycab$name, remove_punct = TRUE)
nycwords_toks <- tokens_remove(nycwords_toks, stopwords("english"))
nycwords_dfm <- dfm(nycwords_toks)
# rest of your code
....
I'm having trouble using a RegEx on a corpus.
I read in a couple of text documents that I converted to a corpus.
I want to display it in a TermDocumentMatrix after some pre-processing.
First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"
Then I want to use character n-grams with n = 1:3, so for the previous example ->
t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.
My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.
I was wondering if there's a way to include in: typedPrefix <- tokens()...
Here's the code:
# read documents
FILEDIR <- (path)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
my_corpus <- corpus(txts)
#start processing
typedPrefix <- my_corpus
typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
dfm2 <- dfm(typedPrefix)
tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
as.matrix(tdm2)
#write output file
write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")
I am doing text analysis in R. Thus far, I have a vector that contains the corpus and metadata in a csv that I would like to merge with it. Here is how I obtain the corpus in vector form
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
Here is the metadata:
metadata <- read.csv("alldocs.csv", header = TRUE, na.strings = c(""), sep = ",")
How can I combine the two? I want to combine them in order (i.e., first document in corpus corresponds to first row in csv, etc.). In the end, I want a dataframe where each row corresponds to the right document from the corpus.
Update:
I was told to try to make the problem reproducible.
I started with a folder with all the texts I have. I start by loading them into a vector:
alldocs <- Corpus(
DirSource("/path/file/wheredocumentsare"),
readerControl = list(reader = readPlain, language = "en", load = FALSE)
)
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
metadata <- read.csv("metadata.csv", header = TRUE, na.strings = c(""), sep = ",")
I would like to combine metadata and corpus.Yet when I input,
fulldata <- data.frame(corpus, metadata)
I get the following error message
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class "c("VCorpus", "Corpus")" to a data.frame
I do stemming on my dataset for sentiment analysis and I got this error message
"Error in structure(if (length(n)) n else NA, names = x) :
'names' attribute [2] must be the same length as the vector [1]"
Please help!
myCorpus<-Corpus(VectorSource(Datasetlow_cost_airline$text))
# Convert to lower case
myCorpus<-tm_map(myCorpus,tolower)
# Remove puntuation
myCorpus<-tm_map(myCorpus,removePunctuation)
# Remove numbers
myCorpus<-tm_map(myCorpus,removeNumbers)
# Remove URLs ?regex = regular expression ?gsub = pattern matching
removeURL<-function(x)gsub("http[[:alnum:]]*","",x)
myCorpus<-tm_map(myCorpus,removeURL)
stopwords("english")
# Add two extra stop words: 'available' and 'via'
myStopwords<-c(stopwords("english"),"available","via","can")
# Remove stopwords from corpus
myCorpus<-tm_map(myCorpus,removeWords,myStopwords)
# Keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy<-myCorpus
# Stem word (change all the words to its root word)
myCorpus<-tm_map(myCorpus,stemDocument)
# Inspect documents (tweets) numbered 11 to 15
for(i in 11:15){
cat(paste("[[",i,"]]",sep=""))
writeLines(strwrap(myCorpus[[i]],width=73))
}
# Stem completion
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
There seems to be something odd about the stemCompletion function in tm version 0.6. There is a nice workaround here that I've used for this answer. In brief, replace your
# Stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # use spaces!
with
# Stem completion
stemCompletion_mod <- function(x,dict) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), dictionary = dict, type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function
myCorpus <- lapply(corpus, stemCompletion_mod, myCorpusCopy)
If that doesn't help then you'll need to give more details and a sample of your actual data.