Network Analysis of column from Dataset and use of stopwords - r

I've been working of a dataset but when I insert the code in I get all words such as 'in' 'and'. I was trying to remove these common words. I know I need to use the stopwords function but I am not sure where to input and it what command to use after it? I want to find the most words use to describe a listing other than 'in' 'for' 'what'
nycab$name <- as.character((nycab$name))
nycab$name <- tolower(nycab$name)
corpus <- Corpus(VectorSource(nycab$name))
nycwords_dfm <- dfm(nycab$name)
head(nycwords_dfm)
wordcountnyc_dfm <- dfm_select(nycwords_dfm, pattern = topwordcount)
topwordcount <- names(topfeatures(wordcountnyc_dfm,50))
head(topwordcount)
nycword_fcm <-fcm(wordcountnyc_dfm)
head(nycword_fcm)
nycwordcount2_fcm <- fcm_select(nycword_fcm, pattern = topwordcount)
textplot_network(nycwordcount2_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Datasets in case anyone needs it -https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Looks like you are using quanteda, so get rid of the tm part in your code, the corpus line.
You can use dfm_remove to get rid of the stopwords.
nycwords_dfm <- dfm(nycab$name)
# remove stopwords
nycwords_dfm <- dfm_remove(nycwords_dfm, stopwords("english"))
# rest of your code
...
If you need to remove more things first use tokens:
# remove punctuation and stopwords via tokens
nycwords_toks <- tokens(nycab$name, remove_punct = TRUE)
nycwords_toks <- tokens_remove(nycwords_toks, stopwords("english"))
nycwords_dfm <- dfm(nycwords_toks)
# rest of your code
....

Related

DocumentTermMatrix misses some words

I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.
Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.
Your last line should be:
mat <- as.matrix(TermDocumentMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf)))))
Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.

Within the context of tm::content_transformer() how would I use mgsub?

qdap::mgsub takes the following parameters:
mgsub(x, pattern, replacement)
Within library(tm) corpus transformation you can wrap non tm functions within content_transformer(), e.g.
corpus <- tm_map(corpus, content_transformer(tolower))
Here is a data frame with some poorly spelt text:
df <- data.frame(
id = 1:2,
sometext = c("[cad] appls", "bannanas")
)
And here is a data frame with a custom lookup for misspelt words:
spldoc <- data.frame(
incorrects = c("appls", "bnnanas"),
corrects = c("apples", "bannanas")
)
Using mgsub outwith the context of corpus and content_transformer() I could just do this:
wrongs <- select(spldoc, incorrects)[,1] %>% paste0("\\b",.,"\\b") # prepend and append \\b to create word boundary regex
rights <- select(spldoc, corrects)[,1]
df$sometext <- mgsub(wrongs, rights, df$sometext, fixed = F)
But I can't see how I could write mgsub inside a function to pass to content_transformer() what would my parameter for x be as in mgsub(x, pattern, replacement)?
This is what I did:
# create separate function to pass into tm_map()
spelling_update <- content_transformer(function(x, lut) mgsub(paste0("\\b", lut[, 1], "\\b") , lut[, 2], x, fixed = F))
Then
corpus <- tm_map(corpus, spelling_update(spldoc))

Using RemoveWords in tm_map in R on words loaded from a file

I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)

error message while stemming for sentiment analysis

I do stemming on my dataset for sentiment analysis and I got this error message
"Error in structure(if (length(n)) n else NA, names = x) :
'names' attribute [2] must be the same length as the vector [1]"
Please help!
myCorpus<-Corpus(VectorSource(Datasetlow_cost_airline$text))
# Convert to lower case
myCorpus<-tm_map(myCorpus,tolower)
# Remove puntuation
myCorpus<-tm_map(myCorpus,removePunctuation)
# Remove numbers
myCorpus<-tm_map(myCorpus,removeNumbers)
# Remove URLs ?regex = regular expression ?gsub = pattern matching
removeURL<-function(x)gsub("http[[:alnum:]]*","",x)
myCorpus<-tm_map(myCorpus,removeURL)
stopwords("english")
# Add two extra stop words: 'available' and 'via'
myStopwords<-c(stopwords("english"),"available","via","can")
# Remove stopwords from corpus
myCorpus<-tm_map(myCorpus,removeWords,myStopwords)
# Keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy<-myCorpus
# Stem word (change all the words to its root word)
myCorpus<-tm_map(myCorpus,stemDocument)
# Inspect documents (tweets) numbered 11 to 15
for(i in 11:15){
cat(paste("[[",i,"]]",sep=""))
writeLines(strwrap(myCorpus[[i]],width=73))
}
# Stem completion
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
There seems to be something odd about the stemCompletion function in tm version 0.6. There is a nice workaround here that I've used for this answer. In brief, replace your
# Stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # use spaces!
with
# Stem completion
stemCompletion_mod <- function(x,dict) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), dictionary = dict, type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function
myCorpus <- lapply(corpus, stemCompletion_mod, myCorpusCopy)
If that doesn't help then you'll need to give more details and a sample of your actual data.

Getting count of keywords using tm package in R

I'm trying to get a count of the keywords in my corpus using the R "tm" package. This is my code so far:
# get the data strings
f<-as.vector(forum[[1]])
# replace +
f<-gsub("+", " ", f ,fixed=TRUE)
# lower case
f<-tolower(f)
# show all strings that contain mobile
mobile<- f[grep("mobile", f, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)]
text.corp.mobile <- Corpus(VectorSource(mobile))
text.corp.mobile <- tm_map(text.corp.mobile , removePunctuation)
text.corp.mobile <- tm_map(text.corp.mobile , removeWords, c(stopwords("english"),"mobile"))
dtm.mobile <- DocumentTermMatrix(text.corp.mobile)
dtm.mobile
dtm.mat.mobile <- as.matrix(dtm.mobile)
dtm.mat.mobile
This returns a table with binary results of weather a keyword appeared in one of the corpus texts or not.
Instead of getting the final result in a binary form I would like to get a count for each keyword. For example:
'car' appeared 5 times
'button' appeared 9 times
without seeing your actual data, its a bit hard to tell but since you just called DocumentTermMatrix I would try something like this:
dtm.mat.mobile <- as.matrix(dtm.mobile)
word.freqs <- sort(rowSums(dtm.mat.mobile), decreasing=TRUE)

Resources