I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.
Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.
Your last line should be:
mat <- as.matrix(TermDocumentMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf)))))
Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.
Related
I've been working of a dataset but when I insert the code in I get all words such as 'in' 'and'. I was trying to remove these common words. I know I need to use the stopwords function but I am not sure where to input and it what command to use after it? I want to find the most words use to describe a listing other than 'in' 'for' 'what'
nycab$name <- as.character((nycab$name))
nycab$name <- tolower(nycab$name)
corpus <- Corpus(VectorSource(nycab$name))
nycwords_dfm <- dfm(nycab$name)
head(nycwords_dfm)
wordcountnyc_dfm <- dfm_select(nycwords_dfm, pattern = topwordcount)
topwordcount <- names(topfeatures(wordcountnyc_dfm,50))
head(topwordcount)
nycword_fcm <-fcm(wordcountnyc_dfm)
head(nycword_fcm)
nycwordcount2_fcm <- fcm_select(nycword_fcm, pattern = topwordcount)
textplot_network(nycwordcount2_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Datasets in case anyone needs it -https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Looks like you are using quanteda, so get rid of the tm part in your code, the corpus line.
You can use dfm_remove to get rid of the stopwords.
nycwords_dfm <- dfm(nycab$name)
# remove stopwords
nycwords_dfm <- dfm_remove(nycwords_dfm, stopwords("english"))
# rest of your code
...
If you need to remove more things first use tokens:
# remove punctuation and stopwords via tokens
nycwords_toks <- tokens(nycab$name, remove_punct = TRUE)
nycwords_toks <- tokens_remove(nycwords_toks, stopwords("english"))
nycwords_dfm <- dfm(nycwords_toks)
# rest of your code
....
As part of my efforts to textmine research papers I am interested in looking at Tf-Idf values.
So far I have had difficulty using tidytext for tf-idf due to issues with columns/objects not being detected (consistent issue on this site). Therefore I utilised TM weighting and hoped to view all my results by exporting to csv.
The limited results that I have are in the right format (paper; term; tf-idf value). Only a few of the papers though are available. This is despite the fact that the object states that there are 71 documents. (One document is not readable therefore shows up with error that can be ignored.)
Any help is appreciated, cheers
setwd('C:\\Users\\[--myname--]\\Desktop\\Text_Mine_TestSet_1')
files <- list.files(pattern = 'pdf$')
summary(files)
corpus_a1 <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
TDM_a1 <- TermDocumentMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
DTM_a1 <- DocumentTermMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
# --------------------------
tdm_TfIdf <- weightTfIdf(TDM_a1)
tdm_TfIdf # 71 Documents 32,177 terms (can sparse here)
tdm_TfIdf %>%
View() # Odd table
inspect(tdm_TfIdf) # Shows limited output
print(tdm_TfIdf)
library(devtools)
tdm_inspect <- inspect(tdm_TfIdf)
tdm_DF <- as.data.frame(tdm_inspect, stringsAsFactors = FALSE)
tdm_DF
write.table(tdm_DF)
write.csv(tdm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\tdm_TfIdf.csv',
row.names = TRUE)
# ---------------------
# SAME ISSUE SIMPLY X and Y AXIS FLIPPED
dtm_TfIdf <- weightTfIdf(DTM_a1)
dtm_TfIdf # 71 Documents 32,177 terms (can sparse here)
dtm_TfIdf %>%
View() # Odd table
inspect(dtm_TfIdf) # Shows limited output
print(dtm_TfIdf)
dtm_inspect <- inspect(dtm_TfIdf)
dtm_DF <- as.data.frame(dtm_inspect, stringsAsFactors = FALSE)
dtm_DF
write.table(dtm_DF)
write.csv(dtm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\dtm_TfIdf.csv',
row.names = TRUE)
As stated above, four papers and ten terms appear in the resulting csv file. I am unsure why the results would be limited in this manner.
Ultimately I was able to accomplish this goal (though not another related one I posted about and is related to my work). Most important, I used Cermine (https://github.com/CeON/CERMINE) who I cannot thank enough and will cite in my work. This allowed me to convert my .pdf into .txt while keeping document format.
In regards to exporting TFIDF values to .csv files (in Excel) I also had a great deal of help. This help, however, has no original reference point that I can find; I found it from someone who sourced it from another etc. After making a dataframe (DF <- function(x,y)) export each as a sheet within Excel (.csv file) with this code:
*NB please take credit if you wrote this script it has been immensely useful
xlsx.writeMultipleData <- function (file, ...)
{
require(xlsx, quietly = TRUE)
objects <- list(...)
fargs <- as.list(match.call(expand.dots = TRUE))
objnames <- as.character(fargs)[-c(1, 2)]
nobjects <- length(objects)
for (i in 1:nobjects) {
if (i == 1)
write.xlsx(objects[[i]], file, sheetName = objnames[i])
else write.xlsx(objects[[i]], file, sheetName = objnames[i],
append = TRUE)
}
}
xlsx.writeMultipleData('filename.xlsx',
Dataframe_A, Dataframe_B, etc)
I would like to be able to import PDF documents into R and classify them as either:
Relevant (contains a specific string, for example, "tacos", within the first 100 words)
Irrelevant (DOES NOT contain "tacos" within the first 100 words)
To be more specific, I would like to address the following questions:
Does a package(s) exist in R to perform this basic classification?
If so, is it possible to generate a dataset that would look something like this in R if I had 2 PDF documents with Paper1 containing at least one instance of the string, "tacos", in the first 100 words and Paper2 that DOES NOT contain at least one instance of the string, "tacos":
Any references to documentation/R packages/sample R code or mock examples related to this type of classification using R would be greatly appreciated! Thanks!
You can use the pdftools library and do something like this:
First, load the library and grab some pdf file names:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
Then define a function that reads a PDF file in as text and looks up the first n words. (It might be useful to check for errros, like unknown password or things like that - my ex. function returns NA for such cases.)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
Finally, wrap it up and put it into a data frame:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
While #lukeA beat me to it, I wrote another small function that uses pdftools as well. The only real difference is that lukeA looks at the first n-characters, and my skript looks at the first n words.
This is how my approach looks
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant
I created a script for the frequency of words in test documents (000_1.txt,000_2.txt,000_11.txt) in R.
I want the file elaboration in order with numerical end (1,2,11).
The SO is Windows 7. The directory "E:\testR" contains the files.
This is the code
library("tm")
pathElaboration <- "E:/testR"
setwd(pathElaboration)
dirSource <- DirSource(pathElaboration, encoding = "ISO-8859-2",pattern="*.txt")
vCorpusFiles <- VCorpus(dirSource, readerControl = list(language = "en"))
for (i in seq(from= 1, to=length(vCorpusFiles), by=1))
{
dtm <- DocumentTermMatrix(vCorpusFiles[i])
vectorFrequencyWord <- as.matrix(dtm)
print(vectorFrequencyWord)
}
But the result is
Terms
Docs file1
000_1.txt 1
Terms
Docs wordinfile11
000_11.txt 1
Terms
Docs wordinfile2
000_2.txt 1
I would have the sequence 000_1.txt, 000_2.txt, 000_11.txt in elaboration
How can I fix this?
That's a text sorting order, so this should work:
dtm <- dtm[order(Docs(dtm)), ]
I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.
Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:
library(RWeka)
library(tm)
corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.
Any help would be much appreciated.
You'll need to use the paste function on your corpus_sample vector.
Paste, with a value set for collapse takes a vector with many text elements and converts it to a vector with one text elements, where the elements are separated by the string you specify.
text <- c('a', 'b', 'c')
text <- paste(text, collapse = " ")
text
# [1] "a b c"
You can also use the quanteda package, as an alternative to tm. That will do what you want in the following steps, after you've created corpus_sample:
require(quanteda)
myDfm <- dfm(corpus_sample, ngrams = 2)
bigramTotals <- colSums(myDfm)
I also suspect it will be faster.