Keep document ID with R corpus - r

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below:
I have a dataframe: ID and Text (Simple document id/name and then some text)
I have two issues:
Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm).
Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm.
For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words?
This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"?
Any help will be greatly appreciated.
Thanks much!

First, here's a sample data.frame
dd<-data.frame(
id=10:13,
text=c("No wonder, then, that ever gathering volume from the mere transit ",
"So that in many cases such a panic did he finally strike, that few ",
"But there were still other and more vital practical influences at work",
"Not even at the present day has the original prestige of the Sperm Whale")
,stringsAsFactors=F
)
Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader. This is all we need to do
library(tm)
myReader <- readTabular(mapping=list(content="text", id="id"))
We just specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl=list(reader=myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("\\b(", paste(words,collapse="|"),"\\b)"), x)
, invert=T)<-" "
x
})
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep<-c("wonder","then","that","the")
tm<-tm_map(tm, content_transformer(tolower))
tm<-tm_map(tm, keepOnlyWords, keep)
tm<-tm_map(tm, stripWhitespace)
And then we can turn that into a document term matrix
dtm<-DocumentTermMatrix(tm)
inspect(dtm)
# <<DocumentTermMatrix (documents: 4, terms: 4)>>
# Non-/sparse entries: 7/9
# Sparsity : 56%
# Maximal term length: 6
# Weighting : term frequency (tf)
# Terms
# Docs that the then wonder
# 10 1 1 1 1
# 11 2 0 0 0
# 12 0 1 0 0
# 13 0 3 0 0
and you can it it has our list of words and the proper document IDs from the data.frame

In newer versions of tm this is a lot easier with the DataframeSource() function.
"A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata."
So in this case:
dd <-data.frame(
doc_id=10:13,
text=c("No wonder, then, that ever gathering volume from the mere transit ",
"So that in many cases such a panic did he finally strike, that few ",
"But there were still other and more vital practical influences at work",
"Not even at the present day has the original prestige of the Sperm Whale")
,stringsAsFactors=F
)
Corpus = VCorpus(DataframeSource(dd))

Related

Extract larger body of character data with stringr?

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)

How to use quanteda to find instances of appearance of certain words before certain others in a sentence

As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).
The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:
kwic(corpus_mar_nga, phrase("investors * shall"))
I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".
And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:
toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")
I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.
At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.
Could anyone offer me a solution on this issue?
Huge thanks in advance!
Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as #Kohei_Watanabe suggests in the comment) using window for tokens_select().
First, create some sample text.
library("quanteda")
## Package version: 2.1.2
# sample text
txt <- c("The investors and their supporters shall do something.
Shall we tell the investors? Investors shall invest.
Shall someone else do something?")
Now reshape this into sentences, since your search occurs within sentence.
# reshape to sentences
corp <- txt %>%
corpus() %>%
corpus_reshape(to = "sentences")
Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)
# method 1: regular expressions
corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
case_insensitive = TRUE
)
print(corpus_subset(corp, flag == TRUE), -1, -1)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "The investors and their supporters shall do something."
##
## text1.2 :
## "Investors shall invest."
A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".
# method 2: tokens_select()
toks <- tokens(corp)
tokens_select(toks, "investors", window = c(0, 1000)) %>%
kwic("shall", window = 1000)
##
## [text1.1, 5] investors and their supporters | shall | do something.
## [text1.3, 2] Investors | shall | invest.
The choice depends on what suits your needs best.

separating an Arabic sentence into words results in a different number of words with different functions

I am trying to separate one Arabic sentence, Verse 38:1 of Quran, with the tm and tokenizers packages but they split the sentence differently into 3 and 4 words, respectively. Can someone explain (1) why this is and (2) what is the meaning of this difference from NLP and Arabic-language points of view? Also, is one of them wrong? I am by no means expert in NLP nor Arabic but trying to run the codes.
Here are the codes I tried:
library(tm)
library(tokenizers)
# Verse 38:1
verse<- "ص والقرآن ذي الذكر"
# This separates into to 3 words by tm library
a <- colnames(DocumentTermMatrix(Corpus(VectorSource(verse) )))
a
# "الذكر" "ذي" "والقرآن"
# This separates into 4 words by
b <- tokenizers::tokenize_words(verse)
b
# "ص" "والقرآن" "ذي" "الذكر"
I would expect them to be equal but they are different. Can someone explain what is going on here?
It doesn't have anything to do with NLP or the Arabic language, there are simply some defaults you have to watch out for. DocumentTermMatrix has a number of default parameters that can be changed via control. Run ?termFreq to see them all.
One of those defaults is wordLengths:
An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.
So, if we run the following we get 3 words because the dropped word has fewer than 3 characters:
dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)))
inspect(dtm)
#### OUTPUT ####
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs الذكر ذي والقرآن
1 1 1 1
To return all words, regardless of length, we need to change c(3, Inf) to c(1, Inf):
dtm <- DocumentTermMatrix(Corpus(VectorSource(verse)),
control = list(wordLengths = c(1, Inf))
)
inspect(dtm)
#### OUTPUT ####
<<DocumentTermMatrix (documents: 1, terms: 4)>>
Non-/sparse entries: 4/0
Sparsity : 0%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs الذكر ذي ص والقرآن
1 1 1 1 1
The default makes sense because the default language is English, where words with less than three characters are articles, prepositions, etc., but it might make less sense with other languages. Definitely take the time to play around with the other parameters related to different tokenizers, language settings, etc. The current results look pretty good, but you might have to tweak some settings as your text becomes more complicated.

How to predict next word in sentence using ngram model in R

I have pre-processed text data into a corpus I would now like to build a prediction model based on the previous 2 words (so I think a 3-gram model?). Based on my understanding of the articles I have read, here is how I am thinking of doing it:
step 1: enter two word phrase we wish to predict the next word for
# phrase our word prediction will be based on
phrase <- "I love"
step 2: calculate 3 gram frequencies
library(RWeka)
threegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
dtm_threegram <- DocumentTermMatrix(corpus, control=list(tokenize=threegramTokenizer))
threegram_freq <- sort(colSums(as.matrix(dtm_threegram)), decreasing = TRUE)
The next step is where I am getting stuck. Conceptually, I think I should subset my 3-gram to only include three word combinations that start with "I love". Then, I should only keep the highest frequency 3-gram. For instance, if "I love you" appeared 12 times in my corpus and "I love beer" appeared 15 times, then the probability of "beer" being the next word is higher than "love" hence the model should return the former. Is this the correct approach and if so, how do I create something like this programmatically? My threegram_freq object appears to be of numeric class with a character attribute which I don't fully understand what that is. Is it possible to use a regular expression to only include elements starting with "I love" and then extract the 3rd word of the 3-gram with the highest frequency?
Thank you!

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

I have two sets of data:
a set of tags (single words like php, html, etc)
a set of texts
I wish now to build a Term-Document-Matrix representing the number occurrences of the tags element in the text element.
I have looked into R library tm, and the TermDocumentMatrix function, but I do not see the possibility to specify the tags as input.
Is there a way to do that?
I am open to any tool (R, Python, other), although using R would be great.
Let's set the data as:
TagSet <- data.frame(c("c","java","php","javascript","android"))
colnames(TagSet)[1] <- "tag"
TextSet <- data.frame(c("How to check if a java file is a javascript script java blah","blah blah php"))
colnames(TextSet)[1] <- "text"
now I'd like to have the TermDocumentMatrix of TextSet according to TagSet.
I tried this:
myCorpus <- Corpus(VectorSource(TextSet$text))
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE, stopwords=TRUE))
>inspect(tdm)
A term-document matrix (7 terms, 2 documents)
Non-/sparse entries: 8/6
Sparsity : 43%
Maximal term length: 10
Weighting : term frequency (tf)
Docs
Terms 1 2
blah 1 2
check 1 0
file 1 0
java 2 0
javascript 1 0
php 0 1
script 1 0
but that's checking the text against the words of the text, whereas I want to check presence of already defined tags.
tdm.onlytags <- tdm[rownames(tdm)%in%TagSet$tag,]
to select only your specified words and next proceed with your analysis.
DocumentTermMatrix(docs, list(dictionary = Dictionary$Var1))
You could pre-defined the dictionary using the set tags

Resources