How to remove punctuation from tokens, when quanteda tokenizes at sentence level?

How to remove punctuation from tokens, when quanteda tokenizes at sentence level? - r

It is my ultimate goal to select some sentences from a corpus which match a certain pattern & perform a sentiment analysis upon these selected cutouts from the corpus. I am trying to do all of that with a current version of quanteda in R.
I noticed that remove_punctuation does not remove punctuation when tokens is applied at the sentence-level (what = "sentence"). When decomposing the selected sentence-tokens to word-tokens for the sentiment analysis, the word-tokens will contain punctuation such as "," or ".". Dictionaries are then no longer able to match on these tokens. Reproducible example:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
For instance, the tokens in toks contain "citizens," or "arrive,". I thought about splitting the tokens back to word-tokens by tokens_split(toks, separator = " ") but separator does allow one input parameter only.
Is there a way to remove the punctuation from the sentences when tokenizing at the sentence-level?

There are better ways to go about your goal, which consists of performing sentiment analysis on just sentences from documents containing your target pattern. You can do this by first reshaping your corpus into sentences, then tokenising them, then using tokens_select() with the window argument to select only those documents containing the pattern. In this case you will set a window so large that it will include the entire sentence.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
Created on 2022-03-22 by the reprex package (v2.0.1)
Note that if you to exclude the sentences that were empty, just use dfm_subset(dfmat, nfeat(dfmat) > 0) where dfmat is your saved output sentiment analysis dfm.

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)

I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0

Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

Find in a dfm non-english tokens and remove them

In a dfm how is it possible to detect non english words and remove them?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
Example the construction of dfm is this:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
I found textcat package as an alterative solution but there are many case in a real dataset where a whole row which is in the English language it recognize it as another language only for a character. Is there any alternative to find non-English rows in a dataframe or token in the dfm using quanteda?

You can do this using a word list of all English words. One place where this exists is in the hunspell pacakges, which is meant for spell checking.
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
Armed with this vector which should, in theory, contain all English words but only words in English, we can remove non-English tokens:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
As you can see, this works pretty well but isn't perfect. The "va" from "ça va" has been picked up as an English word as has "comment". What you want to do is thus a matter of finding the right word list and/or cleaning it. You can also think about removing texts in which too many words have been removed.

The question is not entirely clear as to whether you want to remove non-English "rows" first, or remove non-English words later. There are a lot of cognates between European languages (words that are homographs appearing in more than one language) so the tokens_keep() strategy will be imperfect.
You could remove the non-English documents after detecting the language, using the cld3 library:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
And then input that into quanteda::dfm().

Clean corpus using Quanteda

What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? To be clear, I don't want to create a document-feature matrix with dfm(), I just want a clean corpus that I can use for a specific downstream task.
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude)to get what I want, but I would much prefer being able to do everything in Quanteda.

I think what you want to do is deliberately impossible in quanteda.
You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
But it is not possible to return this tokens object into a corpus. Now it would be possible to write a new function to do this:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
But this object, while technically being a corpus class object, is not what a corpus is supposed to be. From ?corpus [emphasis added]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.

Collocation and compound before a dfm

I would like to find phrases using the column of text to take the so I try the collocation option:
library(quanteda)
dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))
cols <- textstat_collocations(dataset1 $text, size = 2:3, min_count = 30)
After that use compound for their frq try this:
inputforDfm <- tokens_compound(cols)
Error in tokens_compound.default(cols) :
tokens_compound() only works on tokens objects.
but it needs tokens? How is it possible to make it and insert in the dfm :
myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()

You need to tokenize the text, since tokens compound needs a tokens object as its first argument.
library(quanteda)
## Package version: 2.1.1
Here I changed this to min_count = 2 since otherwise you return no collocations in this example, since none occur 30 times or more in the text!
cols <- textstat_collocations(dataset1$text, size = 2:3, min_count = 2)
After compounding, now we can see the compounds among the tokens:
toks <- tokens(dataset1$text) %>%
tokens_compound(cols)
print(toks)
## Tokens consisting of 3 documents.
## text1 :
## [1] "Lorem_Ipsum_is" "simply" "dummy_text" "of_the"
## [5] "printing" "and" "typesetting" "industry"
## [9] "." "Lorem_Ipsum" "has" "been"
## [ ... and 28 more ]
##
## text2 :
## [1] "It_has" "survived" "not" "only" "five" "centuries"
## [7] "," "but" "also" "the" "leap" "into"
## [ ... and 37 more ]
##
## text3 :
## [1] "Contrary" "to" "popular" "belief"
## [5] "," "Lorem_Ipsum_is" "not" "simply"
## [9] "random" "text" "." "It_has"
## [ ... and 63 more ]
Creating a dfm now occurs in the usual way, and we can see the compounds by selecting just those:
dfm(toks) %>%
dfm_select(pattern = "*_*")
## Document-feature matrix of: 3 documents, 5 features (33.3% sparse).
## features
## docs lorem_ipsum_is dummy_text of_the lorem_ipsum it_has
## text1 1 2 1 1 0
## text2 0 0 0 2 1
## text3 1 0 2 1 1

how to extract plain text from .docx file using R

Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks

You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.

You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)

Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.

You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove punctuation from tokens, when quanteda tokenizes at sentence level? - r

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

Find in a dfm non-english tokens and remove them

Clean corpus using Quanteda

Collocation and compound before a dfm

how to extract plain text from .docx file using R

Categories

Resources