I would like to find phrases using the column of text to take the so I try the collocation option:
library(quanteda)
dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))
cols <- textstat_collocations(dataset1 $text, size = 2:3, min_count = 30)
After that use compound for their frq try this:
inputforDfm <- tokens_compound(cols)
Error in tokens_compound.default(cols) :
tokens_compound() only works on tokens objects.
but it needs tokens? How is it possible to make it and insert in the dfm :
myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()
You need to tokenize the text, since tokens compound needs a tokens object as its first argument.
library(quanteda)
## Package version: 2.1.1
Here I changed this to min_count = 2 since otherwise you return no collocations in this example, since none occur 30 times or more in the text!
cols <- textstat_collocations(dataset1$text, size = 2:3, min_count = 2)
After compounding, now we can see the compounds among the tokens:
toks <- tokens(dataset1$text) %>%
tokens_compound(cols)
print(toks)
## Tokens consisting of 3 documents.
## text1 :
## [1] "Lorem_Ipsum_is" "simply" "dummy_text" "of_the"
## [5] "printing" "and" "typesetting" "industry"
## [9] "." "Lorem_Ipsum" "has" "been"
## [ ... and 28 more ]
##
## text2 :
## [1] "It_has" "survived" "not" "only" "five" "centuries"
## [7] "," "but" "also" "the" "leap" "into"
## [ ... and 37 more ]
##
## text3 :
## [1] "Contrary" "to" "popular" "belief"
## [5] "," "Lorem_Ipsum_is" "not" "simply"
## [9] "random" "text" "." "It_has"
## [ ... and 63 more ]
Creating a dfm now occurs in the usual way, and we can see the compounds by selecting just those:
dfm(toks) %>%
dfm_select(pattern = "*_*")
## Document-feature matrix of: 3 documents, 5 features (33.3% sparse).
## features
## docs lorem_ipsum_is dummy_text of_the lorem_ipsum it_has
## text1 1 2 1 1 0
## text2 0 0 0 2 1
## text3 1 0 2 1 1
Related
This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).
It is my ultimate goal to select some sentences from a corpus which match a certain pattern & perform a sentiment analysis upon these selected cutouts from the corpus. I am trying to do all of that with a current version of quanteda in R.
I noticed that remove_punctuation does not remove punctuation when tokens is applied at the sentence-level (what = "sentence"). When decomposing the selected sentence-tokens to word-tokens for the sentiment analysis, the word-tokens will contain punctuation such as "," or ".". Dictionaries are then no longer able to match on these tokens. Reproducible example:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
For instance, the tokens in toks contain "citizens," or "arrive,". I thought about splitting the tokens back to word-tokens by tokens_split(toks, separator = " ") but separator does allow one input parameter only.
Is there a way to remove the punctuation from the sentences when tokenizing at the sentence-level?
There are better ways to go about your goal, which consists of performing sentiment analysis on just sentences from documents containing your target pattern. You can do this by first reshaping your corpus into sentences, then tokenising them, then using tokens_select() with the window argument to select only those documents containing the pattern. In this case you will set a window so large that it will include the entire sentence.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
Created on 2022-03-22 by the reprex package (v2.0.1)
Note that if you to exclude the sentences that were empty, just use dfm_subset(dfmat, nfeat(dfmat) > 0) where dfmat is your saved output sentiment analysis dfm.
I am using tokens_lookup to see whether some texts contain the words in my dictionary. Now I am trying to find a way to discard the matches that occur when the dictionary word is in an ordered sequence of words. To make an example, suppose that Ireland is in the dictionary. I would like to exclude the cases where, for instance, Northern Ireland is mentioned (or any fixed set of words that contains Britain). The only indirect solution that I figured out is to build another dictionary with these sets of words (e.g. Great Britain). However, this solution would not work when both Britain and Great Britain are cited. Thank you.
library("quanteda")
dict <- dictionary(list(IE = "Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = dict)
You can do this by specifying another dictionary key for "Northern Ireland", with the value also "Northern Ireland". If you use the argument nested_scope = "dictionary" in tokens_lookup(), then this will match the longer phrase first and only once, separating "Ireland" from "Northern Ireland". By using the same key as the value, you replace it like for like (with the side benefit of now having these two tokens, "Northern" and "Ireland", combined a single token.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks,
dictionary = dict, exclusive = FALSE,
nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE" "lorem" "ipsum"
##
## doc2 :
## [1] "Lorem" "ipsum" "Northern Ireland"
##
## doc3 :
## [1] "IE" "lorem" "ipsum" "Northern Ireland"
Here I used exclusive = FALSE for illustration purposes, so you could see what got looked up and replaced. You can remove that and the capkeys argument when you run it.
If you want to discard the "Northern Ireland" tokens, just use
tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
##
## doc2 :
## character(0)
##
## doc3 :
## [1] "IE"
In a dfm how is it possible to detect non english words and remove them?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
Example the construction of dfm is this:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
I found textcat package as an alterative solution but there are many case in a real dataset where a whole row which is in the English language it recognize it as another language only for a character. Is there any alternative to find non-English rows in a dataframe or token in the dfm using quanteda?
You can do this using a word list of all English words. One place where this exists is in the hunspell pacakges, which is meant for spell checking.
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
Armed with this vector which should, in theory, contain all English words but only words in English, we can remove non-English tokens:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
As you can see, this works pretty well but isn't perfect. The "va" from "ça va" has been picked up as an English word as has "comment". What you want to do is thus a matter of finding the right word list and/or cleaning it. You can also think about removing texts in which too many words have been removed.
The question is not entirely clear as to whether you want to remove non-English "rows" first, or remove non-English words later. There are a lot of cognates between European languages (words that are homographs appearing in more than one language) so the tokens_keep() strategy will be imperfect.
You could remove the non-English documents after detecting the language, using the cld3 library:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
And then input that into quanteda::dfm().
Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks
You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.
You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)
Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.
You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""