Counting words in text (in R): Results unreadable

Counting words in text (in R): Results unreadable - r

I'm counting words in a given text using R libraries tm and qdap. When my vector (words) has only a few words, everything looks fine:
library(tm)
library(qdap)
text <- "activat affect affected affecting affects aggravat allow attribut based basis
bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(text))
words <- c("activat", "affect", "affected")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%)
But when my vector (words) has too many words the results get garbled and unreadable:
words <- c("activat", "affect", "affected", "affecting", "affects", "aggravat", "allow",
"attribut", "based", "basis", "bc", "because", "bosses", "caus", "change",
"changed", "changes", "changing", "compel", "compliance")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Results:
# docs word.count activat affect affected affecting affects aggravat allow
# attribut based basis bc because bosses caus change changed
# changes changing compel compliance
# 1 doc 1 20 1(5.00%) 4(20.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%) 2(10.00%) 3(15.00%) 1(5.00%)
# 1(5.00%) 1(5.00%) 1(5.00%) 1(5.00%)
How can I have the results display in a dataframe/matrix so I can read them more easily?
I tried using termco2mat (qdap library) which supposedly "returns a matrix of term counts" (https://trinker.github.io/qdap/termco.html) like so (please see below), but I'm getting an error:
apply_as_df(text, termco2mat, match.list=words)
# Results:
# Error in qdapfun(text.var = text, ...) :
# unused arguments (text.var = text, match.list = c("activat", "affect", "affected",
# "affecting", "affects", "aggravat", "allow", "attribut", "based", "basis", "bc",
# "because", "bosses", "caus", "change", "changed", "changes", "changing", "compel",
# "compliance"))
Or:
termco2mat(apply_as_df(text, termco, match.list=words))
# Results:
# Error in `rownames<-`(`*tmp*`, value = "doc 1") :
# attempt to set 'rownames' on an object with no dimensions

Here's a solution without qdap:
library(tm)
text1 <- "activat affect affected affecting affects aggravat allow attribut"
text2 <- "based basis bc because bosses caus change changed changes changing compel compliance"
text <- Corpus(VectorSource(c(text1, text2)))
words <- c("activat", "affect", "affected")
dtm <- DocumentTermMatrix(text)
data.frame(cnt = colSums(as.matrix(dtm[ , words])))
Output
cnt
activat 1
affect 1
affected 1

I'm not sure what you're trying to do but scores counts are how to extract the objects from the list. Maybe you want to t transpose the output?
apply_as_df(text, termco, match.list=words) %>%
counts() %>%
t()
## docs "doc 1"
## word.count "20"
## activat "1"
## affect "4"
## affected "1"
## affecting "1"
## affects "1"
## aggravat "1"
## allow "1"
## attribut "1"
## based "1"
## basis "1"
## bc "1"
## because "1"
## bosses "1"
## caus "2"
## change "3"
## changed "1"
## changes "1"
## changing "1"
## compel "1"
## compliance "1"

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)

I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0

Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

quanteda collocations and lemmatization

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.

The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"

Regex: Capturing Numbers at Beginning and Negating Numbers After Characters

I need to capture the 3.93, 4.63999..., and -5.35. I've tried all kinds of variations, but have been unable to grab the correct set of numbers.
Copay: 20.30
3.93
TAB 8.6MG Qty:60
4.6399999999999997
-5.35
2,000UNIT TAB Qty:30
AMOUNT
Qty:180
CAP 4MG

x = c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG");
grep("^[\\-]?\\d+[\\.]?\\d+$", x);
Output (see ?grep):
[1] 2 4 5
If leading/trailing spaces are allowed change the regex with
"^\\s*[\\-]?\\d+[\\.]?\\d+\\s*$"

Try this
S <- c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG")
library(stringr)
ans <- str_extract_all(S, "-?[[:digit:]]*(\\.|,)?[[:digit:]]+", simplify=TRUE)
clean <- ans[ans!=""]
Output
[1] "20.30" "3.93" "8.6"
[4] "4.6399999999999997" "-5.35" "2,000"
[7] "180" "4" "60"
[10] "30"

R For loop unwanted overwrite

I would like every result of the loop in a different text(somename).
Right now the loop overwrites;
library(rvest)
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
for (i in urls){
a01 <- paste0("http://www.imdb.com",i)
text <- read_html(a01) %>% # load the page
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>% # isloate the text
html_text()
}
How could I code it in such a way that the 'i' from the list is added tot text in the for statement?

To solidify my comment:
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
texts <- sapply(head(urls, n = 3), function(i) {
read_html(paste0("http://www.imdb.com", i)) %>%
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>%
html_text()
}, simplify = FALSE)
str(texts)
# List of 3
# $ /title/tt5843990/: chr [1:4] "Lav Diaz" "Charo Santos-Concio" "John Lloyd Cruz" "Michael De Mesa"
# $ /title/tt4551318/: chr [1:4] "Andrey Konchalovskiy" "Yuliya Vysotskaya" "Peter Kurth" "Philippe Duquesne"
# $ /title/tt4550098/: chr [1:4] "Tom Ford" "Amy Adams" "Jake Gyllenhaal" "Michael Shannon"
If you use lapply(...), you'll get an unnamed list, which may or may not be a problem for you. Instead, using sapply(..., simplify = FALSE), we get a named list where each name is (in this case) the partial url retrieved from urls.
Using sapply without simplify can lead to unexpected outputs. As an example:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [1] 1 2 3
One may think that this will always return a vector. However, if any of the single elements returned is not the same length (for instance) as the others, then the vector becomes a list:
set.seed(10)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [[1]]
# [1] 1 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3 3
In which case, it's best to have certainty in the return value, forcing a list:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)), simplify = FALSE)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3
That way, you always know exactly how to reference sub-returns. (This is one of the tenets and advantages to Hadley's purrr package: each function always returns a list of exactly the type you declare. (There are other advantages to the package.)

Count misspelled words in R

Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?

Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0

To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting words in text (in R): Results unreadable - r

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

quanteda collocations and lemmatization

Regex: Capturing Numbers at Beginning and Negating Numbers After Characters

R For loop unwanted overwrite

Count misspelled words in R

Categories

Resources