R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters - r

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix.
I create the Corpus out of spanish text (with special characters), with no issues.
However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canción).
Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?
Thank you for any help.
As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.
I have tried Encoding(tw2) <- "UTF-8" with no luck.
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT #None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&amp", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*#\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("#\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms

It looks like quanteda (and tm) is losing the encoding when creating the DFM on the windows platform. In this tidytext question the same problem happens with unnesting tokens. Which works fine now and also quanteda's tokens works fine.
If I enforce UTF-8 or latin1 encoding on the #Dimnames$features of the dfm you get the correct results.
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
If you do the following:
Encoding(tdm_quan#Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1

Let me guess...are you using Windows? On macOS it works fine:
clean_tw2
## [1] "enmascarados si masduro chingán si quieres aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs enmascarados si masduro chingán quieres aguantas canción
## text1 1 2 1 1 1 1 1
My system information:
sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] tm_0.7-3 NLP_0.1-11 dplyr_0.7.4 quanteda_1.1.6

Related

Mixed kana and kanji romanization to romaji in R

I have a large character vector of japanese words (mixed kanji and kana) which needs to be romanized (to romaji).
However with the available functions, (zipangu::str_conv_romanhira() and audubon::strj_romanize()), I am not getting the desired results.
For example for 北海道 (Hokkaido), zipangu::str_conv_romanhira() convert it to chinese pinyin and audubon::strj_romanize() converts only kana characters.
How to convert such mixed kana and kanji text to romaji.
library(zipangu)
library(stringi)
library(audubon)
str_conv_romanhira("北海道", "roman")
#> [1] "běi hǎi dào"
stri_trans_general("北海道", "Any-Latin")
#> [1] "běi hǎi dào"
strj_romanize("北海道")
#> [1] ""
There aren't any R packages that provide transliteration of Japanese kanji to romaji that I can see (at least none that are currently on CRAN). It's easy enough, however, to use the python module pykakasi via R to achieve this:
library(reticulate)
py_install("pykakasi") # Only need to install once
# Make module available in R
pykakasi <- import("pykakasi")
# Alias the convert function for convenience
convert <- pykakasi$kakasi()$convert
convert("北海道")
[[1]]
[[1]]$orig
[1] "北海道"
[[1]]$hira
[1] "ほっかいどう"
[[1]]$kana
[1] "ホッカイドウ"
[[1]]$hepburn
[1] "hokkaidou"
[[1]]$kunrei
[1] "hokkaidou"
[[1]]$passport
[1] "hokkaidou"
# Function to extract romaji and collapse
to_romaji <- function(txt) {
paste(sapply(convert(txt), `[[`, "hepburn"), collapse = " ")
}
# Test on some longer text
lapply(c("北海道", "石の上にも三年", "豚に真珠"), to_romaji)
[[1]]
[1] "hokkaidou"
[[2]]
[1] "ishi no ueni mo sannen"
[[3]]
[1] "buta ni shinju"

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

Create a corpus from df including doc names

I'm reading all my textfiles into a df with the readtext package.
df <- readtext(directory, "*.txt")
The .txt files get stored in a df with doc_id (name of the document) and text (content).
Before I upgraded to the newest version of quanteda, the doc_id was stored in the corpus object when I created my corpus using:
corpus <- corpus(df)
But now this doesn't work anymore, the 'documents'-df of the corpus object only stores the 'texts'-values, but not the doc_id-values anymore.
How do I get back my doc_id into my corpus object?
That's because of a bug that we fixed prior to v1.2.0. When constructing a corpus from a data.frame, some field is required for a document id, and by default this is the readtext doc_id.
If you want it also as a document variable, you can do it this way. First, I read in some texts from the system files of the readtext package, for a reproducible example.
library("readtext")
library("quanteda")
packageVersion("readtext")
## [1] ‘0.50’
packageVersion("quanteda")
## [1] ‘1.2.0’
df <- readtext(paste0(DATA_DIR, "txt/EU_manifestos/*.txt"), encoding = "LATIN1")
df
## readtext object consisting of 17 documents and 0 docvars.
## # data.frame [17 × 2]
## doc_id text
## <chr> <chr>
## 1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..."
## 2 EU_euro_2004_de_V.txt "\"Gemeinsame\"..."
## 3 EU_euro_2004_en_PSE.txt "\"PES · PSE \"..."
## 4 EU_euro_2004_en_V.txt "\"Manifesto\n\"..."
## 5 EU_euro_2004_es_PSE.txt "\"PES · PSE \"..."
## 6 EU_euro_2004_es_V.txt "\"Manifesto\n\"..."
When we create the corpus from this, we see no document variables.
crp <- corpus(df)
crp
## data frame with 0 columns and 17 rows
But it's trivial to add them:
docvars(crp, "doc_id") <- df$doc_id
head(docvars(crp))
## doc_id
## EU_euro_2004_de_PSE.txt EU_euro_2004_de_PSE.txt
## EU_euro_2004_de_V.txt EU_euro_2004_de_V.txt
## EU_euro_2004_en_PSE.txt EU_euro_2004_en_PSE.txt
## EU_euro_2004_en_V.txt EU_euro_2004_en_V.txt
## EU_euro_2004_es_PSE.txt EU_euro_2004_es_PSE.txt
## EU_euro_2004_es_V.txt EU_euro_2004_es_V.txt
Note that you are strongly discouraged from accessing the internals of the corpus object through its data.frame element df$documents. Using the accessor docvars() and replacement docvars()<- will work in the future, but the internals of the corpus are likely to change.

Difficulty in using tm_combine

I am unable to use tm_combine in R. Here are the version details
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.3
year 2017
month 03
day 06
svn rev 72310
language R
version.string R version 3.3.3 (2017-03-06)
nickname Another Canoe
I would like to understand more on this. In case there is an issue with accessing this, my question is how do I combine two Document Term matrices D1 and D2 which have different number of columns?
> packageVersion("tm")
[1] ‘0.7.1’
> dim(s.tdm)
[1] 132 536
> dim(f.tdm)
[1] 132 674
>
Thanks.
Here's the code that I was trying
library(tm)
library(SnowballC)
s.dir <- "AuthorIdentify\\Author1"
f.dir <- "AuthorIdentify\\Author2"
s.docs <- Corpus(DirSource(s.dir, encoding="UTF-8"))
f.docs <- Corpus(DirSource(f.dir, encoding="UTF-8"))
cleanCorpus<-function(corpus){
# apply stemming
corpus <-tm_map(corpus, stemDocument)
# remove punctuation
corpus.tmp <- tm_map(corpus,removePunctuation)
# remove white spaces
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
# remove stop words
corpus.tmp <-
tm_map(corpus.tmp,removeWords,stopwords("en"))
return(corpus.tmp)
}
s.cldocs <- cleanCorpus(s.docs) # preprocessing
# forms document-term matrix
s.tdm <- DocumentTermMatrix(s.cldocs)
# removes infrequent terms
s.tdm <- removeSparseTerms(s.tdm,0.97)
dim(s.tdm) # [ #docs, #numterms ]
f.cldocs <- cleanCorpus(f.docs) # preprocessing
# forms document-term matrix
f.tdm <- DocumentTermMatrix(f.cldocs)
# removes infrequent terms
f.tdm <- removeSparseTerms(f.tdm,0.97)
dim(f.tdm) # [ #docs, #numterms ]
#how do I combine f.tdm and s.tdm
tm_combine???
I need to combine them (and eventually to a matrix or data.frame) so that I can have a column identifier for Author1 or Author2
With the approach referenced in the linked article, the output of the combined DTMs does not match the expected output. I have referenced the relevant details in the comments section.

Using VCorpus() function but lose content

I am using the VCorpus() function in r package tm. Here is the problem I have
example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
This looks like
num Author1 Author2
1 1 Text mining is a great time. R is a great language
2 2 Text analysis provides insights R has many uses
3 3 qdap and tm are used in text mining here is a problem
Then I type df_source = DataframeSource(example_text[,2:3]) to only extract the last 2 columns.
df_source looks correct. After that, I did df_corpus = VCorpus(df_source) and df_corpus[[1]] is
<<PlainTextDocument>>
Metadata: 7
Content: chars: 2
And df_corpus[[1]] gives me
$content
[1] "3" "3"
But df_corpus[[1]] should return
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
And df_corpus[[1]][1] should return
$content
[1] "Text mining is a great time." "R is a great language"
I don't know where goes wrong. Any suggestions will be appreciated.
The texts inside example_text that are supposed to be character have all become factors because the 'factory-fresh' value of stringsAsFactors is TRUE, which is weird and annoying from my point of view.
example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)
# $num
# [1] "numeric"
#
# $Author1
# [1] "factor"
#
# $Author2
# [1] "factor"
To ensure the column Author1 and Author2 to be character columns, you may try:
Add options(stringsAsFactors = FALSE) at the beginning of your code.
Add stringsAsFactors = FALSE inside your data.frame(...) statement.
Run example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
Run example_text[, 2:3] <- lapply(example_text[, 2:3], paste)
Then everything should work fine.

Resources