Difficulty in using tm_combine

Difficulty in using tm_combine - r

I am unable to use tm_combine in R. Here are the version details
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.3
year 2017
month 03
day 06
svn rev 72310
language R
version.string R version 3.3.3 (2017-03-06)
nickname Another Canoe
I would like to understand more on this. In case there is an issue with accessing this, my question is how do I combine two Document Term matrices D1 and D2 which have different number of columns?
> packageVersion("tm")
[1] ‘0.7.1’
> dim(s.tdm)
[1] 132 536
> dim(f.tdm)
[1] 132 674
>
Thanks.
Here's the code that I was trying
library(tm)
library(SnowballC)
s.dir <- "AuthorIdentify\\Author1"
f.dir <- "AuthorIdentify\\Author2"
s.docs <- Corpus(DirSource(s.dir, encoding="UTF-8"))
f.docs <- Corpus(DirSource(f.dir, encoding="UTF-8"))
cleanCorpus<-function(corpus){
# apply stemming
corpus <-tm_map(corpus, stemDocument)
# remove punctuation
corpus.tmp <- tm_map(corpus,removePunctuation)
# remove white spaces
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
# remove stop words
corpus.tmp <-
tm_map(corpus.tmp,removeWords,stopwords("en"))
return(corpus.tmp)
}
s.cldocs <- cleanCorpus(s.docs) # preprocessing
# forms document-term matrix
s.tdm <- DocumentTermMatrix(s.cldocs)
# removes infrequent terms
s.tdm <- removeSparseTerms(s.tdm,0.97)
dim(s.tdm) # [ #docs, #numterms ]
f.cldocs <- cleanCorpus(f.docs) # preprocessing
# forms document-term matrix
f.tdm <- DocumentTermMatrix(f.cldocs)
# removes infrequent terms
f.tdm <- removeSparseTerms(f.tdm,0.97)
dim(f.tdm) # [ #docs, #numterms ]
#how do I combine f.tdm and s.tdm
tm_combine???
I need to combine them (and eventually to a matrix or data.frame) so that I can have a column identifier for Author1 or Author2
With the approach referenced in the linked article, the output of the combined DTMs does not match the expected output. I have referenced the relevant details in the comments section.

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)

I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0

Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

Why does rbind with data.table having more than 254 columns reorders column names

I am not sure of the extent of this side effect. Why is this happening ? What caution does one need to take.
dt <- data.table(
sample = 1
)
i = 1
while(i <= 254) {
col <- paste("x", i, sep = "_")
dt[[col]] = i
i = (i + 1)
}
> combined_dt <- rbind(dt, dt)
> print(head(names(combined_dt))) # Columns get reordered
[1] "sample" "x_5" "x_6" "x_1" "x_2" "x_3"
>
> combined_dt <- rbindlist(list(dt, dt))
> print(head(names(combined_dt))) # Columns do not get reordered
[1] "sample" "x_1" "x_2" "x_3" "x_4" "x_5"
R details
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On

Create a corpus from df including doc names

I'm reading all my textfiles into a df with the readtext package.
df <- readtext(directory, "*.txt")
The .txt files get stored in a df with doc_id (name of the document) and text (content).
Before I upgraded to the newest version of quanteda, the doc_id was stored in the corpus object when I created my corpus using:
corpus <- corpus(df)
But now this doesn't work anymore, the 'documents'-df of the corpus object only stores the 'texts'-values, but not the doc_id-values anymore.
How do I get back my doc_id into my corpus object?

That's because of a bug that we fixed prior to v1.2.0. When constructing a corpus from a data.frame, some field is required for a document id, and by default this is the readtext doc_id.
If you want it also as a document variable, you can do it this way. First, I read in some texts from the system files of the readtext package, for a reproducible example.
library("readtext")
library("quanteda")
packageVersion("readtext")
## [1] ‘0.50’
packageVersion("quanteda")
## [1] ‘1.2.0’
df <- readtext(paste0(DATA_DIR, "txt/EU_manifestos/*.txt"), encoding = "LATIN1")
df
## readtext object consisting of 17 documents and 0 docvars.
## # data.frame [17 × 2]
## doc_id text
## <chr> <chr>
## 1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..."
## 2 EU_euro_2004_de_V.txt "\"Gemeinsame\"..."
## 3 EU_euro_2004_en_PSE.txt "\"PES · PSE \"..."
## 4 EU_euro_2004_en_V.txt "\"Manifesto\n\"..."
## 5 EU_euro_2004_es_PSE.txt "\"PES · PSE \"..."
## 6 EU_euro_2004_es_V.txt "\"Manifesto\n\"..."
When we create the corpus from this, we see no document variables.
crp <- corpus(df)
crp
## data frame with 0 columns and 17 rows
But it's trivial to add them:
docvars(crp, "doc_id") <- df$doc_id
head(docvars(crp))
## doc_id
## EU_euro_2004_de_PSE.txt EU_euro_2004_de_PSE.txt
## EU_euro_2004_de_V.txt EU_euro_2004_de_V.txt
## EU_euro_2004_en_PSE.txt EU_euro_2004_en_PSE.txt
## EU_euro_2004_en_V.txt EU_euro_2004_en_V.txt
## EU_euro_2004_es_PSE.txt EU_euro_2004_es_PSE.txt
## EU_euro_2004_es_V.txt EU_euro_2004_es_V.txt
Note that you are strongly discouraged from accessing the internals of the corpus object through its data.frame element df$documents. Using the accessor docvars() and replacement docvars()<- will work in the future, but the internals of the corpus are likely to change.

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix.
I create the Corpus out of spanish text (with special characters), with no issues.
However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canciÃ³n).
Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?
Thank you for any help.
As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.
I have tried Encoding(tw2) <- "UTF-8" with no luck.
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT #None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&amp", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*#\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("#\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canciÃ³n
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms

It looks like quanteda (and tm) is losing the encoding when creating the DFM on the windows platform. In this tidytext question the same problem happens with unnesting tokens. Which works fine now and also quanteda's tokens works fine.
If I enforce UTF-8 or latin1 encoding on the #Dimnames$features of the dfm you get the correct results.
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canciÃ³n
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingÃ¡n quieres aguantas canciÃ³n t
text1 1 2 1 1 1 1 1 1
If you do the following:
Encoding(tdm_quan#Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1

Let me guess...are you using Windows? On macOS it works fine:
clean_tw2
## [1] "enmascarados si masduro chingán si quieres aguantas canción"
Encoding(clean_tw2)
## [1] "UTF-8"
dfm(clean_tw2)
## Document-feature matrix of: 1 document, 7 features (0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs enmascarados si masduro chingán quieres aguantas canción
## text1 1 2 1 1 1 1 1
My system information:
sessionInfo()
# R version 3.4.4 (2018-03-15)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.4
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] tm_0.7-3 NLP_0.1-11 dplyr_0.7.4 quanteda_1.1.6

can not coerce list with transactions with duplicated items :: Win7 SP1 64 :: R v3.02

Problem
I can not understand how to transform list to transactions for further processing by apriori algorithm. I have a synthetic example that works, and real (well, a subset of Foodmart database) that does not work; they look the same to me on the systems level. Please help me to transform a list to transactions object.
System setup
> version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25)
nickname Frisbee Sailing
Code to replicate
Code that works
> a_list <- list(
c("a","b","c"),
c("a","b"),
c("a","b","d"),
c("c","e"),
c("c","e"),
c("a","b","d","e")
)
> a_trans <- as(a_list,"transactions")
> summary(a_trans)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
5 columns (items) and a density of 0.5333333
... and so on ...
2 b
3 c
> a_rules <- apriori(a_trans)
parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target ext
... and so on ...
writing ... [17 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Code that does not work
> b_list <- list(
c("PigTail Frozen Pepperoni Pizza","Bird Call Childrens Cold Remedy","Steady Silky Smooth Hair Conditioner","CDR Regular Coffee"),
c("Horatio Graham Crackers","Excellent Apple Drink","Blue Medal Small Eggs","Cormorant Copper Cleaner","High Quality Copper Cleaner","Fast Apple Fruit Roll"),
c("Toucan Canned Mixed Fruit","Landslide Salt","Gorilla Sour Cream","Hermanos Firm Tofu"),
c("Swell Canned Mixed Fruit","Washington Diet Soda","Super Apple Jam","Plato Strawberry Preserves","Steady Whitening Toothpast","Steady Whitening Toothpast","Better Beef Soup","Hermanos Squash","Carrington Frozen Cheese Pizza","Fort West Fondue Mix","Best Choice Mini Donuts","Cormorant Copper Pot Scrubber","Ebony Cantelope","Denny D-Size Batteries","Akron Eyeglass Screwdriver"),
c("Big Time Ice Cream Sandwich","Musial Mints","Portsmouth Imported Beer","CDR Vegetable Oil","Just Right Rice Soup","Carrington Frozen Peas","High Quality 100 Watt Lightbulb","Fort West Dried Dates"),
c("Consolidated Tartar Control Toothpaste","Plato Tomato Sauce","Quick Seasoned Hamburger")
)
> b_trans <- as(b_list,"transactions")
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
> summary(b_trans)
Error in summary(b_trans) :
error in evaluating the argument 'object' in selecting a method for function 'summary': Error: object 'b_trans' not found
Funny thing
> duplicated(a_list)
[1] FALSE FALSE FALSE FALSE TRUE FALSE
> duplicated(b_list)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Any ideas why this fabulous (WTF) thing happens?

joran and DWin mentioned:
Elements of character vectors in a_list are unique.
There is a duplication in one of the vectors of b_list.
How it looks like. If I add the second "b" into the first vector of a_list2
> a_list2 <- list(
c("a","b","b","c"),
c("a","b"),
c("a","b","d"),
c("c","e"),
c("c","e"),
c("a","b","d","e")
)
in the following attempt to transform the data I get the error
> a_trans2 <- as(a_list2,"transaction")
Error in as(a_list2, "transaction") :
no method or default for coercing “list” to “transaction”
It appears that b_list has "Steady Whitening Toothpast" mentioned twice in the fourth vector. Manual removal of this duplication solved the issue.
> b_trans2 <- as(b_list2,"transactions")
> summary(b_trans2)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
... and so on ...
2 Best Choice Mini Donuts
3 Better Beef Soup
Speaking about the solution for the real data processing, the following code delivers no errors.
aggrData <- split(selData$product_name,selData$transaction_id)
listData <- list()
for (i in 1:length(aggrData)) {
listData[[i]] <- as.character(aggrData[[i]][!duplicated(aggrData[[i]])])
}
trnsData <- as(listData,"transactions")
Though, the following line nor attempts with other parameters deliver no rules.
> rules <- apriori(trnsData)
parameter specification:
... and so on ...
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Yet this is a totally different story.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Difficulty in using tm_combine - r

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

Why does rbind with data.table having more than 254 columns reorders column names

Create a corpus from df including doc names

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

can not coerce list with transactions with duplicated items :: Win7 SP1 64 :: R v3.02

Categories

Resources