Keep special characters in a word-frequency matrix - r

I analyze some brands in text to find out KPI´s like Ad recognition. However brands which contain special characters are destroyed by my code so far.
library(qdap)
library(stringr)
test <- c("H&M", "C&A", "Zalando", "Zalando", "Amazon", "Sportscheck")
wfm(test)
This is the output:
all
a 1
amazon 1
c 1
h 1
m 1
sportscheck 1
zalando 2
Is there a package or method to archieve that H&M gets h&m, but not "h" and "m", like its two brands?
edit: The wfm function has got a ... argument which SHOULD allow me to use the strip function.
wfm(test, ... = strip(test, char.keep = "&"))
Does not work unfortunately.

I am not familiar with the qdap package but maybe substituting & could solve your problem
replacement <- "" # set your replacement e.g. "" (empty string) or "_"
test <- gsub("&", replacement, test, fixed = T)

I would say something like this. In the udpipe package there is a function document_term_frequencies where you can specify the split and it turns the data into a data.frame with the frequency count. If there is no id column to specify it will generate one. The resulting object of the document_term_frequencies is a data.table.
library(udpipe)
# data.frame without a ID column
my_data <- data.frame(text = c("H&M, C&A, Zalando, Zalando, Amazon, Sportscheck",
"H&M, C&A, Amazon, Sportscheck"),
stringsAsFactors = FALSE)
# if you have an ID column add document = my_data$id to the function
# see more examples in ?document_term_frequencies
document_term_frequencies(my_data$text, split = ",")
doc_id term freq
1: doc1 H&M 1
2: doc1 C&A 1
3: doc1 Zalando 2
4: doc1 Amazon 1
5: doc1 Sportscheck 1
6: doc2 H&M 1
7: doc2 C&A 1
8: doc2 Amazon 1
9: doc2 Sportscheck 1

Related

How to work with scores and regex in a keywords dictionary to get a rudimentary sentiment analysis, with R?

I would like to optimize the size of a sentiment dictionary by using regular expressions. But I don't know how to match the keywords with the text to be analysed, without losting the rating of each keyword.
I work with R. And I'd like to stay in about a "matching words" solution.
This is what I tried
library(stringr)
library(tidytext) # tidy text analysis + unnest_tokens
library(tidyverse) # visualization + tibble
# text to be quoted
Corpus<- c("Radicals in their time, early Impressionists violated the rules of academic painting.",
"They also painted realistic scenes of modern life, and often painted outdoors.",
"The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision.",
"Even if the art critics and art establishment disapproved of the new style.")
# dictionary : words and quotes lists
WordsList <- c("^academ.+$","^disapprov.*$","^friend.*$","^fresh.*$","^hostil.+$","^modern.*$","^new.*$","^original.*$","^outstand.*$","^radical.*$","^uncorrect.+$","^violat.+$")
QuotesList <- c(1,-2,2,2,-2,2,1,2,3,-3,-1,-3)
Lexicon <- data.frame(words=WordsList, quotes=QuotesList)
Lexicon
# words quotes
# 1 ^academ.+$ 1
# 2 ^disapprov.*$ -2
# 3 ^friend.*$ 2
# 4 ^fresh.*$ 2
# 5 ^hostil.+$ -2
# 6 ^modern.*$ 2
# 7 ^new.*$ 1
# 8 ^original.*$ 2
# 9 ^outstand.*$ 3
# 10 ^radical.*$ -3
# 11 ^uncorrect.+$ -1
# 12 ^violat.+$ -3
messag <- tibble(docidx = 1:length(Corpus), text = Corpus)
# split into words : 1 row per word per "document"
txt.by.word <- messag %>%
unnest_tokens(mots, text)
# size order instead of alphabetic order
matching<- paste(Lexicon[order(-nchar(Lexicon$words)),]$words, collapse = '|')
matching
# [1] "^disapprov.*$|^original.*$|^radical.*$|^academ.+$|^hostil.+$|^modern.*$|^violat.+$|^fresh.*$|^new.*$"
# search matchings
test<- str_extract_all(txt.by.word$mots, matching, simplify= T) # sensible à la casse
# result
test
tst <- as.data.frame(test)
# except empty
tst[!tst$V1 %in% "",]
# [1] "radicals" "violated" "academic" "modern" "hostile" "fresh" "original" "disapproved"
# [9] "new"
# from here I don't know how to get this expected result: by docidx, matching the words and their associated ratings.
# how to extract both the keyword and the sentiment rating ?
# Expected result
# docidx text quote
# 1 radicals -3
# 1 violated -3
# 1 academic 1
# 2 modern 2
# 3 hostile -2
# 3 fresh 2
# 3 original 2
# 4 disapproved -2
# 4 new 1
Thanks to Maël who answered another post from myself, see an equivalent of the 'match' function that works with regex
I have found an acceptable solution. Very close to my target. Here the heart of the code to be implemented instead of str_extract_all.
'''R
dt.unl <- as.data.table(unlist(sapply(Lexicon$words, grep, Corpus, value = TRUE)), keep.rownames=T)
dt.unl
dt.unl[ , keywords := lapply(.SD, function(x){gsub("[0-9]$", "", x)}), .SDcols=1, by="V1"]
dt.unl
dt.scor <- merge(dt.unl[,.(V2,keywords)], Lexicon, by.x="keywords", by.y="words")
dt.scor
# keywords V2 quotes
# 1: \\bacadem.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. 1
# 2: \\bdisapprov.*\\b Even if the art critics and art establishment disapproved of the new style. -2
# 3: \\bfresh.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 4: \\bhostil.+\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. -2
# 5: \\bmodern.*\\b They also painted realistic scenes of modern life, and often painted outdoors. 2
# 6: \\bnew.*\\b Even if the art critics and art establishment disapproved of the new style. 1
# 7: \\boriginal.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 8: \\bviolat.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. -3
#
'''

tokenizing on a pdf for quantitative analysis

I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.
text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,
text_raw)
So far so good. But here comes my problemo:
unnest_tokens(output = token, input = content) -> text1_long
Error: Must extract column with a single valid subscript.
x Subscript var has the wrong type function.
i It must be numeric or character.
I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.
Here is a piece of simple code. I kept your German words so you can copy paste everything.
library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)
file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,
text_raw)
text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>%
filter(str_detect(wort, "[a-z]"))
text1df_long
# A tibble: 4,134 x 2
Zeile wort
<int> <chr>
1 1 training
2 1 and
3 1 development
4 1 policy
5 1 contents
6 1 policy
7 1 statement
8 1 scope
9 1 induction
10 1 training
# ... with 4,124 more rows

How do I include stopwords(terms) in text2vec

In text2vec package, I am using create_vocabulary function. For eg:
My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be
This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:
vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.
x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)
When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm.
It seems that subset function drops some attributes. You can try:
library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]
v
# Number of docs: 1
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 3
# Vocabulary:
# term term_count doc_count
# 1: good 1 1
# 2: book_is_very 1 1
# 3: This_book 1 1
# 4: This 1 1
# 5: book 1 1
# 6: very_good 1 1
# 7: is_very 1 1
# 8: book_is 1 1
# 9: This_book_is 1 1
# 10: is 1 1
# 11: very 1 1
dtm = create_dtm(it, vocab_vectorizer(v))
#Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function
attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.

R Data frame column update

I have a dataframe with names
I have a second dataframe with a dictionary of names and the sex of those names
I want to check if the name is in the dictionary if it is then add the sex from the dictionary table to the names dataframe
My code looks like below
# Sets everything to -1, 1 for male, 0 for female
train$sex <- "-1"
train$sex[toupper(train$fname) == nam_dict$Name]<-nam_dict$Sex
I am getting the following error
Error in train$sex[toupper(train$fname) == nam_dict$Name] <- nam_dict$Sex :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In toupper(train$fname) == nam_dict$Name :
longer object length is not a multiple of shorter object length
I have a work around - i think - where i can split the dictionary into male and female and simply replace the <-nam_dict$Sex portion of the code with the character 'F' or 'M' depending on the dictionary
I just thought there would be a better way
So your problem can be solved by a quick and simple implementation of match.
First, here's a quick reproducible example
(train <- data.frame(fname = c("Alex", "Jennifer", "David", "Alice")))
# fname
# 1 Alex
# 2 Jennifer
# 3 David
# 4 Alice
(nam_dict <- data.frame(Name = c("alice", "alex"), Sex = 0:1))
# Name Sex
# 1 alice 0
# 2 alex 1
A possible solution
train$sex <- nam_dict$Sex[match(tolower(train$fname), tolower(nam_dict$Name))]
train
# fname sex
# 1 Alex 1
# 2 Jennifer NA
# 3 David NA
# 4 Alice 0
Al bit more advanced solution (if you''ll ever work with huge data sets) is to try data.tables binary join which allows you to update your data by reference, chose which columns to update while everything is done in a lightning speed.
First we will convert both data sets to data.table class and make both columns lower case, then we will key them by the column we want join by, finally we do a left join to train while creating the sex column by reference and pulling the data from i (Sex column in nam_dict)
library(data.table)
setDT(train)[, fname := tolower(fname)]
setDT(nam_dict)[, Name := tolower(Name)]
setkey(train, fname) ; setkey(nam_dict, Name)
train[nam_dict, sex := i.Sex]
train
# fname sex
# 1: alex 1
# 2: alice 0
# 3: david NA
# 4: jennifer NA

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Resources