Calculating Proportion of Column Total compared to total of all columns - r

I'm attempting to calculate the Proportion of a column total compared to the total of all columns.
The Document Term Matrix I'm working with is quite large, causing any tests I run incorrectly to basically crash Rstudio.
Here is my working code so far:
randomSample = read.csv("randomSample2016.csv", stringsAsFactors = FALSE)
str(randomSample)
randomSample$tweet <- as.character(randomSample$tweet)
randomSample$tweetlength <- nchar(randomSample$tweet)
hist(randomSample$tweetlength)
library("tm")
## Use the tm library to construct a document-term matrix of term
frequencies
randomSample_corpus <- Corpus(VectorSource(randomSample$tweet))
print(randomSample_corpus)
inspect(randomSample_corpus[1:3])
#clean up corpus
#make all letters lowercase
randomSample_corpus_clean <- tm_map(randomSample_corpus, tolower)
#Remove Numbers
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
removeNumbers)
#Remove punctuation
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
removePunctuation)
#Remove stop words
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean, removeWords,
stopwords())
#remove unneeded whitespace
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
stripWhitespace)
#Inspect cleaned corpus
inspect(randomSample_corpus_clean[1:3])
#Create document term matrix
randomSample_dtm <- DocumentTermMatrix(randomSample_corpus_clean)
#convert to R matrix
dtm2 <- as.matrix(randomSample_dtm)
#obtain individual word frequencies
frequency <- colSums(dtm2)
This allows me to get the total frequency of all words in dtm2, however when I try to add a new row for column totals (dtm2$newcolumn <- 0) I end up with an insane amount of ram usage.

That's because your colSums returns a named-array. what you want is
as.matrix(randomSample_dtm) %>% {
rbind(., rbind(colSums(.) %>% as.numeric)) %>% as.matrix()
}

Related

non-sense word associations in Text Mining

Hi, I ran this text analysis for word associations. However, the word associations do not make any sense. For example, I was interested in the association between "women" and other words. But the output provides non-sense word associations, such as "bagthey". Does anyone know where the problem is? I also attached my code below. I tried both running or not running the "Eliminate extra white spaces" codes.
Data could be downloaded here: https://drive.google.com/file/d/1zaCrraYYNTXsrbfx0bG53pjxo9AMZu5M/view?usp=sharing
company <- read.csv("C:/Data.csv")
#### Set up data for analysis ####
# Create Corpus #To use the tm package we first transform the dataset to a corpus
corpus_review=Corpus(VectorSource(company$review))
# Convert all text to lowercase
corpus_review=tm_map(corpus_review, tolower)
# Stem words (i.e., to ensure no duplication of words for example work and working)
corpus_review=tm_map(corpus_review, stemDocument)
# Remove punctuations
corpus_review <- tm_map(corpus_review, removePunctuation, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE, ucp = TRUE)
# Convert the text to lower case
corpus_review <- tm_map(corpus_review, content_transformer(tolower))
# Remove numbers
corpus_review <- tm_map(corpus_review, removeNumbers)
# Remove english common stopwords
corpus_review <- tm_map(corpus_review, removeWords, stopwords("english"))
# Remove own words
corpus_review=tm_map(corpus_review, removeWords,c("also", "get","like", "made", "can", "im", "just","a", "I"))
# Eliminate extra white spaces
#corpus_review <- tm_map(corpus_review, stripWhitespace)
# Text stemming - which reduces words to their root form
corpus_review <- tm_map(corpus_review, stemDocument)
# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(corpus_review)
dtm_m <- as.matrix(TextDoc_dtm)
# Sort by deceasing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
head(dtm_d, 5)
#### Word Associations####
associations <- findAssocs(TextDoc_dtm, "women", 0.1)
associations_df <- list_vect2df(associations)[, 2:3]
ggplot(associations_df, aes(y = associations_df[, 1])) +
geom_point(aes(x = associations_df[, 2]),
data = associations_df, size = 3) +
ggtitle("Word Associations to 'the key word'") +
theme_gdocs()

R Tm package dictionary matching leads to higher frequency than actual words of text

I have been using the code below to load text as a corpus and using the tm package to clean the text. As a next step I am loading a dictionary and cleaning it as well. Then I am matching the words from the text with the dictionary to calculate a score. However, the matching results in a higher number of matches than actual words in the text (e.g., the competence score is 1500 but the actual number of words in the text is only 1000).
I think it is related to the stemming of the text and the dictionary as the matches are lower when there is no stemming performed.
Do you have any ideas why this is happening?
Thank you very much.
R Code
Step 1 Storing data as corpus
file.path <- file.path(here("Generated Files", "Data Preparation")) corpus <- Corpus(DirSource(file.path))
Step 2 Cleaning data
#Removing special characters
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\|")
#Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("view", "pdf"))
#Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
#Text stemming
corpus <- tm_map(corpus, stemDocument)
#Unique words
corpus <- tm_map(corpus, unique)
Step 3 DTM
dtm <- DocumentTermMatrix(corpus)
Step 4 Load Dictionaries
dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)
Step 5 Count frequencies
corpus.terms = colnames(dtm)
competence = match(corpus.terms, dic.competence, nomatch=0)
Step 6 Calculate scores
competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)
What does competence return when you run that line? I'm not sure how your dictionary is set up, so I can't say for certain what's happening there. I brought in my own random corpus text as the primary text and brought in a separate corpus as the dictionary and your code worked great. The row names of competence.score.df were the names of the different txt files in my corpus and the scores were all in a 0-1 range.
# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),
control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
removePunctuation = TRUE))
# then I used your programming and it worked as I think you were expecting
# notice what I used here for the dictionary
(competence = match(colnames(dtm),
Terms(tdm)[1:10], # I only used the first 10 in my test of your code
nomatch = 0))
(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))

R: tm package on German text

I want to perform Sentiment classification on German dataset, I am using the following code, which works fine with english text, but raising error in case of German text.
Here is my code for the following:
#loading required libraries
library(tm)
library(readxl)
library(data.table)
library(plyr)
library(dplyr)
library(zoo)
library(ggplot2)
library(ranger)
library(e1071)
df<- data.table(read_excel("data/German2datasets.xlsx", skip = 1))
# An abstract function to preprocess a text column
preprocess <- function(text_column)
{
# Use tm to get a doc matrix
corpus <- Corpus(VectorSource(text_column))
# all lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# remove punctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
# remove numbers
corpus <- tm_map(corpus, content_transformer(removeNumbers))
# remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("german"))
# stem document
corpus <- tm_map(corpus, stemDocument)
# strip white spaces (always at the end)
corpus <- tm_map(corpus, stripWhitespace)
# return
corpus
}
# Get preprocess training and test data
corpus <- preprocess(df$TEXT)
# Create a Document Term Matrix for train and test
# Just including bi and tri-grams
Sys.setenv(JAVA_HOME='D://Program Files/Java/jre1.8.0_112') # for 32-bit version
library(rJava)
library(RWeka)
# Bi-Trigram tokenizer function (you can always get longer n-grams)
bitrigramtokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 2, max = 3))
}
"
Remove remove words <=2
TdIdf weighting
Infrequent (< than 1% of documents) and very frequent (> 80% of documents) terms not included
"
dtm <- DocumentTermMatrix(corpus, control=list(wordLengths=c(2, Inf),
tokenize = bitrigramtokeniser,
weighting = function(x) weightTfIdf(x, normalize = FALSE),
bounds=list(global=c(floor(length(corpus)*0.01), floor(length(corpus)*.8)))))
sent <- df$Sentiment
# Variable selection
# ~~~~~~~~~~~~~~~~~~~~
"
For dimension reduction.
The function calculates chi-square value for each phrase and keeps phrases with highest chi_square values
Ideally you want to put variable selection as part of cross-validation.
chisqTwo function takes:
document term matrix (dtm),
vector of labels (labels), and
number of n-grams you want to keep (n_out)
"
chisqTwo <- function(dtm, labels, n_out=2000){
mat <- as.matrix(dtm)
cat1 <- colSums(mat[labels==T,]) # total number of times phrase used in cat1
cat2 <- colSums(mat[labels==F,]) # total number of times phrase used in cat2
n_cat1 <- sum(mat[labels==T,]) - cat1 # total number of phrases in soft minus cat1
n_cat2 <- sum(mat[labels==F,]) - cat2 # total number of phrases in hard minus cat2
num <- (cat1*n_cat2 - cat2*n_cat1)^2
den <- (cat1 + cat2)*(cat1 + n_cat1)*(cat2 + n_cat2)*(n_cat1 + n_cat2)
chisq <- num/den
chi_order <- chisq[order(chisq)][1:n_out]
mat <- mat[, colnames(mat) %in% names(chi_order)]
}
n <- nrow(dtm)
shuffled <- dtm[sample(n),]
train_dtm <- shuffled[1:round(0.7 * n),]
test_dtm <- shuffled[(round(0.7 * n) + 1):n,]
"
With high dimensional data, test matrix may not have all the phrases training matrix has.
This function fixes that - so that test matrix has the same columns as training.
testmat takes column names of training matrix (train_mat_cols), and
test matrix (test_mat)
and outputs test_matrix with the same columns as training matrix
"
# Test matrix maker
testmat <- function(train_mat_cols, test_mat){
# train_mat_cols <- colnames(train_mat); test_mat <- as.matrix(test_dtm)
test_mat <- test_mat[, colnames(test_mat) %in% train_mat_cols]
miss_names <- train_mat_cols[!(train_mat_cols %in% colnames(test_mat))]
if(length(miss_names)!=0){
colClasses <- rep("numeric", length(miss_names))
df <- read.table(text = '', colClasses = colClasses, col.names = miss_names)
df[1:nrow(test_mat),] <- 0
test_mat <- cbind(test_mat, df)
}
as.matrix(test_mat)
}
# Train and test matrices
train_mat <- chisqTwo(train_dtm, train$Sentiment)
test_mat <- testmat(colnames(train_mat), as.matrix(test_dtm))
dim(train_mat)
dim(test_mat)
n <- nrow(df)
shuffled <- df[sample(n),]
train_data <- shuffled[1:round(0.7 * n),]
test_data <- shuffled[(round(0.7 * n) + 1):n,]
train_mat <- as.data.frame(as.matrix(train_mat))
colnames(train_mat) <- make.names(colnames(train_mat))
train_mat$Sentiment <- train_data$Sentiment
test_mat <- as.data.frame(as.matrix(test_mat))
colnames(test_mat) <- make.names(colnames(test_mat))
test_mat$Sentiment <- test_data$Sentiment
train_mat$Sentiment <- as.factor(train_mat$Sentiment)
test_mat$Sentiment <- as.factor(test_mat$Sentiment)
Then, I will apply caret ML algos on the same for prediction of the Sentiment on the train and test data created.
I am getting the following error at "preprocess" function.
> corpus <- preprocess(df$TEXT)
Show Traceback
Rerun with Debug
Error in FUN(content(x), ...) :
invalid input 'Ich bin seit Jahren zufrieden mit der Basler Versicherubg🌺' in 'utf8towcs'
Data - https://drive.google.com/open?id=1T_LpL2G8upztihAC2SQeVs4YCPH-yfOs
How about trying a different package to get to the pre-Weka etc stages? This is equivalent (and simpler imho):
library("quanteda")
library("readtext")
# reads in the spreadsheet and creates the corpus
germancorp <-
readtext("data/German2datasets.xlsx", text_field = "TEXT")) %>%
corpus()
# does all of the steps of your preprocess() function
dtm <- dfm(germancorp, ngrams = c(2, 3),
tolower = TRUE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove = stopwords("german"),
stem = TRUE)
# remove words with only a single count
dtm <- dfm_trim(dtm, min_count = 2)
# form tf-idf weights - change the base argument from default 10 if you wish
dtm <- dfm_tfidf(dtm)
# if you really want a tm formatted DocumentTermMatrix
convert(dtm, to = "tm")
The quanteda package can do some of what you list as additional steps, although it is not clear exactly what you are doing. (Your question focused on the preprocess() failure so I answered that.)
if you haven´t found the reason yet:
invalid input in 'utf8towcs'
It is the encoding of the file (depending on your [virtual] environment and the current sys-options and of course on the the of saving the file to disk at the time of creation)
A workaround is like:
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")
or
your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")

Find 2 words phrase using tm R

I know this has been asked multiple times. For example
Finding 2 & 3 word Phrases Using R TM Package
However, I don't know why none of these solutions work with my data. The result is always one-gram word no matter how many ngram I chose (2, 3 or 4) for the ngram.
Could anybody know the reason why? I suspect the encoding is the reason.
Edited: a small part of the data.
comments <- c("Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into problem_70918\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into tm-247\n",
"Merge branch 'php5.3-upgrade-sprint6-7' of git.internal.net:/git/pn-project/LegacyCodebase into release2012.08\n",
"Merge remote-tracking branch 'dmann1/p71148-s3-callplan_mapping' into lcst-operational-changes\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into TASK-360148\n",
"Merge remote-tracking branch 'grockett/rpr-pre' into rpr-lite\n"
)
cleanCorpus <- function(vector){
corpus <- Corpus(VectorSource(vector), readerControl = list(language = "en_US"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
#corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
return(corpus)
}
# this function is provided by a team member (in the link I posted above)
test <- function(keywords_doc){
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.99)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
# wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
return(frequentKeywordDF)
}
corpus <- cleanCorpus(comments)
t <- test(corpus)
> head(t)
term freq
added added 6
html html 6
tracking tracking 6
common common 4
emails emails 4
template template 4
Thanks,
I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:
library(tm)
lilbrary(dplyr)
library(quanteda)
# ..construct the corpus as in your post ...
corpus %>%
unlist() %>%
tokens() %>%
tokens_ngrams(2:2, concatenator = " ") %>%
unlist() %>%
as.data.frame() %>%
group_by_(".") %>%
summarize(cnt=n()) %>%
arrange(desc(cnt))

Empty term document matrix

I seem to run into a problem whenever I try to inspect my freq. words and associations.
When I make the tdm I get this info:
TermDocumentMatrix
I can see I have plenty of terms to use, in plenty of documents.
However!
When I try to inspect the content of "tdm", I get this info:
Inspecting the TDM
Howcome the tdm all of a sudden is empty?
Hope someone can help
tweets <- userTimeline("RDataMining", n = 1000)
(n.tweet <- length(tweets))
tweets[1:3]
#convert tweets to a data frame
tweets.df <- twListToDF(tweets)
dim(tweets.df)
##Text cleaning
library(tm)
#build a corpus and specify the source to be a character vector
myCorpus <- Corpus(VectorSource(tweets.df$text))
#convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove URLs
removeURL <- function(x) gsub ("http[^[:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeURL))
#remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct))
#remove stopwords + 2
myStopwords <- c(stopwords('english'),"available","via")
#remove "r" and "big" from stopwords
myStopwords <- setdiff(myStopwords, c("r","big"))
#remove stopwords from corpus
myCorpus <- tm_map(myCorpus,removeWords,myStopwords)
#remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
#keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus
#stem words
library(SnowballC)
myCorpus <- tm_map(myCorpus,stemDocument)
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x),""))
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
x <- paste (x,sep = "",collapse = "")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
#count freq of "mining"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<mining")})
sum(unlist(miningCases))
#count freq of "miner"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<miner")})
sum(unlist(miningCases))
#count freq of "r"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<r")})
sum(unlist(miningCases))
#replace "miner" with "mining"
myCorpus <- tm_map(myCorpus,content_transformer(gsub),
pattern = "miner", replacement = "mining")
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE,stopwords = TRUE))
tdm
##Freq words and associations
idx <- which(dimnames(tdm)$Terms == "r")
inspect(tdm[idx + (0:5), 101:110])
#inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq,term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)
I've been using the following Twitter query to test your code:
tweets = searchTwitter("r data mining", n=10)
and I think the problem is with your function stemCompletion2, which should look something like this:
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x)," "))
print("before:")
print(x)
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
print("after:")
print(x)
x <- paste(x, sep = " ")
PlainTextDocument(stripWhitespace(x))
}
The modifications are as follows: before you had
x <- unlist(strsplit(as.character(x),""))
which was creating a list with all the characters of in each of the documents, and I've modified it to
x <- unlist(strsplit(as.character(x)," "))
to create a list of words. Similarly, when recomposing your documents, you where doing
x <- paste (x,sep = "",collapse = "")
which was creating the long strings you mention in your post, and I've modified it to:
x <- paste(x, sep = " ")
to recompose the words.
One example of the completions would be for my data:
[1] "before:"
[1] "rt" "ebookdealalert" "r" "datamin" "project" "learn" "data" "mine"
[9] "realworld" "project" "book" "solv" "predict" "model"
[1] "after:"
rt ebookdealalert r datamin project learn data mine
"rt" "ebookdealalerts" "r" "datamining" "projects" "learn" "data" ""
realworld project book solv predict model
"realworld" "projects" "book" "solve" "predictive" "modeling"
After that step, you may be able to work with TermDocumentMatrix as expected.
Hope it helps.

Resources