How to convert 2 column matrix to dataframe table - r

I want to create a bigram wordcloud in R with package tau.
I got bigrams in a list as numeric. so I converted it to matrix but its without column name. I want it in a dataframe table so that I can create a bigram WordCloud with it.
Please find my code below and suggest a way out.
library(tau)
speech1 = Corpus(VectorSource(speech))
myDTM = TermDocumentMatrix(speech1, control = list(minWordLength = 1))
bigrams = textcnt(speech1, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)
n = as.matrix(bigrams)
Please suggest a way on how can I create a wordcloud on bigram. unable to do with weka package

If the goal is the wordcloud, then check out this page : http://www.rpubs.com/rgcmme/PLN-09 , and here is a small adapted example from it:
library(tm)
library(wordcloud)
# sample data
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
speech <- readLines(filePath)
speech1 = Corpus(VectorSource(speech))
myDTM = TermDocumentMatrix(speech1, control = list(minWordLength = 1))
myDTM_mat <- as.matrix(myDTM)
myDTM_mat_sorted <- sort(rowSums(myDTM_mat),decreasing = TRUE)
myDTM_df <- data.frame(word = names(myDTM_mat_sorted), freq = myDTM_mat_sorted)
wordcloud(myDTM_df$word,
myDTM_df$freq,
max.words=100,
random.order = F)

Related

How to find frequency of n-grams and visualize it in wordcloud using R?

I have dataframe with a column which includes strings of text, which I would like to do some analysis on. I would like to know what are the most used words and visualize this in a wordcloud. For single words (unigrams) I've managed to do so, but I'm failing in making my code work for n-grams (e.g. bigrams, trigrams). Here I've included my code for the unigrams. I'm open to adjusting my code to make it work, or to have a complete new piece of code. How would I best approach this?
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(stringr)
#Delete special characters and lower text
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
df$text <- tolower(df$text)
#From df to Corpus
corpus <- Corpus(VectorSource(df))
#Remove english stopwords,
stopwords<-c(stopwords("english"))
corpus <- tm_map(corpus, removeWords,stopwords)
rm(stopwords)
#Make term document matrix
tdm <- TermDocumentMatrix(corpus,control=list(wordLenths=c(1,Inf)))
#Make list of most frequent words
tdm_freq <- as.matrix(tdm)
words <- sort(rowSums(tdm_freq),decreasing=TRUE)
tdm_freq <- data.frame(word = names(words),freq=words)
rm(words)
#Make a wordcloud
wordcloud2(tdm_freq, size = 0.4, minSize = 10, gridSize = 0,
fontFamily = 'Segoe UI', fontWeight = 'normal',
color = 'red', backgroundColor = "white",
minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE,
rotateRatio = 0.4, shape = 'circle', ellipticity = 0.8,
widgetsize = NULL, figPath = NULL, hoverFunction = NULL)
Change Corpus to VCorpus so tokenising will work.
# Data
df <- data.frame(text = c("I have dataframe with a column I have dataframe with a column",
"I would like to know what are the most I would like to know what are the most",
"For single words (unigrams) I've managed to do so For single words (unigrams) I've managed to do so",
"Here I've included my code for the unigrams Here I've included my code for the unigrams"))
# VCorpus
corpus <- VCorpus(VectorSource(df))
funs <- list(stripWhitespace,
removePunctuation,
function(x) removeWords(x, stopwords("english")),
content_transformer(tolower))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = funs)
# Tokenise data without requiring any particular package
ngram_token <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)
# Pass into TDM control argument
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngram_token))
freq <- rowSums(as.matrix(tdm))
tdm_freq <- data.frame(term = names(freq), occurrences = freq)
tdm_freq
term occurrences
code unigrams code unigrams 2
column dataframe column dataframe 1
column like column like 1
dataframe column dataframe column 2
included code included code 2
...

TermDocumentMatrix not responding to Tokenizer

I am very new to R and I am trying to do an NGram WordCloud. However, my results always show a 1Gram instead of an NGram. I have searched for days for answers on the web and tried different methods...still the same result. Also, for some reason, I don't have the Ngramtokenizer function that I see everyone is using. However, I found another tokenizer function that I am using here. I hope someone can help me out. Thanks in advance!
library(dplyr)
library(ggplot2)
library(tidytext)
library(wordcloud)
library(tm)
library(RTextTools)
library(readxl)
library(qdap)
library(RWeka)
library(tau)
library(quanteda)
rm(list = ls())
#setwd("C:\\RStatistics\\Data\\")
#allverbatims <-read_excel("RS_Verbatims2018.xlsx") #reads excel files
#selgroup <- subset(allverbatims, FastNPS=="Detractors")
#selcolumns <- selgroup[ ,3:8]
#sample data
selcolumns <- c("this is a test","my test is not working","sample data here")
Comments <- Corpus(VectorSource(selcolumns))
CommentClean <- tm_map(Comments, removePunctuation)
CommentClean <- tm_map(CommentClean, content_transformer(tolower))
CommentClean <- tm_map(CommentClean,removeNumbers)
CommentClean <- tm_map(CommentClean, stripWhitespace)
CommentClean <- tm_map(CommentClean,removeWords,c(stopwords('english')))
#create manual tokenizer using tau textcnt since NGramTokenizer is not available
tokenize_ngrams <- function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string", n=n)))))
#test tokenizer
head(tokenize_ngrams(CommentClean))
td_mat <- TermDocumentMatrix(CommentClean, control = list(tokenize = tokenize_ngrams))
inspect(td_mat) #should be bigrams but the result is 1 gram
matrix <- as.matrix(td_mat)
sorted <- sort(rowSums(matrix),decreasing = TRUE)
data_text <- data.frame(word = names(sorted),freq = sorted)
set.seed(1234)
wordcloud(word = data_text$word, freq = data_text$freq, min = 5, max.words = 100, random.order = FALSE, rot.per = 0.1, colors = rainbow(30))

Purtest Object - How to save Output as tex file?

I was wondering whether there is a way to save a purtest - output as a LaTeX
file?
As you can see in the example-code, I have already tried to produce it via stargazer. However, the stargazer function does not support the purtest-class.
library(plm)
library(stargazer)
dat <- data.frame(entity = c(rep("a",10),rep("b",10)),year =
rep(1970:1979,2),value = rnorm(20))
pdat <- pdata.frame(dat,index = c("entity","year"))
res <- purtest(object = pdat$value,test = "ips",exo = "intercept",pmax = 1)
stargazer(summary(res),type = "latex")
I know that it is possible to extract values manually, to store them in a data.frame and eventually to save the data.frame via print.xtable as a LaTeX file.
But perhaps there is any neat solution to the problem.
Stargazer library has a lot of checks constraining the classes that can be used.
Class "purtest" is not included, but as Stargazer support exporting of "matrix" class, one can trick the restrictions. For example:
# the problem
library(plm)
library(stargazer)
dat <- data.frame(entity = c(rep("a",10),rep("b",10)),year =
rep(1970:1979,2),value = rnorm(20))
pdat <- pdata.frame(dat,index = c("entity","year"))
res <- purtest(object = pdat$value,test = "ips",exo = "intercept",pmax = 1)
# One solution: extract the parametars and place them in the matrix:
a = unlist(res$idres[[1]])
b = unlist(res$idres[[2]])
all = rbind(a, b)
class(all) <- c("matrix")
stargazer(all,type = "latex",align = T)
# need to align , else you get strange double dollar signs

Create Document Term Matrix with N-Grams in R

I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x){
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
}
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.

Extract numerical value from PDF chart to a variable in R

I'm trying to pull a numerical value from a chart that's been embedded in a pdf.
I tried the two methods below, but I was able to convert every other information into xlsx except the line chart information
Link to the pdf:
http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf
The value that I need to pull into a variable
1st Method
library(pdftools)
library(stringr)
library(xlsx)
set.seed(100)
tx <- pdf_text("flureport.pdf")
tx2 <- unlist(str_split(tx, "[\\r\\n]+"))
tx3 <- str_split_fixed(str_trim(tx2), "\\s{2,}", 5)
write.xlsx(tx3, file="ds.xlsx")
2nd Method
library('tm')
file <- 'flureport.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
c<-data.frame(corpus.array)
write.xlsx(c, file="x.xlsx")
Both the xlsx that I wrote didnt contain any chart information, so that I can fetch the value
This is the solution that worked for me, not sure if it would work for all the cases but it did work work in this particular case.
Thanks #user2554330 for mentioning OCR
library(pdftools)
library(stringr)
library(tesseract)
library(magick)
library(magrittr)
list <- c('http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf')
sapply(list, function(x)
pdf_convert(x, format = "png", pages = NULL, filenames = NULL, dpi = 300, opw = "", upw = "", verbose = TRUE))
text <- image_read("Weekly-Flu-Report-01-19-2018_1.png") %>%
image_resize("2000") %>%
image_convert(colorspace = 'gray') %>%
image_trim() %>%
image_ocr()
a<-print(text)
massili<-regmatches(a, gregexpr("\\d+(\\.\\d+){0,1} %", a))[[1]]

Resources