Unable to get n gram word cloud in R - r

I am trying to create a word cloud for bi-gram (and higher n grams) using the below code -
text_input <- scan("Path/Wordcloud.txt")
corpus <- Corpus(VectorSource(text_input))
corpus.ng = tm_map(corpus,removeWords,c(stopwords(),"s","ve"))
corpus.ng = tm_map(corpus.ng,removePunctuation)
corpus.ng = tm_map(corpus.ng,removeNumbers)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram = TermDocumentMatrix(corpus.ng,control = list(tokenize = BigramTokenizer))
tdm.bigram
freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE)
freq.df = data.frame(word=names(freq), freq=freq)
head(freq.df, 20)
pal=brewer.pal(8,"Blues")
pal=pal[-(1:3)]
wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F, colors=pal)
I have seen similar code on few websites being used for generating n gram but I am getting only single word combinations in my output.
The code is not responding to changes in min and max being set to different values (2,3,4 etc) successively in the NGramTokenizer function.
Am I missing something in the code or is it possible that one of the libraries which I am calling in the code (tm,ggplot2,wordcloud,RWeka) or their dependencies (like rJava) is not responding? I will be really grateful if someone can throw some pointers regarding this issue or suggest modifications in the above code.
Thanks,
Saibal

You are missing out on mentioning the token delimiter.
token_delim <- " \\t\\r\\n.!?,;\"()"
BigramTokenizer <- NGramTokenizer(mycorpus, Weka_control(min=2,max=2, delimiters = token_delim))
This should work.
In case you need a working example, you can check this five-minute video:
https://youtu.be/HellsQ2JF2k
Hope this helps.

Also, some others have had problems using the Corpus function.
Try using the volatile corpus
corpus <- VCorpus(VectorSource(text_input))

I tried the following and it worked:
> minfreq_bigram<-2
> bitoken <- NGramTokenizer(corpus, Weka_control(min=2,max=2))
> two_word <- data.frame(table(bitoken))
> sort_two <- two_word[order(two_word$Freq,decreasing=TRUE),]
> wordcloud(sort_two$bitoken,sort_two$Freq,random.order=FALSE,scale =
c(2,0.35),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),max.words=150)

Related

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

Transaction problem in RStudio for tweet apriori analysis

I want to use the apriori algorithm to apply association rules between words on the tweet database I have with RStudio. However, the code below gives an error on a million rows of data, while working on a small number of data. I needed your help as I couldn't understand what caused the error.
TweetTrans <- read.transactions("../input/tweets/output.csv",
rm.duplicates=FALSE,
format = "basket",
sep = ",",
encoding = "UTF-8")
The Error is:
Error in validObject(.Object): invalid class “ngCMatrix” object: row indices are not sorted within columns
Traceback:
1. read.transactions("../input/tweets/output.csv", rm.duplicates = FALSE,
. format = "basket", sep = ",", encoding = "UTF-8")
2. as(data, "transactions")
3. asMethod(object)
4. new("transactions", as(from, "itemMatrix"), itemsetInfo = data.frame(transactionID = names(from),
. stringsAsFactors = FALSE))
5. initialize(value, ...)
6. initialize(value, ...)
7. callNextMethod()
8. .nextMethod(.Object = .Object, ... = ...)
9. callNextMethod()
10. .nextMethod(.Object = .Object, ... = ...)
11. as(from, "itemMatrix")
12. asMethod(object)
13. new("ngCMatrix", p = c(0L, p), i = as.integer(i) - 1L, Dim = c(length(levels(i)),
. length(p)))
14. initialize(value, ...)
15. initialize(value, ...)
16. callNextMethod()
17. .nextMethod(.Object = .Object, ... = ...)
18. validObject(.Object)
19. stop(msg, ": ", errors, domain = NA)
Here are some ideas for how to find a rogue line in the data file. The input to read.transactions should be a text file the looks something like
A, B, C
B, C
C, D, E
D, A, B, F
where A, B ,C, etc are the names of the items (probably longer than one character each!)
So you could read in the file using readLines...
data <- readLines("../input/tweets/output.csv")
Each element of data (one per line of the file) should be a string of the form "A, B, C" etc, as above.
You could then use functions (e.g. from the stringr package) to check if any lines contain unusual characters, or have an odd format. Without seeing your file, it is hard to say how to do this, but you might, for example, look for quotes in odd places (str_detect(data, '\\"')) or characters that are not letters, digits , spaces or commas (str_detect(data, "[^\\w\\d\\s,]")).
Another thing you could try is to write a for loop to take each element of data (or perhaps larger chunks if that is too slow), save it as a file, try reading it with read.transactions, and see where it crashes.
for(i in seq_along(data)){
writeLines(data[i], "dummyfile.csv")
trans <- read.transactions("dummyfile.csv",
rm.duplicates=FALSE,
format = "basket",
sep = ",",
encoding = "UTF-8")
}
The value of i when it crashes will give you the problem row number. It might take a long time to run, though!
I ran into a very similar problem: the same error got triggered when trying to cast a list to a transaction object.
I also couldn't easily figure out what lines in the data caused the issue, as it seems to be triggered by a combination of transactions and not necessarily by any individual one, but I managed to track down the source of the problem in this assignment (source):
p <- new("ngCMatrix", p = c(0L, p),
i = as.integer(i) - 1L,
Dim = c(length(levels(i)), length(p)))
My R got pretty rusty over time and I couldn't find an immediate way to patch the code, but I came up with an alternative solution for constructing the ngCMatrix object:
Assume you have the data in a data.frame following some sort of (user, item) format - in your case it would most likely be (tweet_id, term/word)
Create a unique incremental ID for every user and item and add it to your data.frame
Use those ID to create the sparse matrix and - optionally - enrich it with the labels for item and user to make it more interpretable
Finally, cast the sparse matrix to a transaction object
Example (I implemented mine with data.table, but a traditional dataframe implementation would be very similar):
library(Matrix)
library(data.table)
library(arules)
DT <- data.table(user = c('A','A','B','B','A','C','D'),
item = c('AAB','AAA','AAB','BBB','ABA','BBB','AAB'))
# Create user_ids
unique_users <- unique(DT$user)
users <- data.table(user=unique_users,
user_id=c(1:length(unique_users)))
# Repeat for items
unique_items <- unique(DT$item)
items <- data.table(item=unique_items,
item_id=c(1:length(unique_items)))
# Add indexes to original data table (setting keys helps with performance)
DT <- merge.data.table(x=DT, y=users, by='user')
DT <- merge.data.table(x=DT, y=items, by='item')
# Create the sparse matrix
mat <- sparseMatrix(
i = DT$item_id,
j = DT$user_id,
dims = c(nrow(items), nrow(users)),
dimnames = list(items$item, users$user)
)
# transform to arules 'transactions'
txn <- as(op, "transactions")
Please note that this doesn't help understanding what caused the issue, but rather provides a workaround to solve it. In my data.table implementation the code is pretty performant, taking only a few seconds to process over 30M transactions on a laptop-sized machine (2 CPUs, 16gb RAM).

R: Collect All Function Definitions from a Library

I am working with R. I found this previous post on stackoverflow which shows how to get a "list" of all functions that belong to a given library:
How to find all functions in an R package?
For example:
#load desired library
library(ParBayesianOptimization)
#find out all functions from this library
getNamespaceExports("ParBayesianOptimization")
[1] "addIterations" "getLocalOptimums" "bayesOpt" "getBestPars" "changeSaveFile" "updateGP"
The above code tells me the name of all functions that are used in the "ParBayesianOptimization" library. From here, I could manually inspect each one of these functions - for example:
# manually inspect any one of these functions
getAnywhere(bayesOpt)
A single object matching ‘bayesOpt’ was found
It was found in the following places
package:ParBayesianOptimization
namespace:ParBayesianOptimization
with value
#function stats here
function (FUN, bounds, saveFile = NULL, initGrid, initPoints = 4,
iters.n = 3, iters.k = 1, otherHalting = list(timeLimit = Inf,
minUtility = 0), acq = "ucb", kappa = 2.576, eps = 0,
parallel = FALSE, gsPoints = pmax(100, length(bounds)^3),
convThresh = 1e+08, acqThresh = 1, errorHandling = "stop",
plotProgress = FALSE, verbose = 1, ...)
{
startT <- Sys.time()
optObj <- list()
etc etc etc ...
saveFile = saveFile, verbose = verbose, ...)
return(optObj)
}
#function ends here
<bytecode: 0x000001cbb4145db0>
<environment: namespace:ParBayesianOptimization>
Goal : Is it possible to take each one of these functions and create a notepad file with their full definitions?
Something that would look like this:
My attempt:
I thought I could first make an "object" in R that contained all the functions found in this library:
library(plyr)
a = getNamespaceExports("ParBayesianOptimization")
my_list = do.call("rbind.fill", lapply(a, as.data.frame))
X[[i]]
1 addIterations
2 getLocalOptimums
3 bayesOpt
4 getBestPars
5 changeSaveFile
6 updateGP
Then, I could manually create an "assignment arrow":
header_text <- rep("<-")
Then, "paste" this to each function name:
combined_list <- as.character(paste(my_list, header_text, sep = ""))
But this is not looking correct:
combined_list
[1] "c(\"addIterations\", \"getLocalOptimums\", \"bayesOpt\", \"getBestPars\", \"changeSaveFile\", \"updateGP\")<- "
The goal is to automate the process of manually copying/pasting :
function_1 = getAnywhere("first function ParBayesianOptimization library")
function_2 = getAnywhere("second function ParBayesianOptimization library")
etc
final_list = c(function_1, function_2 ...)
And removing the generic description from each function:
A single object matching ‘bayesOpt’ was found
It was found in the following places
package:ParBayesianOptimization
namespace:ParBayesianOptimization
with value
In the end, if I were to "call" the final_list object, all the functions from this library should get recreated and reassigned.
Can someone please show me how to do this?
Thanks
You can use the dump function for this
pkg <- "ParBayesianOptimization"
dump(getNamespaceExports(pkg), file="funs.R", envir = asNamespace(pkg))
This code will help you write the function definitions of all the functions in a library to a text file.
fn_list <- getNamespaceExports("ParBayesianOptimization")
for(i in seq_along(fn_list)) {
header <- paste('\n\n####Function', i, '\n\n\n')
cat(paste0(header, paste0(getAnywhere(fn_list[i]), collapse = '\n'), '\n\n'),
file = 'function.txt', append = TRUE)
}

Loop in R through variable names with values as endings and create new variables from the result

I have 24 variables called empl_1 -empl_24 (e.g. empl_2; empl_3..)
I would like to write a loop in R that takes this values 1-24 and puts them in the respective places so the corresponding variables are either called or created with i = 1-24. The sample below shows what I would like to have within the loop (e.g. ye1- ye24; ipw_atet_1 - ipw_atet_14 and so on.
ye1_ipw <- empl$empl_1[insample==1]
ipw_atet_1 <- treatweight(y=ye1_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_1
ipw_atet_1$se
ye2_ipw <- empl$empl_2[insample==1]
ipw_atet_2 <- treatweight(y=ye2_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_2
ipw_atet_2$se
ye3_ipw <- empl$empl_3[insample==1]
ipw_atet_3 <- treatweight(y=ye3_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_3
ipw_atet_3$se
coming from a Stata environment I tried
for (i in seq_anlong(empl_list)){
ye[i]_ipw <- empl$empl_[i][insample==1]
ipw_atet_[i]<-treatweight(y=ye[i]_ipw, d=treat_ipw, x=x1_ipw, ATET=TRUE, trim=0.05, boot =2
}
However this does not work at all. Do you have any idea how to approach this problem by writing a nice loop? Thank you so much for your help =)
You can try with lapply :
result <- lapply(empl[paste0('empl_', 1:24)], function(x)
treatweight(y = x[insample==1], d = treat_ipw,
x = x1_ipw, ATET = TRUE, trim = 0.05, boot = 2))
result would be a list output storing the data of all the 24 variables in same object which is easier to manage and process instead of having different vectors.

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources