How to make topic modelling? - r

I was trying my hand at topic modeling for the first time. Tried running the vignette code but getting following errors;
library(pacman)
p_load("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust",
"cluster", "igraph", "fpc", "Rcampdf")
p_load("topicmodels", "devtools", "ldatuning")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
knitr::kable(result)
Error in subset.default(values, select = 2:ncol(values)) :
argument "subset" is missing, with no default
FindTopicsNumber_plot(result)
Error in seq_len(m) : argument must be coercible to non-negative integer
In addition: Warning messages:
1: In rep(digits, length.out = m) :
first element used of 'length.out' argument
2: In seq_len(m) : first element used of 'length.out' argument
Need help in resolving the error. Thank you.

Related

Create a dtm using tokens

I try to run tis command
dtm <- CreateDtm(tokens$text,
doc_names = tokens$ID,
ngram_window = c(1, 2))
However I receive this error:
Error in seq.default(1, length(tokens), 5000) :
wrong sign in 'by' argument
In addition: Warning message:
In CreateDtm(tokens$text, doc_names = tokens$ID, ngram_window = c(1, :
No document names detected. Assigning 1:length(doc_vec) as names.
Any idea what I have to change in order to run it properly?

R: M3C library - Duplicate row.names error message

I am trying to run consensus clustering using M3C library in R. My dataset contains 451 samples and ~2500 genes. The row names are the ENTREZ IDs (numeric values) of the genes. I have crosschecked the dataset using "any(duplicated(colnames(MyData)))" command to make sure that there is no duplicate entries in the row names. I ran the following command to perform the consensus clustering using M3C library:
res <- M3C(MyData, cores=8, seed = 123, des = annotation, removeplots = TRUE, analysistype = 'chi', doanalysis = TRUE, variable = 'class')
I am getting the following error:
Warning message:
"non-unique values when setting 'row.names': "
Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
Traceback:
1. M3C(MyData, cores = 8, seed = 123, des = meta, removeplots = TRUE,
. analysistype = "chi", doanalysis = TRUE, variable = "class")
2. M3Creal(as.matrix(mydata), maxK = maxK, reps = repsreal, pItem = 0.8,
. pFeature = 1, clusterAlg = clusteralg, distance = distance,
. title = "/home/christopher/Desktop/", printres = printres,
. showheatmaps = showheatmaps, printheatmaps = printheatmaps,
. des = des, x1 = pacx1, x2 = pacx2, seed = seed, removeplots = removeplots,
. silent = silent, doanalysis = doanalysis, analysistype = analysistype,
. variable = variable, fsize = fsize, method = method)
3. `row.names<-`(`*tmp*`, value = newerdes$ID)
4. `row.names<-.data.frame`(`*tmp*`, value = newerdes$ID)
5. `.rowNamesDF<-`(x, value = value)
6. stop("duplicate 'row.names' are not allowed")
Can anyone please help me to resolve the issue?
Thanks
I ran the equivalent of the following using M3C:
df_wide_matrix # my expression matrix
any(duplicated(colnames(df_wide_matrix))) # result = FALSE
M3C::M3C(df_wide_matrix, iters=2, repsref=2, repsreal=2, clusteralg="hc", objective="PAC")
I ran into the exact same error message with M3C, in addition to:
In addition: Warning message:
non-unique values when setting 'row.names': ‘ABCDEF’, ‘ABCDGH’
I assumed the issue is caused by the fact the first four characters of each of these features are equal. I therefore temporarily changed their respective names prior to running M3C:
dup_ids <- which(colnames(dissADJ) %in% c("ABCDEF", "ABCDGH"))
colnames(dissADJ)[dup_ids] <- c("A", "B")
M3C::M3C(df_wide_matrix, iters=2, repsref=2, repsreal=2, clusteralg="hc", objective="PAC")
M3C then runs correctly. Not an ideal solution but worked for me - I've posted it as an issue: https://github.com/crj32/M3C/issues/6.

Eventstudies package r phys2eventtime null

I am using the eventstudies package in R and I am struggling to replicate the eventsudy function which is used as an example in the package as below:
# Event study without adjustment
es <- eventstudy(firm.returns = StockPriceReturns,
event.list = SplitDates,
event.window = 7,
type = "None",
to.remap = TRUE,
remap = "cumsum",
inference = TRUE,
inference.strategy = "bootstrap")
I use my own data for StockPriceReturns and SplitDates and even though I have tried to make them of the same format, type etc of those above I think this is where it is failing. The message I am getting is:
Error in phys2eventtime(z = returns$firm.returns, events =
event.list[i, : events$name should a character class. In addition:
Warning message: In is.na(events$name) : is.na() applied to
non-(list or vector) of type 'NULL'
Any help would be much appreciated.

Questions of xgboost with R

I used xgboost to do logistic regression. I followed the steps from, but I got two problems.The datasets are found here.
First, when I run the follow code:
bst <- xgboost(data = sparse_matrix, label = output_vector,nrounds = 39,param)
Then, I got
[0]train-rmse:0.350006
[1]train-rmse:0.245008
[2]train-rmse:0.171518
[3]train-rmse:0.120065
[4]train-rmse:0.084049
[5]train-rmse:0.058835
[6]train-rmse:0.041185
[7]train-rmse:0.028830
[8]train-rmse:0.020182
[9]train-rmse:0.014128
[10]train-rmse:0.009890
[11]train-rmse:0.006923
[12]train-rmse:0.004846
[13]train-rmse:0.003392
[14]train-rmse:0.002375
[15]train-rmse:0.001662
[16]train-rmse:0.001164
[17]train-rmse:0.000815
[18]train-rmse:0.000570
[19]train-rmse:0.000399
[20]train-rmse:0.000279
[21]train-rmse:0.000196
[22]train-rmse:0.000137
[23]train-rmse:0.000096
[24]train-rmse:0.000067
[25]train-rmse:0.000047
[26]train-rmse:0.000033
[27]train-rmse:0.000023
[28]train-rmse:0.000016
[29]train-rmse:0.000011
[30]train-rmse:0.000008
[31]train-rmse:0.000006
[32]train-rmse:0.000004
[33]train-rmse:0.000003
[34]train-rmse:0.000002
[35]train-rmse:0.000001
[36]train-rmse:0.000001
[37]train-rmse:0.000001
[38]train-rmse:0.000000
train-rmse is finally equal to 0! Is that normal? Usually,I know train-rmse can't be equal to 0. So,where is my problem?
Second, when I run
importance <- xgb.importance(sparse_matrix#Dimnames[[2]], model = bst)
Then, I got a Error:
Error in eval(expr, envir, enclos) : object 'Yes' not found.
I don't know what does it mean, maybe the first question leads to the second one.
library(data.table)
train_x<-fread("train_x.csv")
str(train_x)
train_y<-fread("train_y.csv")
str(train_y)
train<-merge(train_y,train_x,by="uid")
train$uid<-NULL
test<-fread("test_x.csv")
require(xgboost)
require(Matrix)
sparse_matrix <- sparse.model.matrix(y~.-1, data = train)
head(sparse_matrix)
output_vector = train[,y] == "Marked"
param <- list(objective = "binary:logistic", booster = "gblinear",
nthread = 2, alpha = 0.0001,max.depth = 4,eta=1,lambda = 1)
bst <- xgboost(data = sparse_matrix, label = output_vector,nrounds = 39,param)
importance <- xgb.importance(sparse_matrix#Dimnames[[2]], model = bst)
I ran into the same problem (Error in eval(expr, envir, enclos) : object 'Yes' not found.) and the reason was the following:
I tried to do
dt = data.table(x = runif(10), y = 1:10, z = 1:10)
label = as.logical(dt$z)
train = dt[, z := NULL]
trainAsMatrix = as.matrix(train)
label = as.matrix(label)
bst <- xgboost(data = trainAsMatrix, label = label, max.depth = 8,
eta = 0.3, nthread = 2, nround = 50, objective = "reg:linear")
bst$featureNames = names(train)
xgb.importance(model = bst)
The problem comes from the line
label = as.logical(dt$z)
I got this line in there because the last time I used xgboost, I wanted to predict a categorial variable. Now since I want to do regression it should read:
label = dt$z
Maybe something similar causes the problem in your case?
Perhaps this is of any help. I'm often getting the same error when the labels have zero variation. Using the current CRAN version of xgboost, which is somewhat old already (0.4.4). xgb.train happily accepts this (showing a .50 AUC) but the error then shows when calling xgb.importance.
Cheers
Otto
[0] train-auc:0.500000 validate-auc:0.500000
[1] train-auc:0.500000 validate-auc:0.500000
[2] train-auc:0.500000 validate-auc:0.500000
[3] train-auc:0.500000 validate-auc:0.500000
[4] train-auc:0.500000 validate-auc:0.500000
[1] "XGB error: Error in eval(expr, envir, enclos): object 'Yes' not found\n"

Error in simple_triplet_matrix -- unable to use RWeka to count Phrases

Using TM, I'm comparing a DocumentTermMatrix against a dictionary list to count totals:
totals <- inspect(DocumentTermMatrix(x, list(dictionary = d)))
This works great for single words, but I want to include double words and can't figure out how to do this.
I tried RWeka:
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(v.corpus,
control = list(tokenize = TrigramTokenizer))
BUt get the following error message:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j, v' different lengths
In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion.
Can you help with the Error message?
Thanks!!
See my answer here
Seems there are problems using RWeka with parallel package. I
found workaround solution here.
1:
http://r.789695.n4.nabble.com/RWeka-and-multicore-package-td4678473.html#a4678948
The most important point is not loading the RWeka package and use the namespace in a encapsulated function.
So your tokenizer should look like
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}

Resources