Normally when topic modeling I use something along the lines of:
matrix <- create_matrix(cbind(as.vector(lda_data)), language="english", removeNumbers=TRUE, weighting=weightTf)
k <- 20 #Hardcoded temp value
lda <- LDA(matrix, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000))
Terms <- terms(lda, 20)
But with a mid sized data set (3.2M rows) I get the following error calculating the matrix:
Warning message:
In nr * nc : NAs produced by integer overflow
Error in as.matrix(textColumns) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
Is there a different library/approach that avoids this error? (The code works fine on small data sets)
Alternatively, when using a TermDocumentMatrix as the matrix for the LDA, my resulting Terms are entirely numerical, is there a way to strings (words) instead?
I've used an alternate approach to creating the matrix which works on the large data set:
dtm <- DocumentTermMatrix(donation_message,
control = list(stemming = TRUE, stopwords = TRUE,
removeNumbers = TRUE, removePunctuation = TRUE))
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm <- dtm[rowTotals> 0, ] #Remove all docs without words
k <- 20 #Hardcoded temp value
lda <- LDA(dtm, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000)) #seed = 1000, thin = 100
Terms <- terms(lda, 20)
Related
I am trying to generate a for loop that will repeat a sequence of the following:
sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)
I want it to repeat 5000 times. So far, I include the above as the body of the loop and added
for (i in seq_along[1:5000]){
at the beginning but I am getting an error message saying
Error in seq_along[1:10000] : object of type 'builtin' is not subsettable
We need replicate
out <- replicate(5000, sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)), simplify = FALSE)
There are a few issues here.
#MartinGal noted the syntax issues with seq_along and the missing ). Note that you can use seq(n) or 1:n in defining the number of loops.
You are not storing the sampled vectors anywhere, so the for loop will run the code but you won't capture the output.
You have x = 1:14 but you only have 4 prob values, which suggests you intended x = 1:4 (either that or you are 10 prob values short).
Here's one way to address these issues using a for loop.
n <- 5
s <- 10
xmax <- 4
p <- 1/4
out <- matrix(nrow = n, ncol = s, byrow = TRUE)
set.seed(1L)
for (i in seq(n)) {
out[i, ] <- sample(x = seq(xmax), size = s, replace = TRUE, prob = rep(p, xmax))
}
As andrew reece notes in his comment, it looks like you want x = 1:4 Depending what you want to do with your result you could generate all of the realizations at one time since you are sampling with replacement and then store the result in a matrix with 5000 rows of 10 realizations per row. So:
x <- sample(1:4, size = 5000 * 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4))
result <- matrix(x, nrow = 5000)
I am working on complaints data analysis where I am adapting text summary technique for reducing unnecessary text and bringing out only useful text.
I have used LDA - Latent Dirichlet Allocation in R for text summarization but I am not able to perform it to its full potential.
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
Error -
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Actual text:
it is the council’s responsibility to deal with the loose manhole cover.
Could you provide an update on the next steps taken by the council.
** Trail Mails Text follows - about 50 lines of text**
summarized text:
it is the council’s responsibility to deal with the loose manhole cover.I have read the email thread, please get in contact with the numbers provided by ABC"
I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
library(speedglm)
# large example that works
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Thanks,
JP.
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Update:
Based on my investigation below, #James figured out that the problem can be avoided by providing non-NULL value for the parameter sparse in the call of the speedglm function, as it prevents the internal call of the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
My original answer:
Both the warning and the error come from the same line in the function is.sparse in the package speedglm.
The line is:
sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)
The warning happens because of the use of nrow(X)*ncol(X) for a large matrix. The nrow and ncol functions return integer values, which can overflow. here is an illustration.
nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
The error happens because the sample function is confused when X is a large matrix and size = NA. Here is an illustration:
sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n, :
# missing value where TRUE/FALSE needed
Thanks to #Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse check. To bypass this I set sparse=FALSE in the options for speedglm (this should work for sparse=TRUE as well, though I haven't tried.) This is because speedglm calls is.sparse via speedglm.wfit in the following way:
if (is.null(sparse))
sparse <- is.sparse(x = x, sparsellim, camp)
So setting sparse avoids the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
I face the following issue. I extracted Tweets and have now a collection of about 500 tweets in one csv file. With these I would like to generate topics using the LDA model. So far so good. I do receive the topics but now I would like to know which Tweets belong to which topics. But I just don't know how to do that...
I have a csv file where I numbered each Tweet. I thought this way I could get the corresponding Tweets which built a topic, but the command "topics(lda)" doesn't work. Maybe someone out there can help me, please^^
This is the csv file I use (LDA_start). Only two columns (number, text)
And this is the code I use which I found in a tutorial for LDA modelling (I'm a R beginner)
library("SocialMediaLab")
library("topicmodels")
library("slam")
library("Rmpfr")
library("tm")
library("stringr")
myData = read.csv("LDA_start.csv", header = TRUE)
tweetCorpus <- VCorpus(VectorSource(myData))
myStopwords <- c(stopwords('english'))
tweetCorpus <- tm_map(tweetCorpus, removeWords, myStopwords)
dtmTopicModeling <- DocumentTermMatrix(tweetCorpus,control = list(stemming = TRUE, tolower = TRUE, removeNumbers = TRUE, removePunctuation = TRUE, wordLengths = c(3, 30)))
harmonicMean <- function(logLikelihoods, precision=2000L) {
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
burnin = 1000
iter = 1000
keep = 50
sequ <- seq(2, 100, 5)
fitted_many <- lapply(sequ, function(k) LDA(dtmTopicModeling, k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
k <- sequ[which.max(hm_many)]
seedNum <- 42
lda <- LDA(dtmTopicModeling, k = k, method = "Gibbs", control = list(burnin = burnin, iter = iter, keep = keep, seed=seedNum))
write.csv(terms(lda,50), "TopicModel.csv")
topics(lda)
1
1
Is there maybe an easier way to generate topics and find the Tweets which made them than mine? I would be really grateful for answers!
I am trying to compute prediction intervals for my neural network created with the neuralnet package.
I use R in Tableau Software, by creating .RData files containing my functions and loaded in Tableau.
It's a simple NN, with one hidden layer containing 5 nodes. I searched and found this package : nnetpredint
So I tried to use it, using their examples.
I tried also to change the way I use it (train/test in same data frame, separated data frames with the same columns names etc.)
And the best result I had was the prediction, but without the lowerBound and upperBound columns.
In fact, I got exactly the same result as when I use compute(myNN, etc.), but I don't have the second and third columns.
Thanks for your help,
EDIT :
My data is coming from tableau, my function take five parameters which are :
ValuesToExplain,train1,train2,test1,test2.
Then, i create and train my NN with the 3first and try to compute the two last.
(test1 = k*train1 and test2 = k2*train2 for now but it will probably move in the future).
Here is my whole code :
NNetwork <- function(objectiveValues, knownValues1, knownValues2, newData, newData2){
numberOfColumn = 3
##Create the training dataframe
training <- data.frame(objectiveValues, knownValues1,knownValues2)
training[which(is.na(training[,"objectiveValues"])),"objectiveValues"]<- mean(training[,"objectiveValues"], na.rm = TRUE)
training[which(is.na(training[,"knownValues1"])),"knownValues1"]<- mean(training[,"knownValues1"], na.rm = TRUE)
training[which(is.na(training[,"knownValues2"])),"knownValues2"]<- mean(training[,"knownValues2"], na.rm = TRUE)
## Create the testing dataframe
testing <- data.frame(objectiveValues,newData,newData2)
names(testing) <- c("objectiveValues", "knownValues1", "knownValues2")
testing[which(is.na(testing[,"objectiveValues"])),"objectiveValues"]<- mean(testing[,"objectiveValues"], na.rm = TRUE)
testing[which(is.na(testing[,"knownValues1"])),"knownValues1"]<- mean(testing[,"knownValues1"], na.rm = TRUE)
testing[which(is.na(testing[,"knownValues2"])),"knownValues2"]<- mean(testing[,"knownValues2"], na.rm = TRUE)
## Scaling
maxs <- apply(training, 2, max)
mins <- apply(training, 2, min)
trainingScaled <- as.data.frame(scale(training, center = mins, scale = maxs - mins))
testingScaled <- as.data.frame(scale(testing, center = mins, scale = maxs - mins))
### NeuralNetwork Part
library(neuralnet)
n <- names(trainingScaled)
f <- as.formula(paste("objectiveValues ~", paste(n[!n %in% "objectiveValues"], collapse = " + ")))
# Training NN
nn <- neuralnet(f, data=trainingScaled,hidden=5,linear.output=TRUE)
# Using NN
computedTrainingScaled <- compute(nn,trainingScaled[,2:numberOfColumn])
computedFromNNScaled <- compute(nn,testingScaled[,2:numberOfColumn])
# UnScaling
computedTraining <- computedTrainingScaled$net.result*(max(training$objectiveValues)-min(training$objectiveValues))+min(training$objectiveValues)
computedFromNN <- computedFromNNScaled$net.result*(max(training$objectiveValues)-min(training$objectiveValues))+min(training$objectiveValues)
RSquare = (1-( (sum((training$objectiveValues - computedTraining)^2))/(sum((training$objectiveValues - mean(training$objectiveValues))^2)) ))*100
RSE = sum((training$objectiveValues - computedTraining)^2)/nrow(training)
res <- (1:nrow(training))
library(nnetpredint) # Getting prediction confidence interval
x <- trainingScaled[,2:numberOfColumn]
y <- trainingScaled[1]
newData <- testingScaled[,2:numberOfColumn]
# S3 generic method: Object of nn
yPredInt <- nnetPredInt(nn, x, y, newData)
for(i in 1:nrow(training)){
res[i] <- paste(computedFromNN[i],RSquare,RSE, sep="#")
}
return(res)
}
save(NNetwork, file = "NNetwork.RData")
Here, i removed the part using the nnetpredint pckage because it was not working, but it was like this :
library(nnetpredint)
y <- trainingScaled
x <- trainingScaled[,2:3]
newData <- testingScaled[,2:3]
yPredInt <- nnetPredInt(nn, x, y, newData)
My problem is that when I try to access yPredInt$lowerBound or yPredInt$upperBound , they don't exist.