LDAvis in r doesn't work: errors and empty output

LDAvis in r doesn't work: errors and empty output - r

I am a beginner in R and text mining. I have already performed the LDA and now I want to visualise my results with the LDAvis package. I have followed every step from the github example (https://ldavis.cpsievert.me/reviews/reviews.html) starting from the 'visualizing' chapter. However, I either get error notifications or empty pages.
I have tried the following:
RedditResults <- list(phi = phi,
theta = theta,
doc.length = doc.length,
vocab = vocab,
term.frequency = term.frequency)
json <- createJSON(phi = RedditResults$phi,
theta = RedditResults$theta,
doc.length = RedditResults$doc.length,
vocab = RedditResults$vocab,
term.frequency = RedditResults$term.frequency)
serVis(json, out.dir = "vis", open.browser = FALSE)
However, this gives me an error display saying:
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'closure') cannot be handled by 'cat'
I reasoned this might have happened because the 'json' object is of class 'function' rather than a character string, which I read the object has to be in to perform serVis. Therefore I tried to convert it before using serVis by means of
RedditResults <- sapply(RedditResults, toJSON)
Resulting in the following error:
Error in run(timeoutMs) :
Evaluation error: argument must be a character vector of length 1.
I feel like I'm making a very obvious mistake somewhere, but after days of trial and error I haven't been able to spot what I should do differently.
The weirdest thing to me is that sometimes it does work, but then when I try to open the html file I only see a blank page. I have tried opening it in multiple browsers as well as opening up those browsers to display local files. I have also tried opening it using the servr package, but this gives me the same result, which is either an error notification (character vector length is not equal to 1) or an empty page.
Hope anyone can spot what I'm doing wrong. Thanks!
EDIT: objects/code underlying the code above:
Convenient to know:
I cleaned the data in corpus form (reddit_data_textcleaned) before converting it to my document-term matrix (tdm3).
After converting it to tdm3, I eliminated any 'empty' documents by excluding those with less than 2 words. Thus, 'reddit_data_textcleaned' contains more documents than relevant and 'tdm3' contains the data I want to work with.
'fit3' is the fitted model resulting from doing LDA on tdm3
'DTM' is the term-document matrix with exactly the same data as tdm3, but with transposed rows/columns.
I am aware that it makes very little sense to call your term-document matrix 'DTM' whilst naming your document-term matrix 'tdm', seeing the abbreviations. Sorry about that.
phi <- as.matrix(posterior(fit3)$terms)
theta <- as.matrix(posterior(fit3)$topics)
dp <- dim(phi) # should be K x W
dt <- dim(theta) # should be D x K
D <- length(as.matrix(tdm3[, 1])) # number of documents (2812)
doc.length <- colSums(as.matrix(DTM)) #number of tokens in each document
N <- sum(doc.length) # total number of tokens in the data (54,136)
vocab <- colnames(phi)# all terms in the vocab
W <- length(vocab) # number of terms in the vocab (6470)
temp_frequency <- inspect(tdm3)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
term.frequency <- freq_matrix$Freq
doc.list <- as.list(reddit_data_textcleaned, "[[:space:]]+")
get.terms <- function(x) {
index <- match(x, vocab)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(doc.list, get.terms)
I presume something goes wrong in the creation of the 'get.terms' and 'documents' objects, as I don't exactly know what happens there. I used these methods based on answers to similar questions I read on this platform. Also, the 'doc.list' object still contains the empty documents I removed from the data after converting 'reddit_data_textcleaned' to 'tdm3'. However, the code above doesn't work with a document-term matrix object so that's why I used 'reddit_data_textcleaned' instead of 'tdm3'. I figured I would fix that issue later.

Related

R - Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric

I am trying to importa single column of a text file data set where each file is a single day of data. I want to take the mean of each day's wind speed. Here is the code I have written for that:
daily.wind.speed <- c()
file.names <- dir("C:\\Users\\User Name\\Desktop\\R project\\Sightings Data\\Weather Data", pattern =".txt")
for(i in 1:length(file.names))
{
##import data file into data frame
weather.data <-read.delim(file.names[i])
## extract wind speed column
wind.speed <- weather.data[3]
##Attempt to fix numeric error
##wind.speed.num <- as.numeric(wind.speed)
##Take the column mean of wind speed
daily.avg <- colMeans(wind.speed,na.rm=T)
##Add daily average to list
daily.wind.speed <- c(daily.wind.speed,daily.avg)
##Print for troubleshooting and progress
print(daily.wind.speed)
}
This code seems to work on some files in my data set, but others give me this error during this section of the code:
> daily.avg <- colMeans(wind.speed,na.rm=T)
Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric
I am also having trouble converting these values to numeric and am looking for options to either convert my data to numeric, or to possibly take the mean in a different way that dosen't encounter this issue.
> as.numeric(wind.speed.df)
Error: (list) object cannot be coerced to type 'double'
weather.data Example

Even though this is not a reproducible example the problem is that you are applying a matrix function to a vector so it won't work. Just change the colMeans for mean

Using a test sample file with MaxEnt in R

I worked a lot with MaxEnt in R recently (dismo-package), but only using a crossvalidation to validate my model of bird-habitats (only a single species). Now I want to use a self-created test sample file. I had to pick this points for validation by hand and can't use random test point.
So my R-script looks like this:
library(raster)
library(dismo)
setwd("H:/MaxEnt")
memory.limit(size = 400000)
punkteVG <- read.csv("Validierung_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteTG <- read.csv("Training_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteVG$X <- as.numeric(punkteVG$X)
punkteVG$Y <- as.numeric(punkteVG$Y)
punkteTG$X <- as.numeric(punkteTG$X)
punkteTG$Y <- as.numeric(punkteTG$Y)
##### mask NA ######
mask <- raster("final_merge_8class+le_bb_mask.img")
dataframe_VG <- extract(mask, punkteVG)
dataframe_VG[dataframe_VG == 0] <- NA
dataframe_TG <- extract(mask, punkteTG)
dataframe_TG[dataframe_TG == 0] <- NA
punkteVG <- punkteVG*dataframe_VG
punkteTG <- punkteTG*dataframe_TG
#### add the raster dataset ####
habitat_all <- stack("blockstats_stack_8class+le+area_8bit.img")
#### MODEL FITTING #####
library(rJava)
system.file(package = "dismo")
options(java.parameters = "-Xmx1g" )
setwd("H:/MaxEnt/results_8class_LE_AREA")
### backgroundpoints ###
set.seed(0)
backgrVMmax <- randomPoints(habitat_all, 100000, tryf=30)
backgrVM <- randomPoints(habitat_all, 1000, tryf=30)
### Renner (2015) PPM modelfitting Maxent ###
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10",
"replicatetype=subsample",
"randomtestpoints=20",
"randomseed=true",
"testsamplesfile=H:/MaxEnt/Validierung_FL_XY_2016_swd_NA"))
After the "maxent()"-command I ran into multiple errors. First I got an error stating that he needs more than 0 (which is the default) "randomtestpoints". So I added "randomtestpoints = 20" (which hopefully doesn't stop the program from using the file). Then I got:
Error: Test samples need to be in SWD format when background data is in SWD format
Error in file(file, "rt") : cannot open the connection
The thing is, when I ran the script with the default crossvalidation like this:
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10"))
...all works fine.
Also I tried multiple things to get my csv-validation-data in the correct format. Two rows (labled X and Y), Three rows (labled species, X and Y) and other stuff. I would rather use the "punkteVG"-vector (which is the validation data) I created with read.csv...but it seems MaxEnt wants his file.
I can't imagine my problem is so uncommon. Someone must have used the argument "testsamplesfile" before.

I found out, what the problem was. So here it is, for others to enjoy:
The correct maxent-command for a Subsample-file looks like this:
maxentVMmax_Renner<-maxent(habitat_all, punkteTG, backgrVMmax, path=paste('H:/MaxEnt',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=1",
"replicatetype=Subsample",
"testsamplesfile=H:/MaxEnt/swd.csv"))
Of course, there can not be multiple replicates, since you got only one subsample.
Most importantly the "swd.csv" Subsample-file has to include:
the X and Y coordinates
the Values at the respective points (e.g.: with "extract(habitat_all, PunkteVG)"
the first colum needs to consist of the word "species" with the header "Species" (since MaxEnt uses the default "species" if you don't define one in the Occurrence data)
So the last point was the issue here. Basically, if you don't define the species-colum in the Subsample-file, MaxEnt will not know how to assign the data.

Igraph Write Communities

We are using igraph and R to detect communities in a network. The detection using cluster_walktrap is working great:
e <- cluster_walktrap(g)
com <-membership(e)
print(com)
write.csv2(com, file ="community.csv", sep=",")
The result is printed fine using print with the number and the community number that it belongs to but we have a problem in writing the result in the csv file and I have an error : cannot coerce class ""membership"" to a data.frame
How can I write the result of membership in a file ?
Thanks

Convert the membership object to numeric. write.csv and write.csv2 expect a data frame or matrix. The command tries to coerce the object into a data frame, which the class membership resists. Since membership really is just a vector, you can convert it a numeric. Either:
write.csv2(as.numeric(com), file ="community.csv")
Or:
com <- as.numeric(com)
write.csv2(com, file ="community.csv")
Oh, and you don't need the sep = "," argument for write.csv.
If you want to create table of vertex names/numbers and groups:
com <- cbind(V(g),e$membership) #V(g) gets the number of vertices
com <- cbind(V(g)$name,e$membership) #To get names if your vertices are labeled

I don't know if you guys resolved the problem but I did the following using R:
```
# applying the community method
com = spinglass.community(graph_builted,
weights = graph_builted$weights,
implementation = "orig",
update.rule = "config")
# creating a data frame to store the results
type = c(0)
labels = c(0)
groups = c(0)
res2 = data.frame(type, labels, groups)
labels = com$names # here you get the vertices names
groups = com$membership # here you get the communities indices
# here you save the information
res = data.frame(type = "spinGlass1", labels, groups)
res2 = rbind(res2, res)
# then you save the .csv file
write.csv(res2, "spinglass-communities.csv")
```
That resolves the problem for me.
Best regards.

Saving a group of histograms in R as a data.frame

I am trying to save a histogram for every file in a list. I cannot load more than 1 file at a time due to their large size. Normally I would use a symbolic object name for each file's histogram and iterate the name for each item in the list. I am having trouble figuring out how to do this in R so instead I attempt to save each hist as a column of a data.frame. The code is as follows:
filelist <- list.files("dir/")
file.hist <- data.frame(check.rows = FALSE)
for(i in 1:length(filelist) {
file <- read.csv(capture.output(cat("dir/", filelist[i], sep = "")))
file.hist[[i]] <- hist(file$Value, breaks = 200)
}
The error message that results is:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(breaks = c(0, 200, :
replacement has 6 rows, data has 0
I have googled the error message and it seems like it might be related to how you go about initializing the data from although I have to admit that my brain is fried this close to Thanksgiving. Has anyone out there dealt with an solved a similar problem? I am not married to this approach.

R write to file / append at start of file

I am trying to write an input file that requires a single line in the first row telling if the file is sparse and if so how many variable levels there are. I know how to append a single line to the end of a file, but can't find a way to append to the first line of a file. Any suggestions?
library(e1071)
library(caret)
library(Matrix)
library(SparseM)
iris2 <- iris
iris2$sepalOver5 <- ifelse(iris2$Sepal.Length >= 5, 1, -1)
head(iris2)
summary(iris2)
trainRows <- sample(1:nrow(iris2), nrow(iris2) * .66, replace = F)
testRows <- which(!(1:nrow(iris2) %in% trainRows))
sum(testRows %in% trainRows)
sum(trainRows %in% testRows)
vtu1 <- c('Sepal.Width','Petal.Length','Petal.Width','Species')
dv1 <- dummyVars( ~., data = iris2[,vtu1], sparse = T)
train <- iris2[trainRows,]
test <- iris2[testRows,]
trainX <- as.matrix.csr(predict(dv1, train))
testX <- as.matrix.csr(predict(dv1, test))
trainY <- train[,'sepalOver5']
testY <- test[,'sepalOver5']
write.matrix.csr( as(trainX , "matrix.csr"), file= "amz.train" , fac = TRUE)
headString <- paste('sparse ',max(trainX#ja),sep = '')
I'd basically like to insert/append headString into amz.train in the first row. Any suggestions?

It is generally not possible to prepend to the start of a file (and if there are ways, they would be really inefficient, since the information of the start of the file in memory is generally unknown. This holds for any programming language).
Three options come to mind:
Read in the file, write the other information first, followed by the rest of the content of the file (might also be inefficient)
Write the information you want to prepend first
In the case you have a writer that cannot append (write.matrix for instance has no append option), you could try to merge this meta information with the data frame, and then writing it as a whole.
Since you are using a specialized format, I wouldn't recommend storing this meta-information this way.
Your file would look like:
sparse 6
1:3 2:5.2 3:2 6:1
1:3.7 2:1.5 3:0.2 4:1
1:3.2 2:6 3:1.8 6:1
And then there is option 4:
Rather, consider having a meta file which contains information such as file name, whether it is sparse or not and the number of levels. Here you could append, and if you would repeat this process it would be preferable. It will avoid problems of reading in weirdly formatted files.