I'm brand new to Julia so forgive any ignorance.
I'm hoping to be able to use Doc2Vec from Python's gensim module in Julia. However, i am running up against an issue where the names from the TaggedDocument python object are not surviving the automatic conversion when being assigned to a variable in Julia.
This seems to be a known issue, but not really one where i can clearly see how to implement the solution. https://github.com/JuliaPy/PyCall.jl/issues/175
# import modules
gensim = pyimport("gensim")
Doc2Vec = pyimport("gensim.models.doc2vec")
# create simple "taggedDocuments"
a = Doc2Vec.TaggedDocument(words=gensim.utils.simple_preprocess("This is some text"), tags = ["tag01"])
# setup the Doc2Vec model
model = Doc2Vec.Doc2Vec(size= 100, min_count = 1, dm = 1)
# use the "taggedDocument" to populate the vocab attributes.
model.build_vocab(a)
# This results in - AttributeError("'tuple' object has no attribute 'words'")
# one idea i had was to try to re-add the names to the julia object
tnames = (:words, :tags);
c = (;zip(tnames, a)...)
# However when these get passed back into python the names are lost again
model.build_vocab(c)
# and again - AttributeError("'tuple' object has no attribute 'words'")
My current assumption is that if i can force the outputs from the Doc2Vec.TaggedDocument() to not be automatically converted and to be stored as a PyObject then the names shouldn't be lost. To me this seems like something simple as part of PyCall but reading the Types section here: https://github.com/JuliaPy/PyCall.jl hasn't helped. So wondering if anyone had potential solution.
Thanks in advance.
Related
How do you save gl_speech_op output to an object within the R?
I successfully ran GoogleLanguageR to convert an audio file to text within the Google Cloud Platform. I can see the output but I don't know how to save the output to an object within R Studio.
Sample code is below. I am using R Notebook.
library(googleLanguageR)
library(tidyverse)
###let's get Craig Watkins
gl_auth("D:/Admin/Documents/Google API JSON Authenticate/My Project two test-db5d6330925e.json")
watkins <- gl_speech("gs://testtwoibm/craig watkins 2018_05_07_14_08_08.flac",
encoding = c("FLAC"), sampleRateHertz = 44100, languageCode = "en-US",
maxAlternatives = 1L, asynch = TRUE)
## Send to gl_speech_op() for status or finished result
gl_speech_op(watkins)
RStudio notebook output showing converted speech to text.
The easiest way in R to save any output of an operation to an object in R is assigning it via the assignment operator <-
In your case, you would only assign it to an object like this:
transcript <- gl_speech_op(watkins)
One small reminder: This will also work if the asynchronous API request hasn't finished transcribing yet. However, your object will not contain any information. In your case it will be any list of 2 with two NULL elements. If finished, the object will contain both the transcript and the timings.
I understand you want the output as a text. If this is the case, then you can use capture.output:
new_obj = capture.output(gl_speech_op(watkins))
new_obj
I think I have exhausted the entire internet looking for an example / answer to my query regarding implementing a h2o mojo model to predict within RShiny. We have created a bunch of models, and wish to predict scores in a RShiny front end where users enter values. However, with the following code to implement the prediction we get an error of
Warning: Error in checkForRemoteErrors: 6 nodes produced errors; first
error: No method asJSON S3 class: H2OFrame
dataInput <- dfName
dataInput <- toJSON(dataInput)
rawPred <- as.data.frame(h2o.predict_json(model= "folder/mojo_model.zip", json = dataInput, genmodelpath = "folder/h2o-genmodel.jar"))
Can anyone help with some pointers?
Thanks,
Siobhan
This is not a Shiny issue. The error indicates that you're trying to use toJSON() on an H2OFrame (instead of an R data.frame), which will not work because the jsonlite library does not support that.
Instead you can convert the H2OFrame to a data.frame using:
dataInput <- toJSON(as.data.frame(dataInput))
I can't guarantee that toJSON() will generate the correct input for h2o.predict_json() since I have not tried that, so you will have to try it out yourself. Note that the only way this may work is if this is a 1-row data.frame because the h2o.predict_json() function expects a single row of data, encoded as JSON. If you're trying to score multiple records, you'd have to loop over the rows. If for some reason toJSON() doesn't give you the right format, then you can use a function I wrote in this post here to create the JSON string from a data.frame manually.
There is a ticket open to create a better version of h2o.predict_json() that will support making predictions from a MOJO on data frames (with multiple rows) without having to convert to JSON first. This will make it so you can avoid dealing with JSON altogether.
An alternative is to use a H2O binary model instead of a MOJO, along with the standard predict() function. The only requirement here is that the model must be loaded into H2O cluster memory.
The following works now using the json formatting from first two lines and the single quote around var with spaces.
df<- data.frameV1=1,V2=1,CMPNY_EL_IND=1,UW_REGION_NAME = "'LONDON & SE'" )
dfstr <- sapply(1:ncol(df), function(i) paste(paste0('\"', names(df)[i], '\"'), df[1,i], sep = ':'))
json <- paste0('{', paste0(dfstr, collapse = ','), '}')
dataPredict <- as.data.frame(h2o.predict_json(model = "D:\\GBM_model_0_CMP.zip", json = json, genmodelpath = "D:\\h2o-genmodel.jar", labels = TRUE))
This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.
It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.
For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...
Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)
The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.
Reference classes definitions can pile up quite some lines of code in R. When methods are defined inside a reference class a couple of methods plus the field definitions gives you a quite confusing class definition – at least it's hard to read at 300+ lines. And I have other issues:
roxygen2 documentation does not really work as out-of-the-box as for functions.
auto-suggest using the $ operator works for functions and lists of functions, but not for methods in RC, just for field names
I don't know you I could possible split method definitions to several files. All code of my package resides in 2 or 3 files that contain the classes defintions
So speaking in code, why should I not do something like this?
someDummy <- setRefClass("someDummy", fields = list(df = "matrix",
multiplier = "numeric"))
test <- someDummy()
thingsYouCanDo <- function(){
rc <- NULL
mtrx <- NULL
multi <- NULL
populate <- function(rc,mtrx,multi){
rc$df <- mtrx
rc$multiplier <- multi
}
multiply <- function(rc){
out <- rc$df * rc$multiplier
out
}
return(list(populate = populate,
multiply = multiply))
}
te <- thingsYouCanDo()
te$populate(test,matrix(1:12,4,3),5)
test
te$multiply(test)
Are there any well written packages on CRAN that make use of RC and are documented well? Speaking of documentation, I do not mean a neat website, but .Rd based documentation.
What I have seen a lot lately in other people's source code is function that contain function or list of functions. Should I rather use that?
I have found part of an answer to my own question: the lme4 packages uses quite some RC classes and also documents them using .Rd
I am using tableNominal{reporttools} to produce frequency tables. The way I understand it, tableNominal() produces latex code which has to be copied and pasted onto a text file and then saved as .tex. But is it possible to simple export the table produced as can be done in print(xtable(table), file="path/outfile.tex"))?
You may be able to use either latex or latexTranslate from the "Hmisc" package for this purpose. If you have the necessary program infrastructure the output gets sent to your TeX engine. (You may be able to improve the level of our answers by adding specific examples.)
Looks like that function does not return a character vector, so you need to use a strategy to capture the output from cat(). Using the example in the help page:
capture.output( TN <- tableNominal(vars = vars, weights = weights, group = group,
cap = "Table of nominal variables.", lab = "tab: nominal") ,
file="outfile.tex")