LDA gensim model in a flask HTTP API - Memory issues - http

I am new to Machine Learning and it is the first time that I am using python's gensim in order to extract topics from text.
I successfully trained a model (for 100 topics) and then I had the idea to use that model in an HTTP API that I created using python flask. The endpoint gives as back terms for a given text.
Btw model is loaded when I initialize the API.
After trying this out on production, memory (on a small VM ~ 1GB Ram) exhausted and finally I got an error:
tags = tags + lda.topic_words(topic_index, num_of_keywords_for_topic, model, words)
File "/var/app/tagbee/lda.py", line 64, in topic_words
x2 = model.get_topic_terms(topicid=topic_index, topn=number_of_keywords)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py", line 1224, in get_topic_terms
topic = self.get_topics()[topicid]
File "/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py", line 1204, in get_topics
topics = self.state.get_lambda()
File "/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py", line 269, in get_lambda
return self.eta + self.sstats
MemoryError: Unable to allocate 96.6 MiB for an array with shape (100, 253252) and data type float32
So I have some questions:
Can a gensim LDA model be used that way, mean in an HTTP API?
If yes, what is the trick to make it happen? If it needs at least 90MB of memory per request, how does it scale?
Is there any alternative approach?
Thank you in advance!

Your question seems to be related to LDA or gensim only accidentally. The main point seems to be how to and maintain (and reuse) an object in memory across a number of Flask requests.
Inspired by Flask documentation and answers from this question:
Flask - Store values in memory between requests I propose the following approach:
from flask import g # global context of all queries
def get_lda_model():
if 'lda' not in g:
g.lda = # read a model file here
return g.lda
#app.route('/example_request_path', methods =['POST'])
def my_request():
lda = get_lda_model()
# use lda model here....
Once the LDA model is loaded you can reuse it very quickly in a number of requests without reloading it into memory. As long as your model is not going to be changed across requests it does not matter if this approach is thread-safe.

Related

You are passing KerasTensor an intermediate Keras symbolic input/output

I'm triying to run the code of section 5.4.2 of the F. Chollet's book "Deep Learning with R", on Visualizing convnet filters. The code is as follows:
library(keras)
model <- application_vgg16(
weights = 'imagenet',
include_top = FALSE
)
layer_name <- 'block3_conv1'
filter_index <- 1
layer_output <- get_layer(model, layer_name)$output
loss <- k_mean(layer_output[,,,filter_index])
grads <- k_gradients(loss, model$input)[[1]]
but throws Error in py_call_impl(): ! RuntimeError: tf.gradients is not supported when eager execution is enabled. Use tf.GradientTape instead.
According to this github issue this can easily be resolved adding two lines of code at the beginning of the script:
library(tensorflow)
tf$compat$v1$disable_eager_execution()
and effectively, the tf$executing_eagerly() call transitions from TRUE to FALSE. Now, k_gradients() throws another error:
Error in `py_call_impl()`:
! TypeError: You are passing KerasTensor(type_spec=TensorSpec(shape=(), dtype=tf.float32, name=None), name='tf.math.reduce_mean/Mean:0', description="created by layer 'tf.math.reduce_mean'"), an intermediate Keras symbolic input/output, to a TF API that does not allow registering custom dispatchers, such as `tf.cond`, `tf.function`, gradient tapes, or `tf.map_fn`. Keras Functional model construction only supports TF API calls that *do* support dispatching, such as `tf.math.add` or `tf.reshape`. Other APIs cannot be called directly on symbolic Kerasinputs/outputs. You can work around this limitation by putting the operation in a custom Keras layer `call` and calling that layer on this symbolic input/output.
How to overcome this issue and visualize convnet filters?

Use Azure custom-vision trained model with tensorflow.js

I've trained a model with Azure Custom Vision and downloaded the TensorFlow files for Android
(see: https://learn.microsoft.com/en-au/azure/cognitive-services/custom-vision-service/export-your-model). How can I use this with tensorflow.js?
I need a model (pb file) and weights (json file). However Azure gives me a .pb and a textfile with tags.
From my research I also understand that there are also different pb files, but I can't find which type Azure Custom Vision exports.
I found the tfjs converter. This is to convert a TensorFlow SavedModel (is the *.pb file from Azure a SavedModel?) or Keras model to a web-friendly format. However I need to fill in "output_node_names" (how do I get these?). I'm also not 100% sure if my pb file for Android is equal to a "tf_saved_model".
I hope someone has a tip or a starting point.
Just parroting what I said here to save you a click. I do hope that the option to export directly to tfjs is available soon.
These are the steps I did to get an exported TensorFlow model working for me:
Replace PadV2 operations with Pad. This python function should do it. input_filepath is the path to the .pb model file and output_filepath is the full path of the updated .pb file that will be created.
import tensorflow as tf
def ReplacePadV2(input_filepath, output_filepath):
graph_def = tf.GraphDef()
with open(input_filepath, 'rb') as f:
graph_def.ParseFromString(f.read())
for node in graph_def.node:
if node.op == 'PadV2':
node.op = 'Pad'
del node.input[-1]
print("Replaced PadV2 node: {}".format(node.name))
with open(output_filepath, 'wb') as f:
f.write(graph_def.SerializeToString())
Install tensorflowjs 0.8.6 or earlier. Converting frozen models is deprecated in later versions.
When calling the convertor, set --input_format as tf_frozen_model and set output_node_names as model_outputs. This is the command I used.
tensorflowjs_converter --input_format=tf_frozen_model --output_json=true --output_node_names='model_outputs' --saved_model_tags=serve path\to\modified\model.pb folder\to\save\converted\output
Ideally, tf.loadGraphModel('path/to/converted/model.json') should now work (tested for tfjs 1.0.0 and above).
Partial answer:
Trying to achieve the same thing - here is the start of an answer - to make use of the output_node_names:
tensorflowjs_converter --input_format=tf_frozen_model --output_node_names='model_outputs' model.pb web_model
I am not yet sure how to incorporate this into same code - do you have anything #Kasper Kamperman?

Is rJava object is exportable in future(Package for Asynchronous computing in R)

I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.
(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.
The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"

How to save a model when using MXnet

I am using MXnet for training a CNN (in R) and I can train the model without any error with the following code:
model <- mx.model.FeedForward.create(symbol=network,
X=train.iter,
ctx=mx.gpu(0),
num.round=20,
array.batch.size=batch.size,
learning.rate=0.1,
momentum=0.1,
eval.metric=mx.metric.accuracy,
wd=0.001,
batch.end.callback=mx.callback.log.speedometer(batch.size, frequency = 100)
)
But as this process is time-consuming, I run it on a server during the night and I want to save the model for the purpose of using it after finishing the training.
I used:
save(list = ls(), file="mymodel.RData")
and
mx.model.save("mymodel", 10)
But none of them can save the model! for example when I load the "mymodel.RData", I can not predict the labels for the test set!
Another example is when I load the "mymodel.RData" and try to plot it with the following code:
graph.viz(model$symbol$as.json())
I get the following error:
Error in model$symbol$as.json() : external pointer is not valid
Can anybody give me a solution for saving and then loading this model for future use?
Thanks
You can save the model by
model <- mx.model.FeedForward.create(symbol=network,
X=train.iter,
ctx=mx.gpu(0),
num.round=20,
array.batch.size=batch.size,
learning.rate=0.1,
momentum=0.1,
eval.metric=mx.metric.accuracy,
wd=0.001,
epoch.end.callback=mx.callback.save.checkpoint("model_prefix")
batch.end.callback=mx.callback.log.speedometer(batch.size, frequency = 100)
)
A mxnet model is an R list, but its first component is not an R object but a C++ pointer and can't be saved and reloaded as an R object. Therefore, the model needs to be serialized to behave as an actual R object. The serialized object is also a list, but its first object is a text string containing model information.
To save a model:
modelR <- mx.serialize(model)
save(modelR, file="~/model1.RData")
To retrieve it and use it again:
load("~/model1.RData", verbose=TRUE)
model <- mx.unserialize(modelR)
The best practice for saving a snapshot of your training progress is to use save_snapshot (http://mxnet.io/api/python/module.html#mxnet.module.Module.save_checkpoint) as part of the callback after every epoch training. In R the equivalent command is probably mx.callback.save.checkpoint, but I'm not using R and not sure about it usage.
Using these snapshots can also allow you to take advantage of the low cost option of using AWS Spot market (https://aws.amazon.com/ec2/spot/pricing/ ), which for example now offers and instance with 16 K80 GPUs for $3.8/hour compare to the on-demand price of $14.4. Such 80%-90% discount is common in the spot market and can optimize the speed and cost of your training, as long as you use these snapshots correctly.

What to do when a NOAA ERDDAP dataset is not found?

I'm trying to download some gridded ERDDAP data using the rnoaa package in R. While the data retrieval works perfectly for some datasets, I'm having some problems getting the data for some datasets in particular. For example when I run:
library (rnoaa)
ds.info <- erddap_info ("noaa_pfeg_95de_54ab_a60a")
erddap_grid (ds.info,
time = c("2005-01-01", "2015-01-01"),
altitude = c (0,0),
latitude = c (3.25, 3.75),
longitude = c (72.5, 73.25),
fields = "all")
I get the following error:
`Error: (404) - Resource not found: /erddap/griddap/ncdcOwDly.csv (Currently unknown datasetID=ncdcOwDly)`.
The error is not really consistent because it works sometimes when I try different time-spans. But I get it pretty much every single time I try to download data from the datasets noaa_pfeg_95de_54ab_a60a, noaa_pfeg_1a4b_0c2a_2365 and some others by NOAA-NCDC.
Because erddap_grid works for some datasets but not for others, I'm inclined to think it's not a bug. Maybe it is a problem of the ERDDAP server?, or maybe something to do with my API key? Is there a way around it?
Update - 2015-01-10: It seems it is a server's problem. When trying to download the data using the address generated by the web interface (see below) I get the same error. I guess I'll just have to wait until "they" fix the problem with the database.
http://coastwatch.pfeg.noaa.gov/erddap/griddap/ncdcOw6hr.csv?u[(2006-01-01):1:(2015-01-09T18:00:00Z)][(10.0):1:(10.0)][(3.25):1:(3.75)][(72.5):1:(73.25)],v[(2006-01-01):1:(2015-01-09T18:00:00Z)][(10.0):1:(10.0)][(3.25):1:(3.75)][(72.5):1:(73.25)]
ERDDAP servers often become overloaded and 404 on some requests. These are public-facing servers that do heavy data lifting, after all.
So the answer here is to try again after waiting some time (please wait a while to be nice to the ERDDAP administrators), and contact the server administrator to be sure that your IP address has not been blacklisted for performing too many requests.

Resources