tf transform is handy for feature processing, but it's not efficient to run on large dataset without distributed computation. tf transform runs on beam, which to my understanding can use multiple runners like dataflow, spark runner etc., but I can't find any example about running tf transform on spark. I am wondering if it is supported at this moment.
I don't think you can run tf.transform on Spark at this time yet.
tf.transform is in Python, and the Beam's Spark runner only supports Java. AFAIK only the Google's Cloud Dataflow runner works with Python and tf.transform. There is one article mentioned PySpark, but not sure how that fits in.
There are ongoing Beam runner developments and one that is furtherest is probably Flink Runner which has Python SDK, but it is still under development, and support and examples are very sparse. Here is a stack overflow post about setting it up.
Related
Can one implement automated UI testing for any UI designed using Azure GUIX? I came across an earlier post asking the same question.
Azure GUIX automated testing specifically for languages
I searched for Azure Test Harness but could not find much on it. Any suggestions?
We have added a string-fit test in GUIX Studio for the 6.1.9.1 patch release, under Edit -> Run String Fit Test. This release has been pushed to the App Store and will be rolling out over the next couple of days. This tests to insure every string in every language fits within the assigned widgets. Of course you can assign and re-assign strings at runtime, so Studio can't test for that, only the know assignments of widgets and fonts can be statically tested by GUIX Studio.
For our own internal regression testing, we build our test apps as Win32 apps and write python scripts to generate events into those apps to drive them. We do md5sum calculations of the canvas memory and compare the computed value with "golden values" to insure nothing has been broken. We haven't yet instrumented anything similar to support on-target regression testing but we have this feature in our backlog, I will see if we can get this on the priority list for the next release.
Best Regards
I am writing a machine learning toolkit to run algorithm with different settings in parallel (each process run the algorithm for one setting). I am thinking about either to use mpi4py or python's build-in multiprocessing ?
There are a few pros and cons I am considering about.
Easy-to-use:
mpi4py: It seems more concepts to learn and a bit more tricks to make it work well
multiprocessing: quite easy and clean API
Speed:
mpi4py: people say it is more low level, so I am expect it can be faster than python multiprocessing ?
multiprocessing: compared with mpi4py, much slower ?
Clean and short code:
mpi4py: seems more code to write
multiprocessing: preferred, easy to use API
The working context is I am aiming at running the code basically in one computer or a GPU server. Not really targeting at running in different machines in the network (which only MPI can do it).
And since the main goal is doing machine learning, so the parallelization is not really required to be very optimal, the key goal I want to have is to balance easy, clean and quick to maintain code base but at the same time like to exploit the benefits of parallelization.
With the background described above, is it recommended that using multiprocessing should just be enough ? Or is there a very strong reason to use mpi4py ?
By using mpi4py you can divide the task into multiple threads, but with a single computer with limited performance or number of cores the usability will be limited. However you might find it handy during training.
mpi4py is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings.
MPI for Python provides MPI bindings for the Python language, allowing programmers to exploit multiple processor computing systems.
MPI for Python supports convenient, pickle-based communication of generic Python object as well as fast, near C-speed, direct array data communication of buffer-provider objects
I know java language very well, i created custom library in java.
while executing i am using jybot to execute the scripts.
When i am running the scripts have Oracle Database Connection, got few errors like cx_oracle is not found, but i ran the same script with pybot option i got no errors.
I understood that when i execute the script using jybot, the verification of folders for prerequisites is different.
I want to know, which is better or have more functions to create our custom library java or python.
I want to know the difference between jybot and pybot when it comes to the execution of scripts.
There are three questions you are asking:
1. What is the difference between pybot (python) and robot on Jython.
2. What is the better approach for developing custom libraries.
3. What causes my Oracle problems.
For question 1 the answer is that in principle the same core code is running for robot running on Python as well as within Jython. So in that sense this shouldn't matter much. However, as most people are running the pure Python flavour this version would probably be the better version from a support perspective. That said, if you and your colleagues are more comfortable with Java, then this may be the better option for you.
Regarding question 2. This follows the same line as the answer for 1. If you feel more comfortable with Java, then this should be fine. However, since robot at the core is a Python application (even on Jython) it makes more sense to run this version. This has also been asked before and a tutorial about the Remote Library approach also good to read. In any case the official documentation holds great examples as well.
For your last question. Please provide us with more details, or better yet; create a new question for it.
There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.
There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?
The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io
The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.
You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.
A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).
It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.
Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!