How do I extract non-HANA ECC tables into R? - r

I find that there's very little documentation on how to extract SAP tables into R.
I'm not talking about SAP HANA.
Currently, it's very troublesome that I need to manually extract SAP tables using a GUI interface, export them into tabular format. Then only I can import them using my R script.
The current solution I'm exploring is to have my SAP colleagues to export those SAP tables into SQL database, then I can query the tables from R.
Ideally I want to cut this seemingly unnecessary step of having the SAP tables exported into a database.

For SAP R/3 systems (or what you call ECC), your best bet would be executing remote function calls (i.e. RFC).
Normally these would be supported by open source interfaces for at least the more recent versions (e.g. 4.6 or above).
However, they are fairly scarce and I know only of one such implementation in R - this is the RSAP. You'd also need to download NW RFC SDK, and there may be further requirements based on your OS (e.g. what Visual C++ you'd need for Windows, etc.).
There's also a slightly more widely recognised equivalent in Python, the PyRFC.
On the other hand, you may try Robotic Process Automation (RPA) to interact with GUI in an automated way. One of the options is UiPath but there are others. This way you could configure the automation of table extraction - at the same time you can also call R scripts directly from the RPA.
Overall - to be honest - the solution with extracting tables into a separate database does seem to be the best alternative (compared to what I've described above).
Note: The above presumes that - for any reason, usually security - you cannot access the database underlying ECC directly through ODBC calls - otherwise the instructions for connecting and calling SQL from R are the same as for HANA or similar.

Consider using RODBC. This package allows adding different ODBC sources and use them in R Studio.
Follow this article and don't bug to word "HANA", this approach allows using any database, not only HANA.

Related

Using R in Apache Spark

There are some options to access R libraries in Spark:
directly using sparkr
using language bindings like rpy2 or rscala
using standalone service like opencpu
It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.
Do you have any experience with integrating R and Spark you can share?
The main language for the project seems like an important factor.
If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.
There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)
If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.
If your main language is Scala, rscala should be your first try.
While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

Connecting to a SSAS cube using R

Is it possible to query a SQL Server Analysis Services cube using R?
I have this cube on a different external Server and i am working from my machine but i have admin privileges on the Server with my domain account.
To put it simply, i want to create an alaysis services solution, suing some mining algorithm to examine data cotained in the Cube. I can do that with Excel, for instance, but i need to use R to take advantage of several interesting clustering algorithms different from those MS offers.
Is that possible? Which package should i import in R?
maybe something is coming from here
https://www.linkedin.com/groups/Importing-MDX-OLAP-cube-data-77616%2ES%2E265568694?view=&gid=77616&item=265568694&type=member&commentID=discussion%3A265568694%3Agroup%3A77616&trk=hb_ntf_COMMENTED_ON_GROUP_DISCUSSION_YOU_COMMENTED_ON&_mSplash=1
at the moment it's not available from CRAN, but maybe you will give it a try.
In the meantime you can also use an ODBC / OLEDB bridge and use the RODBC from inside R.
Hope this helps a little
Tom

Options for deploying R models in production

There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?
The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io
The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.
You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.
A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).
It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.
Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!

How much data do I need to have to make use of Presto?

How much data do I need to have to make use of Presto? The web site states that it can query data sizes from gigabytes to petabytes. I understand how it is used to query very large datasets, but is anyone using it for hundreds of gigabytes?
Currently, Presto is most useful if you already have an existing Hive installation. If you are using Hive, you should definitely try Presto. If all your data fits in a relational database like PostgreSQL or MySQL on a single machine, and you are happy with the performance, then keep using that.
However, Presto should be much faster than either of those databases on a single machine for analytic queries because it executes a query in parallel. Neither of those databases parallelize the execution of individual queries. At the moment, using Presto requires setting up HDFS and Hive (even on a single machine), so getting started will be more work than if you already have an existing Hive installation.
Or, you can take a look at Impala - which has been available as production-ready software for six months. Like Presto, Impala is a distributed SQL query engine for data in HDFS that circumvents MapReduce. Unlike Presto, there is a commercial vendor providing support (Cloudera).
That said, David's comments about data size still apply. Use the right tool for the job.

connecting to salesforce using SAS

I have a client with salesforce enterprise edition. I need to connect to and extract the salesforce data using Base SAS (SAS/Access for ODBC is licensed).
How can this be achieved? Is it possible to map a libname using an ODBC engine, or is it necessary to use the web APIs?
Don't know about Salesforce support for ODBC, but certainly it is possible to map a libname using ODBC.
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/access_odbc_7365.pdf
has examples - basically it is
libname <name> odbc <connection options>;
Salesforce specific SAS product:
http://support.sas.com/documentation/cdl/en/bidsag/61236/HTML/default/viewer.htm#a003279102.htm
That requires more than base SAS, of course, but if SAS BI is an option it seems pretty easy to configure.
Another option that is not free (well, it seems to be free for 15 days):
http://blogs.datadirect.com/2012/02/sas-access-to-salesforce-crm-for-superior-odbc-integration-with-sas.html
It doesn't seem like there is a free ODBC driver for Salesforce, though, unless something's changed - example, http://success.salesforce.com/ideaView?id=08730000000Bqqu seems to suggest it's something desired but not available.
So for free solution you may want to use the Web API...

Resources