With R and Shiny Pro it is possible to implement multi-user analytical applications.
When a database is used to store intermediate data, how to give access to multiple user access to the db becomes very relevant and necessary.
Currently I'm using MonetDB / MonetDB.R configured (as usual for R) as a single user access, which means that any user operation occurs in sequence.
I would like to implement some type of connection pooling with the DB.
From past SO responses the driver does not include connection pooling.
Are there alternatives within these toolsets?
I am not aware of any connection pool implemented for R DBI connections. The setup you describe seems rather special. You could just create a connection for every client session. MonetDB limits the amount of concurrent connections however, to increase this limit, you could set max_clients to a higher value, for example by starting mserver5 with --set max_clients=1000 or (if you use monetdbd), monetdb set nclients=1000 somedb. Of course, DBI connection pooling would also be a feature request for Shiny Pro and not for MonetDB.R .
Related
how can I perform normal R-Code on a SQL Server without using the Microsoft rx-functions? I think the ComputeContext "RxInSqlServer" isn't the right one? But I couldn't find good Information about the other ComputeContext-options.
Is this possible with this Statement?
rxSetComputeContext(ComputeContext)
Or can I only use it to perform rx-functions? An other Option could be to set the Server Connection in RStudio or VisualStudio?
My Problem is: I want analyse data from hadoop via ODBC-Connection on the SQL Server, so I would like to use the performance of the remote SQL Server and not the data in SQL Server. And then I want analyse the hadoop-data with sparklyr.
Summary: I want to use the performance from the remote server and not the SQL Server data. So RStudio should run not local, it should perform and use the memory of the remote server.
Thanks!
The concept of a compute context in Microsoft R Server is, “Where will the computation be performed?”
When setting compute context, you are telling Microsoft R Server that computation will occur on either the local machine (with either “local” or “localpar” compute contexts), or, the script will be executed on a remote machine which has Microsoft R Server installed on it. Remote compute contexts are defined by creating a compute context object, and then setting the context to that object.
For SQL Server, you would create an RxInSqlServer() object, and then call rxSetComputeContext() on that object. For Hadoop, the object would be created via the RxHadoopMR() call.
In code, it would look something like:
CC <- RxHadoopMR( < context defined here > )
rxSetComputeContext(CC)
To see usage on defining a context, please see documentation (Enter "?RxHadoopMR" in the R Client, no quotes).
Any call to an "rx" function after this will be performed on the Hadoop cluster, with no data being transferred to the client; other than the results.
RxInSqlServer() would follow the same pattern.
Note: To perform any remote computation, Microsoft R Server must be installed on that machine.
If you wish to run a standard R function on a remote compute context, you must wrap that function in a call to rxExec(). rxExec() is desinged as an interface to parallelize any Open Source R function and allow for its execution on a remote context. Please see documentation (enter "?rxExec" in the R Client, no quotes) for usage.
For information on efficient parallelization, please see this blog: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/11/14/performance-optimization-when-using-rxexec-to-parallelize-algorithms/
You called out "without using the Microsoft rx-functions" and I am interpreting this as, "I would like to use Open Source R Algorithms on data in-SQL Server", with Microsoft R Server, you must use rxExec() as the interface to run Open Source R. If you want to use no rx functions at all, you will need to query the data to your local machine, and then use Open Source R. To interface with a remote context using Microsoft R Server, the bare minimum is using rxExec().
This is how you will be able to achieve the first part of your ask, "how can I perform normal R-Code on a SQL Server without using the Microsoft rx-functions? I think the ComputeContext "RxInSqlServer" isn't the right one?"
For your second ask, "My Problem is: I want analyse data from hadoop via ODBC-Connection on the SQL Server, so I would like to use the performance of the remote SQL Server and not the data in SQL Server. And then I want analyse the hadoop-data with sparklyr."
First, I'd like to comment that with the release of Microsoft R Server 9.1, you can use sparklyr in-line with an MRS Spark connection, for some examples, please see this blog: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/04/19/new-features-in-9-1-microsoft-r-server-with-sparklyr-interoperability/
Secondly, what you are trying to do is very involved. I can think of two ways that this is possible.
One is, if you have SQL Server PolyBase, you can configure SQL Server to make a virtual table referencing data in Hadoop, similar to Hive. After you have referenced your Hadoop data in SQl Server, you would use an RxInSqlServer() compute context on these tables. This would analyse the data in SQL Server, and return the results to the client.
Here is a detailed blog explaining an end-to-end setup on Cloudera and SQL Server: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/10/17/integrating-polybase-with-cloudera-using-active-directory-authentication/
The Second, which I would NOT recommend, is untested, hacky, and has the following prereqs:
1) Your Hadoop cluster must have OpenSSH installed and configured
2) Your SQL Server Machine must have the ability to SSH into your Hadoop Cluster
3) You must be able to place an SSH Key on your SQL Server machine in a directory which the R Services process has the ability to access
And I need to add another disclaimer here, there is No Guarantee of this working, and, likely, it will not work. The software was not designed to operate in this fashion.
You would then do the following:
On your client machine, you would define a custom function which contains the analysis that you wish to perform, this can be Open Source R Function, rx functions, or a mix.
In this custom function, before calling any other R or rx functions, you would define a RxHadoopMR compute context object which points to your cluster, referencing the SSH key in the directory on the SQL Server machine as if you were executing from that machine. (in the same way that you would define the RxHadoopMR object if you were to do a remote Hadoop operation from your client machine).
Within this custom function, immediately after RxHadoopMR() is defined, you would call rxSetComputeContext() on your defined RxHadoopMR() object
Still in this custom function, write the actual script which will operate on the data in Hadoop.
After this function is defined, you would define an RxInSqlServer() compute context object on the client machine.
You would set your compute context to RxInSqlServer()
Then you would call rxExec() with your custom function as an input.
What this will do is execute your custom function on the SQL Server machine, which would hopefully cause it to define its compute context as your Hadoop cluster, and pull the data over SSH for analysis on the SQL Server machine; returning the results to client.
With that said, this is not how Microsoft R Server was designed to be used, and if you wish to optimize performance, please use Option One and configure PolyBase.
I am not sure if this is the right place to ask this question. Please point out to me where if this is the case.
I must build a multi user, stateful (sessions; object persistance) web application that will uses .NET in the backend and must connect to R in order to perform calculations on data that lies in a SQL server 2016 DB. Basically, I need to connect a MS based backend with R.
Everything is clear, except for the problem that I need to find an R server that handles sessions. I know shiny but I can't use it (long story).
rApache and openCPU do not handle sessions.
Rserve for windows is very limited (no parallel connections are supported, subsequent connections share the same namespace and sessions are not supported - this is a consequence of the fact that parallel connections are not supported)
Finally, I have seen Rook (i.e. Run R/Rook as a web server on startup) but I can't read anywhere, even the docs. if it is able to deal with sessions. My question is: is there a non stateless R web server or does anyone knows if Rook is stateless?
EDIT:
Apparently, this question has been around for longer: http://jeffreyhorner.tumblr.com/about#comment-789093732
When using web services (we're specifically using asmx and WCF) with ASP.NET, what is the best way to establish a SQL connection? Right now, I'm establishing a new connection for each web service call, but I'm not convinced this will be too efficient when there will be thousands of users connecting. Any insight on this topic would be much appreciated.
What you are doing is fairly standard.
Assuming you are using the same connection string, the connections will be coming from the connection pool, which is the most efficient way to get connections already.
Only doing the work required and closing the connection on each call is good practice.
One thing you can do is cache results and return the cached results for calls that are not likely to result in changed data over the life of the cache item. This will reduce database calls.
It is strongly recommended that you always close the connection when you are finished using it so that the connection will be returned to the pool. You can do this using either the Close or Dispose methods of the Connection object, or by opening all connections inside a using statement in C#. Connections that are not explicitly closed might not be added or returned to the pool.
You should add "Pooling = true" (and add a non-zero "Min Pool Size") to the connection string.
Let the provider handle connection pooling for you; don't try to do better than it - you will fail.
With the default connection settings the provider will maintain a connection pool. When you close/dispose, the connection is actually just released to the pool. it is not necessarily really closed.
By default, SqlConnections make use of connection pooling, which will allow the system to manage the re-use of previous connection objects rather than truly creating "new" connections for each request - up to a pool maximum value. And its built-in, so you don't really have to do anything to leverage it.
Writing your own pooling/connection manager is fraught with peril, and leads to all manner of evil, so it seems to me allowing the system to manage your connections from the pool is probably your best bet.
I'm a total unix-way guy, but now our company creates a new application under ASP.NET + SQL Server cluster platform.
So I know the best and most efficient principles and ways to scale the load, but I wanna know the MS background of horizontal scaling.
The question is pretty simple – are there any built-in abilities in ASP.Net to access the least loaded SQL server from SQL Server cluster?
Any words, libs, links are highly appreciated.
I also would be glad to hear best SQL Server practices or success stories around this theme.
Thank you.
Pavel
SQL Server clustering is not load balancing, it is for high-availability (e.g. one server dies, cluster is still alive).
If you are using SQL Server clustering, the cluster is active/passive, in that only one server in the cluster ever owns the SQL instance, so you can't split load across both of them.
If you have two databases you're using, you can create two SQL instances and have one server in the cluster own one of the two instances, and the other server own the other instance. Then, point connection strings for one database to the first instance, and connection strings for the second database to the second instance. If one of the two instances fails, it will failover to the passive server for that instance.
An alternative (still not load-balancing, but easier to setup IMO than clustering) is database mirroring: http://msdn.microsoft.com/en-us/library/ms189852.aspx. For mirroring, you specify the partner server name in the connection string: Data Source=myServerAddress;Initial Catalog=myDataBase;User Id=myUsername;Password=myPassword;Failover Partner=myBackupServerAddress; ADO.Net will automatically switch to the failover partner if the primary fails.
Finally, another option to consider is replication. If you replicate a primary database to several subscribers, you can split your load to the subscribers. There is no built-in functionality that I am aware of to split the load, so your code would need to handle that logic.
I have heard the term connection pooling and looked for some references by googling it... But can't get the idea when to use it....
When should i consider using
connection pooling?
What are the advantages and
disadvantagesof connection pooling?
Any suggestion....
The idea is that you do not open and close a single connection to your database, instead you create a "pool" of open connections and then reuse them. Once a single thread or procedure is done, it puts the connection back into the pool and, so that it is available to other threads. The idea behind it is that typically you don't have more than some 50 parallel connections and that opening a connection is time- and resource- consuming.
When should i consider using
connection pooling?
Always for production system.
What are the advantages and
disadvantages of connection pooling?
Advantages:
Performance. Use a fixed pool of connection and avoid the costly creation and release of connections.
Shared infrastructure. If your database is shared between several apps, you don't want one app to exhaust all connections. Pooling help to limit the number of connection per app.
Licensing. Depending on your database license, the number of concurrent client is limited. You can set a pool with the number of authorized connections. If no connection is available, client waits until one is available, or times out.
Connectivity issue. The connection pool that is between the client and the database, can provide handy features such as "ping" test, connection retry, etc. transparently for the client. In worse case, there is a time-out.
Monitoring. You can monitor the pool, see the number of active connections, etc.
Disadvantage:
You need to set it up and configure it, which is really peanuts usually.
You should use connection pooling whenever the time to establish a connection is greater than zero (pretty much always) and when there is a sufficient average usage such that the connection is likely to be used again before it times out.
Advantages are it's much faster to open/close new connections as they're not really opened and closed, they're just checked out/in to a pool.
Disadvantage would be in some connection pools you'll get an error if all pooled connections are in use. This usually is a good thing as it indicates a problem with the calling code not closing connections, but if you legitimately need more connections than are in the pool and haven't configured it properly, you could get errors where you wouldn't otherwise.
And of course there will be other pros and cons depending on the specific environment you're working in and database.
In .NET, if you are using the same connection string for data access then you already have connection pooling. The concept is to reuse an idle connection without having to tear it down & recreate it, thereby saving server resources.
This is of-course taking into consideration that you are closing open connections upon completion of your work.
connection pooling enables re-use of an existing, but not used database connection. by using it you eliminate the overhead of the connection/disconnection to the database server. it provides a significant performance boost in most cases. the only reason i can think of not to use it is if your software won't be connecting frequently enough to keep the connections alive or if there's some bug in the pooling layer.
If we required to communicate with the database multiple times then it is not recommended to create a separate Connection Object every time, because creating and destroying connection object impacts performance.
To overcome this problem we should use a connection pool.
If we want to communicate with the database then we request a connection pool to provide a connection. Once we got the connection, by using it we can communicate with the database.
After completing our work, we can return the connection object back to the pool instead of destroying it.
The main advantage of connection pooling is to reuse the same connection object multiple times.