Currently I'm building a Shiny APP using several queries to a PostgreSQL database (mainly SELECT and INSERT statements). The application works but I'm trying to make it faster. When I compare the execution times between the same query using the RPostgreSQL package and a db client like Postico, it's taking 8 times more with the RPostgreSQL package.
Any ideas of ways of boosting the performance or connecting to a PostgreSQL database from R?
Thanks
Have you ever heard about the package dbplyr (with the b)?
I would recommend it because this package enables your dplyr (with no b) to be used with SQL databases.
There are many advantages since the way you interact with your databases will shift
from this:
to this:
These images are extracted from a great article entitled "Databases using R" by Edgar Ruiz (2017). You should take a look at it HERE for more details.
The main advantages presented by Mr. Ruiz are, and I quote:
"
1) Run data exploration over all of the data - Instead of coming up with a plan to decide what data to import, we can focus on analyzing the data inside the database, which in turn should yield faster insights.
2) Use the SQL Engine to run the data transformations - We are, in effect, pushing the computation to the database because dplyr is sending SQL queries to the database.
3) Collect a targeted dataset - After become familiar with the data and choosing the data points that will either be shared or modeled, a final query can then be used to bring back only that data into memory in R.
4) All your code is in R! - Because we are using dplyr to communicate with the database, there is no need to change language, or tools, to perform the data exploration. "
So, you will probably gain the speed you are looking for with dbplyr/dplyr.
You should give it a try.
You can find more information about it and how to establish the connection with your PostgreSQL Server using the DBI package at:
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
and
https://rviews.rstudio.com/2017/05/17/databases-using-r/
Related
I am trying to reconstruct a Cognos Transformer cube in Snowflake.
1. Do we have an option to build an OLAP cube in Snowflake (like SSAS, Cognos Transformer)?
2. Any recommendations of what the approach should be or steps to be followed?
Currently there is no option similar to an SSAS cube in Snowflake. Once data is loaded into the databases Snowflake allows to query the data in a way similar to traditional OLTP databases.
For data aggregations the Snowflake library offers rich sets of in-built functions. Together with UDFs, SPs and Materialized Views we can build custom solutions for precomputed data aggregations.
For data analysis we still have to rely upon third party tools. Snowflake provides a variety of different connectors to access its database objects from other analytical tools.
There are plans in near future to introduce an integrated tool for data aggregations, analysis and reporting.
Use TM1 to build your OLAP cube, then run Cognos over the top of the TM1 cube.
TM1 will have no problem shaping your Snowflake data into OLAP structure.
Snowflake is no multidimensional database and offers analytical statements like "Group by cube" as Oracle also does. But this is more like a matrix with aggregations. There's no drill down or drill up available like SSAS Cubes, PowerCubes and other multidimensional databases (MDDB) are offering.
An option could be to simulate OLAP by creating ad hoc aggregations and use JavaScript to drill down / drill up. But in my experience operations equal to drilling will take often more than 10 seconds (if not extremly high ressources are available). Snowflake is probably not the best solution for such use cases.
I currently have a table in BigQuery with a size of 100+GB that I would like to retrieve to R. I am using the list_tabledata() function in bigrquery package in R, but it takes a huge amount of time.
Anyone has recommendation on handling this large amount of data in R, and how to boost the performance? Like any packages, tools?
tabledata.list is not a great way to consume a large amount of table data from BigQuery - as you note, it's not very performant. I'm not sure if bigrquery has support for table exports, but the best way to retrieve data from a large BigQuery table is using an export job. This will dump the data to a file on Google Cloud Storage that you can then download to your desktop. You can find more info on exporting tables in our documentation.
Another option, would be: instead of bringing that large volume of data to code - try to bring your code to data. This can be challenging in terms of implementing logic in BQL. JS UDF might help. It depends.
In case if this is not doable - i would recommend either use sampled data or revisit your model
I have a big analytics module in my system and plan to use vertica for it.
Someone suggested that we also use vertica in the rest of our app (standard crud app with models from our domain) so not to manage multiple databases.
would vertica fit this dual scenario?
High frequency UPDATEs is probably where Vertica lags behind the worst. I would avoid using it for such data models.
Alec - I would like to respectfully challenge your comments on Vertica. In no way do you need to denormalize or sort data before loading. Vertica also holds the record for fastest loading of data over all databases.
You also talk about Vertica not being able to do complex analytics as well as an RDBMS. Vertica IS an RDBMS and can do analytics faster than any other RDBMS and they prove it over and over.
As far as your numbers, in my use case I load roughly 5 million records per second into my Vertica cluster and have 100's of billions of records.
So Yaron - I would highly recommend you look at Vertica before you rule it out based on this information.
As is often the case these days, a meaningful answer depends on what you need to do. In a general sense, 'big data' solutions have grown from large data volume deficiencies in RDBMS systems. No 'big data' solution can compete with the core capabilities of RDBMS systems, ie complex analytics, but RDBMS systems are poor (expensive) solutions for large data volume procesing. Practical solutions for now have to be hybrid solutions. Vertica can be good once data is loaded, but I believe (not an expert) it requires denormalisation of data and pre-sorting before loading to perform at it's best. For large data volumes this may add significantly to the required resources. There is a definite benefit to using one system for all your needs, but there are also benefits to keeping your options open.
The approach I take is to store and index new data and then provide specific feeds to various reporting/analytic engines as required. This separates the collection and storage of raw data from the complex analytic processing. I am happy to provide more details if you are interested. This separation addresses a core problem which has always been present in database systems. In the past you used to hear 'store fast, report slowly or store slowly, report fast, but you cannot do both'. The search for a complete solution has, in the last few years, spawned the many NoSQL offerings which typically address the 'store fast' task. Some systems also provide impressive query performance by storing data in memory or cache but this requires many servers for large data volumes. I believe NoSQL and SQL solutions can, and will be, integrated, but this is till down the track.
To give you some context, I work with scenarios where at least 1 billion records a day are loaded. If you are dealing with say 100 million records a day (big is relative), then your Vertica approach will probably suffice, otherwise I think you need to expand your options.
Test it. Each use case is different. Assuming Vertica is a solution for every use case is almost as bad as using MongoDB for every use case.
Vertica is a high performance analytics database, column oriented, designed to analyze incredibly large datasets and scale horizontally. It's also expensive, hard to administer, and documentation is spotty. The payoff in the right environment can be easily worth the work, obviously
MySQL is a traditional RDBMS, row oriented, designed to model relationships between structured data, and works well on a single node scale (though many companies have retrofitted it to great success, exemplar gratia, Facebook). It's incredibly well documented and seemingly works on any platform, language, or framework and can be used by anyone.
My guess is using Vertica for an employee address book database is like showing up to a blue collar job in a $3000 suit. Sure it works, but is it the right tool for the job? Maybe if you already have a Vertica license and your applications already have the requisite data adaptors/ORM/etc..., go ahead and give it a shot. It's still a SQL database so it should work fine in those situations. If your goal is minimal programming as opposed to optimal performance, then why use Vertica at all? Sounds like something simpler would be more ideal. Vertica may or may not give better performance in a regular CRUD application environment since it's not optimized for that, but you can always test both and see.
Vertiy have many issues with high concurrency (Many small transaction per minute )
In MPP systems , the data is segmented across the cluster and any time there is need to take cluster level lock ( mainly in commit time ) , so many commits many cluster level X locks .
high concurrency is less the use case in DWH and reporting , so vertica is perfect for that .
In most of the cases OLTP solutions ( like CRM and etc ) required to provide high concurrency for that very is bad choice
Thanks
I've nearly finished the development of a project and would like to test its performance, especially the database query calls. I'm using Linq to SQL to search via usernames, but I've only got around 10 'users' in my database, so I can't really get a decent speed reading. How can I simulate thousands/millions of users in the database without actually creating new records? I've read about Selenium, but it seems that is good for repeat actions (simulating concurrent users?). Are there any other tools I should look into, or are there any options in VS 2008 (Professional Edition)?
Thanks
You can "trick" SQL Server into thinking there are more records than there actually are in a table using the approach outlined in this article. See the section on False SQL Server Statistics
e.g.
UPDATE STATISTICS TableName WITH ROWCOUNT=100000
will create statistics for the table as if it has 100000 rows in. You can then see what effect this has on the execution plan. But note this is undocumented functionality as so it may give quirky behaviour.
You could just populate your table with sample data. There's various tools available to help out with that like, Red Gate's SQL Data Generator. I prefer actually having large data volumes as I think that is what will be more accurate.
Old subject, combined with new tools: What would be the best/appropriate way to query data for a web application from an AspenTech IP21 (InfoPlus.21) data historian?
In the past, I've used some pretty awful queries via the Aspen SqlPlus ODBC driver, but that doesn't seem like the right approach, as it doesn't seem to install on Win 7 at all.
Anyone here have experience with that?
1) make sure you have appropriate version of Aspen tools, later ones (7.1, 7.2) will run on Windows 7 with no problems
2) I have worked with Aspen IP21 going over 15 years and have never had issues with SQL performance compared to other databases like Oracle or SQL server as long as the IP21 is on an approriate server and the query is written appropriately per the structure of the database. Doing a join against timestamp is going to produce a slow query. Depending on what you want to accomplish, there are multiple other ways to get data, through HISTORY pseudo table, AGGREGATES table, or other query techniques that are specific to IP21.
3) ODBC is still the most standard, easiest, and to me best performance for getting data from Ip21 form any client, ASP, .Net, web page, other databases, VB programs, Excel VBA, etc. Just may need some optimization tweaking probably in how SQL is written.
I've had extensive experience using the normal SQLPlus drivers in C#/ASP.NET and performance has never been an issue. While the ODBC drivers work, I have encountered certain limitations, such as not always returning SELECTs results.
As for how to check 'out of spec':
If this is for real-time values and not for ranges of time, I would suggest using record references to simply select the current value. That way the entire query stays in memory.
For time ranges you will have to select the ranges and iterate over them, which is more costly.