performing table merges outside of oracle in R - r

If I have a set of tables that I need to extract from an Oracle server, is it always more efficient to join the tables within Oracle and have the system return the joined table, or are there cases where it would be more efficient to return two tables into R (or python) and merge them within R/Python locally?
For this discussion, let's presume that the two servers are equivalent and both have similar access to the storage systems.

I will not go into the efficiencies of joining itself but anytime you are moving data from a database into R kep the size into account. If the dataset after joining will be much smaller (maybe after an inner join) it might be best to join in db. If the data is going to expand significantly after join (say cross join) then joining it after extraction might be better. If there is not much difference then my preference would be to join in db as it can be better optimized. In fact if the data is already in db try to do as much of data preprocessing before extracting it out.

Related

Should I use WITH instead of a JOIN on a table with a lot of data?

I have a MariaDB table which contains a lot of metadata and is very big in terms of bytes.
I have columns A, B in that table a long with other columns.
I would like to join that table with another table (stuff) in order to get column C from it.
So I have something like:
SELECT metadata.A, metadata.B, stuff.C FROM metadata JOIN
stuff on metadata.D = stuff.D
This query takes a very long time sometimes, I suspect its because (AFAIK, please correct me if Im wrong) that JOIN stores the result of the join in some side table and because metadata table is very big it has to copy a lot of data even though I dont use it, so I thought about optimizing it with WITH as follows:
WITH m as (SELECT A,B,D FROM metadata),
s as (SELECT C,D FROM stuff)
SELECT * FROM m JOIN s ON m.D = s.D;
The execution plan is the same (using EXPLAIN) but I think it will be faster since the side tables that will be created by WITH (again AFAIK WITH also creates side tables, please correct me if Im wrong) will be smaller and only contain the needed data.
Is my logic correct? Is there some way I can test that in MariaDB?
More likely, there is some form of cache speeding up one query or the other.
The Query cache is usually recognizable by a query time that is only about 1ms. It can be turned off via SELECT SQL_NO_CACHE ... to get a timing to compare against.
The other likely cache is the buffer_pool. Data is read from disk into the buffer_pool unless it is already there. The simple workaround for strange timings is to run the query twice and take the second 'time'.
Your hypothesis that WITH creates 'small' temp tables falls apart because of the work that is needed to read the original tables is the same with or without WITH.
Please provide SHOW CREATE TABLE for the two tables. There are a couple of datatype issues that may be involved -- big TEXTs or BLOBs.
The newly-added WITH opens up the possibility of recursive CTEs (and other things). And it provides a way to materialize a temp table that is used more than once. Neither of those applies in your query, so I would not expect any performance improvement.

join index or collect stats which is better in Teradata

Am facing an issue with one of my FACT tables.
Through same job,
I call a procedure to load this FACT table and then second procedure to collect stats on this fact table.
As part of a new requirement I need to create a join index which will also include the above mentioned fact tables.
I believe that join index will be executed whenever there is a change in any of involved tables.So what will happen in above scenario?.will my collect stats procedure wait for join index execution to complete.or Will there be any contention because of the simulataneous occurance of collect stats and joinindex
Regards,
Anoop
The Join Index will automatically be maintained by Teradata when ETL processes add, change, or delete data in the table(s) referenced by the Join Index. The Join Index will have to be removed if you apply DDL changes to table(s) referenced in the Join Index that affect the columns participating in the Join Index or before you can DROP the table(s) referenced in the Join Index.
Statistics collection on either the Join Index or Fact table should be reserved until after the ETL for the Fact table has been completed or during a regular stats maintenance period. Whether you collect stats after each ETL process or only during a regular stats maintenance period is dependent on how much of the data in your Fact table is changing during each ETL cycle. I would hazard a guess that if you are creating a join index to improve performance of querying the fact table you likely do not need to collect stats on the same fact table after each ETL cycle unless this ETL cycle is a monthly or quarterly ETL process. Stats collection on the JI and fact table can be run in parallel. The lock required for COLLECT STATS is no higher than a READ. (It may in fact be an ACCESS lock.)
Depending on you release of Teradata you may be able to take advantage of using the THRESHOLD options to allow the optimizer to determine whether or not statistics do in fact need to be collected. I believe this was included in Teradata 14 as a stepping stone toward the automated statistics maintenance that has been introduced in Teradata 14.10.

Query on multiple tables using dbGetQuery of RMySQL package

With the help of dbConnect, multiple connections were established with SQL DBs (say, DB1 and DB2). How can I write a query that involves tables from DB1 and DB2? Does dbGetQuery allow querying one only one DB? Can sqldf package be leveraged after the DB connections have been made?
This isn't the answer you're looking for, but I've had the same problem.
In short, I would drop the idea of doing any joins/grouping/subquerys between tables in 1 (or more) DBs in SQL. With the newer big data packages in R, specifically with dplyr or data.table there's truly almost no need. The only exception I can think of where SQL is faster is when your query results are large enough to take up too much RAM.
An interesting use-case for me is the following: My tables coming from an MPP database are around 20B rows. Problem: Query an entire result set of 2M rows, and use dplyr::group_by() to group on 3 variables, or just do the GROUP BY in SQL to return the final result of 100k rows.
Timing wise, there's always a tipping point where R or SQL is faster, and except for maybe a dimension table join in MySQL, R is almost always faster for everything. (My example is on the tipping point for my hardware.)
With dplyr as easy to use as SQL, I'm not sure we need to ask this question anymore.

Sqlite3 database performance

I want create database. Simple I think. Just to storage number of phone, date, time and note.
Better (for database perfomance) use new table for every phone number and notes or one table and all information in it?
The right way is to normalize your data (hence, use as much tables as needed).
If you split your data into several tables (assuming you use indexed) write performance will be better.
Regarding read performance, depends on the size of the data (namely notes), but I would argue that having more tables is also better - except if indexing is out of the question (no reason for that really) and if you would otherwise need to join tables to get data. Even then, I don't think it would be a big trade-off.
SQLite can write millions of rows/s and read another more, are you sure you want to ask this question?

Linq to entity performance issues with large set of data

I am currently working with EF4 and in one of my scenario i am using join and wanted to retrieve the data but as the resultant data is so much EF4 is even fail to generate the query plan..As a work around i tried to load the data in simple generic list( using Selecting all data from both the tables) and then tried to join on that two list but still i am getting outofmemory exception as one table contains around 100k records and second table contains 50k records i wanted to join them in query...but still with noluck using EF...please suggest me any work around of this...
I can't think of any scenario where you would need a result set containing 100k+ records. It may not be the answer you want, but the best way to improve performance is to reduce the amount of records that you're dealing with.
What we did is that we wrote custom SQL and executed it with Context.Database.SqlQuery(sql, params)

Resources