With the help of dbConnect, multiple connections were established with SQL DBs (say, DB1 and DB2). How can I write a query that involves tables from DB1 and DB2? Does dbGetQuery allow querying one only one DB? Can sqldf package be leveraged after the DB connections have been made?
This isn't the answer you're looking for, but I've had the same problem.
In short, I would drop the idea of doing any joins/grouping/subquerys between tables in 1 (or more) DBs in SQL. With the newer big data packages in R, specifically with dplyr or data.table there's truly almost no need. The only exception I can think of where SQL is faster is when your query results are large enough to take up too much RAM.
An interesting use-case for me is the following: My tables coming from an MPP database are around 20B rows. Problem: Query an entire result set of 2M rows, and use dplyr::group_by() to group on 3 variables, or just do the GROUP BY in SQL to return the final result of 100k rows.
Timing wise, there's always a tipping point where R or SQL is faster, and except for maybe a dimension table join in MySQL, R is almost always faster for everything. (My example is on the tipping point for my hardware.)
With dplyr as easy to use as SQL, I'm not sure we need to ask this question anymore.
Related
If I have a set of tables that I need to extract from an Oracle server, is it always more efficient to join the tables within Oracle and have the system return the joined table, or are there cases where it would be more efficient to return two tables into R (or python) and merge them within R/Python locally?
For this discussion, let's presume that the two servers are equivalent and both have similar access to the storage systems.
I will not go into the efficiencies of joining itself but anytime you are moving data from a database into R kep the size into account. If the dataset after joining will be much smaller (maybe after an inner join) it might be best to join in db. If the data is going to expand significantly after join (say cross join) then joining it after extraction might be better. If there is not much difference then my preference would be to join in db as it can be better optimized. In fact if the data is already in db try to do as much of data preprocessing before extracting it out.
I want create database. Simple I think. Just to storage number of phone, date, time and note.
Better (for database perfomance) use new table for every phone number and notes or one table and all information in it?
The right way is to normalize your data (hence, use as much tables as needed).
If you split your data into several tables (assuming you use indexed) write performance will be better.
Regarding read performance, depends on the size of the data (namely notes), but I would argue that having more tables is also better - except if indexing is out of the question (no reason for that really) and if you would otherwise need to join tables to get data. Even then, I don't think it would be a big trade-off.
SQLite can write millions of rows/s and read another more, are you sure you want to ask this question?
Here's my problem.
I want to ingest lots and lots of data .... right now millions and later billions of rows.
I have been using MySQL and I am playing around with PostgreSQL for now.
Inserting is easy, but before I insert I want to check if that particular records exists or not, if it does I don't want to insert. As the DB grows this operation (obviously) takes longer and longer.
If my data was in a Hashmap the look up would be o(1) so I thought I'd create a Hash index to help with lookups. But then I realised that if I have to compute the Hash again every time I will slow the process down massively (and if I don't compute the index I don't have o(1) lookup).
So I am in a quandry, is there a simple solution? Or a complex one? I am happy to try other datastores, however I need to be able to do reasonably complex queries e.g. something to similar to SELECT statements with WHERE clauses, so I am not sure if no-sql solutions are applicable.
I am very much a novice, so I wouldn't be surprised if there is a trivial solution.
Nosql Stores are good for handling huge inserts and updates
MongoDB has really good feature for update/Insert (called as upsert) based on whether the document is existing.
Check out this page from mongo doc
http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
Also you can checkout the safe mode in mongo connection. Which you can set it as false to get more efficiency in inserts.
http://www.mongodb.org/display/DOCS/Connections
You could use CouchDB. Its no SQL so you can't do queries per se, but you can create design documents that allow you to run map/reduce functions on your data.
we have two sqlite DB's , we have a requirement to "attach" one to other and perform some joins. we have some questions/concerns as below:
say we have attached DB1 with DB2 and performing some SELECT's , can some other thread concurrently UPDATE/INSERT on DB2 or DB1 with a different connection ?
is there a separate C API to attach or we need to use "sqlite3_step"
how is the performance with ATTACH.
Thanks in Advance
DEE
Another thread can concurrently alter either database, but this will mean that at some point the database can be locked for the querying thread. See here about concurrency with SQLite.
ATTACH is a one step operation, you can us sqlite3_exec.
Performance is a tough thing to predict and will vary greatly with schema, indexing, usage, and data stored (and some other factors too like page size). In some cases, ATTACH can be slower than if all data is in one database. My personnal experience was that separating large datasets was faster for inserts and affected final query output minimally/imperceptibly. Your mileage may vary.
I have over 1.500.000 data entries and it's going to increase gradually over time. This huge amount of data would come from 150 regions.
Now should I create 150 tables to manage this increasing huge data? Will this be efficient? I need fast operation. ASP.NET and Oracle will be used.
If all the data is the same, don't split it in to different tables. Take a look at Oracle's table partitions. One-hundred fifty partitions (or more) split out by region (or more) is probably more in line with what you're going to be looking for.
I would also recommend you look at the Oracle Database Performance Tuning Tips & Techniques book and browse Ask Tom on Oracle's website.
Only 1.5 M rows? Not a lot really...
Use one table; working out how to write a 150-way union across 150 tables will be murder.
1.5 million rows doesn't really seem like that much. How many people are accessing the table(s) at any given point? Do you have any indexes setup? If you expect it to grow much larger, you may want to look into partitioning in databases.
FWIW, I work with databases on a regular basis with 100M+ rows. It shouldn't be this bad unless you have thousands of people using it at a time.
1 table per region is way not normalized; you're probably going to lose a bunch of efficiency there. 1 table per data entry site is pretty unusual too. Normalization is huge, it will save you a ton of time down the road, so I'd make sure you're not storing any duplicate data.
If you're using oracle, you shouldn't need to have multiple tables. It'll support a lot more than 1.5 million rows. If you need to speed up data access, you can try a snowflake schema to pull in commonly accessed data.
If you mean 1,500,000 rows in a table then you do not have much to worry about. Oracle can handle much larger loads than that with ease.
If you need to identify the regions that the data came in, you can create a Region table and tie the ID from that to the big data table.
IMHO, you should post more details and we can help you better.
A database with 2,000 rows can be slow. It all depends on your database design, index, keys and most important is the hardware configuration your database server is running on. The way your application uses this data is also important. Is a read intensive database or transaction intensive? There is no right answer to what you are asking right now.
You first need to consider what operations are going to access the table. How will inserts be performed? Will the existing rows be updated, and if so how? By how much will the rows grow, and what percentage of them will grow? Will rows get deleted? By what criteria? How will you be selecting data? By what criteria and how many per query?
Data partition can be used for volume of data much larger than 1.5m rows. Look into optimizing
the SQL query ,batch processing and storage of data.