I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
Related
I am new to InfluxDB and I am trying to compare the performance of MariaDB and InfluxDB 2.0. Therefore I perform a benchmark of about 350.000 rows which are stored in a txt file (30mb).
I use ‘executemany’ to write multiple rows into the database when using MariaDB which took about 20 seconds for all rows (using Python).
So, I tried the same with InfluxDB using the Python client, attached are the major steps of how i do it.
#Configuring the write api
write_api = client.write_api(write_options=WriteOptions(batch_size=10_000, flush_interval=5_000))
#Creating the Point
p = Point(“Test”).field(“column_1”,value_1).field(“column_2”,value_2) #having 7 fields in total
#Appending the point to create a list
data.append(p)
#Then writing the data as a whole into the database, I do this after collecting 200.000 points (this had the best performance), then I clean the variable “data” to start again
write_api.write(“bucket”, “org”, data)
When executing this it takes about 40 seconds which is double the time of MariaDB.
I am stuck with this problem for quite some time now because the documentation suggests that I write it in batches, which I do and in theory it should be faster than MariaDB.
But probably I am missing something
Thank you in Advance!
It takes some time to shovel 20MB of anything onto the disk.
executemany probably does batching. (I don't know the details.)
It sounds like InfluxDB does not do as good a job.
To shovel lots of data into a table:
Given a CSV file, LOAD DATA INFILE is the fastest. But if you have to first create that file, it may not win the race.
"Batched" INSERTs are very fast: INSERT ... VALUE (1,11), (2, 22), ... For 100 rows, that runs about 10 times as fast as single-row INSERTs. Beyond 100 or so rows, it gets into "diminishing returns".
Combining separate INSERTs into a "transaction" avoids transactional overhead. (Again there is "diminishing returns".)
There are a hundred packages between the user and the database; InfluXDB is yet another one. I don't know the details.
I'm trying to ascertain if there are any limits to the size of a script passed to Informix via ODBC.
My Informix script size is going to run into a few megabytes (approximately 3.5K INSERT rows to a TEMP table), and is of the form...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...followed by a section to return a SELECT list based on an existing table...
SELECT
t1.field_1,
t1.field_2,
...
t1.field_n,
t2.field_2
FROM
table_1 AS t1
INNER JOIN
temp_table_2 AS t2
ON t1.field_1 = t2.field_1
Are there any limits to the size of the script, or, for that matter, the memory table? I'm estimating (hoping?) that 3.5K rows (we're only looking at one or two columns) would not cause an issue, or affect the server in an adverse way (there's easily be enough memory). Please note that my only communication method is via ODBC, and this is a proprietary database - I cannot create actual data tables on the server.
The reason I'm asking, is that, previously, I generated a script that was a considerable size, but, instead of putting the 3.5k IDs in a TEMP table (with associated data), I used an IN condition to look for the IDs only (processing could take place once the records were located). However, I cannot be certain whether it was the script editor (which was some kind of interface to the database) that baulked, limits to the IN condition, or the size of the script itself, that caused a problem, but basically the script would not run. After this we VIed a script, saving it to a folder and attempted to execute this, with similar (but not the same) results (sorry - I don't have the error messages from either process - this was done a little while ago).
Any Informix oriented tips for in this area would really be appreciated! :o)
Which version of Informix are you using? Assuming it is either 12.10 or 14.10, then there is no specific limit on the size of a set of statements, but a monstrosity like you're proposing is cruel and unusual punishment for a database server (it is definitely abusing your server).
It can also be moderately risky; you have to ensure you quote any data provided by the user correctly to avoid the problem of Little Bobby Tables.
You should be preparing one INSERT statement with two placeholder values:
INSERT INTO table(field_1, field_2) VALUES(?,?)
You should then execute this repeatedly, providing the different values. This will be more effective than making the server parse 3,500 similar statements. In ESQL/C, you can declare an INSERT cursor which will buffer the sets of values, reducing the round trips to the server — that can also be very valuable. I'm not sure whether that's an option in ODBC; probably not.
At the very least, you should experiment with using a prepared statement. Sending 3,500 x 60+ bytes = 210 KiB to the server is doable. But you'd be sending less volume of data to the server (but there'd be more round trips — which can be a factor) if you use the prepared statement and execute it repeatedly with new parameters each time. And you avoid the security risks of converting the values to strings. (Since you've not stated the types of the values, it's not certain there's a risk. If they're numeric, or things like date and time, they're very low risk. If they're character strings, the risk of is considerable — not insuperable, but not negligible.)
Older versions of Informix had smaller limits on the size of a set of statements — 64 KiB, and before that, 32 KiB. You're unlikely to be using an old enough version for that to be a problem, but the rules have changed over time.
I am running R Studio and R 3.5.2.
I have loaded around 250 parquet files using sparklyr::spark_read_parquet from S3a.
I need to collect the data from Spark (installed by sparklyr):
spark_install(version = "2.3.2", hadoop_version = "2.7")
But for some reason it takes ages to do the job. Sometimes the task is distributed to all CPU's and sometimes only one works:
Please advise how would you solve the dplyr::collect or sparklyr::sdf_collect "running ages" issue.
Please also understand that I can't provide you with the data and if it's a small amount it will work significantly fast.
That is an expected behavior. dplyr::collect, sparklyr::sdf_collect or Spark's native collect will bring all data to the driver node.
Even if feasible (you need at least 2-3 times more memory than the actual size of the data, depending on a scenario) it is bound to take a long time - with drivers network interfaces being the most obvious bottleneck.
In practice if you're going to collect all the data it typically makes more sense to skip network and platform overhead and load data directly using native tools (given the description it would be to download data to the driver and convert to R friendly format file by file).
I have recently started working on big data. Specifically, I have several GBs of data and I have to do computation (addition, modification) on it frequently. Since any computation on the data takes a lot of time, I been thinking of how to store the data for quick computation. Following are the options I have looked into:
Plain text file: The only advantage of this technique is inserting data is very easy. Changes to existing data are pretty slow, since there is no way to search for records efficiently.
Database: Insertion and modification of data are simplified. However, since this is a ongoing research project, schema may need to be updated frequently depending upon experimental results (this has NOT happened uptil now, but will definitely be something that may happen in near future). Besides, moving data around is not simple (as compared to a simple files). Moreover, I have noticed that querying data is not that quick as compared to when data is stored in XML.
XML: Using BeautifulSoup, only loading the XML file containing all the data takes around ~15 minutes and takes up ~15GB of RAM. Since it is quite normal to run scripts multiple times in a day, ~15 minutes for every invocation seems awfully long. The advantage is once the data is loaded, I can search/modify elements (tags) fairly quickly.
JSON and YAML: I have not looked into it deeply. They can surely compress the disk space needed to store the file (relative to XML). However, I have found no way to query records when data is stored in these formats (unlike database or XML).
What do you suggest I do? Do you have any other option in mind?
If you're looking for a flexible database for a large amount of data, MongoDB may be the technology you are looking for.
MongoDB belongs to the family of the NoSQL database systems and is:
based on JSON-alike documents
highly performant even with large amounts of data
schema-free
document-based
open-source
queryable
indexable
It allows you to modify your schema in the future in a very flexible way, quite easy to insert data (1.), modify the data and its structure (2.), faster than XML (3.) and JSON-based for efficient storage (4.).
size of integer is 4,long long int is 8 byte and it can access about 19 digits data and for unsinged long long int size also 8 byte but handle larger value than long long int but this is less than 20 digits.Is there any way that can hangle over 20 digits data.
#include<iostream>
using namespace std;
int main()
{
unsigned long long int a;
cin>>a;
if(a>789456123789456123123)//want to take a higher thand this digits
{
cout<<"a is larger and big data"<<endl;
}
}
I searched about it for a while but didn't find helpful contents.all about is java biginteger..
I am currently porting some code from a big-endian to a little-endian system (Linux x86_64) which interacts with an Oracle database.
I am trying to make sure that the data is not messed up because of endian-difference, type changes (long is 8 bytes in 64-bit archs) etc.
Unfortunately I have just one database.
So when I want to compare the output of the old and the new code, I can't because they connect to the same DB!
I am not an Oracle person, so I am looking for some solution where I can in sequence:
Ask Oracle to remember the database state (millions of records) at a point-in-time.
Run the big-endian code.
For each column in each table, collect stats like average, max, min, stddev.
Ask Oracle to revert to the saved state.
Run the little-endian code.
For each column in each table, collect stats like average, max, min, stddev.
And then compare data from 3 and 6.
Granted, this wouldn't be as good as row-by-row comparison, but given the volume of data, this seems to be an acceptable solution.
Is this possible in Oracle without too much resource utilization (like adding new disks, because thats a lot of red-tape and takes too much time).
Thanks!