I am having an ETL Process in talend to extract the data from SQL Server and Oracle compare those and then Insert the Transformed data in Oracle
The ETL process is
tSQLInput -- Row1 -----
-Comparison--TMap --->tOracleOutput
tOracleInput -- Row2 ---
The same process as above for insertion of 213058 rows it takes only 3 secs but for update it takes around 15 mins
For Insertion it takes 10000 rows/Sec, but same for update it takes 332 rows/sec
I want to increase this performance of update process
tOracleOutput Basic Settings
tOracleOutput Adavnced Settings
Can anybody please Shed some light on this issue why update alone is taking more time
Many Thanks in advance
Related
I am new to InfluxDB and I am trying to compare the performance of MariaDB and InfluxDB 2.0. Therefore I perform a benchmark of about 350.000 rows which are stored in a txt file (30mb).
I use ‘executemany’ to write multiple rows into the database when using MariaDB which took about 20 seconds for all rows (using Python).
So, I tried the same with InfluxDB using the Python client, attached are the major steps of how i do it.
#Configuring the write api
write_api = client.write_api(write_options=WriteOptions(batch_size=10_000, flush_interval=5_000))
#Creating the Point
p = Point(“Test”).field(“column_1”,value_1).field(“column_2”,value_2) #having 7 fields in total
#Appending the point to create a list
data.append(p)
#Then writing the data as a whole into the database, I do this after collecting 200.000 points (this had the best performance), then I clean the variable “data” to start again
write_api.write(“bucket”, “org”, data)
When executing this it takes about 40 seconds which is double the time of MariaDB.
I am stuck with this problem for quite some time now because the documentation suggests that I write it in batches, which I do and in theory it should be faster than MariaDB.
But probably I am missing something
Thank you in Advance!
It takes some time to shovel 20MB of anything onto the disk.
executemany probably does batching. (I don't know the details.)
It sounds like InfluxDB does not do as good a job.
To shovel lots of data into a table:
Given a CSV file, LOAD DATA INFILE is the fastest. But if you have to first create that file, it may not win the race.
"Batched" INSERTs are very fast: INSERT ... VALUE (1,11), (2, 22), ... For 100 rows, that runs about 10 times as fast as single-row INSERTs. Beyond 100 or so rows, it gets into "diminishing returns".
Combining separate INSERTs into a "transaction" avoids transactional overhead. (Again there is "diminishing returns".)
There are a hundred packages between the user and the database; InfluXDB is yet another one. I don't know the details.
I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
I am trying to build a model in R that takes real time input(heartbeat) from a sensor and perform an operation when the value of heartbeat goes beyond a baseline value. Is there way to get the real time data from sensor in R directly or indirectly?
I will use Windows server 2008 R2 with 32 GB memory and 12 cores. I am planning to use data(total 3 bytes, 1 for unique ID and 2 for data) from watch that I custom designed and process that into R. Basically send a notification(email) to user when his/her heartbeat goes beyond a baseline value. Can this be possible in R? Can the input data be received in real time in R in any way or form? What other languages/platform do you recommend for this kind of project?
Any recommendation for databases(AWS, oracle, Apache...) to use for data collection?
I have been having trouble with a database for the last month or so... (it was fine in November).
(S0 Standard tier - not even the lowest tier.) - Fixed in update 5
Select statements are causing my database to throttle (timeout even).
To makes sure it wasn't just a problem with my database, Ive:
Copied the database... same problem on both (unless increasing the tier size).
Deleted the database, and created the database again (blank database) from entity framework code-first
The second one proved more interesting. Now my database has 'no' data, and it still peaks the DTU and makes things unresponsive.
Firstly ... is this normal?
I do have more complicated databases at work that use about 10% max of the dtu at the same level (s0). So i'm perplexed. This is just one user, one database and currently empty, and I can make it unresponsive.
Update 2:
From the copy ("the one with data 10000~ records"). I upgraded it to standard S2 (5x more powerful than s0 potentially. No problems.
Down-graded it to S0 again and
SET STATISTICS IO ON
SET STATISTICS TIME ON
select * from Competitions -- 6 records here...
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(6 row(s) affected)
Table 'Competitions'. Scan count 1, logical reads 3, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 407 ms, elapsed time = 21291 ms.
Am i miss understanding azure databases, that they need to keep warming up? If i run the same query again it will be immediate. If i close the connection and do it again its back to ~20 seconds.
Update 3:
s1 level and it does the same query above for the first time at ~1 second
Update 4:
s0 level again ... first query...
(6 row(s) affected)
Table 'Competitions'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 35 ms.
Nothing is changing on these databases apart from the tier. After roaming around on one of my live sites (different database, schema and data) on s0 ... it peaked at 14.58% (its a stats site)
Its not my best investigation. But im tired :D
I can give more updates if anyone is curious.
** Update: 5 - fixed sort of **
The first few 100% spikes were the same table. After updating the schema and removing a geography field (the data was null in that column) it has moved to the later smaller peaks ~1-4% and a result time back in the very low ms.
Thanks for the help,
Matt
The cause of the problem to the crippling 100% DTO was a GEOGRAPHY field:
http://msdn.microsoft.com/en-gb/library/cc280766.aspx
Removing this from my queries fixed the problem. Removing it from my EF models will hopefully make sure it never comes back.
I do want to use the geography field in Azure (eventually and probably not for a few months), so if anyone knows why it was causing a unexpected amount of DTU to be spent on a (currently always null) column that would be very useful for future knowledge.
In one part of our application, we have to save some data. The saving of data takes some time and it could vary from few secs to few minutes. To handle this, so there isn't a sql time out we have following process in place.
Note: This is all setup by our DBA and this process is in use for several years without any problem.
1: Call a storedProcedure that places a job in a table. This stored procedure returns me a jobid. The input parameters to this sp is an exec statement to execute another storedprocdure with parameters.
2: Then I call another stored procedure passing the jobid, that will return the status of the job e.g. completed(true) else false + an error message.
3: I loop through step 2 for max 5 times with a sleep of 2 seconds. If the status returned is completed or there is some error, I break out of the loop. And after 5 times still the status in incomplete, I just display message that it's taking too long and please visit the page later (this isn't a good idea but so far the process has finished in about 2-3 times looping and not reached 5 times max but there are chances).
Now I am wondering if there is any better way to do the above job or even today, after several years, this is the best option I have?
Note: Using ASP.NET 2.0 Webforms / ADO.NET
I would look into queueing the operation, or if that's not an option (sounds like it might not be) then I would execute the stored procedure asynchronously, and use a callback to let you know when it's finished.
This question might help:
Ping to stored procedure to know execution completed in .net?