Slow querying speed for RODBC with Vertica - odbc

I usually access Vertica in two ways: vsql in command line and RODBC through R. However, queries taking ~ 20s in vsql will usually 10-15 minutes through RODBC. Does anyone have this problem?

If you dig into the vertica.log, you might be able to see when your sql statement is actually getting processed, or if it's actually held up by queuing or something else.
Calling with the same user?

Very probably this is a Fetch issue. I'd suggest:
Option 1: continue using RODBC and increase the number of rows retrieved per Fetch cycle (rows_at_time). For example:
ch <- odbcConnect("mydsn", uid="mouser", pwd=“XXX", rows_at_time=8192)
Option 2: try replacing RODBC with RJDBC.

Related

Faster way to read data from oracle database in R

I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.

Performance issue for session with TPT write

I have session which taking data from AS400 and loading into Teradata. Source having 182 columns and ~19 millions records. I am using TPT write connection for writing data on Teradata. Target table is truncate and load and mapping is straight mapping. Still I am getting 6 throughput (Row/sec).
Break it down: is it a source/transformation/target bottleneck?
Modify the mapping and add a filter (criterion: FALSE) straight after the source and see how fast it runs, do the same right before target (if you have a little bit more than ‘straight’)
Then get back to us :)
More concrete: 182 columns is a bit much... perhaps you need to increase the buffer block size to allow for at least 100 rows per block

sqlite3 performance testing - how to quickly reset/clear the cache

I hope this is a simple question.
When doing query performance testing, running an identical, consecutive query will always return a response faster than the first attempt (generally, significantly faster).
What's the easiest/fastest method to 'reset' sqlite3 back to its default state?
Running VACUUM can take quite awhile and is obviously doing more than simply 'resetting' things.
Thank you,
So, it seems as though sqlite3 doesn't have the ability to do this on its own. You can compensate for this by flushing the pagecache/inodes in linux by running the following as root:
echo 3 > /proc/sys/vm/drop_caches
For it to be effective for performance testing, you'll need to run this command between each iteration. The value won't change (which is counter intuitive), but each time the value is written to the file, the flush process is activated.

Why does sqlQuery from SAP HANA using RODBC return no data if request 18 or more rows

I have a 64-bit Windows 7 machine with HANA Client installed and an ODBC connection to an SAP HANA database. In the ODBC Data Source administrator, I have added a connection to the required database and can successfully connect.
I am trying to use RStudio to retrieve data for analysis using R. I am finding that queries that return a handful of rows ("TOP 1" to "TOP 17") successfully return all 71 columns of data for the requested number of rows, but when I query with "TOP 18" or higher number of rows, I get all column titles, but 0 rows returned.
So the query:
res<-sqlQuery(ch, 'SELECT TOP 17 * FROM "SAPL2P"."/BIC/PZRPA_CNO" WHERE "/BIC/ZRPA_DCD"=\'CONFIRMED\'')
results in 17 rows of data, but
res2<-sqlQuery(ch, 'SELECT TOP 18 * FROM "SAPL2P"."/BIC/PZRPA_CNO" WHERE "/BIC/ZRPA_DCD"=\'CONFIRMED\'')
has 0 rows of data.
Any ideas what could be causing data not to be returned for more than 17 rows?
Ok, the problem here really is how R on Windows handles UTF data from ODBC (as already has been described).
A quick search around SO shows that this problem is pretty common for R on Windows with a lot of different DBMS.
For SAP HANA what worked for me was to add the following parameter to the ODBC DSN (in the ODBC driver settings -> Settings... -> Special property settings):
CHAR_AS_UTF8 | TRUE
This makes SAP HANA ODBC to handle SQL_C_CHAR as UTF8.
The solution turned out to require 2 parts as the issue arose due to Baltic characters within the SAP data. The 2 steps were:
Adding the property 'CHAR_AS_UTF8' with a value of 'TRUE' in the ODBC DSN settings as Lars suggested above.
Opening the channel in RStudio with an extra parameter 'DBMSencoding="UTF-8"‌'​
So the RStudio command to open the channel to the HANA data is now:
ch<-odbcConnect("HA‌​NA_QA_DS",uid="aaaaaa‌​aa",pwd="bbbbbbbb", DBMSencoding="UTF-8"‌​)
Thanks, Lars. Your input was instrumental to solving this!
I had the same issue.
Adding the property 'CHAR_AS_UTF8' = 1 was not an option for me, since it has to be defined on the system of every new user using the code.
In my solution I cast the column to NCLOB.
res<-sqlQuery(ch, 'SELECT TO_NCLOB("COLUMNNAME") as "COLUMNNAME" FROM "SAPL2P"."MYSCHEMA"')
This might be complicated on the selection of *. However, mostly only certain known columns are affected by undesired characters.
I had a similar issue. At first, I had no rows returns at all. Adding believeNRows=FALSE and rows_at_time=1 to the initial connection helped in at least bringing back some data (until a failure occurred). In my case, I could only retrieve 5 rows with the 6th one failing which helped me to identify the problem row in the database.
In the end, I believe it has to do with encoding as per your suggestion above. I found that my issue was with the inverted single quote character.
I found the following information on the RODBC cran site:
The other
internationalization issue is the character encoding used. When R and
the DBMS are running on the same machine this is unlikely to be an
issue, and in many cases the ODBC driver has some options to translate
character sets. SQL is an ANSI (US) standard, and DBMSs tended to
assume that character data was ASCII or perhaps currently this is
stymied by bugs in the ODBC driver, so SQLColumns is unable to report
on tables in specified databases. More recently DBMSs have started to
(optionally or by default) to store data in Unicode, which
unfortunately means UCS-2 on Windows and UTF-8 elsewhere. So cross OS
solutions are not guaranteed to work, but most do.
SAP HANA is a database that store data in Unicode.
I tried various options to set the encoding:
setting the DBMSencoding property in the RODBS.odbcConnect proc to UTF-8, latin1, ISO8859-1 and UCS-2 without any luck
I also tried setting the encoding in RStudio itself without luck
In the end, I settled on creating a view in SAP HANA to replace the problem character in SQL.
replace(content,x,y) as content where x was the problem character and y the replacement.
From that point on RODBC could retrieve the data without problem.
$conn = odbc_connect("Driver=$driver;ServerNode=$host;Database=$db_name;CHAR_AS_UTF8=TRUE;", $username, $password,SQL_CUR_USE_DRIVER);
this is the sample connection how to give CHAR_AS_UTF8=TRUE in ODBC.

Asp.net sql server 2005 timeout issue

HI
We am getting time outs in our asp.net application. We are using sql server 2005 as the DB.
The queries run very fast in the query analyser . However when we check the time through the profiler it shows a time that is many times more than what we get in query analyser.
(paramter sinffing is not the cause)
Any help is much appreciated
thanks
We are on a SAN
Cleared the counters. The new counters are
ASYNC_NETWORK_IO 540 9812 375 78
WRITELOG 70 1828 328 0
The timeout happens only on a particular SP which a particular set of params. if we change the params and access the app it works fine. We ran the profiler and found that the SP batchcompleted statement comes up in the profiler after the timeout happens on asp.net side. If we restart the server everything works fine
if we remove the plan from the cache the app works fine. However we have taken into consideration parameter sniffing in the sp. what else could be the reason
If I was to take a guess, I would assume that the background database load from the webserver is elevating locks and causing the whole thing to slow down. Then you take a large-ish query and run it and that causes lock (and resource) contension.
I see this ALL THE TIME with companies complaining of performance problems with their client-server applications when going from one SQL server to a cluster. In the web-world, we get those issues much earlier.
The solution (most times) to lock issues with one of the following:
* Refactor your queries to work better (storing SCOPE_IDENTITY instead of calling it 5 times for example)
* Use the NO LOCK statement everywhere it makes sense.
EDIT:
Also, try viewing the server with the new 2008 SQL Management Studio 'Activity Monitor'. You can find it by right-clicking on your server and selecting 'Activity Monitor'.
Go to the Processes section and look at how many processes are 'waiting'. Your wait time should be near-0. If you see alot of stuff under 'Wait Type', post a screen shot and I can give you an idea of what the next step is.
Go to the Resource Waits section and see what the numbers look like there. Your waiters should always be near-0.
And 'Recent Expensive Queries' is awesome to look at to find out what you can do to improve your general performance.
Edit #2:
How much slower is it? Your SAN seems to be taking up about 10 seconds worth, but if you are talking 20 seconds vs. 360 seconds, then that would not be relevent, and there is no waits for locks, so I guess I am drawing a blank. If the differene is between 1 second and 10 seconds then it seems to be network related.
Run the following script to create this stored proc:
CREATE PROC [dbo].[dba_SearchCachedPlans]
#StringToSearchFor VARCHAR(255)
AS
/*----------------------------------------------------------------------
Purpose: Inspects cached plans for a given string.
------------------------------------------------------------------------
Parameters: #StringToSearchFor - string to search for e.g. '%<MissingIndexes>%'.
Revision History:
03/06/2008 Ian_Stirk#yahoo.com Initial version
Example Usage:
1. exec dba_SearchCachedPlans '%<MissingIndexes>%'
2. exec dba_SearchCachedPlans '%<ColumnsWithNoStatistics>%'
3. exec dba_SearchCachedPlans '%<TableScan%'
4. exec dba_SearchCachedPlans '%CREATE PROC%MessageWrite%'
-----------------------------------------------------------------------*/
BEGIN
-- Do not lock anything, and do not get held up by any locks.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT TOP 100
st.TEXT AS [SQL],
cp.cacheobjtype,
cp.objtype,
DB_NAME(st.dbid) AS [DatabaseName],
cp.usecounts AS [Plan usage],
qp.query_plan
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp
WHERE CAST(qp.query_plan AS NVARCHAR(MAX)) LIKE #StringToSearchFor
ORDER BY cp.usecounts DESC
END
Then execute:
exec dba_SearchCachedPlans '%<MissingIndexes>%'
And see if you are missing any recommended indexes.
When SQL server creates a plan it saves it, along with any recommended indexes. Just click on the query_plan column text to show you the graph. On the top there will be recommended indexes you should implement.
I don't have the answer for you, because I'm not a guru. But I do remember reading on some SQL blogs recently that SQL 2008 has some extra things you can add to the query/stored procedure so it calculates things differently. I think one thing you could try searching for is called 'hints'. Also, how SQL uses the current 'statistics' makes a difference too. Look that up. And how the execution plan is only generated for the first run--if that plan doesn't work with different parameter values because there would be a vast difference in what would be searched/returned, it can present this behavior I think.
Sorry I can't be more helpful. I'm just getting my feet wet with SQL Server performance at this level. I bet if you asked someone like Brent Ozar he could point you in the right direction.
I've had this exact same issue a couple of times before. It seemed to happen to me when a particular user was on the site when it was deployed. When that user would run certain stored procedures with their ID it would timeout. When others would run it, or I would run it from the DB, it would run in no time. We had our DBA's watch everything they could and they never had an answer. In the end, everything was fixed whenever I re-deployed the site and the user was not already logged in.
I've had similar issues and with my case it had to do with the SP recompiling. Specifically it was my use of temp tables vs table variables.

Resources