We are using pipelined function to return the results of a select statement, pipe row will return the output of select statement. The time taken to execute each select statement that is returned by pipe row is **~15 sec** but if the pipe row returns **n** records then the time taken by the pipelined function is **~n*15 sec** which is causing performance issues if the records returned are huge in number.
So in this case, is there a way that we can process all the records of pipe row in parallel and hence irrespective of number of rows pipe row returns, the time taken should be equal to that of as if it is returning **one row**. I am not sure if using a parallel hint in the function call will improve the performance.
Thanks
Bharath
Related
am am trying to write a function in python to use sqlite and while I managed to get it to work there is a behavior in sqlite that I dont understand when using the count command. when I run the following sqlite counts as expected, ie returns an int.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10
however when I add, shown below, an offset to the end sqlite returns an emply list, in other words nothing.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10 OFFSET 82
while omitting the offset is an easy fix I don't understand why sqlite returns nothing. Is this the expected behavior for the command I gave?
thanks for reading
When you execute that COUNT(*) it will return you only a single row.
The LIMIT function is for limiting the number of rows returned. You are setting the limit to 10 which doesn't have any effect here (Because it is returning only a single row).
OFFSET is for offsetting/skipping specified number of rows. Which also doesn't have any effect here.
In simple terms your query translates to COUNT number of rows, then return 10 rows starting from 83rd position. Since you've a single row it will always return empty.
Read about LIMIT and OFFSET
I'm inserting data into table from another table using below query in Teradata and I want to run this statement until table reaches 20GB. So I want to run below statement in a loop to achieve that. However I written one but it's giving query invalid error when I'm trying to execute. Could you please help me as I'm new to Teradata. Thanks.
insert into schema1.xyx select * from schema2.abc;
if/loop/etc. are only allowed in Stored Procedures.
Looping a billion times will be quite inefficient (and will result in much more than 20GB). Better check the current size of table abc from dbc.TableSizeV, calculate how many loops you need and then cross join.
insert into schema1.xyx
select t.*
from schema2.abc AS t
cross join
( -- calculated number of loops
select top 100000 *
-- any table with a large number of rows
from sys_calendar_calendar
);
But much easier is using sampling.
Calculate the number of rows needed and then
insert into schema1.xyx
select *
from schema2.abc
sample with replacement 100000000;
I want to generate uniformly distributed random numbers twice. I used profvis to check the code.
I found that the second runif function is taking much more time than the first one. Is there any way to avoid this situation.
L is just an integer between 50 and 100. Please ignore the second line.
Additionally, in each of my loop, I rbind a new record to the current records data.frame. This rbind operation is also time-consuming.
If I know the number of records in advance, I can initialize a data.frame to store all records. But it cannot be known before the loop ends. Is there any way to append a row to an existing data.frame more quickly?
Or you can try this simple example to see how the second runif turns out to be.
library(profvis)
profvis({
runif(100000,0,1)
runif(100000,0,1)
})
I'm trying to test the field: ResultBufferSize when working with Vertica 7.2.3 using ODBC.
From my understanding this field should effect the result set.
ResultBufferSize
but even with value 1 I get 20K results.
Anyway to make it work?
ResultBufferSize is the size of the result buffer configured at the ODBC data source. Not at runtime.
You get the actual size of a fetched buffer by preparing the SQL statement - SQLPrepare(), counting the result columns - SQLNumResultCols(), and then, for each found column, running SQLDescribe() .
Good luck -
Marco
I need to add a whole other answer to your comment, Tsahi.
I'm not completely sure if I still misunderstand you, though.
Maybe clarifying how I do it in an ODBC based SQL interpreter sheds some light on the matter.
SQLPrepare() on a string containing, say, "SELECT * FROM foo", returns SQL_SUCCESS, and the passed statement handle becomes valid.
SQLNumResultCols(&stmt,&colcount) on that statement handle returns the number of columns in its second parameter.
In a for loop from 0 to (colcount-1), I call SQLDescribeCol(), to get, among other things, the size of the column - that's how many bytes I'd have to allocate to fetch the biggest possible occurrence for that column.
I allocate enough memory to be able to fetch a block of rows instead of just one row in a subsequent SQLFetchScroll() call. For example, a block of 10,000 rows. For this, I need to allocate, for each column in colcount, 10,000 times the maximum possible fetchable size. Plus a two-byte integer for the Null indicator for each column. These two : data area allocated and null indicator area allocated, for 10,000 rows in my example, make the fetch buffer size, in other words, the result buffer size.
For the prepared statement handle, I call a SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE to 10,000 rows.
SQLFetchScroll() will return either 10,000 rows in one call, or, if the table foo contains fewer rows, all rows in foo.
This is how I understand it to work.
You can do the maths the other way round:
You set the max fetch buffer.
You prepare and describe the statement and columns as explained above.
For each column, you count two bytes for the null indicator, and the maximum possible fetch size as from SQLDescribeCol(), to get the sum of bytes for one row that need to be allocated.
You integer divide the max fetch buffer by the sum of bytes for one row.
And you use that integer divide result for the call of SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE.
Hope it makes some sense ...
Problem: What is the best way to loop through a vector of IDs, so that one ID is passed as an argument to a function, the function runs, then the next ID is used to run the function again, and so on until the function has been run 30 times with the 30 IDs in my vector?
Additional Info: I have a complex function that retrieves data from several different sources, manipulates it, writes it to a different table, and then emails me when its done. It has several arguments that are hard coded in, and an ID argument that I manually input each time I want to run it.
I'm sorry that I can't give a lot of specifics, but this is an extremely simplified version of my setup
#Manually Entered Arguments
ID<-3402
Arg1<- "Jon_Doe"
Arg2<- "Jon_Doe#gmail.com"
#Run Function
RunFun <- function (ID, arg1, arg2) {...}
Now, I have 30 non-sequential IDs (all numerical) that I have imported from an Excel column using:
ID.Group<- scan()
I know that it is extremely inefficient to run each ID through the function one at a time, but the complexity of the function and technological limitations only allow for one to be run at a time.
I am just getting started with R, so I'm sorry if any of this didn't make sense. I have spent the last 5 hours trying to figure this out so any help would be greatly appreciated.
Thank you!
The Vectorize function is actually a wrapper to mapply and is often used when vectorization is not a natural outcome of the function body. If you wrote the function with values for the arg1 and arg2 like this:
RunFun <- function (ID, arg1="Jon_Doe", arg2="Jon_Doe#gmail.com") {...}
V.RunFun <- Vectorize(Runfun)
V.RunFun ( IDvector )
This is often used with integrate or outer which require that their arguments return a vector of equal length to input.