Vertica, ResultBufferSize has not effect - odbc

I'm trying to test the field: ResultBufferSize when working with Vertica 7.2.3 using ODBC.
From my understanding this field should effect the result set.
ResultBufferSize
but even with value 1 I get 20K results.
Anyway to make it work?

ResultBufferSize is the size of the result buffer configured at the ODBC data source. Not at runtime.
You get the actual size of a fetched buffer by preparing the SQL statement - SQLPrepare(), counting the result columns - SQLNumResultCols(), and then, for each found column, running SQLDescribe() .
Good luck -
Marco

I need to add a whole other answer to your comment, Tsahi.
I'm not completely sure if I still misunderstand you, though.
Maybe clarifying how I do it in an ODBC based SQL interpreter sheds some light on the matter.
SQLPrepare() on a string containing, say, "SELECT * FROM foo", returns SQL_SUCCESS, and the passed statement handle becomes valid.
SQLNumResultCols(&stmt,&colcount) on that statement handle returns the number of columns in its second parameter.
In a for loop from 0 to (colcount-1), I call SQLDescribeCol(), to get, among other things, the size of the column - that's how many bytes I'd have to allocate to fetch the biggest possible occurrence for that column.
I allocate enough memory to be able to fetch a block of rows instead of just one row in a subsequent SQLFetchScroll() call. For example, a block of 10,000 rows. For this, I need to allocate, for each column in colcount, 10,000 times the maximum possible fetchable size. Plus a two-byte integer for the Null indicator for each column. These two : data area allocated and null indicator area allocated, for 10,000 rows in my example, make the fetch buffer size, in other words, the result buffer size.
For the prepared statement handle, I call a SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE to 10,000 rows.
SQLFetchScroll() will return either 10,000 rows in one call, or, if the table foo contains fewer rows, all rows in foo.
This is how I understand it to work.
You can do the maths the other way round:
You set the max fetch buffer.
You prepare and describe the statement and columns as explained above.
For each column, you count two bytes for the null indicator, and the maximum possible fetch size as from SQLDescribeCol(), to get the sum of bytes for one row that need to be allocated.
You integer divide the max fetch buffer by the sum of bytes for one row.
And you use that integer divide result for the call of SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE.
Hope it makes some sense ...

Related

Building an index in Excel

Working in Excel365, what would you say is the most resource-effective formula for building an index from percentage changes?
Assume you have a time series of percentage changes of any variable (e.g. daily changes in a stock price) in A2:A1000 in the form of a dynamic array, and you want to build an index starting at 100 in column B. In its simplest form, you would enter 100 in B1, enter B1*(1+A2) in B2 and copy that formula down to (in this case) B1000. But how would you suggest to do this in the most resource effective way, so that B1:B1000, or at least B2:B1000 becomes a dynamic array following the length of A2#, i.e. if A2# is 2345 rows (instead of 999 rows as in the example above), B1# becomes 2346 rows (or B2# 2345 rows if that solution is simpler)?
I do not have access to the values of the underlying variable, only to the percentage change, and I have many columns I need to build indexes for, therefore it is preferable if it is as resource-effective as possible.
Thanks a million for any ideas!
Kindly,
Johan
P.S. Using OFFSET() to get a dynamic array doesn't work, since the calculation is iterative (index value at t+1 is dependent on the index value at t), thus yielding a circular reference error. Instead I have tried BYROW() with LAMBDAs without much success and I'm not convinced that they are very resource-effective anyway. A seemingly simple problem that has thrown me into a dead-end street...

Count with limit and offset in sqlite

am am trying to write a function in python to use sqlite and while I managed to get it to work there is a behavior in sqlite that I dont understand when using the count command. when I run the following sqlite counts as expected, ie returns an int.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10
however when I add, shown below, an offset to the end sqlite returns an emply list, in other words nothing.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10 OFFSET 82
while omitting the offset is an easy fix I don't understand why sqlite returns nothing. Is this the expected behavior for the command I gave?
thanks for reading
When you execute that COUNT(*) it will return you only a single row.
The LIMIT function is for limiting the number of rows returned. You are setting the limit to 10 which doesn't have any effect here (Because it is returning only a single row).
OFFSET is for offsetting/skipping specified number of rows. Which also doesn't have any effect here.
In simple terms your query translates to COUNT number of rows, then return 10 rows starting from 83rd position. Since you've a single row it will always return empty.
Read about LIMIT and OFFSET

column slots in data.table

I have a dataset x with 350m rows and 4 columns. When joining two columns from a dataset i of 13m rows and 19 columns, I encounter the following error:
Internal logical error. DT passed to assign has not been allocated enough column slots. l=4, tl=4, adding 1
I have checked Not Enough Columns Slots but there the problem appears to be in the number of columns. Since I have only a few, I would be surprised if this was the issue.
Also, I found https://github.com/Rdatatable/data.table/issues/1830, where the error is related to "column slots", but I do not understand what they are. When checking truelength, I obtain
> truelength(x)
[1] 0
> truelength(i)
[1] 0
My understanding is that setting, for example, alloc.col(x,32) or alloc.col(i,32), or both could solve the issue. However, I don`t understand what this does and and what the issue is. Can anyone offer an explanation?
Part of what makes data.table so efficient is it tries to be smart about memory usage (whereas base data.frames tend to end up getting copied left and right in regular usage, e.g., setting names(DF) = col_names can actually copy all of DF despite only manipulating an attribute of the object).
Part of this, in turn, is that a data.table is always allocated a certain size in memory to allow for adding/subtracting column pointers more fluidly (from a memory perspective).
So, while actual columns take memory greedily (when they're created, sufficient memory is claimed to store the nrow(DT)-size vector), the column pointers, which store addresses where to find the actual data (you can think of this ~like~ column names, if you don't know the grittier details of pointers), have a fixed memory slot upon creation.
alloc.col forces the column pointer address reserve process; this is most commonly used in two cases:
Your data needs a lot of columns (by default, room is allocated for 1024 pointers more than there are columns at definition)
You've loaded your data from RDS (since readRDS/load don't know to allocate this memory for a data.table upon loading, we have to trigger this ourselves)
I assume Frank is right and that you're experiencing the latter. See ?alloc.col for some more details, but in most cases, you should just run alloc.col(x) and alloc.col(i) -- except for highly constrained machines, allocating 1024 column pointers requires relatively little memory, so you shouldn't spend to much effort skimping and trying to figure out the right quantity.

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Metadata of a Spark DataFrame (RDD)

I am benchmarking spark in R via "sparklyr" and "SparkR". I test different functions on different Testdata. In two particular cases, where I count the amount of zeros in a column and the amount of NA's in a column, I realized that no matter how big the data is, the result is there in less than a second. All the other computations scale with the size of the data.
So I don't think that Spark computes anything there, but that those cases are stored somewhere in the meta data, and that it computed these results while loading the data. I tested my functions and they always give me the right result.
Can anyone confirm whether the number of zeros and number of nulls in a column is stored in a dataframe's metadata, and if not, why does it return so quickly with the correct value?
There is no metadata associated to a Spark DataFrame that would contain columnar data; therefore, my guess is that the performance difference you measured is caused by something else. Hard to tell without a reproducible example.

Resources