am am trying to write a function in python to use sqlite and while I managed to get it to work there is a behavior in sqlite that I dont understand when using the count command. when I run the following sqlite counts as expected, ie returns an int.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10
however when I add, shown below, an offset to the end sqlite returns an emply list, in other words nothing.
SELECT COUNT (*) FROM Material WHERE level IN (?) LIMIT 10 OFFSET 82
while omitting the offset is an easy fix I don't understand why sqlite returns nothing. Is this the expected behavior for the command I gave?
thanks for reading
When you execute that COUNT(*) it will return you only a single row.
The LIMIT function is for limiting the number of rows returned. You are setting the limit to 10 which doesn't have any effect here (Because it is returning only a single row).
OFFSET is for offsetting/skipping specified number of rows. Which also doesn't have any effect here.
In simple terms your query translates to COUNT number of rows, then return 10 rows starting from 83rd position. Since you've a single row it will always return empty.
Read about LIMIT and OFFSET
Related
I have this sample table:
NAME SIZE
sam 100
skr 200
sss 50
thu 150
I want to do this query:
select total(size) > 300 from sample;
but my table is very big, so I want it to stop computing total(size) early if it's already greater than 300, instead of going through the entire table. (There are no negative sizes in my table.) Is there any way to do this in SQLite?
I've found a way to allow it to stop early, by using a window function, but unfortunately it makes it slower for a different reason. I hope someone else has a way to do this faster. To truly make it fast, you might need to create a custom aggregate function.
Window function method
A normal aggregate function like total() will always add all of the rows its given, but you can use an aggregate window function instead to add only some of the rows:
select name, size,
total(size) over (rows between unbounded preceding
and current row)
from sample
will give you
sam|100|100.0
skr|200|300.0
sss|50|350.0
thu|150|500.0
in which the third column is a cumulative sum. You can see in this result that you'd like to stop this query once you see the 350. You can do this by putting the above query into a subquery and using the EXISTS operator:
select exists(
select 1
from (select total(size) over (rows between unbounded preceding
and current row)
as total_size
from sample)
where total_size > 300)
This will filter the query to only the rows > 300, and then stop and return true (1) as soon as it finds one of them. If it never finds a row with that sum, it returns false (0).
However: this version can take longer than simply
select total(size) > 300 from sample
because it re-calculates the sum for each row, instead of just adding the next row's size to the running total.
I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.
We've never experienced this before. We are importing Tab Delimited TXT files that include numeric columns. Negative numbers have the indicator behind the number and not in front (e.g. 550.00- rather than -550.00).
We are using SQLite Expert Professional. When reviewing the import results in the table, any number with the negative sign in the back was converted to the negative sign in the front but everyone of these cells are highlighted in blue (we are not sure why SQLite Expert is doing this but assume it has meaning). In addition, when querying and summing they are being ignored causing the resulting value to be higher than expected.
The field types are FLOAT and DECIMAL
We have googled and cannot find any results about negative sign location.
Appreciate any assistance on how to handle this.
An UPDATE query could alter the values by subtracting the incorrect value from 0 after replacing the negative sign with nothing.
The UPDATE could be based upon :-
UPDATE vneg SET iv = (0 - replace(iv,'-','')) WHERE instr(iv,'-') > 1;
e.g. :-
DROP TABLE IF EXISTS vneg;
CREATE TABLE IF NOT EXISTS vneg (iv);
INSERT INTO vneg VALUES ('100.35'),('133.44-'),('25.453-');
SELECT * FROM vneg; -- First Result (before)
UPDATE vneg SET iv = (0 - replace(iv,'-','')) WHERE instr(iv,'-') > 1;
SELECT * FROM vneg; -- Second result (after)
Before the update :-
After the Update :-
And then using SELECT sum(iv) AS summed FROM vneg; results in :-
-58.543
I'm using SQLite3 and trying to query for recent rows. So I'm having SQLite3 insert a unix timestamp into each row with strftime('%s','now'). My Table looks like this:
CREATE TABLE test(id INTEGER PRIMARY KEY, time);
INSERT INTO test (time) VALUES (strftime('%s','now')); --Repeated
SELECT * FROM test;
1|1516816522
2|1516816634
3|1516816646 --etc lots of rows
Now I want to query for only recent entries, for example, I'm trying to get all rows with a time within the last hour. I'm trying the following SQL query:
SELECT * FROM test WHERE time > strftime('%s','now')-60*60;
However, that always returns all rows regardless of the value in the time column. I really don't know what's going on.
Also, if I put WHERE time > strftime('%s','now') it'll return nothing (which is expected) but if I put WHERE time > strftime('%s','now')-1 then it'll return everything. I don't know why.
Here's one more example:
sqlite> SELECT , strftime('%s','now')-1 AS window FROM test WHERE time > window;
1|1516816522|1516817482
2|1516816634|1516817482
3|1516816646|1516817482
It seems that SQLite3 thinks the values in the middle column are greater than the values in the right column!?
This isn't at all what I expect. Can someone please tell me what's going on? Thanks!
The purpose of strftime() is to format values, so it returns a string.
When you try to do computations with its return value, the database must convert it into a number. And numbers and strings cannot be compared directly with each other.
You must ensure that both values in a comparison have the same data type.
The best way to do this is to store numbers in the table:
INSERT INTO test (time)
VALUES (CAST(strftime('%s','now') AS MAKE_THIS_A_NUMBER_PLEASE));
(Or just declare the column type as something with numeric affinity.)
I'm trying to test the field: ResultBufferSize when working with Vertica 7.2.3 using ODBC.
From my understanding this field should effect the result set.
ResultBufferSize
but even with value 1 I get 20K results.
Anyway to make it work?
ResultBufferSize is the size of the result buffer configured at the ODBC data source. Not at runtime.
You get the actual size of a fetched buffer by preparing the SQL statement - SQLPrepare(), counting the result columns - SQLNumResultCols(), and then, for each found column, running SQLDescribe() .
Good luck -
Marco
I need to add a whole other answer to your comment, Tsahi.
I'm not completely sure if I still misunderstand you, though.
Maybe clarifying how I do it in an ODBC based SQL interpreter sheds some light on the matter.
SQLPrepare() on a string containing, say, "SELECT * FROM foo", returns SQL_SUCCESS, and the passed statement handle becomes valid.
SQLNumResultCols(&stmt,&colcount) on that statement handle returns the number of columns in its second parameter.
In a for loop from 0 to (colcount-1), I call SQLDescribeCol(), to get, among other things, the size of the column - that's how many bytes I'd have to allocate to fetch the biggest possible occurrence for that column.
I allocate enough memory to be able to fetch a block of rows instead of just one row in a subsequent SQLFetchScroll() call. For example, a block of 10,000 rows. For this, I need to allocate, for each column in colcount, 10,000 times the maximum possible fetchable size. Plus a two-byte integer for the Null indicator for each column. These two : data area allocated and null indicator area allocated, for 10,000 rows in my example, make the fetch buffer size, in other words, the result buffer size.
For the prepared statement handle, I call a SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE to 10,000 rows.
SQLFetchScroll() will return either 10,000 rows in one call, or, if the table foo contains fewer rows, all rows in foo.
This is how I understand it to work.
You can do the maths the other way round:
You set the max fetch buffer.
You prepare and describe the statement and columns as explained above.
For each column, you count two bytes for the null indicator, and the maximum possible fetch size as from SQLDescribeCol(), to get the sum of bytes for one row that need to be allocated.
You integer divide the max fetch buffer by the sum of bytes for one row.
And you use that integer divide result for the call of SQLSetStmtAttr() to set SQL_ATTR_ROW_ARRAY_SIZE.
Hope it makes some sense ...