SQLite best way to save and traverse arrays of strings

SQLite best way to save and traverse arrays of strings - sqlite

I have a table that looks like this.
| id | coords |
| 0 | [1,0],[4,3],[4,9],[9,3],[1,8]
| 1 | [3,6],[3,8],[7,4],[5,2],[2,1]
.. and more
There will be around 70k-100k rows at most, and the CPU is not very powerful.
What is the fastest and least cpu intensive SQLite statement i can use to determine which id has any given coordinate? No two id's share a coordinate.
Example.
SELECT * FROM mytable WHERE coords LIKE '%[[]3,8]%'
I imagine the LIKE statement above will get pretty intensive right?

You should always try to have a properly normalized database.
In this case, the coordinate list is not in the first normal form.
If you move the coordinates to a separate table, you can search for coordinates with a simple and obvious query, which can be be sped up with an index:
CREATE TABLE MyTable (
ID,
[...]
);
CREATE TABLE MyCoordinates (
MyTableID,
CoordX,
CoordY
);
SELECT MyTableID FROM MyCoordinates WHERE X = ? AND Y = ?;

Related

Preventing padded virtual columns in external table pathformat

My data is partitioned in ADL the following manner:-
year=2022/month=9/day=7/hour=5/city=chicago/ [...a bunch of data files...]
year=2022/month=9/day=7/hour=5/city=london/ [...a bunch of data file...]
So it's partitioned by year, month, day, hour & city. Hour is a 24 hour format with values ranging from 0 to 23. Also please note that directory names are not 0 padded. As you can see in the example the month=9 appears as a single digit and not 09 , the same goes for day and hour too. The thing to note is that this data is produced by another process so we can't change the way partition directories appear.
My goal is to create an external table to read this table from Kusto. To optimize any query on this table I would like to take an advantage of virtual column feature. In case of our data the actual year, month, day, hour, city values are not part of data itself but rather only appear in partition directory names (as is common in many big data scenarios). So considering this I created the external table as follows:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (date_id:datetime, city:string)
pathformat = (
"year=" datetime_pattern("yyyy",date_id)
"/month=" datetime_pattern("MM",date_id)
"/day=" datetime_pattern("dd",date_id)
"/hour=" datetime_pattern("HH",date_id)
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
As you can notice I defined two virtual columns i.e. date_id & city.
The external table creation was successful. But I think it's not correctly looking into the right partitions when I tried to query it:-
external_table('myexternaltable') | where date_id == datetime(2022-9-3-5) | where city == 'london' | take 1
This returned no rows , even though there is data in the corresponding locations. I am suspecting that the issue is that the pathformat I use uses padded digit format , i.e. it probably searches year=2022/month=09/day=03/hour=05 whereas the data exists in year=2022/month=9/day=3/hour=5. Is that the reason? If so what is the correct pathformat for this sort of requirement ?

After some playing around I found the following hack to be working:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (year:string,
month:string,
day:string,
hour:string,
city:string)
pathformat = (
"year=" year
"/month=" month
"/day=" day
"/hour=" hour
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
Now both the following methods to query the table are working quite fast , as you can see now it doesn't matter whether I use 0 in the query because behind the scenes Kusto is removing that while converting number to string since I defined year,month,day,hour as strings:-
external_table('myexternaltable') | where year==2022 | where month == 09 | where day==03 | where hour==05 | where city=='london' | take 1
external_table('myexternaltable') | where year==2022 | where month == 9 | where day==3 | where hour==5 | where city=='london' | take 1
It would still be good to find a single virtual column like date_id of type datetime , so one can perform datetime arithmetic and also it would look neat to have a single virtual column instead of four.

Updated: I believe the following should work. See example in the docs:
pathformat = (datetime_pattern("'year='yyyy'/month='M'/day='d",date_id))

Query ADX table name dynamically

I have a need to be able to query Azure Data Explorer (ADX) tables dynamically, that is, using application-specific metadata that is also stored in ADX.
If this is even possible, the way to do it seems to be via the table() function. In other words, it feels like I should be able to simply write:
let table_name = <non-trivial ADX query that returns the name of a table as a string>;
table(table_name) | limit 10
But this query fails since I am trying to pass a variable to the table() function, and "a parameter, which is not scalar constant string can't be passed as parameter to table() function". The workaround provided doesn't really help, since all the possible table names are not known ahead of time.
Is there any way to do this all within ADX (i.e. without multiple queries from the client) or do I need to go back to the drawing board?

if you know the desired output schema, you could potentially achieve that using union (note that in this case, the result schema will be the union of all tables, and you'll need to explicitly project the columns you're interested in)
let TableA = view() { print col1 = "hello world"};
let TableB = view() { print col1 = "goodbye universe" };
let LabelTable = datatable(table_name:string, label:string, updated:datetime)
[
"TableA", "MyLabel", datetime(2019-10-08),
"TableB", "MyLabel", datetime(2019-10-02)
];
let GetLabeledTable = (l:string)
{
toscalar(
LabelTable
| where label == l
| order by updated desc
| limit 1
)
};
let table_name = GetLabeledTable('MyLabel');
union withsource = T *
| where T == table_name
| project col1

Is there any simpler way to see if a field has a certain value?

To see if there is a Todd in my database I currently do the following:
SELECT * FROM MyTable WHERE name='Todd' LIMIT 1
I then check the Cursor to see if its size == 1. Is there a way to return a 0 or 1 from the select statement if the condition is false or true, rather than a list of fields?

You can do
SELECT COUNT(*) it_exists
FROM
(
SELECT 1
FROM MyTable
WHERE name = 'Todd'
LIMIT 1
) q;
An inner select guarantees that LIMIT is applied. Meaning if you have hypothetically thousands of matching rows database engine will stop and return results after the first one instead of going through all of them.
Output
| it_exists |
|-----------|
| 1 |
Here is SQLFiddle demo

Query to find 'most watched' [COUNT()] from one table while returning the results from another

The question probably is quite confusing.
In affect i have the following:
WatchList table
UserId | FilmId
| 3 77
| etc etc
|
|
|
these are foreign keys for the following tables
FilmDB - Film_title, Film_plot, Film_Id etc.
and
aspnet_memberships - UserId, Username etc..
Now, i presume i will need to use a join but i am struggling with the syntax.
I would like to use 'Count' on the 'WatchList' and return the most frequent filmId's and their counterpart information, but i'd then like to return the REST of the FilmDB results, essentially giving me a list of ALL films, but with those found in the WatchedList my frequently sorted to the top.
Does that make sense? Thanks.

SELECT *
FROM filmdb
LEFT JOIN (
SELECT filmid, count(*) AS cnt
FROM watch_list
GROUP BY filmid) AS a
ON filmdb.film_id = a.filmid
ORDER BY isnull(cnt, 0) DESC;
http://sqlfiddle.com/#!3/46b16/10

You did not specify if the query should be grouped by film_id or user_id. The example I have provided is grouped by user if you change that to film_id then you will get the watch count for all users per film.
You need to use a subquery to get the count and then order the results by the count descending to get an ordered list.
SELECT
*
FROM
(
SELECT
WatchList.Film_Id,
WatchCount=COUNT(*)
FilmDB.Film_Title
FROM
WatchList
INNER JOIN FilmDB ON FilmDB.Film_Id=WatchList.Film_Id
GROUP BY
WatchList.UserID,
WatchList.Film_Id,
FilmDB.Film_Title
)AS X
ORDER BY
WatchCount DESC

Hive Distinct count over array of maps

I am new in Hive and I am trying to count distinct words_values from my whole words column.
id---------------------------words
435400064446779392 [{"words_value":"i","words_id":"1"},{"words_value":"hate","words_id":"2"}]
Notice that the words column is an array. I have much more rows but this above is to show an example.
I have tried:
SELECT words.words_value,count(words.words_value) from T1 GROUP BY words.words_value WITH ROLLUP;
But it counts in each rows.
Does anyone have any idea?

The explode UDTF is useful for converting nested data structures into ordinary tables that work with ordinary SQL statements. Since you have an array of maps you would need to use explode twice.
select count(distinct value) from
( select explode(col) from
( select explode(words) from mytable ) subquery1
) subquery2
where
key = "words_value";

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

SQLite best way to save and traverse arrays of strings - sqlite

Related

Preventing padded virtual columns in external table pathformat

Query ADX table name dynamically

Is there any simpler way to see if a field has a certain value?

Query to find 'most watched' [COUNT()] from one table while returning the results from another

Hive Distinct count over array of maps

Categories

Resources