Select conversations from a SQLite chatlog - sqlite

I have a SQLite table representing a chatlog. The two important columns for this question are 'content' and 'timestamp'.
I need to group the messages in the chatlog by conversations. Each message is only an individual line, so a conversation can be selected as each message joined by a new line using group_concat
group_concat(content, CHAR(10)
I want to identify a conversation by any messages which are within a length of time (such as 15 minutes) from each other. A conversation can be any length (including just an individual message, if there are no other messages within 15 minutes of it).
Knowing this, I can identify whether a message is the start or part of a conversation as
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
But this is as far as I've gotten. I can make a column 'is_new_convo' using
WITH ordered_messages AS (
SELECT content, timestamp
FROM messages
ORDER BY timestamp
), conversations_identified AS (
SELECT *,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM ordered_messages
) SELECT * FROM conversations_identified
How can I then form a group of messages from where is_new_convo = 1 to the last subsequent is_new_convo = 0?
Here is some sample data and the expected result.

If you take the sum of the is_new_convo column from the start to a certain row, you get the number of times a new conversation has been formed, resulting in an ID that is unique for all messages in a conversation (since is_new_convo is 0 for messages continuing a conversation, they result in the same conversation ID). Using this, we can find the conversation ID for all messages, then group them together for group_concat. This doesn't require referencing the original table multiple times, so the 'WITH' clauses aren't needed.
SELECT group_concat(content, CHAR(10)) as conversation
FROM (
SELECT content, timestamp,
SUM(is_new_convo) OVER (ORDER BY timestamp) as conversation_id
FROM (
SELECT content, timestamp,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM messages
)
) GROUP BY conversation_id

Related

android room database Dao two queries?

How do I return the results of two queries using one #Query statement?
I have a database of items with a single table. Each item has a due date (saved as a long in the Room database) or no due date (saved as -1 in the database). I would like to have a query that returns all items with due dates in ascending order and then return all of the remaining items, sorted by a timestamp that is saved in the database. The timestamp represents the calendar date and time when the item was originally saved to the Room database.
Here is an example of the output I expect, using a U.S. calendar for the due dates:
8/17/2022 (August 17, 2022 due date)
8/19/2022 (due date)
12/15/2022 (due date)
5601 timestamp (no due date)
4200 timestamp (no due date)
1150 timestamp (no due date)
The below query in the Dao returns the expected results of the first part of the query, the ascending due dates. So how do I append the below query with the second part where I also return the items that have no due dates and show their timestamps in descending order? I tried multiple ways to use UNION, UNION ALL, etc. with no luck.
#Query("SELECT * FROM cards WHERE cardDuedatentime !=-1 ORDER BY cardDuedatentime ASC")
First sort by the boolean expression cardDuedatentime = -1 to get all the rows with no due date at the bottom of the resultset.
Then use conditional sorting with a CASE expression to sort the rows with no due date descending and the rows with a valid due date ascending:
SELECT *
FROM cards
ORDER BY cardDuedatentime = -1,
CASE WHEN cardDuedatentime = -1 THEN -timestamp ELSE cardDuedatentime END;
If you want only 1 column in the results:
SELECT CASE WHEN cardDuedatentime = -1 THEN timestamp ELSE cardDuedatentime END time
FROM cards
ORDER BY cardDuedatentime = -1,
CASE WHEN cardDuedatentime = -1 THEN -timestamp ELSE cardDuedatentime END;
See the demo.
If I understand you questions correctly then I believe that you could use:-
#Query("WITH cte1 AS (SELECT * FROM cards WHERE cardDueDatentime != -1 ORDER BY cardDueDatentime ASC),cte2 AS (SELECT * FROM cards WHERE cardDueDatentime = -1 ORDER BY timestamp ASC) SELECT * FROM cte1 UNION ALL SELECT * FROM cte2;")
The following was used to test/demonstrate:-
DROP TABLE IF EXISTS cards;
CREATE TABLE IF NOT EXISTS cards (cardDueDatentime INTEGER,timestamp INTEGER, othercolumns TEXT);
INSERT INTO cards VALUES
(strftime('%s','2022-08-17'),strftime('%s','now'),'A')
,(-1,5601,'A')
,(strftime('%s','2022-08-11'),strftime('%s','now'),'A')
,(-1,4201,'A')
,(strftime('%s','2022-12-15'),strftime('%s','now'),'A')
,(-1,1150,'A')
;
WITH
cte1 AS (SELECT * FROM cards WHERE cardDueDatentime != -1 ORDER BY cardDueDatentime ASC),
cte2 AS (SELECT * FROM cards WHERE cardDueDatentime = -1 ORDER BY timestamp ASC)
SELECT * FROM cte1 UNION ALL SELECT * FROM cte2;
DROP TABLE IF EXISTS cards;
The result from the executing the above (using the Navicat for SQLite tool):-
Two CTE's (Common Table Expressions, aka temporary tables) were used, each to extract one of the sets of data importantly sorting them independently. They are then combined via the UNION and not sorted (as the sort affects the complete set of data).
Note how the data has purposefully been inserted so that they are not appropriately sorted.
Here's an SQLFiddle of the above
An even simpler way would be to use:-
#Query("SELECT * FROM cards ORDER BY cardDueDatentime=-1 ASC,cardDueDatentime ASC, timestamp ASC;")
Which using the data above results in the same. This works because cardDueDatentime=-1 will equate to either true (1) or false (0). Therefore -1 will equate to 1 and a valid datetime will equate to 0, so the valid datetimes precede the invalid (-1) datetimes. Then the subsequent sort fields sort each set accordingly.
If you wanted any invalid date (les than 0) then you could use something like:-
#Query("SELECT * FROM cards ORDER BY cardDueDatentime<0 ASC,max(CAST(cardDueDatentime AS INTEGER),0) ASC, timestamp ASC;")
So if you had additional rows inserted such as :-
....
,(-2,111,'B')
,(-3,11,'C')
,(-1,1,'X')
The the result would be:-
Whilst with the first simpler SELECT, with the additional data, the result (WRONG) would be :-
i.e. for the C row as -3 is not equal to -1 then it will be as if it were a valid date,
so < 0 treats it as an invalid date so it is include in the set of invalid dates;
However, with < 0 allowing the -3 to in the invalid date set, the second sort, on the cardDueDatentime would place -3 before the -2 and before the -1 so the max function will for values less than -1 make them -1 and hence -3 becomes -1 (as with all the other invalid dates) so the third sort field is then the applicable sort field within the set of invalid dates.
this could be useful if you for some reason wanted to have different sets/types of invalid dates but not affect the query.

TERADATA: Is it possible to ignore rows in an OLAP partition when the condition is met and still pass the value down when it isn't met?

I'm partitioning data based on a customers previous order, so if the customer previously added a service to their account (they either have the service or they don't), I want that value to carry down to the next row for that customer for all orders regardless of the order status, but I don't want canceled order services to be calculated with the next order, I want to skip those rows and bring down the value from the previously completed order. Does anyone know if this is possible? If I add the field into the Partition By clause, it'll partition by order status instead of reporting the order status from the previous completed order.
(
Sum
(
SUBSCR1_ORD
)
Over
(
PARTITION BY ACCT_NO
ORDER BY ORDER_DATE
ROWS BETWEEN 1 Preceding AND 1 Preceding
)
)
AS EXISTING_SVC1
This is what I'd want the results to look like for the EXISTING_SVC columns based on activity in the SUBSCR1_ORD column with special handing on ORDER_STATUS
ACCT_NO
ORDER_DATE
ORDER_STATUS
SUBSCR1_ORD
SUBSCR2_ORD
EXISTING_SVC1
EXISTING_SVC2
1234
6/5/2022
Complete
1
null
0
0
1234
6/6/2022
Canceled
-1
1
1
0
1234
6/7/2022
Complete
null
1
1
0
Use LAG with IGNORE NULLS and a CASE expression to "pull down" the prior value.
SELECT Acct_No, Order_Date, Order_Status, Subscr1_Ord, Subscr2_Ord,
LAG(CASE WHEN Order_Status='Canceled' THEN NULL ELSE Subscr1_Ord END,1,0)
IGNORE NULLS
OVER(PARTITION BY Acct_No ORDER BY Order_Date)
AS Existing_Svc1,
LAG(CASE WHEN Order_Status='Canceled' THEN NULL ELSE Subscr2_Ord END,1,0)
IGNORE NULLS
OVER(PARTITION BY Acct_No ORDER BY Order_Date)
AS Existing_Svc2
FROM MyTable
ORDER BY Order_Date;

Automatically count DB records by interval and write result periodically to aggregation table

I have a SQLITE3 DB with following 3 column layout
typ (1=gas or 0=electrical power) | time (seconds since epoch) | value (float)
In there, I document events from a gas meter which fires every 10 liter of consumed gas. This is (when the gas heating is active) once every ~20 seconds. The value written together with the timestamp is 0 (zero).
I want to automatically fill an aggregaton table with the count of all records within an interval of 10 minutes.
I had success with this query to get the counts within the intervals:
select time/600*600+600 _time, count(*) _count
from data
where typ = 1 and value = 0
group by _time
order by _time
But how would I achive the following:
run this query regularely every 10 minutes (or at every INSERT with a TRIGGER?) at xx:10 / xx:20 / xx:20 / ...
write the resulting count of only the last 10 minutes to an aggregation table together with the interval end time.
I of course could do this with a program (e.g. PHP) but I'd prefer a DB-only solution if possible.
Thanks for any help.
This trigger will run for every inserted row, and tries to insert a corresponding row in an aggregate table if one does not already exist. Then it increments the counter value in the aggregate table for the timespan of the newly inserted row.
create trigger after insert on data
begin
insert or ignore into aggregateData(startTime, counter) values ((new.time / 600) * 600, 0);
update aggregateData set counter = counter + 1 where startTime = (new.time / 600) * 600;
end;
I think that I found an easier solution which in the end creates the same result:
Just turn my aggregate query into a view:
CREATE VIEW _aggregate as
select time/600*600+600 _time, count(*) _count
from data
where typ = 1 and value = 0
group by _time
order by _time
This gives me exactly my desired result if I do a:
select * from _aggregate
It's good enough to have the aggregated values at runtime and not to store them. Or do you see a substantial difference to your solution?

Big Query - Google Analytics - Time diff between first visit and purchase

Trying to get a list:
visitorid, time first visit, time of hit where transaction occurred.
What've I've written is only grabbing rows that have transaction revenue. I am also trying to convert visitStartTime which is a unix date, to a regular date via Date(visitStartTime) but that's failing in the group by because of the outputted date.
Any direction super helpful.
SELECT
fullvisitorID,
visitNumber,
visitStartTime,
hits.transaction.transactionRevenue
FROM
[75718103.ga_sessions_20150310],
[75718103.ga_sessions_20150309],
[75718103.ga_sessions_20150308],
[75718103.ga_sessions_20150307],
[75718103.ga_sessions_20150306],
[75718103.ga_sessions_20150305],
[75718103.ga_sessions_20150304],
[75718103.ga_sessions_20150303],
[75718103.ga_sessions_20150302],
WHERE totals.transactions >=1
GROUP BY
fullvisitorID, visitNumber, visitStartTime, hits.transaction.transactionRevenue;
visitStartTime is defined as POSIX time in Google Analytics schema, which means number of seconds since epoch. BigQuery TIMESTAMP is encoded as number of microseconds since epoch. Therefore, to get start time as TIMESTAMP, I used TIMESTAMP(INTEGERvisitStartTime*1000000)). hits.time contains number of milliseconds since first hit, therefore to get time of transactions, they needed to be multiplied by 1000 to get to microsecond granularity, hence TIMESTAMP(INTEGER(visitStartTime*1000000 + hits.time*1000)). Since hits is repeated RECORD, no GROUP BY is necessary, the data model already has all the hits grouped together.
Putting it all together:
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
Mosha's solution is simple and elegant, but is too simple, actually it calculates the time between the first pageview and each transaction inside one visit, so it does not calculate the time between the first visit and the first transaction of one visitor. So if you calculate the average time using Mosha's query it will be 1.33 minute. But if you use the query I created it will be 9.91 minutes. My SQL skills are quite rusted, so it probably can be improved.
Masha's query (avg. time between the first pageview and each transaction inside one visit):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time,
ROUND((TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000 + hits.time*1000))) - TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000)) )) / 60, 2) AS MinutesToTransaction
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
)
My query (avg. time between the first visit and the first transaction of one visitor):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT firstInteraction.fullVisitorId,
MIN(firstInteraction.visitId) AS firstInteraction.visitId,
TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)) AS timeFirstInteraction,
firstTransaction.visitId,
firstTransaction.timeFirstTransaction,
FIRST(BOOLEAN(firstInteraction.visitId = firstTransaction.visitId)) AS transactionInFirstVisit,
ROUND((TIMESTAMP_TO_SEC(firstTransaction.timeFirstTransaction) - TIMESTAMP_TO_SEC(TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)))) / 60, 2) AS MinutesToTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] firstInteraction
INNER JOIN (
SELECT
fullVisitorId,
visitId,
TIMESTAMP(INTEGER(MIN(visitStartTime*1000000 + hits.time*1000))) AS timeFirstTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.type = "TRANSACTION"
GROUP BY 1, 2
) AS firstTransaction
ON firstInteraction.fullVisitorId = firstTransaction.fullVisitorId
GROUP BY 1, 4, 5
)
I left some extra fields so if you use it without the first SELECT you can see some interesting data.
Ps: Thanks Mosha for showing how to calculate the time.

Conditional pl/sql output

I need to get a value 1 or 0 from DB query which in its turn should do next:
get some field value from a table
compare that field value to some literal (defined in query itself)
if value will not match the literal and query is executed in certain time period (i.e. from 9:00 AM to 10:00 AM) it should return 0, else 1
it includes multiple result sets (rows) response (see further)
So far I have next thing:
select instr(field, 'literal') from table_name where trunc(time) = trunc(sysdate)
which returns 1 if field from table table_name contains 'literal' (where clause checks if truncated time in table_name is equal to truncated system time).
What I can't get is how I can:
introduce a time constraint (basically, if its from 9:00 AM to 10:00 AM always return 1)
handle several response rows, meaning that if any of the response rows will return 1 then I need only 1 row with 1 value in it
Thanks in advance.
P.S.: Please comment on the question if something is left vague.
It sounds like you want a CASE statement. It would be helpful if you posted the DDL to create the table, some DML to populate the data, and the expected output. You seem to have conflicting requirements about what you want returned if the query is run between 9 and 10:00 am. You say "if ... query is executed in certain time period ... it should return 0, else 1" initially but then later you say "if its from 9:00 AM to 10:00 AM always return 1"). My guess is that you want something like
SELECT MAX(zero_or_one)
FROM (
SELECT (CASE WHEN to_char( sysdate, 'HH24' ) = '09'
THEN 1
WHEN instr( column_name, 'literal' ) > 0
THEN 1
ELSE 0
END) zero_or_one
FROM table_name
WHERE trunc(date_column) = trunc(sysdate)
)

Resources