I'm using Neo4J / Cypher to store / retrieve some data based on a graph model.
Let's suppose the following model: I have a set of node (type=child) that are connected through a relation (type=CONNECTED_TO).
C1 -[:CONNECTED_TO]-> C2 -[:CONNECTED_TO]-> C3 -[:CONNECTED_TO]-> C4
If I want to query a path starting from C1 to C4 without knowing intermediates:
MATCH p=
(a:child {id:'c1Id'}) -[:CONNECTED_TO*0..]-(z:child {id:'c4Id'})
RETURN p
So far so good.
Now suppose that each child is contained in a parent and I want to start the query from parent ID
P1 -[:CONTAINS]-> C1
P2 -[:CONTAINS]-> C2
P3 -[:CONTAINS]-> C3
P4 -[:CONTAINS]-> C4
The query looks like:
MATCH p=
(a:parent {id:'p1Id'})
-[:CONTAINS]->
(cStart:child)
-[:CONNECTED_TO*0..]-
(cEnd:child)
<-[Contains]-
(z:parent {id:'p4Id'})
RETURN p
This give me the good result. The following path:
P1 -[:CONTAINS]-> C1 -[:CONNECTED_TO]-> C2 -[:CONNECTED_TO]-> C3 -[:CONNECTED_TO]-> C4 <-[:CONTAINS]- P4
What I would like to do is to query this path from P1 to P4 using the child topology but I want to retrieve also all parents containing intermediates.
How can I improve my last cypher query to return in addition of that:
P2 -[:CONTAINS]-> C2
P3 -[:CONTAINS]-> C3
Is it possible? Maybe my model design is not appropriate for that Use case? In this case, how to improve it to address this query?
Tx
You can use list comprehension construct:
MATCH p=
(a:parent {id:'p1Id'})
-[:CONTAINS]->
(cStart:child)
-[:CONNECTED_TO*0..]-
(cEnd:child)
<-[Contains]-
(z:parent {id:'p4Id'})
RETURN p,
[n IN nodes(p)[1..-1] | (n)<-[:CONTAINS]-(:parent)][0]
Related
I have a table event_log in Athena which has logs collected from an event processing system. There are various stages in the system and each stage processes these events in sequential order. start_time column indicates the time at which the event entered the system and end_time is the time at which it exited. The system processes millions of events per day. And, we have data for a year in the below table.
event_id
event_type
start_time
end_time
E1
TypeA
T1
T4
E2
TypeB
T2
T6
M1
TypeM
T2
T6
E3
TypeA
T3
T7
E4
TypeB
T4
T7
E5
TypeA
T5
T8
M2
TypeM
T5
T8
E6
TypeB
T6
T9
E7
TypeA
T7
T10
E8
TypeB
T8
T11
M3
TypeM
T8
T11
There are special type of events TypeM (marker events). I have to calculate the processing latency of these special events from these logs. From the table above, this can be achieved by filtering events of that type and compute the latency as end_time - start_time. In addition to that, I want to augment the latency with an additional info - number of events that were actively being processed in various stages of the system when this event was being processed.
-- sample event_log table
CREATE TABLE event_log AS
SELECT * FROM (
VALUES
('E1','TypeA', 1, 4),
('E2','TypeB', 2, 6),
('M1','TypeM', 2, 6),
('E3','TypeA', 3, 7),
('E4','TypeB', 4, 7),
('E5','TypeA', 5, 8),
('M2','TypeM', 5, 8),
('E6','TypeB', 6, 9),
('E7','TypeA', 7, 10),
('E8','TypeB', 8, 11),
('M3','TypeM', 8, 11)
) AS t (event_id, event_type, start_time, end_time)
-- filtered marker table
CREATE TABLE marker_table AS
SELECT * FROM event_log
WHERE event_type = 'TypeM'
-- Join with the filtered marker table on markers start and end time
SELECT mark.*,count(processed_events_in_band.event_id) AS events_processed_count
FROM event_log processed_events_in_band
JOIN marker_table mark
ON processed_events_in_band.end_time between mark.start_time AND mark.end_time
WHERE processed_events_in_band.event_type != 'TypeM'
GROUP BY mark.event_id
Expected result
event_id
event_type
start_time
end_time
events_processed_count
M1
TypeM
T2
T6
2 E1, E2
M2
TypeM
T5
T8
4 E2, E3, E4, E5
M3
TypeM
T8
T11
4 E5, E6, E7, E8
There are partitions on end_time (daily). Have been using them to reduce the data scan. Single day data can be up to 10m. Query should scale to that. Query took around 17 mins for marker table having 18K rows and event logs having 10m rows. There are around 2K parquet files to scan for this 10m rows. Don't think there is S3 read latency causing an issue here.
How do I optimize this query? What is the best way to get this data efficiently?
To improve performance:
be aware that CREATE TABLE will write the output of the query on disk (doc). Consider using a common table expression instead:
with marker_table as (SELECT * FROM event_log
WHERE event_type = 'TypeM')
select ...
try using a condition with an = sign in the join. Presto will do a hash join which is a lot more efficient. In your case I would attempt truncating down start and truncating up end times and write the ON condition using equality of truncated times (up or down)
always place the largest table to the left of the join (doc)
if you want to add the list of events being processed, and not just the count, you can use the array_agg function (doc) combined with array_distinct to produce a list of unique entries and array_join to join it into a string.
I have a table that have a column like this:
table1:
c1 c2 c3
. a .
. a .
. a .
a
b
b
c
How to get a result like the following?:
-- a b c
count(a) count(b) count(c)
Of course, there is an auxiliary table like the one below:
--field table
d1 d2
a
b
c
Transferring comments into an answer.
If there was an entry in table1.c2 with d as the value, is it correct to guess/assume that you'd want a fourth column of output with the name d and the count of the number of d values as the value. And there'd be an extra row in the auxilliary table too. That's pretty tricky.
You'd probably be better off with a result table with N rows, one for each value in the table1.c2 column, with the first column identifying the value and the second the count:
SELECT c2, COUNT(c2) FROM table1 GROUP BY c2 ORDER BY c2
To generate a single row with the names and counts as shown requires a dynamically built SQL statement — you write an SQL statement that generates the SQL (or the key components of the SQL) for a second statement that you actually execute to get the result. The main reason for it being dynamic like that is that the number of columns in the result set is not known until you run a query that determines which values exist in table1.c2. That's non-trivial — doable, but non-trivial.
I forget whether 11.50 has a built-in sysmaster:sysdual table. I ordinarily use a regular one-column, one-row table called dual. You can get the result you want, if your Table1.C2 has values a through e in it, with:
SELECT (SELECT COUNT(*) FROM Table1 WHERE c2 = 'a') AS a,
(SELECT COUNT(*) FROM Table1 WHERE c2 = 'b') AS b,
(SELECT COUNT(*) FROM Table1 WHERE c2 = 'c') AS c,
(SELECT COUNT(*) FROM Table1 WHERE c2 = 'd') AS d,
(SELECT COUNT(*) FROM Table1 WHERE c2 = 'e') AS e
FROM dual;
This gets the information you need. I don't think it is elegant, but "works" beats "doesn't work".
Need help. I am putting my requirement in simple steps here. I have a data like below.
with x as (
select 'a=x AND b=y AND c=z' C1, 'a' C2, '100' C3 from dual union all
select 'a=x AND b=y AND c=z' C1, 'b' C2, '200' C3 from dual union all
select 'a=x AND b=y AND c=z' C1, 'c' C2, '300' C3 from dual union all
select 'a=x AND d=y AND c=z' C1, 'd' C2, '400' C3 from dual union all
select 'a=x AND e=y AND c=z' C1, 'e' C2, '500' C3 from dual
)
select * from x;
My output looks like below:
C1 C2 C3
------------------------------
a=x AND b=y AND c=z a 100
a=x AND b=y AND c=z b 200
a=x AND b=y AND c=z c 300
a=x AND d=y AND c=z d 400
a=x AND e=y AND c=z e 500
I am looking for a query to get the output like below. I have a condition in one column (C1), Also I have look-up data in same table in different columns (C2 and C3). I want to replace the values in C1 if any of the string exists in column C2 with the value from column C3.
100=x AND 200=y AND 300=z a 100
100=x AND 200=y AND 300=z b 200
100=x AND 200=y AND 300=z c 300
100=x AND 400=y AND 300=z d 400
100=x AND 500=y AND 300=z e 500
My exact requirement is, I have a table with a column contains WHERE condition (C1) like above. They used business column names in condition and there is second column with business name (C2) and there is third column with actual physical column name in DB (C3) in the same table. I am looking for the query which can replace the business names in C1, by looking at column C2 with the corresponding value in column C3.
I want to randomly select a row from a table where multiple minimum values can exist in the column Number. For example I have a Table containing Titles, Numbers and Categories like so:
Ti Nu Ca
A 0 c7
W 0 c7
Y 0 c7
C 0 c9
H 3 c9
This query will return a random row where Number equals 0 AND Ca equals c7:
SELECT * FROM Table WHERE ((Ca = 'c7') And Nu = ( SELECT min(Nu) FROM Table )) ORDER BY RANDOM() LIMIT 1;
But when the Table contains:
Ti Nu Ca
A 3 c7
W 1 c7
Y 5 c7
C 0 c9
H 3 c9
The above query does not return anything. I would expect the row "W 1 c7" being returned. What am I doing wrong?
Your sub-query looks like it was always matching Nu = 0 regardless of what Ti was, and since there is no row where Nu = 0 and Ca = "c7", it returns nothing. You'll probably need something like this:
SELECT * FROM [Table] x
WHERE Ca = 'c7' And Nu = (SELECT min(Nu) FROM [Table] where x.Ca = Ca)
ORDER BY RANDOM() LIMIT 1;
If you add another row containing Ti="Z", Nu=1, Ca="c7" then you should see the Ti value flip between "W" and "Z" in the returned row.
Suppose 1000,000 records arranged as:
c1_v1 c2_v1 c3_v1 d1
c1_v1 c2_v1 c3_v2 d2
c1_v1 c2_v1 c3_v3 d3
...
c1_v1 c2_v2 c3_v1 d999
c1_v1 c2_v2 c3_v2 d1000
...
c1_v999 c2_v999 c3_v998 d999999
c1_v999 c2_v999 c3_v999 d1000000
say that we need three conditions(c1_vx, c2_vx, c3_vx) to query the result(dx), but the single condition such as c1_v1 in different records may be same. An alternative style of the records:
c1_v1
c2_v1
c3_v1 : d1
c3_v2 : d2
c3_v3 : d3
...
c2_v2
c3_v1 : d999
c3_v2 : d1000
...
c1_v999
c2_v999
c3_v998: d999999
c3_v1000: d1000000
How to design the tables for fasttest query? (Just query, don't care about insert/update/delete)
Thanks!
A typical query operation is like select d from t_table where c1 = 'UA1000_2048X32_MCSYN' and c2 = '1.234' and c3 = '2.345';
Well, then you just need a composite index on {c1, c2, c3}.
Ideally, you'd also cluster the table, so retrieving d just involves an index seek without a table heap access, but I don't think SQLite supports clustering. Alternatively, consider creating a covering index on {c1, c2, c3, d} instead.
c1 is a string like UA1000_2048X32_MCSYN, c2 and c3 is a real(double) number
I'd refrain from trying to equate numbers with strings in your query - some DBMSes can't use index in these situations and SQLite might be one of them. Instead, just write the query in most natural way, without single quotes around number literals:
select d from t_table
where c1 = 'UA1000_2048X32_MCSYN' and c2 = 1.234 and c3 = 2.345;