How to visualize concurrent event data on some kind of time line - bigdata

I have a large csv data file with over 5 million records. It contains Date Time Began and Date Time Ended.
Here is an example of what the data looks like
2019-08-06 16:07:25,2019-08-06 16:07:42
2019-08-06 17:21:42,2019-08-06 17:21:59
2019-08-06 15:43:03,2019-08-06 15:43:20
2019-08-06 13:48:13,2019-08-06 13:48:30
2019-08-06 16:18:56,2019-08-06 16:19:13
2019-08-06 14:34:10,2019-08-06 14:34:27
2019-08-06 16:59:47,2019-08-06 17:00:04
2019-08-06 16:14:57,2019-08-06 16:15:14
2019-08-06 13:04:38,2019-08-06 13:04:55
2019-08-06 16:09:28,2019-08-06 16:09:45
My goal is to be able to visualize the data to be identify at what times has the highest active concurrent connections. Ideally the more narrow the time interval we are able to look at the better. The data is dated between 2 months.
Can anyone suggest an approach I can use to tackle this?
I have tried using Python to loop through the entire file for each record to identify how many concurrent connections there were. It worked well when I did a small scale test with a few thousand records. But with a file that has over 5 million records, that approach won't work well because it will take quite a while to loop through all 5 million records, 5 million times each.

I address only the central problem how to transform your data to show the degree of parallelism of the transactions.
I use (Oracle) SQL to demonstrate the idea.
Let's assume this is your data
select * from tab;
START_DT END_DT
------------------- -------------------
06.08.2019 16:07:25 06.08.2019 16:07:42
06.08.2019 16:07:30 06.08.2019 16:07:35
06.08.2019 16:07:33 06.08.2019 16:07:50
06.08.2019 16:07:50 06.08.2019 16:07:55
In the first step you split each row in two parts (using UNION ALL)
The first part identify the start of the transaction will increase the count of active transaction by 1.
The second part identify the end of the transaction will decrease the count of active transaction by 1.
As a timestamp you use either the start or end day of the transaction.
select start_dt trans_dt ,1 trans_cnt from tab union all
select end_dt trans_dt ,-1 trans_cnt from tab
order by 1;
TRANS_DT TRANS_CNT
------------------- ----------
06.08.2019 16:07:25 1
06.08.2019 16:07:30 1
06.08.2019 16:07:33 1
06.08.2019 16:07:35 -1
06.08.2019 16:07:42 -1
06.08.2019 16:07:50 1
06.08.2019 16:07:50 -1
06.08.2019 16:07:55 -1
Having the date so prepared you need only to accumulate the transaction count, which is done by aggregated function SUM in the analytic from ordered on the transaction timestamp. More to that later.
Finally you consolidates the case where more than one transaction starts or ends in a same timestamp - you use GROUP BY and MAX parallel degree.
The full query
with trans as (
select start_dt trans_dt ,1 trans_cnt from tab union all
select end_dt trans_dt ,-1 trans_cnt from tab),
trans_cum as (
select trans_dt,trans_cnt,
sum(trans_cnt) over (order by trans_dt, trans_cnt) parallel_trans_cnt
from trans)
select trans_dt, max(parallel_trans_cnt) parallel_trans_cnt
from trans_cum
group by trans_dt
order by 1;
result
TRANS_DT PARALLEL_TRANS_CNT
------------------- ------------------
06.08.2019 16:07:25 1
06.08.2019 16:07:30 2
06.08.2019 16:07:33 3
06.08.2019 16:07:35 2
06.08.2019 16:07:42 1
06.08.2019 16:07:50 1
06.08.2019 16:07:55 0
This data can be simple visualized with an ordinary x,y plot.
Some care should be taken is one transaction ends in the same second as the next start.
Were at this second two transaction active or only one?
The query above assumes the latter case, if you want to consider the former case simple adjust the accumulation logic to
sum(trans_cnt) over (order by trans_dt, trans_cnt DESC) parallel_trans_cnt
Within the same second you will first consider the starting transaction (which will increase the parallel degree) and than the ending transactions.
You may implement this logic in any programming language, but note that SQL 1) is very concise, 2) you are not limited with the main memory and 3) you have parallel processing out of the box.

Related

divide counts in one column where a condition is met

I am trying to determine the on time delivery rate of orders:
The column of interest is on time delivery orders, which contains a field of 0 (not on time) or 1 ( on time). How can I calculate in sql the on time rate for each person? Basically count the number of 0 / over total count(0's & 1's) for each person? Same thing for on time ( count 1/total count (0's & 1's)?
Heres a data example:
Week Delivery on time Person
1 0 sARAH
1 0 sARAH
1 1 sARAH
2 1 vIC
2 0 Vic
You may aggregate by person, and then take the average of the on time statistic:
SELECT Person, AVG(1.0*DeliveryOnTime) AS OnTime,
AVG(1.0 - DeliveryOnTime) AS NotOnTime
FROM yourTable
GROUP BY Person;
Demo
The demo given is for SQL Server, and the above syntax might have to change slightly depending on your actual database, which you did not reveal to us.

Selecting max value across columns opposed to across rows

I am attempting to select the max value within separate columns per a dimension listed in a row as so;
Input Dataset
Person|Date#1 |Date#2 |Date#3 |Date#4
------+--------+--------+--------+---------
Matt |12/01/18|01/15/19|02/15/19|04/15/18
Dave |01/15/18|01/02/19|03/15/19|11/01/19
Desired result
Input Dataset
Person|Max Date|
------+--------+
Matt |02/15/19|
Dave |11/01/19|
Once you fix up your tables to a proper format like YYYY-mm-dd so the table looks like so:
Person Date#1 Date#2 Date#3 Date#4
---------- ---------- ---------- ---------- ----------
Matt 2018-12-01 2019-01-15 2019-02-15 2018-04-15
Dave 2018-01-15 2019-01-12 2019-03-15 2019-11-01
it becomes a trivial
SELECT Person, max("Date#1", "Date#2", "Date#3", "Date#4") AS "Max Date" FROM mytable;
Person Max Date
---------- ----------
Matt 2019-02-15
Dave 2019-11-01
Remember, sqlite does not have any date or time types. It uses strings or numbers to hold those values. When storing dates as strings, they have to be formatted in a way that can be compared meaningfully. '04/15/18' is greater than '01/15/19' because the character 4 is greater than the character 1. None of the standard time string formats have that problem.

Count rows until you get to the current owning team value... Kusto, countof()

I have this Kusto code that I have been trying to develop and any help would be greatly appreciated.
The objective is to count to the first occurrence of the CurrentOwningTeamId in the OwningTeamId column.
I packed the Owning Team number and parsed the value into a column of its own. I need to count the owning teams until I get to the current owning team.
Columns are (example):
Objective: Count to the first occurrence of the CurrentOwningTeam value in the OwningTeamId column using Kusto (Application Insights code):
[CODE]
OwningTeamId, CurrenOwningTeam, CreateDate, RequestType
155523 **888888** 2017-07-02 PRIMARY
256924 **888888** 2017-08-02 TRANSFER
**888888** **888888** 2017-09-02 TRANSFER
954005 **888888** 2017-10-02 TRANSFER
**888888** **888888** 2017-11-02 TRANSFER
155523 **888888** 2017-12-02 TRANSFER
954005 **888888** 2017-13-02 TRANSFER
**888888** **888888** 2017-14-02 TRANSFER
[/CODE]
I think you can match the current owning team with the countof() function, but I don't know how to go about it using regex. Note: values are different with each owning team on every incident, is why I capture the owning team on the incident first and try to count the very first instance of the CurrentOwningTeam number in the OwningTeamId column. In other words I want to count the number of times it takes to get to the very first owning team. In this case, it would be three.
Note: OwningTeamId's and CurrentOwningTeam can change on every incident, I first capture the CurrentOwningTeam then try to match in the OwningTeamId column.
Note: This is just one incident, but I am trying to do multiple Incidents.
Below is how I got the Current Owning Team Value.
[/CODE]
| extend CurrentOwningTeam=pack_array(OwningTeamId)
| parse CurrentOwningTeam with * "[" CurrentOwningTeam:int "]" *
| serialize CurrentOwningTeam
[/CODE]
I tried using row_number() but it will not work for multiple incidents, only per incident, so I have to use count or countof functions or another way of doing it.
Thanks for clarification. Here is a suggestion for a query that counts ordered by-time rows until certain condition is reached (count is contextual using IncidentId key).
datatable(IncidentId:string, OwningTeamId:string, CurrentOwningTeam:string, CreateDate:datetime, RequestType:string)
[
'Id1','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id1','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-09),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id1','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-14),'TRANSFER',
// Id2
'Id2','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id2','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id2','999999','888888',datetime(2017-02-09),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id2','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-14),'TRANSFER',
]
| order by IncidentId, CreateDate asc
| extend c= row_cumsum(1, IncidentId!=prev(IncidentId))
| where OwningTeamId == CurrentOwningTeam
| summarize arg_min(CreateDate, c) by IncidentId
Result:
IncidentId CreateDate c
Id1 2017-02-09 00:00:00.0000000 3
Id2 2017-02-11 00:00:00.0000000 5
Here are the links to the docs that point how to find earliest record using arg_min() aggregation, and link to the row_cumsum() (cumulative sum) function.
https://learn.microsoft.com/en-us/azure/kusto/query/arg-min-aggfunction
https://learn.microsoft.com/en-us/azure/kusto/query/rowcumsumfunction
I figured it out by using the RowNumber directly into grouping inside the table, then finally summing to get my total count.
[CODE]
| serialize Id
| extend RowNumber=row_number(1, (Id) ==Id)
| summarize TotalOwningTeamChanges=sum(RowNumber) by Id
[/CODE]
Then after that I got the Minimum Date to extract the entire data set to the first instance of the current OwningTeamName.
[CODE]
//Outside the scope of the table.
| extend ExtractFirstOwningTeamCreateDate=CreateDate2
| extend VeryFirstOwningTeamCreateDate=MinimumCreateDate
| where FirstOwningTeamRow == true or MinimumCreateDate <=
ExtractFirstOwningTeamCreateDate
| serialize VeryFirstOwningTeamCreateDate
[/CODE]

Use Oracle Partition and Over By clause to retrieve section numbers

My Table comprises 4 Columns (Patient, Sample, Analysis and Component). I am trying to write a query that will look at the combination of Patient, Analysis and Component for each record and assign a "Section Number".
The numbering should re-start for every patient.
See expected output below. Patient 1010 has 3 samples but all have same Analysis-component. Hence they all have the same section (1).
Now, counting restarts for Patient 2020. This patient has 2 samples but both have a different Analysis-Component combination. Hence they are placed in separate sections 1 and 2.
Patient Sample Analysis Component Section Number
_______ ______ ________ _________ ______________
1010 720000140249 CALC Calcium 1
1010 720000140288 CALC Calcium 1
1010 720000140288 CALC Calcium 1
2020 720000190504 ALB Albumin 1
2020 720000160504 ALB Albumin Pct 2
3030 720000134568 CALC Calcium 1
3030 720000123404 ALB Albumin 2
3030 720000160765 ALB Albumin Pct 3
I have written the following query but all it does is groups samples with the same Component into one section. It does not consider the Patient or Analysis at all.
Your help is much appreciated (as always!)
select
x.patient, x.sample_number, x.analysis, x.component
a.myRowCount
from
X_PREV_PAT_RESULTS x inner join (
select distinct
x1.COMPONENT
, ROW_NUMBER() OVER (ORDER BY x1.COMPONENT) myRowCount
from X_PREV_PAT_RESULTS x1
group by x1.patient ) A on x.COMPONENT = A.COMPONENT
order by a.myRowCount, x.patient;
My guess is that you want
dense_rank() over (partition by patient
order by analysis desc, component) myRowCount
What happens with rows after a tie? If patient 1010 gets an ALB analysis? Would that have a MyRowCount of 2? Or 4? rank would return 4. dense_rank would return 2.
How are you determining the order of rows for a partiticular patient? It appears that you're going in reverse alphabetical order for analysis and then alphabetically for component but that seems like a pretty unusual ordering.
select x.patient, x.sample_number, x.analysis, x.component,
dense_rank() over(partition by x.patient order by x.analysis, x.component)
from X_PREV_PAT_RESULTS x
where exists (select 1 from X_PREV_PAT_RESULTS x1 where x1.COMPONENT = x.COMPONENT);

SQL Server - Group by, having and count in a mix

I have a database with a long list of records. Most of the columns have foreign keys to other tables.
Example:
ID SectorId BranchId
-- -------- --------
5 3 5
And then I will have a table with sectors, branches ect.
My issue:
I want to know how many records which has sector 1, 2, 3 ... n. So what I want is a group by Sector and then some count(*) which will tell me how many there is of each.
Expected output
So for instance, if I have 20 records the result might look like this:
SectorId Count
-------- -----
1 3
2 10
3 4
4 6
My attempts so far
I do not normally work a lot with databases and I have been trying to solve this for 1.5 hours. I have tried something like this:
SELECT COUNT(*)
FROM Records r
GROUP BY r.Sector
WHERE r.Date BETWEEN '2011-01-01' AND '2011-12-31'
But... errors and problems all over!
I would really appreciate some help. I do know this is probably very simple.
Thanks!
The sequence of your query is not correct; it should be like this: -
SELECT COUNT(*)
FROM Records r
WHERE r.Date BETWEEN '2011-01-01' AND '2011-12-31'
GROUP BY r.Sector
The output will be only counts i.e.
count
-----
3
10
4
6
If you want to fetch both sector and count then you need to modify the query a little
SELECT r.Sector, COUNT(*) as Count
FROM Records r
WHERE r.Date BETWEEN '2011-01-01' AND '2011-12-31'
GROUP BY r.Sector
The output will be like this: -
Sector Count
------ -----
1 3
2 10
3 4
3 6
Your query was partially right,But it needs some modification.
If I write this way:-
SELECT r.SectorID,COUNT(*) AS count
FROM Records r
WHERE r.Date BETWEEN '2011-01-01' AND '2011-12-31'
GROUP BY r.SectorID
Then output will be:-
SectorID Count
1 3
2 10
3 4
4 6

Resources