MarkLogic Xquery finding all timestamps within 1 minute of eachother - xquery

Sorry for this rather specific use case.
I have a sequence of documents that all have a timestamp field:
2021-12-15T03:06:04Z
2021-12-15T03:06:14Z
2021-12-15T03:06:24Z
2021-12-15T03:06:34Z
2021-12-15T03:06:44Z
2021-12-15T03:07:04Z
2021-12-15T03:17:04Z
My aim is to identify which documents are all within 1 minute of eachother, and delete all documents except one of those (so we only have 1 document per any 60 second interval). Which document is kept is not important.
Is there any dateTime functions or xfunct functions I could leverage to tackle this elegantly? A big caveat of the data is that all timestamps are random, there are no patterns to the timestamps coming back we could build off of. There is also around 1k documents that this needs to be run on every 3 hours, so performance is also an issue.
Thank you in advance to anyone who replies.

You can group them by the dateTime formatted with minute precision, use that as the key for a map, and just perform a put with those values. At the end, there will only be one entry per minute.
let $dates := ("2021-12-15T03:06:04Z",
"2021-12-15T03:06:14Z",
"2021-12-15T03:06:24Z",
"2021-12-15T03:06:34Z",
"2021-12-15T03:06:44Z",
"2021-12-15T03:07:04Z",
"2021-12-15T03:17:04Z")!xs:dateTime(.)
let $dates-by-minute := map:map()
let $_group :=
for $date in $dates
let $key := fn:format-dateTime($date, "[Y01]/[M01]/[D01] [H01]:[m01]")
return map:put($dates-by-minute, $key, $date)
return
map:keys($dates-by-minute) ! map:get($dates-by-minute, .)

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

Sample Data in Cosmos (Return every 3rd data point)

We are looking to pull a few parameters off a few thousand cosmos documents. Arrange these and return them to the client for graphing.
The trouble is the number of points quickly gets into the 10s of 1000s which is much more resolution that the front end needs to spot trends. It's also SLOW.
While I can easily return all the data to the API and then sample the data there, it would seem to be more efficient to just skip every 3rd data point or so.
Is there a way to do something like .Skip(x => x % 3 == 0) in cosmos?
It would be doable in a stored procedure (don't include any rows numbers not divisible by the sampling frequency in the results).
There (looking at the docs) does not seem to be any ranking functions (not even a row_number()) that could do it in pure SQL.

Is there a way of looking up a datetime in a datetime reference table and returning corresponding data

Have been searching for the answer to this for a which but with no joy. Hoping you DAX geniuses can help out!
I have a table of transactional data with a date time column (in the format "dd/mm/yyyy hh:mm:ss")
I want to look this datetime up in a separate 'shift reference' table to add a new column to my transactional data i.e. if it falls between 2 date times (which it always will), Start time and End Time there will be a corresponding shift associated with it.
The format of this table is
Start time - End Time - Shift Pattern
In this table we have the datetime (in the same format as before) the shift started - "Start_Time", when it ended - "End_Time" and what 'Shift' was working. I want to use my transactional datetime to look up what shift was on when the transaction took place.
Ive tried combinations of Lookupvalue/Calculate/Max and on some occasions it has returned values, but never correct ones!
I hope this makes sense!
Best Regards,
Colin
You can use this code to add a calculated column with the Shift value looked up based on transaction timestamp.
Shift = CALCULATE (
VALUES ( Shift[SHIFT_PATTERN] ),
FILTER (
Shift,
Shift[start_time] <= [Timestamp] && Shift[end_time] > [Timestamp]
)
)
Just for showing another option, it is even possible to calculate transaction facts sliced by Shift without adding a column in the fact table. You may handle it in measures like below.
Transaction Count =
SUMX (
Shift,
COUNTROWS (
FILTER (
'Transaction',
'Transaction'[Timestamp] >= Shift[start_time]
&& 'Transaction'[Timestamp] < Shift[end_time]
)
)
)

VB or macro to exclude period of times from time duration calculation in Excel

I have an Excel table which contains thousands of incident tickets. Each tickets typically carried over few hours or few days, and I usually calculate the total duration by substracting opening date and time from closing date and time.
However I would like to take into account and not count the out of office hours (night time), week-ends and holidays.
I have therefore created two additional reference tables, one which contains the non-working hours (eg everyday after 7pm until 7am in the morning, saturday and sunday all day, and list of public holidays).
Now I need to find some sort of VB macro that would automatically calculate each ticket "real duration" by removing from the total ticket time any time that would fall under that list.
I had a look around this website and other forums, however I could not find what I am looking for. If someone can help me achieve this, I would be extremely grateful.
Best regards,
Alex
You can use the NETWORKDAYS function to calculate the number of working days in the interval. Actually you seem to be perfectly set up for it: it takes start date, end date and a pointer to a range of holidays. By default it counts all days non-weekend.
For calculating the intraday time, you will need some additional magic. assuming that tickets are only opened and closed in bussines hours, it would look like this:
first_day_hrs := dayend - ticketstart
last_day_hrs := ticketend - daystart
inbeetween_hrs := (NETWORKDAYS(ticketstart, ticketend, rng_holidays) - 2) * (dayend - daystart)
total_hrs := first_day_hrs + inbetween_hrs + last_day_hrs
Of course the names should in reality refer to Excel cells. I recommend using lists and/or names.

ORDER BY timestring as time instead of a string

I have a table with a column of times such as
| time|
|=====|
| 9:20|
|14:33|
| 7:35|
In my query, I have ORDER BY time, but it sorts the times as strings, so the result is ordered as
|14:33|
| 7:35|
| 9:20|
What do I have to do to my ORDER BY statement to get the result to be sorted as times so it would result in
| 7:35|
| 9:20|
|14:33|
One solution is to pad hours that do not include a leading 0 with one in the query itself and then perform the sort.
SELECT * FROM <table> ORDER BY SUBSTR('0' || time, -5, 5);
Here's the breakdown on what the SUBSTR method is doing.
|| is a string concatenation operation in SQLite. So '0' || '7:35' will give '07:35', and '0' || '14:23' will give '014:23'. Since we're only interested in a string like HH:MM, we only want the last 5 characters from this concatenated string.
If you store the original string with a leading 0 for hours and minutes, then you could simply order by the time column and get the desired results.
|14:33|
|07:35|
|09:20|
That will also make it easy to the use the time column as an actual time unit, and do computations on it easily. For example, if you wanted to add 20 minutes to all times, that can simply be achieved with:
SELECT TIME(time, '+20 minutes') FROM <table>;
The reason for this 0 padding is because SQLite as of now only understands 24 hour times such as 'HH:MM' but not 'H:MM'.
Here's a good reference page for SQLite documentation on Date and Time related functions.
The best way is to store the time as seconds. Either as a unix timestamp(recommended), or as number of seconds since midnight.
In the second case, 7:35 will 7*3600+35*60=27300 and the representation for 14:33 will be 52380 Store them as integers(timestamps). Similarly for unix timestamps, times are stored as no of seconds since 1970.
You can now sort them as integers
Use utility methods to easily handle the conversion

Resources