How I can generate transactional synthetic data? - associations

I am working on Association Rules so I need transactional dataset which is unavailable on UCI repository so I need to generate transactional data. Transactional data is a set of transactions and each transaction have subset of items. Groceries data is a example of Transactional database.
Let D be a transactional database and T be a transactions t={t1,t2,t3 ...... tn} and I be a set of items I={i1,i2,i3, ..... im} then transactional data looks like
TID Items
001 i1,i2,i5
002 i5,i6,i8,i10
003 i1,i4
004 i6,i4,i8
Thanks

So based off of your definition what it looks like you're trying to do is generate a two dimensional array. In JavaScript you could do something like this:
var t = 5, d = [], r = 10, s = 10;
for(var i=0; i<t; i++){
d.push([]);
for(var j=0; j<Math.random()*r; j++){
d[i].push("i"+Math.floor(Math.random()*s))
}
}
Here we let t be the number of transactions, d be the two dimensional array of transactions, r be the maximum number of transactions in row i and s be the maximum value of some number in the string (i+someNumber). Running the above and printing out d (console.log(d)) could give you something like this:
0 i3, i8
1 i5, i6, i8
2 i1, i2, i5
3 i3, i8
4 i9, i1, i7, i3, i5

Here is an opensource app that leverages the R package conjurer to generate transactional data.

Related

Azure CosmosDB, count with subquery

I'm currently using CosmosDB to store some reviews data. I'm trying to retrieve the average rating from a specific date and also the number of registers that have a rating below 3.
The query that I'm using to retrieve the average rating and count the number of registers with ratings below 3 are:
**Ratings**
SELECT avg(c.x_review) as avg_x_review,
avg(c.y_review) as avg_y_review,
{date} as date
FROM c where c.date = {date}
**Criticals (below 3)**
SELECT count(1) FROM c where c.date = {date}
AND (
c.x_review < 3 OR
c.y_review < 3
)
I wanted it to be inside just one query. This data is going to be retrieved by an Azure Function HTTP Trigger, I will use and save the generated JSON from this query in another container.
I've been searching for tips but nothing seems to work, one example of a query that I'm trying to reach is:
SELECT avg(c.x_review) as avg_x_review,
avg(c.y_review) as avg_y_review,
count(SELECT * FROM c where c.date = {date}
AND (
c.x_review < 3 OR
c.y_review < 3
)
) as criticals
FROM c where c.date = {date}
...or something like that where I could have both queries inside one query alone.
I'm expecting to generate a JSON like this:
{
"avg_x_review": 5,
"avg_y_review": 4,
[...]
[...]
"criticals": 0,
"date": "2022-02-28"
}

Find the values missing between two tables

Pretty new to kql. Have a very basic question.
Say - I have two tables - Table1, Table2 which has a column named id.
What i am looking for is a query - to find the id's which are present in Table1 but not in Table2?
I saw set_differnce, where i am stuck is to generate the arrays to be passed to this.
Thanks in advance.
You could consider using a leftanti/rightanti join, or !in.
examples:
this returns a table with a column x, with the values 1,4,7,22,25,28
let T1 = range x from 1 to 30 step 3;
let T2 = range y from 10 to 20 step 1;
T1
| join kind=leftanti T2 on $left.x == $right.y
and so does this:
let T1 = range x from 1 to 30 step 3;
let T2 = range y from 10 to 20 step 1;
T1
| where x !in((T2 | project y))

Creating even ranges based on values in an oracle table

I have a big table which is 100k rows in size and the PRIMARY KEY is of the datatype NUMBER. The way data is populated in this column is using a random number generator.
So my question is, can there be a possibility to have a SQL query that can help me with getting partition the table evenly with the range of values. Eg: If my column value is like this:
1
2
3
4
5
6
7
8
9
10
And I would like this to be broken into three partitions, then I would expect an output like this:
Range 1 1-3
Range 2 4-7
Range 3 8-10
It sounds like you want the WIDTH_BUCKET() function. Find out more.
This query will give you the start and end range for a table of 1250 rows split into 20 buckets based on id:
with bkt as (
select id
, width_bucket(id, 1, 1251, 20) as id_bucket
from t23
)
select id_bucket
, min(id) as bkt_start
, max(id) as bkt_end
, count(*)
from bkt
group by id_bucket
order by 1
;
The two middle parameters specify min and max values; the last parameter specifies the number of buckets. The output is the rows between the minimum and maximum bows split as evenly as possible into the specified number of buckets. Be careful with the min and max parameters; I've found poorly chosen bounds can have an odd effect on the split.
This solution works without width_bucket function. While it is more verbose and certainly less efficient it will split the data as evenly as possible, even if some ID values are missing.
CREATE TABLE t AS
SELECT rownum AS id
FROM dual
CONNECT BY level <= 10;
WITH
data AS (
SELECT id, rownum as row_num
FROM t
),
total AS (
SELECT count(*) AS total_rows
FROM data
),
parts AS (
SELECT rownum as part_no, total.total_rows, total.total_rows / 3 as part_rows
FROM dual, total
CONNECT BY level <= 3
),
bounds AS (
SELECT parts.part_no,
parts.total_rows,
parts.part_rows,
COALESCE(LAG(data.row_num) OVER (ORDER BY parts.part_no) + 1, 1) AS start_row_num,
data.row_num AS end_row_num
FROM data
JOIN parts
ON data.row_num = ROUND(parts.part_no * parts.part_rows, 0)
)
SELECT bounds.part_no, d1.ID AS start_id, d2.ID AS end_id
FROM bounds
JOIN data d1
ON d1.row_num = bounds.start_row_num
JOIN data d2
ON d2.row_num = bounds.end_row_num
ORDER BY bounds.part_no;
PART_NO START_ID END_ID
---------- ---------- ----------
1 1 3
2 4 7
3 8 10

Random sampling without replacement in longitudinal data

My data is longitudinal.
VISIT ID VAR1
1 001 ...
1 002 ...
1 003 ...
1 004 ...
...
2 001 ...
2 002 ...
2 003 ...
2 004 ...
Our end goal is picking out 10% each visit to run a test. I tried to use proc SURVEYSELECT to do SRS without replacement and using "VISIT" as strata. But the final sample would have duplicated IDs. For example, ID=001 might be selected both in VISIT=1 and VISIT=2.
Is there any way to do that using SURVEYSELECT or other procedure (R is also fine)? Thanks a lot.
This is possible with some fairly creative data step programming. The code below uses a greedy approach, sampling from each visit in turn, sampling only ids that have not previously been sampled. If more than 90% of the ids for a visit have already been sampled, less than 10% are output. In the extreme case, when every id for a visit has already been sampled, no rows are output for that visit.
/*Create some test data*/
data test_data;
call streaminit(1);
do visit = 1 to 1000;
do id = 1 to ceil(rand('uniform')*1000);
output;
end;
end;
run;
data sample;
/*Create a hash object to keep track of unique IDs not sampled yet*/
if 0 then set test_data;
call streaminit(0);
if _n_ = 1 then do;
declare hash h();
rc = h.definekey('id');
rc = h.definedata('available');
rc = h.definedone();
end;
/*Find out how many not-previously-sampled ids there are for the current visit*/
do ids_per_visit = 1 by 1 until(last.visit);
set test_data;
by visit;
if h.find() ne 0 then do;
available = 1;
rc = h.add();
end;
available_per_visit = sum(available_per_visit,available);
end;
/*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
samprate = 0.1;
number_to_sample = round(available_per_visit * samprate,1);
do _n_ = 1 to ids_per_visit;
set test_data;
if available_per_visit > 0 then do;
rc = h.find();
if available = 1 then do;
if rand('uniform') < number_to_sample / available_per_visit then do;
available = 0;
rc = h.replace();
samples_per_visit = sum(samples_per_visit,1);
output;
number_to_sample = number_to_sample - 1;
end;
available_per_visit = available_per_visit - 1;
end;
end;
end;
run;
/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;

How to design a database(sqlite) for multi-condition query?

Suppose 1000,000 records arranged as:
c1_v1 c2_v1 c3_v1 d1
c1_v1 c2_v1 c3_v2 d2
c1_v1 c2_v1 c3_v3 d3
...
c1_v1 c2_v2 c3_v1 d999
c1_v1 c2_v2 c3_v2 d1000
...
c1_v999 c2_v999 c3_v998 d999999
c1_v999 c2_v999 c3_v999 d1000000
say that we need three conditions(c1_vx, c2_vx, c3_vx) to query the result(dx), but the single condition such as c1_v1 in different records may be same. An alternative style of the records:
c1_v1
c2_v1
c3_v1 : d1
c3_v2 : d2
c3_v3 : d3
...
c2_v2
c3_v1 : d999
c3_v2 : d1000
...
c1_v999
c2_v999
c3_v998: d999999
c3_v1000: d1000000
How to design the tables for fasttest query? (Just query, don't care about insert/update/delete)
Thanks!
A typical query operation is like select d from t_table where c1 = 'UA1000_2048X32_MCSYN' and c2 = '1.234' and c3 = '2.345';
Well, then you just need a composite index on {c1, c2, c3}.
Ideally, you'd also cluster the table, so retrieving d just involves an index seek without a table heap access, but I don't think SQLite supports clustering. Alternatively, consider creating a covering index on {c1, c2, c3, d} instead.
c1 is a string like UA1000_2048X32_MCSYN, c2 and c3 is a real(double) number
I'd refrain from trying to equate numbers with strings in your query - some DBMSes can't use index in these situations and SQLite might be one of them. Instead, just write the query in most natural way, without single quotes around number literals:
select d from t_table
where c1 = 'UA1000_2048X32_MCSYN' and c2 = 1.234 and c3 = 2.345;

Resources