I’m trying to generate a query where I limit the number of sub results I get per a particular category, and could use some help on if there is a good function for this.
Quick Example:
| ID | Category | Value | A bunch of other important columns |
|-----------|-----------------|--------------|-------------------------------------------|
| 1 | A | GUID | |
| 2 | A | GUID | |
| 3 | A | GUID | |
| 4 | A | GUID | |
| 5 | B | GUID | |
| 6 | B | GUID | |
I want to return only N GUIDs per category. (Largely because I’m hitting the 64MB Kusto query limits for some Categories that won’t be useful anyway)
The Top-nested operator looks good at first, BUT I don’t want to do any aggregation, and it filters out other important columns. Per the note on the page, I can use Ignore=max(1) to remove the aggregation, then do some serializing of all my other columns to a certain value, then unpack after the filter. But that feels like I’m doing something very wrong.
I've also tried something like:
| partition by Category ( top 3 by Value)
But it's limited to 64 partitions, and I need closer to 500.
Any idea of a good pattern to do this?
Here you go:
let NumItemsPerCategory = 3;
datatable(ID:long, Category:string, Value:guid)
[
1, "A", guid(40b73f8f-78d2-4eae-bd5b-b3e00f38ac33),
2, "A", guid(043ee507-aadf-4453-bcc6-d8f4f541b043),
3, "A", guid(f71d3cc0-ce46-474f-9dcd-f3883fa08859),
4, "A", guid(bf259fc8-e9fe-4a99-a296-ca81e1fa250a),
5, "B", guid(d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2),
6, "B", guid(282e74ff-3b71-407c-a2a7-92bb1cb17b27),
]
| summarize PackedItems = make_list(pack_all(), NumItemsPerCategory) by Category
| project-away Category
| mv-expand PackedItem = PackedItems
| evaluate bag_unpack(PackedItem)
| project-away PackedItems
Result:
| ID | Category | Value |
|----|----------|--------------------------------------|
| 1 | A | 40b73f8f-78d2-4eae-bd5b-b3e00f38ac33 |
| 2 | A | 043ee507-aadf-4453-bcc6-d8f4f541b043 |
| 3 | A | f71d3cc0-ce46-474f-9dcd-f3883fa08859 |
| 5 | B | d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2 |
| 6 | B | 282e74ff-3b71-407c-a2a7-92bb1cb17b27 |
I have set of 3 relational tables. I want to convert them to a single table in dynamodb. Each table hosts data for different tranType. Each table has Id, tranDate as its key. For a given Id, tranDate and tranType, there are multiple rows.
My access pattern is get data for a given Id and TranDate - which will get me data for all tranTypes.
Rows in each table is within 400KB for a given Id, tranDate, but if I add rows for a given Id and tranDate across 3 tables, it will exceed 400KB.
Definitions
Table1
Id, tranDate,tranType,col1,col2,col3,col4
Table2
Id, tranDate,tranType,col1,col2,col3,col4,col5
Table3
Id, tranDate,tranType,col1,col2
Table1 (Sample Data)
1, 2018-12-01,'DETAIL',12,13,14,'A'
1, 2018-12-01,'DETAIL',15,23,11,'B'
1, 2018-12-01,'DETAIL',17,33,24,'C'
1, 2018-12-01,'DETAIL',19,43,14,'D'
2, 2018-12-01,'DETAIL',11,13,14,'A1'
2, 2018-12-01,'DETAIL',12,23,11,'B1'
1, 2018-11-01,'DETAIL',42,13,14,'X'
1, 2018-11-01,'DETAIL',45,23,11,'Y'
1, 2018-11-01,'DETAIL',47,33,24,'Z'
Table2 (Sample Data)
1, 2018-12-01,'SUMMARY',12,13,14,'A','S'
1, 2018-12-01,'SUMMARY',15,23,11,'B','B1'
2, 2018-12-01,'SUMMARY',17,33,24,'C','D1'
2, 2018-12-01,'SUMMARY',22,43,14,'D','D2'
2, 2018-12-01,'SUMMARY',33,13,14,'A1' ,'D3'
Table3 (Sample Data)
1, 2018-12-01,'GEO',11,'MI'
1, 2018-12-01,'GEO',12,'NY'
1, 2018-12-01,'GEO',11,'AL'
2, 2018-12-01,'GEO',14,'DE'
2, 2018-12-01,'GEO',15,'PA'
Given Id=1, tranDate='2018-12-01' -- Expected Results
1, 2018-12-01,'DETAIL',12,13,14,'A'
1, 2018-12-01,'DETAIL',15,23,11,'B'
1, 2018-12-01,'DETAIL',17,33,24,'C'
1, 2018-12-01,'DETAIL',19,43,14,'D'
1, 2018-12-01,'SUMMARY',12,13,14,'A','S'
1, 2018-12-01,'SUMMARY',15,23,11,'B','B1'
1, 2018-12-01,'GEO',11,'MI'
1, 2018-12-01,'GEO',12,'NY'
1, 2018-12-01,'GEO',11,'AL'
Based on your description, one possible design would be to use the concatenation of id + date as the partition key, and use the transaction type as the sort key, possibly combining the sort key with one of the other ids.
So, your table might look like this:
PK | TranType | data
----------------+-------------------+------------------------------------------
"1:2018-12-01" | "DETAIL" | ["12,13,14,A", "15,23,11,B",...]
"1:2018-12-01" | "SUMMARY" | ["12,13,14,A,S", "15,23,11,B,B1"]
"1:2018-12-01" | "GEO" | ["11,MI", "12,NY", "11,AL"]
"2:2018-12-01" | "DETAIL" | [...]
"2:2018-12-01" | "SUMMARY" | [...]
"2:2018-12-01" | "GEO" | [...]
Assuming the data payload is not too big that will work.
Another possibility would be to further break down the data into discrete attributes and creating a composite column for the sort key, formed from the transaction type prefix and one of the ids in the data, or just a numeric index (this really depends on what those other columns in your example really mean).
An example, assuming the col1 is unique for detail and summary, might look like this:
PK | TTID | states | c2 | c3 | c4 | c5 |
----------------+-------------------+-------------+----+----+-----+----+--
"1:2018-12-01" | "DETAIL:12" | | 13 | 14 | 'A'
"1:2018-12-01" | "DETAIL:15" | | 23 | 11 | 'B'
"1:2018-12-01" | "DETAIL:17" | | 33 | 24 | 'C'
"1:2018-12-01" | "DETAIL:19" | | 43 | 14 | 'D'
"1:2018-12-01" | "SUMMARY:12" | | 13 | 14 | 'A' | 'S'
"1:2018-12-01" | "SUMMARY:15" | | 23 | 11 | 'B' | 'B1'
"1:2018-12-01" | "GEO:11" | ["MI","AL"] |
"1:2018-12-01" | "GEO:12" | ["NY"] |
"2:2018-12-01" | "DETAIL:11" | | 13 |14 | A1
"2:2018-12-01" | "DETAIL:.." | | ...
"2:2018-12-01" | "SUMMARY:17" | | ...
"2:2018-12-01" | "SUMMARY:.." | | ...
"2:2018-12-01" | "GEO:.." | ... |
There is not one single answer to this question. Design the schema based on the data that you have and how you access your data.
I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))