Divide one record with another in kusto - azure-data-explorer

OsType Type count P50 P99
Linux Gen2 1635 39 159
Windows Gen2 1638 44 149
Linux Gen1 1647 43 133
Windows Gen1 1687 46 138
I want to make a comparison b/w P99 of Gen1 Vs Gen2? how can I write a kusto query for it?

1. Based on summarize
let t = datatable (OsType:string, Type:string, ["count"]:int, P50:int, P99:int)
[
"Linux" ,"Gen2" ,1635 ,39 ,159
,"Windows" ,"Gen2" ,1638 ,44 ,149
,"Linux" ,"Gen1" ,1647 ,43 ,133
,"Windows" ,"Gen1" ,1687 ,46 ,138
];
t
| summarize ratio = round(1.0 * take_anyif(P99, Type == "Gen1") / take_anyif(P99, Type == "Gen2"),2) by OsType
OsType
ratio
Linux
0.84
Windows
0.93
Fiddle
2. Based on join
let t = datatable (OsType:string, Type:string, ["count"]:int, P50:int, P99:int)
[
"Linux" ,"Gen2" ,1635 ,39 ,159
,"Windows" ,"Gen2" ,1638 ,44 ,149
,"Linux" ,"Gen1" ,1647 ,43 ,133
,"Windows" ,"Gen1" ,1687 ,46 ,138
];
t
| where Type == "Gen1"
| join kind=inner (t | where Type == "Gen2" | project-rename P99_Gen2 = P99) on OsType
| project OsType, P99_Gen1 = P99, P99_Gen2, ratio = round(1.0 * P99 / P99_Gen2, 2)
OsType
P99_Gen1
P99_Gen2
ratio
Linux
133
159
0.84
Windows
138
149
0.93
Fiddle

Related

How to combine countpct and binomCI into the same summary statistic to be used in tableby function?

I'm using the tableby function from the arsenal package to create summary tables. For most of the statistics I need to generate, this package gives me exactly the format I'm asked except for one. I need to get in the same cell something like this:
n (%) [95%CI of the percentage]
For now, I'm using the countpct function which gives me the "n (%)" and binomCI which gives me the proportion with 95%CI but it doubles the number of rows in my final table so it's not ideal...
How could I do to have everything on the same line ?
I tried to see if I could create another function from the original ones but I don't really understand their syntax...
Thanks for your help.
EDIT : Here is a reproducible example.
Code for the original functions can be found here.
So this is what I have now :
data<-NULL
data$Visit2<-c(rep("Responder",121),rep("Not Responder",29),rep("Responder",4),rep("Not Responder",47))
data$Group<-c(rep("Tx",150),rep("No Tx",51))
data<-as.data.frame(data)
library(arsenal)
my_controls <- tableby.control(test = F,total = F, cat.stats = c("countpct" ,"binomCI"), conf.level = 0.95)
summary(tableby(Group ~ Visit2,
data = data,
control = my_controls),
digits=2, digits.p=3, digits.pct=1)
# Results :
| | No Tx (N=51) | Tx (N=150) |
|:-------------------------------|:-----------------:|:-----------------:|
|**Visit2** | | |
| Not Responder | 47 (92.2%) | 29 (19.3%) |
| Responder | 4 (7.8%) | 121 (80.7%) |
| Not Responder | 0.92 (0.81, 0.98) | 0.19 (0.13, 0.27) |
| Responder | 0.08 (0.02, 0.19) | 0.81 (0.73, 0.87) |
And this is what I want :
| | No Tx (N=51) | Tx (N=150) |
|:----------------|:-------------------------:|:------------------------:|
|**Visit2** | |
| Not Responder | 47 (92.2%) [81.1, 97.8] | 29 (19.3%) [13.3, 26.6] |
| Responder | 4 (7.8%) [2.2, 18.9] | 121 (80.7%) [73.4, 86.7] |
|

how to combine R data frames with non-exact criteria (greater/less than condition)

I have to combine two R data frames which have trade and quote information. Like a join, but based on a timestamp in seconds. I need to match each trade with the most recent quote. There are many more quotes than trades.
I have this table with stock quotes. The Timestamp is in seconds:
+--------+-----------+-------+-------+
| Symbol | Timestamp | bid | ask |
+--------+-----------+-------+-------+
| IBM | 10 | 132 | 133 |
| IBM | 20 | 132.5 | 133.3 |
| IBM | 30 | 132.6 | 132.7 |
+--------+-----------+-------+-------+
And these are trades:
+--------+-----------+----------+-------+
| Symbol | Timestamp | quantity | price |
+--------+-----------+----------+-------+
| IBM | 25 | 100 | 132.5 |
| IBM | 31 | 80 | 132.7 |
+--------+-----------+----------+-------+
I think a native R function or dplyr could do it - I've used both for basic purposes but not sure how to proceed here. Any ideas?
So the trade at 25 seconds should match with the quote at 20 seconds, and the trade #31 matches the quote #30, like this:
+--------+-----------+----------+-------+-------+-------+
| Symbol | Timestamp | quantity | price | bid | ask |
+--------+-----------+----------+-------+-------+-------+
| IBM | 25 | 100 | 132.5 | 132.5 | 133.3 |
| IBM | 31 | 80 | 132.7 | 132.6 | 132.7 |
+--------+-----------+----------+-------+-------+-------+
Consider merging on a calculated field by increments of 10. Specifically, calculate a column for multiples of 10 in both datasets, and merge on that field with Symbol.
Below transform and within are used to assign and de-assign the helper field, mult10. In this use case, both base functions are interchangeable:
final_df <- transform(merge(within(quotes, mult10 = floor(Timestamp / 10) * 10),
within(trades, mult10 = floor(Timestamp / 10) * 10),
by=c("Symbol", "mult10"),
multi10 = NULL)
Now if the 10 multiple does not suffice for your needs, adjust to level you require such as 15, 5, 2, etc.
within(quotes, mult10 <- floor(Timestamp / 15) * 15)
within(quotes, mult10 <- floor(Timestamp / 5) * 5)
within(quotes, mult10 <- floor(Timestamp / 2) * 2)
Even more, you may need to use the reverse, floor or ceiling for both data sets respectively to calculate highest multiple of quote's Timestamp and lowest multiple of trade's Timestamp:
within(quotes, mult10 <- ceiling(Timestamp / 15) * 15)
within(trades, mult10 <- floor(Timestamp / 5) * 5)

Error in mySparkDF.show() : could not find function "mySparkDF.show"

I'm want to start working with sparkR followed tutorials but I get the below error:
library(SparkR)
Sys.setenv(SPARK_HOME="/Users/myuserhone/dev/spark-2.2.0-bin-hadoop2.7")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
spark <- sparkR.session(appName = "mysparkr", Sys.getenv("SPARK_HOME"), master = "local[*]")
csvPath <- "file:///Users/myuserhome/dev/spark-data/donation"
mySparkDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "?")
mySparkDF.show()
But I get:
Error in mySparkDF.show() : could not find function "mySparkDF.show"
Not sure what I do wrong, in addition, I don't have code completion for the spark functions like read.df(...)
In addition if I try
show(describe(mySparkDF))
or
show(summary(mySparkDF))
I get in results the metadata and not the "describe" expected result
SparkDataFrame[summary:string, id_1:string, id_2:string, cmp_fname_c1:string, cmp_fname_c2:string, cmp_lname_c1:string, cmp_lname_c2:string, cmp_sex:string, cmp_bd:string, cmp_bm:string, cmp_by:string, cmp_plz:string]
Anything i'm doing wrong?
show is not used in such a way in SparkR, neither it serves the same purpose with the same-name command in PySpark; you should use either head or showDF:
df <- as.DataFrame(faithful)
show(df)
# result:
SparkDataFrame[eruptions:double, waiting:double]
head(df)
# result:
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
showDF(df)
# result:
+---------+-------+
|eruptions|waiting|
+---------+-------+
| 3.6| 79.0|
| 1.8| 54.0|
| 3.333| 74.0|
| 2.283| 62.0|
| 4.533| 85.0|
| 2.883| 55.0|
| 4.7| 88.0|
| 3.6| 85.0|
| 1.95| 51.0|
| 4.35| 85.0|
| 1.833| 54.0|
| 3.917| 84.0|
| 4.2| 78.0|
| 1.75| 47.0|
| 4.7| 83.0|
| 2.167| 52.0|
| 1.75| 62.0|
| 4.8| 84.0|
| 1.6| 52.0|
| 4.25| 79.0|
+---------+-------+
only showing top 20 rows

Different behavior between pdf_document and bookdown::pdf_document2 when using compareGroups

I am having an issue when knit-ing documents using bookdown::pdf_document2 that don't appear when using the standard pdf_document.
Specifically, I am using the compareGroups library and the export2md function to output comparison tables such as the one shown below:
This is successful when I use output:pdf_document. However, The table is not properly created when I use output: bookdown:pdf_document2.
There are clearly differences in the tex files and I am manually able to copy the table from the tex outputed by pdf_document to pdf_document2. Does anyone have any thoughts on how to get bookdown to correctly create the table? I have create a repo with my bug found here for more details: https://github.com/vitallish/bookdown-bug
Overview
bookdown::pdf_document2() is different from rmarkdwon::pdf_document(), the former set $opts_knit$kable.force.latex to TRUE while the latter leaves that to default value (FALSE).
check .md file
I think that the process from .md to .tex should be the same, and the difference in .tex files might due to the difference in .md files. So I run the following code to keep the intermediate .md files.
rmarkdown::render('pdf_document.Rmd', clean = FALSE)
file.remove('pdf_document.utf8.md');
rmarkdown::render('pdf_document2.Rmd', clean = FALSE)
file.remove('pdf_document2.utf8.md');
pdf_document.knit.md
Table: Summary descriptives table by groups of `Sex'
Var Male N=1101 Female N=1193 p.overall
----------------------------------------------- --------------- ----------------- -----------
Recruitment year: 0.506
1995 206 (18.7%) 225 (18.9%)
2000 390 (35.4%) 396 (33.2%)
2005 505 (45.9%) 572 (47.9%)
Age 54.8 (11.1) 54.7 (11.0) 0.840
Smoking status: <0.001
Never smoker 301 (28.1%) 900 (77.5%)
Current or former < 1y 410 (38.3%) 183 (15.7%)
Former >= 1y 360 (33.6%) 79 (6.80%)
Systolic blood pressure 134 (18.9) 129 (21.2) <0.001
Diastolic blood pressure 81.7 (10.2) 77.8 (10.5) <0.001
pdf2_document.knit.md
\begin{table}
\caption{(\#tab:md-output)Summary descriptives table by groups of `Sex'}
\centering
\begin{tabular}[t]{l|c|c|c}
\hline
Var & Male N=1101 & Female N=1193 & p.overall\\
\hline
Recruitment year: & & & 0.506\\
\hline
\ \ \ \ 1995 & 206 (18.7\%) & 225 (18.9\%) & \\
\hline
\ \ \ \ 2000 & 390 (35.4\%) & 396 (33.2\%) & \\
\hline
\ \ \ \ 2005 & 505 (45.9\%) & 572 (47.9\%) & \\
\hline
Age & 54.8 (11.1) & 54.7 (11.0) & 0.840\\
\hline
Smoking status: & & & <0.001\\
\hline
\ \ \ \ Never smoker & 301 (28.1\%) & 900 (77.5\%) & \\
\hline
\ \ \ \ Current or former < 1y & 410 (38.3\%) & 183 (15.7\%) & \\
\hline
\ \ \ \ Former >= 1y & 360 (33.6\%) & 79 (6.80\%) & \\
\hline
Systolic blood pressure & 134 (18.9) & 129 (21.2) & <0.001\\
\hline
Diastolic blood pressure & 81.7 (10.2) & 77.8 (10.5) & <0.001\\
\hline
\end{tabular}
\end{table}
That explains why you see different appearance in the pdf output.
explore
To further explore the reason,
> pdf1 <- rmarkdown::pdf_document()
> pdf2 <- bookdown::pdf_document2()
> all.equal(pdf, pdf2)
[1] "Length mismatch: comparison on first 11 components"
[2] "Component “knitr”: Component “opts_knit”: target is NULL, current is list"
[3] "Component “pandoc”: Component “args”: Lengths (8, 12) differ (string compare on first 8)"
[4] "Component “pandoc”: Component “args”: 8 string mismatches"
[5] "Component “pandoc”: Component “ext”: target is NULL, current is character"
[6] "Component “pre_processor”: target, current do not match when deparsed"
[7] "Component “post_processor”: target is NULL, current is function"
Since knitr convert Rmarkdown to pandoc markdown, I guess $knitr cause the difference in .md files.
> all.equal(pdf$knitr, pdf2$knitr)
[1] "Component “opts_knit”: target is NULL, current is list"
> pdf2$knitr$opts_knit
$bookdown.internal.label
[1] TRUE
$kable.force.latex
[1] TRUE
kable is a function to output table, so $knitr$opts_knit$kable.force.latex might to the root reason.
verify
To test my assumption,
pdf3 <- pdf2
pdf3$knitr$opts_knit$kable.force.latex = FALSE
rmarkdown::render('pdf_document3.Rmd', clean = FALSE, output_format = pdf3)
file.remove('pdf_document3.utf8.md')
pdf_document3.knit.md
Var Male N=1101 Female N=1193 p.overall
----------------------------------------------- --------------- ----------------- -----------
Recruitment year: 0.506
1995 206 (18.7%) 225 (18.9%)
2000 390 (35.4%) 396 (33.2%)
2005 505 (45.9%) 572 (47.9%)
Age 54.8 (11.1) 54.7 (11.0) 0.840
Smoking status: <0.001
Never smoker 301 (28.1%) 900 (77.5%)
Current or former < 1y 410 (38.3%) 183 (15.7%)
Former >= 1y 360 (33.6%) 79 (6.80%)
Systolic blood pressure 134 (18.9) 129 (21.2) <0.001
Diastolic blood pressure 81.7 (10.2) 77.8 (10.5) <0.001
Wa oh!
Advanced
Actually compareGroups::export2md use knitr::kable as the working horse,
> compareGroups::export2md
function (x, which.table = "descr", nmax = TRUE, header.labels = c(),
caption = NULL, ...)
{
if (!inherits(x, "createTable"))
stop("x must be of class 'createTable'")
...
if (ww %in% c(1)) {
...
table1 <- table1[-1, , drop = FALSE]
return(knitr::kable(table1, align = align, row.names = FALSE,
caption = caption[1]))
}
if (ww %in% c(2)) {
table2 <- prepare(x, nmax = nmax, c())[[2]]
...
return(knitr::kable(table2, align = align, row.names = FALSE,
caption = caption[2]))
}
}
which use kable.force.latex as an internal option to adjust its output. If your browse the GitHub repository of knitr, you can find the following code in the R/utils.R file
kable = function(
x, format, digits = getOption('digits'), row.names = NA, col.names = NA,
align, caption = NULL, format.args = list(), escape = TRUE, ...
) {
# determine the table format
if (missing(format) || is.null(format)) format = getOption('knitr.table.format')
if (is.null(format)) format = if (is.null(pandoc_to())) switch(
out_format() %n% 'markdown',
latex = 'latex', listings = 'latex', sweave = 'latex',
html = 'html', markdown = 'markdown', rst = 'rst',
stop('table format not implemented yet!')
) else if (isTRUE(opts_knit$get('kable.force.latex')) && is_latex_output()) {
# force LaTeX table because Pandoc's longtable may not work well with floats
# http://tex.stackexchange.com/q/276699/9128
'latex'
} else 'pandoc'
if (is.function(format)) format = format()
...
structure(res, format = format, class = 'knitr_kable')
}
Conclusion
$knitr$opts_knit$kable.force.latex = TRUE cause bookdown::pdf_document2() to insert latex code in the .md file, while rmarkdown::pdf_document() preserves the markdown code, which leaves pandoc the chance to give a pretty table.
I don't think this is a bug. Yihui Xie (the author of bookdown) might have some special reason to do this. And bookdown::pdf_document2() never need to be the same as rmarkdown::pdf_document().
This problem with export2md has been solved in the latest version of compareGroups package (4.0) available on github. You can install this newest version by typing:
library(devtools)
devtools::install_github("isubirana/compareGroups")
I hope this version will be submitted to CRAN very soon.

Recursive query with sub-graph aggregation

I am trying to use Neo4j to write a query that aggregates quantities along a particular sub-graph.
We have two stores Store1 and Store2 one with supplier S1 the other with supplier S2. We move 100 units from Store1 into Store3 and 200 units from Store2 to Store3.
We then move 100 units from Store3 to Store4. So now Store4 has 100 units and approximately 33 originated from supplier S1 and 66 from supplier S2.
I need the query to effectively return this information, E.g.
S1, 33
S2, 66
I have a recursive query to aggregate all the movements along each path
MATCH p=(store1:Store)-[m:MOVE_TO*]->(store2:Store { Name: 'Store4'})
RETURN store1.Supplier, reduce(amount = 0, n IN relationships(p) | amount + n.Quantity) AS reduction
Returns:
| store1.Supplier | reduction|
|-------------------- |-------------|
| S1 | 200 |
| S2 | 300 |
| null | 100 |
Desired:
| store1.Supplier | reduction|
|---------------------|-------------|
| S1 | 33.33 |
| S2 | 66.67 |
What about this one :
MATCH (s:Store) WHERE s.name = 'Store4'
MATCH (s)<-[t:MOVE_TO]-()<-[r:MOVE_TO]-(supp)
WITH t.qty as total, collect(r) as movements
WITH total, movements, reduce(totalSupplier = 0, r IN movements | totalSupplier + r.qty) as supCount
UNWIND movements as movement
RETURN startNode(movement).name as supplier, round(100.0*movement.qty/supCount) as pct
Which returns :
supplier pct
Store1 33
Store2 67
Returned 2 rows in 151 ms
So the following is pretty ugly, but it works for the example you've given.
MATCH (s4:Store { Name:'Store4' })<-[r1:MOVE_TO]-(s3:Store)<-[r2:MOVE_TO*]-(s:Store)
WITH s3, r1.Quantity as Factor, SUM(REDUCE(amount = 0, r IN r2 | amount + r.Quantity)) AS Total
MATCH (s3)<-[r1:MOVE_TO*]-(s:Store)
WITH s.Supplier as Supplier, REDUCE(amount = 0, r IN r1 | amount + r.Quantity) AS Quantity, Factor, Total
RETURN Supplier, Quantity, Total, toFloat(Quantity) / toFloat(Total) * Factor as Proportion
I'm sure it can be improved.

Resources