How to aggregate sum all the columns in Kusto? - azure-data-explorer

For the following datatable, is there a way to get the expected result without having to specify all the columns one by one? the problem here is that my real table has 20+ columns. I am looking for a cleaner solution.
Expected Result
Col1Sum | Col2Sum | Col3Sum | Col4Sum
--------------------------------------
3 | 3 | 3 | 3
Table + Query
datatable(Col1: int, Col2: int, Col3: int, Col4: int)
[
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
]
| summarize
Col1Sum = sum(Col1),
Col2Sum = sum(Col2),
Col3Sum = sum(Col3),
Col4Sum = sum(Col4);

You can generate the query using a Kusto query:
datatable(Col1: int, Col2: int, Col3: int, Col4: int)
[
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
]
| getschema
| extend SumColumn = strcat(ColumnName, "Sum = sum(", ColumnName, ") ")
| summarize replace('"|\\[|]', "", tostring(make_list(SumColumn)))
| project Query = strcat("summarize ", Column1)

Related

alternative of for loop in R

cell_support_xyz <- function(level, zero)
{
for(i in 1:level[1]){
for(j in 1:level[2]){
for(k in 1:level[3]){
cat("cell (", i, ", ", j, ", ", k,") --> support set = (",
+!(i == zero[1]), ", ", +!(j == zero[2]), ", ", +!(k == zero[3]), ")\n", sep = "")
}
}
}
}
#Example 1
l<-c(2,3,2)
z<-c(1,1,1)
> cell_support_xyz(l,z)
cell (1, 1, 1) --> support set = (0, 0, 0)
cell (1, 1, 2) --> support set = (0, 0, 1)
cell (1, 2, 1) --> support set = (0, 1, 0)
cell (1, 2, 2) --> support set = (0, 1, 1)
cell (1, 3, 1) --> support set = (0, 1, 0)
cell (1, 3, 2) --> support set = (0, 1, 1)
cell (2, 1, 1) --> support set = (1, 0, 0)
cell (2, 1, 2) --> support set = (1, 0, 1)
cell (2, 2, 1) --> support set = (1, 1, 0)
cell (2, 2, 2) --> support set = (1, 1, 1)
cell (2, 3, 1) --> support set = (1, 1, 0)
cell (2, 3, 2) --> support set = (1, 1, 1)
The above code works just fine. But I want to avoid for loop. Here I used 3 for loops (because the length of both argument vectors is 3). If the length of vectors increases or decreases the function won't work (I need to adjust for loops accordingly); which is why I want to replace for-loop with some efficient alternative that works for any length. Any suggestion?
One way to remove the for loop and making the solution flexible enough for any length input.
We use expand.grid to create all possible combinations of level and use apply rowwise to create a string to print.
cell_support_xyz <- function(level, zero) {
tmp <- do.call(expand.grid, lapply(level, seq))
abc <- apply(tmp, 1, function(x)
cat(sprintf('cell (%s) --> support set = (%s)\n',
toString(x), toString(+(x != zero)))))
}
l<-c(2,3,2)
z<-c(1,1,1)
cell_support_xyz(l, z)
#cell (1, 1, 1) --> support set = (0, 0, 0)
#cell (2, 1, 1) --> support set = (1, 0, 0)
#cell (1, 2, 1) --> support set = (0, 1, 0)
#cell (2, 2, 1) --> support set = (1, 1, 0)
#cell (1, 3, 1) --> support set = (0, 1, 0)
#cell (2, 3, 1) --> support set = (1, 1, 0)
#cell (1, 1, 2) --> support set = (0, 0, 1)
#cell (2, 1, 2) --> support set = (1, 0, 1)
#cell (1, 2, 2) --> support set = (0, 1, 1)
#cell (2, 2, 2) --> support set = (1, 1, 1)
#cell (1, 3, 2) --> support set = (0, 1, 1)
#cell (2, 3, 2) --> support set = (1, 1, 1)
You can do that in 2 steps:
l<-c(2,3,2)
z<-c(1,1,1)
cells <- expand.grid(lapply(l, seq))
t(apply(cells, 1, function(x) 1L*!(x == z)))
cells contains all the combinations. If the order matters, you can simply reorder it:
cells <- dplyr::arrange(cells, Var1, Var2, Var3)
Then, for each row (apply(,1,)) you can use == which is already vectorized to compare the entire row to the entire z vector.
Multiplying by 1L makes it integer, same as +.

count cumulative values across factor levels over time in r

I have a very large dataframe that looks like so:
month <- c(201101, 201101, 201101, 201102, 201102, 201102, 201103, 201103, 201103, 201104, 201104, 201104)
su <- as.factor(c(045110B238, 045110B238, 045110B238, 045110B238, 045110B238,045110B238, 045110B238, 045110B238, 045110B238, 045110B238, 045110B238, 045110B238))
item <- as.factor(c(045110B238A01, 045110B238A02, 045110B238A03, 045110B238A01, 045110B238A02, 045110B238A03, 045110B238A01, 045110B238A02, 045110B238A03, 045110B238A01, 045110B238A02, 045110B238A03))
item.dlq <- c(1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1)
df <- data.frame(month, su, item, item.dlq)
Using the item.dlq variable I count the cumulative number of months for which each itemcode has item.dlq == 1:
library(dplyr)
df <- data.frame(df %>%
group_by(itemcode, grp = cumsum(item.dlq == 0)) %>%
mutate(item.cum.dlq = cumsum(item.dlq)))
which should give me a vector like so:
item.cum.dlq <- c(1, 1, 1, 2, 0, 2, 3, 1, 3, 4, 2, 4)
Based on the information above, I would like to
create a variable that counts the number of consecutive months in which ALL items for the su have values of dlq==1.
count the number of consecutive months when at least 1 itemcode has a value of 1. For example, where month is equal to 201102 (i.e. 2/2011), item 045110B238A02 has item.dlq == 0, so only 2/3 items have dlq == 1.
Note that there is only one value of su in the example above, but there are many in the full data frame I am working with. I would also like to compress the data frame as well, if possible, to avoid carrying around unnecesary observations. Here is what the raw data would look like without compressing:
su.cum.fulldlq <- c(1, 1, 1, 0, 0, 0, 1, 1, 1, 2, 2, 2) ## all items dlq ==1
su.cum.partdlq <- c(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0) ## at least 1 item but not all have dlq == 1
If the data frame were compressed, it would look like so:
month <- c(201101, 201102, 201103, 201104)
su <- c(045110B238, 045110B238, 045110B238, 045110B238)
su.cum.fulldlq <- c(1, 0, 1, 2)
su.cum.partdlq <- c(0, 1, 0, 0)
I was thinking something along the lines of this, but I keep getting error messages.
df <- data.frame(df %>%
group_by(su, month)) %>%
mutate(burden = n_distinct(itemcode)) # count number of items
mutate(dlq.items = n_distinct(dlq == 1)) %>% # count number of items where dlq == 1
mutate(full.dlq = ifelse(burden == dlq_items, 1, 0)) %>% # if number of items equals the number of items with dlq == 1, then full.dlq == 1.
after this i am not certain at all.
Is there a way to do so using dplyr? If not, any other approaches would be welcome. If something is not clear please comment and I will change it. Either way, any help or suggestions would be greatly appreciated. Thanks so much!

Abelian group quotient in sage

Let d1 and d2 be matrices over the integers Z. How can I compute the group quotient ker d1 / im d2 in Sage?
So far I've been able to compute a basis for the kernel and image as follows:
M24 = MatrixSpace(IntegerRing(),2,4)
d1 = M24([-1,1, 1,-1, -1,1, 1,-1])
kerd1 = d1.right_kernel().basis()
M43 = MatrixSpace(IntegerRing(),4,3)
d2 = M43([1,1,-1, 1,-1,-1, 1,-1,1, 1,1,1])
imd2 = d2.column_space().basis()
which gives output:
kerd1 = [
(1, 0, 0, -1),
(0, 1, 0, 1),
(0, 0, 1, 1)
]
imd2 = [
(1, 1, 1, 1),
(0, 2, 0, -2),
(0, 0, 2, 2)
]
I tried to compute the quotient like this:
Z4.<a,b,c,d> = AbelianGroup(4, [0,0,0,0])
G = Z4.subgroup([a/d, b*d, c*d])
H = Z4.subgroup([a*b*c*d, b^2/d^2, c^2*d^2])
G.quotient(H)
But I got a NotImplementedError.
I found two ways to do this:
d1 = matrix(ZZ,4,2, [-1,1, 1,-1, -1,1, 1,-1]).transpose()
d2 = matrix(ZZ,4,3, [1,1,-1, 1,-1,-1, 1,-1,1, 1,1,1])
(d1.right_kernel() / (d2.column_space())).invariants()
# OUTPUT: (2, 2)
ChainComplex([d2, d1]).homology()[1]
# OUTPUT: C2 x C2

Is there a way to obtain coefficients for each step of the optimization algorithm in glm function?

When one performs a logit regression in R, it is possible to obtain coefficients after the optimization algorithm has converged (or not) with coefficients() function:
library(MASS)
data(menarche)
glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
family=binomial(logit), data=menarche)
coefficients(glm.out)
## (Intercept) Age
## -21.226395 1.631968
Is there a way to obtain coefficients for each step of the optimization algorithm to trace its steps?
The internals of glm.fit have changed (see comment from #John) so use this instead. It does not rely on line positions of the internals but rather intercepts each instance of cat in glm.fit and adds a message to iteration message so although it still depends on the internals it should be a bit less fragile. This worked for me in R 4.1 and 4.2.
library(MASS)
data(menarche)
trace(glm.fit, quote(cat <- function(...) {
base::cat(...)
if (...length() >= 3 && identical(..3, " Iterations - ")) print(coefold)
}))
glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
family=binomial(logit), data=menarche,
control = glm.control(trace = TRUE))
untrace(glm.fit)
Previous solution
The control= argument with the value shown causes the deviance to print and the trace statement will cause the coefficient values to print:
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))
glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
family=binomial(logit), data=menarche,
control = glm.control(trace = TRUE))
The output will look like this:
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
NULL
Deviance = 27.23412 Iterations - 1
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] -20.673652 1.589536
Deviance = 26.7041 Iterations - 2
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] -21.206854 1.630468
Deviance = 26.70345 Iterations - 3
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .... step 22,4,8,4,19,3
[1] -21.226370 1.631966
Deviance = 26.70345 Iterations - 4
To remove the trace use:
untrace(glm.fit)
Note that in the trace call, coefold is the name of a variable used internally in glm.fit source code and the numbers used refer to statement numbers in the source code and so either could need to be changed if glm.fit source changes. I am using "R version 3.2.2 Patched (2015-10-19 r69550)".

Rearrange data for ANOVA

I haven't quite got my head around R and how to rearrange data. I have an old SPSS data file that needs rearranging so I can conduct an ANOVA in R.
My current data file has this format:
ONE <- matrix(c(1, 2, 777.75, 609.30, 700.50, 623.45, 701.50, 629.95, 820.06, 651.95,"nofear","nofear"), nr=2,dimnames=list(c("1", "2"), c("SUBJECT","AAYY", "BBYY", "AAZZ", "BBZZ", "XX")))
And I need to rearrange it to this:
TWO <- matrix(c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 777.75, 701.5, 700.5, 820.06, 609.3, 629.95, 623.95, 651.95), nr=8, dimnames=list(c("1", "1", "1", "1", "2", "2", "2", "2"), c("SUBJECT","AA", "ZZ", "XX", "RT")))
I am sure that there is an easy way of doing it, rather than hand coding. Thanks for the consideration.
This should do it. You can tweak it a bit, but this is the idea:
library(reshape)
THREE <- melt(as.data.frame(ONE),id=c("SUBJECT","XX"))
THREE$AA <- grepl("AA",THREE$variable)
THREE$ZZ <- grepl("ZZ",THREE$variable)
THREE$variable <- NULL
# cleanup
THREE$XX <- as.factor(THREE$XX)
THREE$AA <- as.numeric(THREE$AA)
THREE$ZZ <- as.numeric(THREE$ZZ)
Reshape and reshape() both help with this kind of stuff but in this simple case where you have to generate the variables hand coding is pretty easy, just take advantage of automatic replication in R.
TWO <- data.frame(SUBJECT = rep(1:2,each = 4),
AA = rep(1:0, each = 2),
ZZ = 0:1,
XX = 1,
RT = as.numeric(t(ONE[,2:5])))
That gives the TWO you asked for but it doesn't generalize to a larger ONE easily. I think this makes more sense
n <- nrow(ONE)
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
AB = rep(1:0, each = n),
YZ = rep(0:1, each = 2*n),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
This latter one gives more representative variable names, and handles the likely case that your data is actually much bigger with XX (fear) varying and more subjects. Also, given that you're reading it in from an SPSS data file then ONE is actually a data frame with numeric numbers and factored character columns. The reshaping was only this part of the code...
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
You could add in other variables afterward.

Resources