Here is the sample data and my query. How to calculate % of failures and create new column value based on the percentage of failures?
let M = datatable (Run_Date:datetime , Job_Name:string , Job_Status:string , Count:int )
["2020-10-21", "Job_A1", "Succeeded", 10,
"2020-10-21", "Job_A1", "Failed", 8,
"10/21/2020", "Job_B2", "Succeeded", 21,
"10/21/2020", "Job_C3", "Succeeded", 21,
"10/21/2020", "Job_D4", "Succeeded", 136,
"10/21/2020", "Job_E5", "Succeeded", 187,
"10/21/2020", "Job_E5", "Failed", 4
];
M
| summarize count() by Job_Name, Count, summary = strcat(Job_Name, " failed " , Count, " out of ", Count ," times.")
And the desired output is below.
you could try something like this:
datatable (Run_Date:datetime , Job_Name:string , Job_Status:string , Count:int )
[
"2020-10-21", "Job_A1", "Succeeded", 10,
"2020-10-21", "Job_A1", "Failed", 8,
"10/21/2020", "Job_B2", "Succeeded", 21,
"10/21/2020", "Job_C3", "Succeeded", 21,
"10/21/2020", "Job_D4", "Succeeded", 136,
"10/21/2020", "Job_E5", "Succeeded", 187,
"10/21/2020", "Job_E5", "Failed", 4
]
| summarize failures = sumif(Count, Job_Status == "Failed"), total = sum(Count) by Job_Name, Run_Date
| project Job_Name, failure_rate = round(100.0 * failures / total, 2), Summary = strcat(Job_Name, " failed " , failures, " out of ", total ," times.")
| extend ['Alert if failure rate > 40%'] = iff(failure_rate > 40, "Yes", "No")
--->
| Job_Name | failure_rate | Summary | Alert if failure rate > 40% |
|----------|--------------|-----------------------------------|-----------------------------|
| Job_A1 | 44.44 | Job_A1 failed 8 out of 18 times. | Yes |
| Job_B2 | 0 | Job_B2 failed 0 out of 21 times. | No |
| Job_C3 | 0 | Job_C3 failed 0 out of 21 times. | No |
| Job_D4 | 0 | Job_D4 failed 0 out of 136 times. | No |
| Job_E5 | 2.09 | Job_E5 failed 4 out of 191 times. | No |
Related
Here is my data set in kusto and I am trying to generate "releaseRank" column based on the release column value.
Input Dataset:
let T = datatable(release:string, metric:long)
[
"22.05", 20,
"22.04", 40,
"22.03", 50,
"22.01", 560
];
T
|take 100;
desired output :
found that there is serialize and row_number kusto
T
|serialize
|extend releaseRank = row_number()
|take 100;
But if the release value is repeated, i need the releaseRank to be same for eg. given the data set, i am not getting the desired output
T = datatable(release:string, metric:long)
[
"22.05", 20,
"22.05", 21,
"22.04", 40,
"22.03", 50,
"22.01", 560
];
T
|serialize
|extend releaseRank = row_number()
|take 100;
expected output
This should do what we want.
let T = datatable(release:string, metric:long)
[
"22.05", 20,
"22.05", 21,
"22.04", 40,
"22.03", 50,
"22.01", 560
];
T
| sort by release desc , metric asc
| extend Rank=row_rank(release)
22.05 20 1
22.05 21 1
22.04 40 2
22.03 50 3
22.01 560 4
I am stuck with a Kusto query.
This is what I want to do - I would like to show day wise sales amount with the previous month's sales amount on the same day.
datatable(DateStamp:datetime, OrderId:string, SalesAmount:int)
[
"02-01-2019", "I01", 100,
"02-01-2019", "I02", 200,
"02-02-2019", "I03", 250,
"02-02-2019", "I04", 150,
"02-03-2019", "I13", 110,
"01-01-2019", "I10", 20,
"01-02-2019", "I11", 50,
"01-02-2019", "I12", 30,
]
| extend SalesDate = format_datetime(DateStamp, 'MM/dd/yyyy')
| summarize AmountOfSales = sum(SalesAmount) by SalesDate
This is what I see.
And, instead this is what I want to show as result --
I couldn't figure out how to add multiple summarize operator in one query.
Here's an option:
datatable(DateStamp:datetime, OrderId:string, SalesAmount:int)
[
"02-01-2019", "I01", 100,
"02-01-2019", "I02", 200,
"02-02-2019", "I03", 250,
"02-02-2019", "I04", 150,
"02-03-2019", "I13", 110,
"01-01-2019", "I10", 20,
"01-02-2019", "I11", 50,
"01-02-2019", "I12", 30,
]
| summarize AmountOfSales = sum(SalesAmount) by bin(DateStamp, 1d)
| as hint.materialized = true T
| extend prev_month = datetime_add("Month", -1, DateStamp)
| join kind=leftouter T on $left.prev_month == $right.DateStamp
| project SalesDate = format_datetime(DateStamp, 'MM/dd/yyyy'), AmountOfSales, AmountOfSalesPrevMonth = coalesce(AmountOfSales1, 0)
SalesDate
AmountOfSales
AmountOfSalesPrevMonth
01/01/2019
20
0
01/02/2019
80
0
02/01/2019
300
20
02/02/2019
400
80
02/03/2019
110
0
I am writing a query in Kusto to parse heartbeat data from a sensor. This is what I've written:
datatable(timestamp:datetime, healthycount:int, unhealthycount:int, origin:string)
[
datetime(1910-06-11), 10, 1, 'origin',
datetime(1910-05-11), 9, 2, 'origin'
]
| summarize latest = arg_max(timestamp, *) by origin
| project healthy = healthycount,
unhealthy = unhealthycount
This outputs data like this:
+--------------+----------------+
| healthy | unhealthy |
+--------------+----------------+
| 10 | 1 |
+--------------+----------------+
However, I want to represent this data as a pie chart, but to do that I need the data in the following format:
+----------------+-------+
| key | value |
+----------------+-------+
| healthy | 10 |
| unhealthy | 1 |
+----------------+-------+
Is it possible to do this? What terminology am I looking for?
Here is one way:
datatable(timestamp:datetime, healthycount:int, unhealthycount:int, origin:string)
[
datetime(1910-06-11), 10, 1, 'origin',
datetime(1910-05-11), 9, 2, 'origin'
]
| summarize arg_max(timestamp, *) by origin
| extend Pack = pack("healthycount", healthycount, "unhealthycount", unhealthycount)
| mv-expand kind=array Pack
| project key = tostring(Pack[0]), value = toint(Pack[1])
In Unix, I am printing the unique value for the first character in a field. I am also printing a count of the unique field lengths. Now I would like to do both together. Easy to do in SQL, but I'm not sure how to do this in Unix with awk (or grep, sed, ...).
PRINT FIRST UNIQ LEADING CHAR
awk -F'|' '{print substr($61,1,1)}' file_name.sqf | sort | uniq
PRINT COUNT OF FIELDS WITH LENGTHS 8, 10, 15
awk -F'|' 'NR>1 {count[length($61)]++} END {print count[8] ", " count[10] ", " count[15]}' file_name.sqf | sort | uniq
DESIRED OUTPUT
first char, length 8, length 10, length 15
a, 10, , 150
b, 50, 43, 31
A, 20, , 44
B, 60, 83, 22
The fields that start with an upper or lower 'a' are never length 10.
The input file is a | delimited .sqf with no header. The field is varChar 15.
sample input
56789 | someValue | aValue | otherValue | 712345
46789 | someValue | bValue | otherValue | 812345
36789 | someValue | AValue | otherValue | 912345
26789 | someValue | BValue | otherValue | 012345
56722 | someValue | aValue | otherValue | 712345
46722 | someValue | bValue | otherValue | 812345
desired output
a: , , 2
b: 1, , 1
A: , , 1
B: , 1,
'a' has two instances that are length 15
'b' has one instance each of length 8 and 15
'A' has one instance that is length 15
'B' has one instance that is length 10
Thank you.
I think you need a better sample input file, but I guess that's what you're looking for
$ awk -F' \\| ' -v OFS=, '{k=substr($3,1,1); ks[k]; c[k,length($3)]++}
END {for(k in ks) print k": "c[k,6],c[k,10],c[k,15]}' file
A: 1,,
B: 1,,
a: 2,,
b: 2,,
note that since all lengths are 6, I printed that count instead of 8. With the right data you should be able to get the output you expect. Note however that the order is not preserved.
I'm trying to port a code from R to Scala to perform Customer Analysis. I have already computed Recency, Frequency and Monetary factors on Spark into a DataFrame.
Here is the schema of the Dataframe :
df.printSchema
root
|-- customerId: integer (nullable = false)
|-- recency: long (nullable = false)
|-- frequency: long (nullable = false)
|-- monetary: double (nullable = false)
And here is a data sample as well :
df.order($"customerId").show
+----------+-------+---------+------------------+
|customerId|recency|frequency| monetary|
+----------+-------+---------+------------------+
| 1| 297| 114| 733.27|
| 2| 564| 11| 867.66|
| 3| 1304| 1| 35.89|
| 4| 287| 25| 153.08|
| 6| 290| 94| 316.772|
| 8| 1186| 3| 440.21|
| 11| 561| 5| 489.70|
| 14| 333| 57| 123.94|
I'm trying to find the intervals for on a quantile vector for each column given a probability segment.
In other words, given a probability vector of non-decreasing breakpoints,
in my case it will be the quantile vector, find the interval containing each element of x;
i.e. (pseudo-code),
if i <- findInterval(x,v),
for each index j in x
v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(v).
In R, this translates to the following code :
probSegment <- c(0.0, 0.25, 0.50, 0.75, 1.0)
RFM_table$Rsegment <- findInterval(RFM_table$Recency, quantile(RFM_table$Recency, probSegment))
RFM_table$Fsegment <- findInterval(RFM_table$Frequency, quantile(RFM_table$Frequency, probSegment))
RFM_table$Msegment <- findInterval(RFM_table$Monetary, quantile(RFM_table$Monetary, probSegment))
I'm kind of stuck with the quantile function thought.
In an earlier discussion with #zero323, he suggest that I used the percentRank window function which can be used as a shortcut. I'm not sure that I can apply the percentRank function in this case.
How can I apply a quantile function on a Dataframe column with Scala Spark? If this is not possible, can I use the percentRank function instead?
Thanks.
Well, I still believe that percent_rank is good enough here. Percent percent_rank window function is computed as:
Lets define pr as:
Transforming as follows:
gives a definition of a percentile used, according to Wikipedia, by Microsoft Excel.
So the only thing you really need is findInterval UDF which will return a correct interval index. Alternatively you can use rank directly and match on rank ranges.
Edit
OK, it looks like percent_rank is not a good idea after all:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation
I am not exactly sure what is the point of moving data to a single partition to call non-aggregate function but it looks like we are back to square one. It is possible to use zipWithIndex on plain RDD:
import org.apache.spark.sql.{Row, DataFrame, Column}
import org.apache.spark.sql.types.{StructType, StructField, LongType}
import org.apache.spark.sql.functions.udf
val df = sc.parallelize(Seq(
(1, 297, 114, 733.27),
(2, 564, 11, 867.66),
(3, 1304, 1, 35.89),
(4, 287, 25, 153.08),
(6, 290, 94, 316.772),
(8, 1186, 3, 440.21),
(11, 561, 5, 489.70),
(14, 333, 57, 123.94)
)).toDF("customerId", "recency", "frequency", "monetary")
df.registerTempTable("df")
sqlContext.cacheTable("df")
A small helper:
def addRowNumber(df: DataFrame): DataFrame = {
// Prepare new schema
val schema = StructType(
StructField("row_number", LongType, false) +: df.schema.fields)
// Add row number
val rowsWithIndex = df.rdd.zipWithIndex
.map{case (row: Row, idx: Long) => Row.fromSeq(idx +: row.toSeq)}
// Create DataFrame
sqlContext.createDataFrame(rowsWithIndex, schema)
}
and the actual function:
def findInterval(df: DataFrame, column: Column,
probSegment: Array[Double], outname: String): DataFrame = {
val n = df.count
// Map quantiles to indices
val breakIndices = probSegment.map(p => (p * (n - 1)).toLong)
// Add row number
val dfWithRowNumber = addRowNumber(df.orderBy(column))
// Map indices to values
val breaks = dfWithRowNumber
.where($"row_number".isin(breakIndices:_*))
.select(column.cast("double"))
.map(_.getDouble(0))
.collect
// Get interval
val f = udf((x: Double) =>
scala.math.abs(java.util.Arrays.binarySearch(breaks, x) + 1))
// Final result
dfWithRowNumber
.select($"*", f(column.cast("double")).alias(outname))
.drop("row_number")
}
and example usage:
scala> val probs = Array(0.0, 0.25, 0.50, 0.75, 1.0)
probs: Array[Double] = Array(0.0, 0.25, 0.5, 0.75, 1.0)
scala> findInterval(df, $"recency", probs, "interval").show
+----------+-------+---------+--------+--------+
|customerId|recency|frequency|monetary|interval|
+----------+-------+---------+--------+--------+
| 4| 287| 25| 153.08| 1|
| 6| 290| 94| 316.772| 2|
| 1| 297| 114| 733.27| 2|
| 14| 333| 57| 123.94| 3|
| 11| 561| 5| 489.7| 3|
| 2| 564| 11| 867.66| 4|
| 8| 1186| 3| 440.21| 4|
| 3| 1304| 1| 35.89| 5|
+----------+-------+---------+--------+--------+
but I guess it is far from optimal.
Spark 2.0+:
You could replace manual rank computation with DataFrameStatFunctions.approxQuantile. This would allow for faster interval computation:
val relativeError: Double = ????
val breaks = df.stat.approxQuantile("recency", probs, relativeError)
This can be achieved with Bucketizer. Using the same data frame as in the example above:
import org.apache.spark.ml.feature.Bucketizer
val df = sc.parallelize(Seq(
(1, 297, 114, 733.27),
(2, 564, 11, 867.66),
(3, 1304, 1, 35.89),
(4, 287, 25, 153.08),
(6, 290, 94, 316.772),
(8, 1186, 3, 440.21),
(11, 561, 5, 489.70),
(14, 333, 57, 123.94)
)).toDF("customerId", "recency", "frequency", "monetary")
val targetVars = Array("recency", "frequency", "monetary")
val probs = Array(0.0, 0.25, 0.50, 0.75, 1.0)
val outputVars = for(varName <- targetVars) yield varName + "Segment"
val breaksArray = for (varName <- targetVars) yield df.stat.approxQuantile(varName,
probs,0.0)
val bucketizer = new Bucketizer()
.setInputCols(targetVars)
.setOutputCols(outputVars)
.setSplitsArray(breaksArray)
val df_e = bucketizer.transform(df)
df_e.show
Result:
targetVars: Array[String] = Array(recency, frequency, monetary)
outputVars: Array[String] = Array(recencySegment, frequencySegment, monetarySegment)
breaksArray: Array[Array[Double]] = Array(Array(287.0, 290.0, 333.0, 564.0, 1304.0), Array(1.0, 3.0, 11.0, 57.0, 114.0), Array(35.89, 123.94, 316.772, 489.7, 867.66))
+----------+-------+---------+--------+--------------+----------------+--------------
-+|customerId|recency|frequency|monetary|recencySegment|frequencySegment|monetarySegment|
+----------+-------+---------+--------+--------------+----------------+---------------+
| 1| 297| 114| 733.27| 1.0| 3.0| 3.0|
| 2| 564| 11| 867.66| 3.0| 2.0| 3.0|
| 3| 1304| 1| 35.89| 3.0| 0.0| 0.0|
| 4| 287| 25| 153.08| 0.0| 2.0| 1.0|
| 6| 290| 94| 316.772| 1.0| 3.0| 2.0|
| 8| 1186| 3| 440.21| 3.0| 1.0| 2.0|
| 11| 561| 5| 489.7| 2.0| 1.0| 3.0|
| 14| 333| 57| 123.94| 2.0| 3.0| 1.0|
+----------+-------+---------+--------+--------------+----------------+---------------+