I have following DB schema and I'd like to find the best way how to select list of Sorted keys which are common for PK_A and PK_B:
+---------------+---------+
| PK | SortKey |
+---------------+---------+
| | SK_A |
| PK_A | SK_B |
| | SK_C |
| - - - - - - - | |
| | SK_B |
| PK_B | SK_C |
| | SK_D |
+---------------+---------+
so when I do select by PK_A and PK_B it should return me only SK_B and SK_C?
Any help is appreciated.
Simple answer, you can't do it (in one call).
Dynamo is not a relational database, operations such as intersection are not supported.
You'd need to query() once for each partition key and then calculate the intersect yourself.
I am writing a query in Kusto to parse heartbeat data from a sensor. This is what I've written:
datatable(timestamp:datetime, healthycount:int, unhealthycount:int, origin:string)
[
datetime(1910-06-11), 10, 1, 'origin',
datetime(1910-05-11), 9, 2, 'origin'
]
| summarize latest = arg_max(timestamp, *) by origin
| project healthy = healthycount,
unhealthy = unhealthycount
This outputs data like this:
+--------------+----------------+
| healthy | unhealthy |
+--------------+----------------+
| 10 | 1 |
+--------------+----------------+
However, I want to represent this data as a pie chart, but to do that I need the data in the following format:
+----------------+-------+
| key | value |
+----------------+-------+
| healthy | 10 |
| unhealthy | 1 |
+----------------+-------+
Is it possible to do this? What terminology am I looking for?
Here is one way:
datatable(timestamp:datetime, healthycount:int, unhealthycount:int, origin:string)
[
datetime(1910-06-11), 10, 1, 'origin',
datetime(1910-05-11), 9, 2, 'origin'
]
| summarize arg_max(timestamp, *) by origin
| extend Pack = pack("healthycount", healthycount, "unhealthycount", unhealthycount)
| mv-expand kind=array Pack
| project key = tostring(Pack[0]), value = toint(Pack[1])
I am trying to do the following.
Connected to DevCluster
[cqlsh 5.0.1 | Cassandra 3.10.0.1695 | DSE 5.1.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
user#cqlsh:test> desc table del28;
CREATE TABLE test.del28 (
sno int PRIMARY KEY,
dob date,
name range_dates,
ssss_details map<text, date>,
ssss_range map<text, frozen<map<date, date>>> );
CREATE INDEX idx_ssss_range ON test.del28 (keys(ssss_range));
CREATE INDEX ssss_details_idx ON test.del28 (values(ssss_details));
CREATE INDEX ssss_range_idx ON test.del28 (values(ssss_range));
user#cqlsh:test> select * from del28;
sno | dob | name | ssss_details | ssss_range
-----+------+--------------------------------------+----------------------------------------------+---------------------------------
5 | null | {start: 2014-03-05, end: 2018-04-05} | {'hello': 2014-05-05} | {'1': {2018-04-05: 2012-02-05}}
8 | null | {start: 2018-03-04, end: 2018-08-02} | {'hello8': 2018-08-08} | {'8': {2018-08-08: 2012-02-08}}
2 | null | {start: 2018-03-04, end: 2018-05-05} | {'hello': 2018-05-05} | {'1': {2018-07-08: 2018-09-01}}
4 | null | {start: 2014-03-04, end: 2018-04-02} | {'hello1': 2014-05-02} | {'1': {2018-04-08: 2012-02-04}}
7 | null | {start: 2014-03-04, end: 2018-04-02} | {'hello4': 2014-05-03, 'hello5': 2014-05-02} | {'2': {2018-04-08: 2012-02-04}}
6 | null | {start: 2014-03-04, end: 2018-04-02} | {'hello2': 2014-05-02, 'hello3': 2014-05-03} | {'2': {2018-04-08: 2012-02-04}}
9 | null | {start: 2014-03-04, end: 2018-04-02} | {'hello7': 2014-05-02, 'hello8': 2014-05-03} | {'2': {2018-04-08: 2012-02-04}}
3 | null | {start: 2014-03-04, end: 2018-04-02} | {'hello': 2014-05-02} | {'1': {2018-04-08: 2012-02-04}}
(8 rows)
My question is, can I use Filters on ssss_range, if so how? If not what is the best way to save this data. Idea is, there is number or text followed by dates. Example house1: {2012-04-05: 2013-02-05}, house2:{2013-04-08: 2014-02-04}...... for one particular user and where dates are a set and , explaining that person stayed on these times. I tried to split the dates in 'name' column. Still it did not work for me. Now there is lot of other info regarding this record.
I should be able to query based on house1, house2 i.e. where aaa = 'house1', some thing like that. Also should be able query based on dates i.e. where from_date > '' and to_date < ''. Something like that.
I am okay to change the way data is changed if it can query a better way. Any type of collections or data types are fine.
Please suggest the right approach.
Thanks
I'm trying to port a code from R to Scala to perform Customer Analysis. I have already computed Recency, Frequency and Monetary factors on Spark into a DataFrame.
Here is the schema of the Dataframe :
df.printSchema
root
|-- customerId: integer (nullable = false)
|-- recency: long (nullable = false)
|-- frequency: long (nullable = false)
|-- monetary: double (nullable = false)
And here is a data sample as well :
df.order($"customerId").show
+----------+-------+---------+------------------+
|customerId|recency|frequency| monetary|
+----------+-------+---------+------------------+
| 1| 297| 114| 733.27|
| 2| 564| 11| 867.66|
| 3| 1304| 1| 35.89|
| 4| 287| 25| 153.08|
| 6| 290| 94| 316.772|
| 8| 1186| 3| 440.21|
| 11| 561| 5| 489.70|
| 14| 333| 57| 123.94|
I'm trying to find the intervals for on a quantile vector for each column given a probability segment.
In other words, given a probability vector of non-decreasing breakpoints,
in my case it will be the quantile vector, find the interval containing each element of x;
i.e. (pseudo-code),
if i <- findInterval(x,v),
for each index j in x
v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(v).
In R, this translates to the following code :
probSegment <- c(0.0, 0.25, 0.50, 0.75, 1.0)
RFM_table$Rsegment <- findInterval(RFM_table$Recency, quantile(RFM_table$Recency, probSegment))
RFM_table$Fsegment <- findInterval(RFM_table$Frequency, quantile(RFM_table$Frequency, probSegment))
RFM_table$Msegment <- findInterval(RFM_table$Monetary, quantile(RFM_table$Monetary, probSegment))
I'm kind of stuck with the quantile function thought.
In an earlier discussion with #zero323, he suggest that I used the percentRank window function which can be used as a shortcut. I'm not sure that I can apply the percentRank function in this case.
How can I apply a quantile function on a Dataframe column with Scala Spark? If this is not possible, can I use the percentRank function instead?
Thanks.
Well, I still believe that percent_rank is good enough here. Percent percent_rank window function is computed as:
Lets define pr as:
Transforming as follows:
gives a definition of a percentile used, according to Wikipedia, by Microsoft Excel.
So the only thing you really need is findInterval UDF which will return a correct interval index. Alternatively you can use rank directly and match on rank ranges.
Edit
OK, it looks like percent_rank is not a good idea after all:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation
I am not exactly sure what is the point of moving data to a single partition to call non-aggregate function but it looks like we are back to square one. It is possible to use zipWithIndex on plain RDD:
import org.apache.spark.sql.{Row, DataFrame, Column}
import org.apache.spark.sql.types.{StructType, StructField, LongType}
import org.apache.spark.sql.functions.udf
val df = sc.parallelize(Seq(
(1, 297, 114, 733.27),
(2, 564, 11, 867.66),
(3, 1304, 1, 35.89),
(4, 287, 25, 153.08),
(6, 290, 94, 316.772),
(8, 1186, 3, 440.21),
(11, 561, 5, 489.70),
(14, 333, 57, 123.94)
)).toDF("customerId", "recency", "frequency", "monetary")
df.registerTempTable("df")
sqlContext.cacheTable("df")
A small helper:
def addRowNumber(df: DataFrame): DataFrame = {
// Prepare new schema
val schema = StructType(
StructField("row_number", LongType, false) +: df.schema.fields)
// Add row number
val rowsWithIndex = df.rdd.zipWithIndex
.map{case (row: Row, idx: Long) => Row.fromSeq(idx +: row.toSeq)}
// Create DataFrame
sqlContext.createDataFrame(rowsWithIndex, schema)
}
and the actual function:
def findInterval(df: DataFrame, column: Column,
probSegment: Array[Double], outname: String): DataFrame = {
val n = df.count
// Map quantiles to indices
val breakIndices = probSegment.map(p => (p * (n - 1)).toLong)
// Add row number
val dfWithRowNumber = addRowNumber(df.orderBy(column))
// Map indices to values
val breaks = dfWithRowNumber
.where($"row_number".isin(breakIndices:_*))
.select(column.cast("double"))
.map(_.getDouble(0))
.collect
// Get interval
val f = udf((x: Double) =>
scala.math.abs(java.util.Arrays.binarySearch(breaks, x) + 1))
// Final result
dfWithRowNumber
.select($"*", f(column.cast("double")).alias(outname))
.drop("row_number")
}
and example usage:
scala> val probs = Array(0.0, 0.25, 0.50, 0.75, 1.0)
probs: Array[Double] = Array(0.0, 0.25, 0.5, 0.75, 1.0)
scala> findInterval(df, $"recency", probs, "interval").show
+----------+-------+---------+--------+--------+
|customerId|recency|frequency|monetary|interval|
+----------+-------+---------+--------+--------+
| 4| 287| 25| 153.08| 1|
| 6| 290| 94| 316.772| 2|
| 1| 297| 114| 733.27| 2|
| 14| 333| 57| 123.94| 3|
| 11| 561| 5| 489.7| 3|
| 2| 564| 11| 867.66| 4|
| 8| 1186| 3| 440.21| 4|
| 3| 1304| 1| 35.89| 5|
+----------+-------+---------+--------+--------+
but I guess it is far from optimal.
Spark 2.0+:
You could replace manual rank computation with DataFrameStatFunctions.approxQuantile. This would allow for faster interval computation:
val relativeError: Double = ????
val breaks = df.stat.approxQuantile("recency", probs, relativeError)
This can be achieved with Bucketizer. Using the same data frame as in the example above:
import org.apache.spark.ml.feature.Bucketizer
val df = sc.parallelize(Seq(
(1, 297, 114, 733.27),
(2, 564, 11, 867.66),
(3, 1304, 1, 35.89),
(4, 287, 25, 153.08),
(6, 290, 94, 316.772),
(8, 1186, 3, 440.21),
(11, 561, 5, 489.70),
(14, 333, 57, 123.94)
)).toDF("customerId", "recency", "frequency", "monetary")
val targetVars = Array("recency", "frequency", "monetary")
val probs = Array(0.0, 0.25, 0.50, 0.75, 1.0)
val outputVars = for(varName <- targetVars) yield varName + "Segment"
val breaksArray = for (varName <- targetVars) yield df.stat.approxQuantile(varName,
probs,0.0)
val bucketizer = new Bucketizer()
.setInputCols(targetVars)
.setOutputCols(outputVars)
.setSplitsArray(breaksArray)
val df_e = bucketizer.transform(df)
df_e.show
Result:
targetVars: Array[String] = Array(recency, frequency, monetary)
outputVars: Array[String] = Array(recencySegment, frequencySegment, monetarySegment)
breaksArray: Array[Array[Double]] = Array(Array(287.0, 290.0, 333.0, 564.0, 1304.0), Array(1.0, 3.0, 11.0, 57.0, 114.0), Array(35.89, 123.94, 316.772, 489.7, 867.66))
+----------+-------+---------+--------+--------------+----------------+--------------
-+|customerId|recency|frequency|monetary|recencySegment|frequencySegment|monetarySegment|
+----------+-------+---------+--------+--------------+----------------+---------------+
| 1| 297| 114| 733.27| 1.0| 3.0| 3.0|
| 2| 564| 11| 867.66| 3.0| 2.0| 3.0|
| 3| 1304| 1| 35.89| 3.0| 0.0| 0.0|
| 4| 287| 25| 153.08| 0.0| 2.0| 1.0|
| 6| 290| 94| 316.772| 1.0| 3.0| 2.0|
| 8| 1186| 3| 440.21| 3.0| 1.0| 2.0|
| 11| 561| 5| 489.7| 2.0| 1.0| 3.0|
| 14| 333| 57| 123.94| 2.0| 3.0| 1.0|
+----------+-------+---------+--------+--------------+----------------+---------------+
For the following schema:
Animal
- age
- gender
- size
Cat extends Animal
- fur_color
Snake extends Animal
- scales_color
Elephant extends Animal
- tusks_size
When I do $em->getRepository('AcmeDemoBundle:Animal')->findAll() I will recieve a collection on Animal objects without their subclass properties.
When I do $em->getRepository('AcmeDemoBundle:Cat')->findAll() I will recieve the objects with their subclass (Cat) properties, however I will get only Cat objects (no snakes or elephants).
1) Is there I way to get all the animals, but not as base Animal objects, but actually their leaf subclass type?
Eg. for database like this:
Animals table:
ID | discr | age | gender | size | fur_color | scales_color | tusks_size
1 | snake | 2 | male | 20ft | NULL | green | NULL
2 | cat | 3 | female | 5ft | red | NULL | NULL
3 | eleph | 6 | male | 99ft | NULL | NULL | 40ft.
4 | cat | 2 | male | 6ft | grey | NULL | NULL
I'd like to recieve a Collection of:
Snake (id: 1, age: 2, gender: male, size: 20ft, scales_color: green)
Cat (id: 2, age: 3, gender: female, size: 5ft, fur_color: red)
Elephant (id: 3, age: 6, gender: female, size: 99ft, tusks_Size: 40ft.)
Cat (id: 4, age: 2, gender: male, fur_color: grey)
2) If it's not possible with STI... is it possible with Class Table Inheritance?
Indeed it seems I had some error in my configuration. Recreating the bundle and writing the entities again fixed the problem as #Bez and #Cerad suggested.