I have a DynamoDB table like depicted in the attached image and I'm looking for ways to query the table based on lon and lat fields. More specifically, I'm looking for all the results with Lon between X and Y and Lat between A and B.
Is there any way to do that ? I created indexes for both Lon and Lat but the values are strings.
Thanks a lot for your help !!
you can come up with a good hash function(lets say f) and have the below schema for dynamodb
| pk | sk | lat | lon | name
| hashvalue1 | 48.80#2.35#Fac | 48.80 | 2.35 | Fac du
| hashvalue1 | 48.83#2.36#Groupe| 48.83 | 2.36 | Groupe Hos
here f(48.80, 2.35) = hashvalue1
f(48.83, 2.36) = hashvalue1
And whenever you have to query for lat1 and lon1, calculate f(lat1, lon1) and query the db.
But the problem with this solution is coming up with a good hashing function because in the worst case you may have to recalculate hash of every entered value in db otherwise it may become a hot key. this approach is well documented here and here.
I would suggest go with elastic search, it will give you much more flexibility. in terms of future use cases.
Related
I have two sets of points in two separate tables like this :
t1 :
Point_1 |Lat | Long
..................
Point_n |Lat |Long
and
t2 :
Pt_1 |Lat | Long
..................
Pt_m |Lat |Long
with no relation between the two tables.
What would be the best way (least resources) to identify the top 3 closest points in t2 for each pt in t1, particulalrly when t1 and t2 are huge? Maybe Geohashing?
What I tried and seems to work fine with small datasets is :
t1
| extend blah=1
| join kind=fullouter (t2 |extend blah=1) on blah
| extend distance = geo_distance_2points(Long,Lat,Long1,Lat1)
|sort by spbldrom_code, distance asc
| extend rnk = row_number(1,point <> prev(point))
| where rnk<=3
|project point, pt, distance, rnk
Please pardon the sloppiness ; I'm learning .
Thank you!
Try reducing the data size on both sides of the join operator, by filtering out irrelevant or ill formatted rows and columns. Perhaps you can use geo_point_in_polygon\circle() to throw out irrelevant data.
Try using broadcast join or maybe shuffle join:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/broadcastjoin
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/shufflequery
You can use s2\geohash\h3 hashing functions in two ways:
a. Per each table, combine nearby points into one representative point. The
idea is to use hash cell central point as a representative for all
points that reside in the cell. This will reduce tables sizes. Something like:
datatable(lng:real, lat:real)
[
10.1234, 53,
10.3579, 53,
10.6842, 53,
]
| summarize by hash = geo_point_to_s2cell(lng, lat, 8)
| project geo_s2cell_to_central_point(hash)
b. Calculate hash value for each point and join on the hash value. Something like:
let t1 =
datatable(lng:real, lat:real)
[
10.3579, 53,
10.6842, 53,
];
let t2 =
datatable(lng:real, lat:real)
[
10.1234, 53,
];
t1 | extend hash = geo_point_to_s2cell(lng, lat, 8)
| join kind=fullouter hint.strategy=broadcast (t2 | extend hash = geo_point_to_s2cell(lng, lat, 8)) on hash
Perhaps partition operator might also speed up the query:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/partitionoperator
I found what I think to be a better way to do this and want to share it.
Firstly, the issue with tessellation / geo-hashing is this:
Let's assume you have two sets of points with coordinates in two tables T1 and T2 and what to calculate the closest point in T2 for each point in T1. Now let's assume you have a point in T1 very close to the border of a geo-hash cell, and another point in T2 close to the same border, but in the neighboring geo-hash cell. Using the join method based on hash id, the algorithm will never calculate the distance between these two points, although they are very close, so the end result will miss this pair.
A better way to do the join of the two tables to calculate inter-points distance is generating a join key based on truncated coordinates. So for each point in each table , we create this key based on the relevancy of interpoint distance (what is the max inter-point distance we care about).
Example : for a point with coordinates ( 45.1234; -120.5678 ) the join key could be 25.1-120.6 (truncation and concatenation). With this rounding and using the join method , we would capture everything in table 2 within app 15 km radius of that point in table 1. Going for 25-120 as the join key will capture everything within 150Km. This will reduce significantly the joined table and avoids the caveats of geo-hashing method.
At this point I'm better at writing prose than code :), however I hope what I described above it makes sense. It certainly works for my project while circumventing the resource problems (cpu/mem).
Happy you've found a way that works for you. Another option that you may try is also taking into account neighbor cells.
H3 hash has such capability: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/geo-h3cell-rings-function
Something like this:
let h3_resolution = 8;
let t1 = datatable(lng1:real, lat1:real)
[
40.75864778392896, -73.97856558479198,
40.74860253711237, -73.98577679198793,
40.741092676839024, -73.9902397446769,
];
let t2 = datatable(lng2:real, lat2:real)
[
40.75594965648444, -73.98157034840024,
40.766085141039774, -74.01798702196743
];
t1
| extend hash = geo_point_to_h3cell(lng1, lat1, h3_resolution)
| join kind = inner (
t2
| extend rings = geo_h3cell_rings(geo_point_to_h3cell(lng2, lat2, h3_resolution),1)
| project lng2, lat2, hash_array = array_concat(rings[0], rings[1])
| mv-expand hash_array to typeof(string)
) on $left.hash == $right.hash_array
| project-away hash, hash_array
| extend distance = geo_distance_2points(lng1, lat1, lng2, lat2)
| project p1 = tostring(pack_array(lng1, lat1)), p2 = pack_array(lng2, lat2), distance
| sort by distance asc
| summarize closest_3_points = make_list(p2, 3) by p1
I want to show a graph of minimum value, maximum value and difference between maximum and minimum for each timeslice.
It works ok for min and max
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition) ,min(FromPosition) group by _timeslice
but I couldn't find the correct way to specify the difference.
e.g.
| (max(FromPosition)- min(FromPosition)) as diffFromPosition by _timeslice
returns error -Unexpected token 'b' found.
I've tried a few different combinations to declare them on different lines as suggested on https://help.sumologic.com/05Search/Search-Query-Language/aaGroup. e.g.
| int(FromPosition) as intFromPosition
| max(intFromPosition) as maxFromPosition , min(intFromPosition) as minFromPosition
| (maxFromPosition - minFromPosition) as diffFromPosition
| diffFromPosition by _timeslice
without success.
Can anyone suggest the correct syntax?
Try this:
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition), min(FromPosition) by _timeslice
| _max - _min as diffFromPosition
| fields _timeslice, diffFromPosition
the group by is for the min and max functions to know what range to work with, not the group by for the overall search query. That's why you were getting the syntax errors and one reason I prefer to just use by as above.
For these kinds of queries I usually prefer a box plot where you would just do:
| min(FromPosition), pct(FromPosition, 25), pct(FromPosition, 50), pct(FromPosition, 75), max(FromPosition) by _timeslice
Then selecting box plot as the graph type. Looks great on a dashboard and provides a lot of detailed information about deviation and such at a glance.
I have an example pedgree with a structure as shown here.
My ultimate goal is to extract the ancestry of certain people in the so-called trio format, which is a table with columns id mom dad.
In my example, the result for the pedigree of the two most recent persons G and H would be
+-----+-----+-----+
| id | mom | dad |
+-----+-----+-----+
| D | A | B |
| E | C | B |
| G | D | E |
| H | F | E |
+-----+-----+-----+
The closest thing I could come up with in AQL is the following query.
LET last_generation = ['people/G', 'people/H']
FOR person IN last_generation
FOR v, e, p in 1..10 OUTBOUND person is_mom, is_dad
LET role = contains('mom', e._id) ? 'mom': 'dad'
SORT e._from DESC
RETURN DISTINCT {'id': DOCUMENT('people', e._from)._key,
'parent': DOCUMENT('people', e._to)._key,
'role': role}
Altough the result is not yet in the right format, post-processing is easy.
Now my questions are:
I am forced to use the DISTINCT keyword to ensure uniqueness of rows. However, I would like to avoid unnesseary traversal in the first place rather than filtering. Ideally, I think I need the option uniqueEdges: "global", which is sadly not availabe any more. For instance, after having processed the pedigree of person G, I don't want to traverse the part of the pedigree shared between G and H (i.e., person E and its parents) again. Using uniqueVertices: "global" is not an option, because I would then miss the edge between H --> E.
Is there some kind of option to know the edge collection type during a traversal rather than using the kind of cumbersome checking I do? Please note that it is not an option for me to put the sex into a property of the person (which is reasonable for most humans), because in reality I am dealing with plants, which can (usually) be mother and father at the same time.
I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks
Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.
I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:
Dimension Gazetteer:
dimension_id
country_name
province_name
district_name
Dimension Device:
dimension_id
device_category
device_subcategory
Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)
A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:
+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id | area_m2 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?
I am using the Pentaho stack.
Thanks in advance
[moved from comment]
If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.
If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.
I would also consider adding a device-count fact, which could have the num devices of each type per hazard.