DynamoDB data structure / architecture to support set of particular queries - amazon-dynamodb

I currently have a lambda function pushing property data into a DynamoDB with streams enabled.
When a new item is added, the stream triggers another Lambda function which should query against a second DynamoDB table to see if there is a 'user query' in the table matching the new object in the stream.
The items in the first table which are pushed into the stream look like this...
{
Item: {
partitionKey: 'myTableId',
bedrooms: 3,
latitude: 52.4,
longitude: -2.6,
price: 200000,
toRent: false,
},
}
The second table contains active user queries. For example one user is looking for a house within a 30 mile radius of his location, between £150000 and £300000.
An example of this query object in the second table looks like this...
{
Item: {
partitionKey: 'myTableId',
bedrooms: 3,
minPrice: 150000,
maxPrice: 300000,
minLatitude: 52.3,
maxLatitude: 52.5
minLongitude: -2.5,
maxLongitude: -2.7,
toRent: false,
userId: 'userId',
},
}
When a new property enters the stream, I want to trigger a lambda which queries against the second table. I want to write something along the lines of...
get me all user queries where bedrooms == streamItem.bedrooms AND minPrice < streamItem.price AND maxPrice > streamItem.price AND minLatitude < streamItem.latitude AND maxLatitude > streamItem.latitude.
Ideally I want to achieve this via queries and filters, without scanning.
I'm happy to completely restructure the tables to suit the above requirements.
Been reading and reading and haven't found a suitable answer, so hoping an expert can point me in the right direction!
Thank you in advance

There's no silver bullet with DynamoDB here. Your only tools are the PK/SK lookup by value and range, filters to brute force things after that, and GSIs to give an alternate point of view. You're going to have to get creative. The details depend on your circumstances.
Like if you know you're getting all those specific values every time, you can construct a PK that is bed##rent# and an SK of price. Then for those three attributes you can do exact index-based resolution and filter for the geo attributes.
If you wanted, you could quantize the price range values (store pre-determined price ranges as singular values) and put that into the PK as well. Like divide prices into 50k chunks, each of which gets a name of the leading value. If someone wanted 150,000 to 250,000 then you'd lookup using two PKs, the "150" and "200" blocks.
You get PK/SK + GSI + filter. That's it. So it's up to you to invent a solution using them, aiming for efficiency.

Related

Dropdown from values of a database field

I have an issue related to the data filtering. I have a Google Drive table to store data, and I want to show one field of this data source in a dropdown to make a filter by this field (Country).
The problem is that this dropdown filter it's only showing the countries that appears on the current page of the list. For example, if in the first page appears one country (Thailand) on the dropdown I'll only see Thailand.
If we move to the second page of the list we have another two countries (Spain and Portugal) and then the dropdown will only show Spain and Portugal.
What I really want is a dropdown which shows all the countries, no matter if they aren't on the current page, but I don't know how to fix it. ​
​This the the configuration of the Country Selector:
In the help, it's said we should use #datasource.model.fields.COUNTRY.possibleValues,
but if I use this paramater as Options, nothing is displayed in the selector.
I have spend a lot of hours trying to fix this issue and I don't find the solution, and I would like to check with you if it's an issue or I'm doing something wrong...
Could you help me?
You are using the same datasource for your dropdown and table and by #distinct()#sort() you are filtering items that are already loaded to browser (opposed to the whole dataset stored in database).
You need to have a separate datasource for your dropdown. There are at least three techniques to do this:
Possible values
You can predefine allowed values for your Country field and use them to populate drop down options both in create form and table filtering #datasource.model.fields.Country.possibleValues as you mentioned in question:
Create model for countries
By introducing dedicated related model for countries you can get the following benefits:
normalized data (you will not store the same country multiple times)
you'll be able to keep your countries list clean (with current approach there is possibility to have the same country with different spellings like 'US', 'USA', 'United State', etc)
app users when they create new records will be able to choose the country they need from dropdown (opposed to error prone typing it every time for all new records).
your dropdown bindings will be as simple as these:
// for names
#datasources.Countries.items..Names
// for options
#datasources.Countries.items.._key
// for value
#datasource.query.filters.Country._key._equals
Create Calculated Model
With Calculated Model you'll be able to squeeze unique country values from your table. You server query script can look similar to this:
function getUniqueCountries_() {
var consumptions = app.models.Consumption.newQuery().run();
var countries = [];
consumptions.reduce(function (allCountries, consumption) {
if (!allCountries[consumption.Country]) {
var country = app.models.CountryCalc.newRecord();
country.Name = consumption.Country;
countries.push(country);
allCountries[consumption.Country] = true;
}
}, {});
return countries;
}
However with growth of your Consumption table it can give you significant performance overhead. In this case I would rather look into direction of Cloud SQL and Calculated SQL model.
Note:
I gave a pretty broad answer that also covers similar situations when number of field options can be unlimited (opposed to limited countries number).

Arango DB performace: edge vs. DOCUMENT()

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?
LET a = DOCUMENT(#from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})
vs
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
A simple benchmark you can try:
Create the collections products, categories and an edge collection has_category. Then generate some sample data:
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
LET random_categories = (
FOR c IN categories
SORT RAND()
LIMIT 5
RETURN c._id
)
LET category_subset = SLICE(random_categories, 0, RAND()*5+1)
UPDATE p WITH {
categories: category_subset,
categoriesEmbedded: DOCUMENT(category_subset)[*].name
} INTO products
FOR cat IN category_subset
INSERT {_from: p._id, _to: cat} INTO has_category
Then compare the query times for the different approaches.
Graph traversal (depth 1..1):
FOR p IN products
RETURN {
product: p.name,
categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
}
Look-up in categories collection using DOCUMENT():
FOR p IN products
RETURN {
product: p.name,
categories: DOCUMENT(p.categories)[*].name
}
Using the directly embedded category names:
FOR p IN products
RETURN {
product: p.name,
categories: p.categoriesEmbedded
}
Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.
If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.
The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:
Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)
Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance
Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.
Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.
Generally spoken, the latter variant
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.
When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.
However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

Firebase for complex query. A no go?

I've moved from parse-server to firebase for my new project, but reached a point in the project where I beginning to think it was a bad idea.
Basically, I'm making an app where people can post information about concerts going on in their town.
My first challenge was to filter the events, so a user only get events in his/her own town. I did this by structure the data after cities:
{
concerts: {
"New york": {
...,
...
},
"Chicago": {
...,
...
}
}
}
Then I figure I need another filter for the type of concert, e.g rock, pop, etc. So I though I did another restructure. However, there probably need to be 5-10 more filters, and it will become very hard to structure the database in a good way.
I though about multiple query, but this wasn't allowed:
firebase.database().ref("concerts")
.orderByChild("type").equalTo("rock")
.orderByChild("length").equalTo("2")
.orderByChild("artist").equalTo("beatles")
I thought about fetching everything from the server, and then filter the result in the client. I see however two problems with this:
There might be a ton of unnecessarily data being downloaded.
Some concerts will be locked only to certain users (e.g users who have gone to at least 10 other concerts), and there might be a security aspect of pulling home these concerts to user not being allowed to see them.
I thought about combining filters to create query keys, like this this, but with over 10 filters, it will become to complex.
Is there a solution to this or should I forget about firebase for this use case?
Thanks in advance
Incredibly complex queries can be crafted in Firebase. The data needs to be stored in a structure that lends itself to being queried and most importantly, don't be afraid of duplicate data.
For example, lets assume we have an app that enables a user to select a concert for a particular year and month, a specific city, and in a particular genre.
There are 3 parameters
year_month
city
genre
The UI first queries the user to select a City
Austin
then the UI asks to select a year and month
201704
then a genre
Rock
Your Firebase structure looks like this
concerts
concert_00
city: Memphis
year_month: 201706
genre: Country
city_date_genre: Memphis_201606_Country
concert_01
city: Austin
year_month: 201704
genre: Rock
city_date_genre: Austin_201704_Rock
concert_02
city: Seattle
year_month: 201705
genre: Disco
city_date_genre: Seattle_201705_Disco
Your UI has already polled the user for the query info and with that, build a query string
Austin_201704_Rock
and then query the 'city_date_genre' node for that string and you have your data.
What if the user wanted to know all of the concerts in Austin for April 2017
queryStartingAt("Austin_201704").queryEndingAt("Austin_201704")
You could easily expand on this by adding another query node and changing the order of the data
concerts
concert_00
city: Memphis
year_month: 201706
genre: Country
city_date_genre: Memphis_201606_Country
city_genre_date: Memphis_Country_201606
And depending on which order the user selects their data, you could query the associated node.
Adding additional nodes is a tiny amount of data and allows for very open ended queries for the data you need.
I see this is an old post, but I'd like to take this opportunity to point others running into a similar Firebase issues to AceBase, which is is a free and open source alternative to the Firebase realtime database. The lack of proper querying and indexing options in Firebase was one of the reasons AceBase was built. Using AceBase would enable you to query your data like so:
const snapshots = await db.ref('concerts')
.query()
.filter('city', '==', 'New York')
.filter('date', 'between', [today, nextWeek]) // today & nextWeek being Dates
.filter('genre', 'in', ['rock', 'blues', 'country'])
.get();
Because AceBase supports indexing, adding 1 or more indexes to the the queried fields will make those queries run incredibly fast, even with millions of records. It supports simple indexes, but also FullText and Geo indexes, so you could also query your data with a location and keywords:
.filter('location', 'geo:nearby', { lat: 40.730610, long: -73.935242, radius: 10000 }) // New York center with 10km radius
.filter('title', 'fulltext:contains', '"John Mayer" OR "Kenny Wayne Shepherd"')
If you want to limit results to allow paging, simply add skip and limit: .skip(80).limit(20)
Additionally, if you'd want to make the query deliver realtime results so any newly added concert will immediately notify your app - simply adding event listeners will upgrade it to a realtime query:
const results = await db.ref('concerts')
.filter('location', 'geo:nearby', { lat: 40.730610, long: -73.935242, radius: 10000 })
.on('add', addConcert)
.on('remove', removeConcert)
.get();
function addConcert(match) {
results.push(match.snapshot);
updateUI();
}
function removeConcert(match) {
const index = results.findIndex(r => r.ref.path === match.ref.path);
results.splice(index, 1);
updateUI();
}
If you want to know more about AceBase, check it out at npm: https://www.npmjs.com/package/acebase. AceBase is free and its entire source code is available on GitHub. Have fun!

How to query to get the list of instance count of every freebase types?

I want to annotate the corpus using freebase types. But almost every instance in freebase has several types. So I decide to choose the most common types as the instance's type. Is there a way to get the list of the count of the instance? I found this query but it seems not right because the result only has like 400 types. But I think the real types are way more than that.
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"/freebase/type_profile/instance_count": []
}]
I question the premise, but let's talk about that at the end after answering your question.
That's (close to) the correct query. When I ask for the count with by adding "return" : "count", I get 17,972 which sounds about right. Perhaps your query framework is adding a "limit" : 400 somehow?
Since you want the most common, why don't we modify the query to sort them. Due to a quirk in the sorting, nulls sort last (or first in our reversed sort), so we'll also add a qualifier to filter them out. We could use >0, but since presumably you aren't interested in low frequency types, let's use >1000 instead.
The final query looks like this:
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"instance_count>": 1000,
"instance_count": null,
"sort": "-instance_count"
}]
which will return an ordered list of 849 types sorted in descending order by instance count.
You'll probably want to do a little hand curation of the resulting list to eliminate things like /common/topic, /common/document, /book/isbn, /book/pagination, etc. Mediator types won't also have /common/topic, so you could filter on that first (but depending on the types of things in your corpus, they may all be topics (ie entities) to start with.
Now back to the premise that most frequent == best. Depending on your application, you may actually want more specific (which usually means lower frequency) types, rather than broader, high frequency types. For example, Deceased Person rather than Person, or Politician, Author, or Athlete, in preference to Person. You may want to consider using least frequent type (which is used at least some threshold times). The other thing that you may want to do is blacklist non-commons types (ie types rooted at /base/... or /user/...) which haven't been as carefully curated.
EDIT - word of warning:
Those counts were last updated in 2012. That should be fine for an exercise like this where you just want a rough ordering, but if you need current stats, you'll need to either count occurrences in the Freebase data dump or figure out the separate Stats API which I'm not sure is public/documented http://freebase-site.googlecode.com/svn/trunk/www/lib/queries/stats.sjs

CouchDB: Merging Objects in Reduce Function

I'm new to CouchDB, so bear with me. I searched SO for an answer, but couldn't narrow it down to this specifically.
I have a mapper function which creates values for a user. The users have seen different product pages, and we want to tally the type and products they've seen.
var emit_values = {};
emit_values.name = doc.name;
...
emit_values.productsViewed = {};
emit_values.productsViewed[doc.product] = 1
emit([doc.id, doc.customer], emit_values);
In the reduce function, I want to gather different values into that productsViewed object for that given user. So after the reduce, I have this:
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
Unfortunately, doing this creates a reduce overflow error. According to the other posts, this is because the productsViewed object is growing in size in the reduce function, and Couch doesn't like that. Specifically:
A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each.
So, I understand this is not the right way to do this in Couch. Does anyone have any insight into how to properly gather values into a document after reduce?
You simple build a view with the customer as key
emit(doc.customer, doc.product);
Then you can call
/:db/_design/:name/_view/:name?key=":customer"
to get all products an user has viewed.
If a customer can have viewed a product several times you can build a multi-key view
emit([doc.customer, doc.product], null);
and reduce it with the built-in function _count
/:db/_design/:name/_view/:name?startkey=[":customer","\u0000"]&endkey=[":customer","\u9999"]&reduce=true&group_level=2
You have to accept that you cannot
construct complex aggregate values
with CouchDB by requesting the view. If you want to have a data structure like your wished payload
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
i recommend to use an _update handler on the customer doc. Every request that logs a product visit adds a value to the customer.productsViewed property instead of creating a new doc.

Resources