I have a problem doing complex queries in Firestore database. I have read the documentation countless times so I know where the limitations are, but I wonder whether there is a way to structure the data so it supports my use cases. Let me explain what the use cases first:
I have a list of jobs, and users and I want to able to list/filter jobs according to some criteria and to list/filter users according to some criteria.
Job
JOB ID
- job type (1 of predefined values)
- salary (any number value)
- location (any value)
- long
- lat
- rating (1 - 5)
- views (any number value)
- timeAdded (any timestamp value)
- etc.
User
User ID
- experiences (0, 1 or more of predefined values)
- experience1
- jobCategory
- jobName
- timeEmployed
- experience2
- etc
- languages (0, 1 or more of predefined values)
- language1
- languageName
- proficency
- language2
- etc.
- location (any value)
- long
- lat
- rating (1 - 5)
- views (any number value)
- timeLastActive (any timestamp value)
- etc.
Filtering by field which can only have one value is fine. Even when I add sorting by "timeAdded" or a ragne filter.
1) The problem is when I introduce second range filter, such as jobs with "salary" higher then 15 bucks and at the same time "rating" higher then 4. Is there a way to get around this problem when I have N range filters?
2) Second problem is that I cannot use logical OR. Lets say, filter jobs, where "jobCategory" is Bartender or Server.
3) Another problem is to filter by fields, which can have more then 1 value, e.g. User can speak more than one language. Then if I want to filter users which speak English, it is not possible. Not to mention to filter users who speak e.g. English OR French.
I know I can model the data the way that I use the language as the name of the field, like -english = true, but when I introduce range filter to this, I need to create a Firestore index, which is very inconvenient since I can have around 20 languages and around 50 job types at the same time, and I would have to create indexes all the combinations together with different range filters.. is this assumption correct?
4) How would I filter jobs which up to 20 km from certain position? Radius and the position is on the user to choose.
5) What if I want to filter by all those fields at the same time? E.g. filter certain "jobCategory", location and radius, "salary" higher then something and "rating" higher then something, and sort it all by "timeAdded".
Is this possible with Firestore / Realtime database, can I model the data in some way to support this, or do I have to look for an alternative DB solution? I really like the real-time aspect of it. It will come handy when it is time to implement chat feature to the app. Is it solvable with Cloud functions? I am trying to avoid doing multiple requests, merging them together and sending that to client, since there can be any combination of filters.
If not doable with Firebase, do you know of any alternatives similar to Firestore with better querying options? I really hope I am just missing something :)
Thank you!
Related
I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period
I would like to filter certain sources and mediums (specifically email clients). I need to keep the dimension as one column (I use the maximum number of dimensions - 7).
The filter works fine when I have only one sourceMedium such as:
ga:sourceMedium!=amail.centrum.cz / referral
Filter doesn’t work at all when I use two sourceMedium:
ga:sourceMedium!=amail.centrum.cz / referral,ga:sourceMedium!=mail.google.com / referral
It doesn’t matter If I use AND / OR, the query doesn’t output the desired data.
I assume that there supposed to be some delimiter which could identify that amail.centrum.cz is one string which is delimited by another one. I already tried to use ' at the beginning and at the end of the string, but it seems that it doesn’t work.
Is there anything that I missed in docs or anything else? Looking for your help :)
BTW: I'm aware of the solution: Pull out data from GA, filter data manually (compare output data vs my list of email clients what I would like to exclude)
I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.
I have a dataset of objects, e.g. cars. I want to make a system where users are presented an object, and can decide whether or not they like the car. But, I want to show them each car only once.
- allCars
- car1
- car2
...
- car348237
- carsLiked
- user1
- carsLiked
- car123
- car234
- carsNotLiked
- car321
- user2
- carsLiked
- carsNotLiked
Given some user, e.g. user1, how can I make a selection from allCars, WHITOUT cars that the user has already seen? In SQL I would do something like " WHERE carId NOT IN (car123, car234, car321) "
Any idea how I can do this in firebase (without filtering on clientside, I know how to do that)...? Any structure possible, using some kind of index.? I struggled for sometime but didn't find a solution.
Denormalization is a key.
I would replicate a set of all cars in each user's object and than I would delete the car object already displayed to user.
cars:{
CAR_AUDO_ID: {
//car object
}
users:{
user1:{
car_selection:{
CAR_AUTO_ID: true //reference to above
...
},
cars_liked:{
},
cars_disliked:{
}
}
coming from SQL it might sound like a lot of replication, but that the way to go with firebase.
in case you have something like 10K+ cars, of course the above would be overkill. If users are presented a random car than I would focus on random number generator where I would store only already picked numbers. In that case the best would be to use priority ordered list and your key would be something like increment counter.
I am working with the practice repository in preparation for doing upcoming work with a large enterprise client using BQ. The repository link is: google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910
I have 3 questions to ask in relation to the sample repository & a query that was run (please see the bottom of the link for the query that motivated the question:
1) What is the difference between customDimensions.index, customDimensions.value and hits.customDimensions.index, hits.customDimensions.value?
2) If a single hit has multiple custom dimensions/metrics how is that returned/queried? I only see single dimensions matching at the hit level in the sample data.
3) There are no custom metric values passed in the example data, what will those values look like?
Here is the query that motivated the previous 3 questions:
SELECT hits.page.pagePath AS urls,
hits.time,
customDimensions.index,
customDimensions.value,
hits.customMetrics.index,
hits.customMetrics.value,
trafficSource.medium,
hits.customVariables.index,
hits.customVariables.customVarName,
hits.customVariables.customVarValue
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
Every record in that table represents one Google Analytics Session. Big Query has this concept of nested fields and that's how individual hits are defined. They are nested into the hits record.
Answering your questions:
1) customDimensions.index and customDimensions.value are the index and value for user or session scoped custom dimensions. hits.customDimensions.index and hits.customDimensions.value re custom Dimensions set at hit scope level. The scope is defined when you create the custom Dimension through GA interface. indexes are integers from 1 to 20 (as defined in the Admin section) and value is the string passed as the value for that custom Dimension. More info about Custom Dimensions/Metrics
2) Both rows and rows.customDimensions are REPEATED RECORDS in Big Query. So in essence every row in that BQ table looks like this:
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
But when you query the data this should be FLATTEN by default. Because it's flatten if a single hit has multiple custom dimensions and metrics it should show multiple rows, one for each.
3) Should be the same as customDimensions but the values are INTEGER instead of STRINGS.
For a simpler and more educational dataset I suggest that you create a brand new BQ table and load the data provided on this developer document page.
PS: Tell my good friends at Cardinal Path that Eduardo said Hello!