Dealing with lots of data in Firebase for a recommender system - firebase

I am building a recommender system where I use Firebase to store and retrieve data about movies and user preferences.
Each movie can have several attributes, and the data looks as follows:
{
"titanic":
{"1997": 1, "english": 1, "dicaprio": 1, "romance": 1, "drama": 1 },
"inception":
{ "2010": 1, "english": 1, "dicaprio": 1, "adventure": 1, "scifi": 1}
...
}
To make the recommendations, my algorithm requires as input all the data (movies) and is matched against an user profile.
However, in production mode I need to retrieve over >10,000 movies. While the algorithm can handle this relatively fast, it takes a lot of time to load this data from Firebase.
I retrieve the data as follows:
firebase.database().ref(moviesRef).on('value', function(snapshot) {
// snapshot.val();
}, function(error){
console.log(error)
});
I am there wondering if you have any thoughts on how to speed things up? Are there any plugins or techniques known to solve this?
I am aware that denormalization could help split the data up, but the problem is really that I need ALL movies and ALL the corresponding attributes.

My suggestion would be to use Cloud Functions to handle this.
Solution 1 (Ideally)
If you can calculate suggestions every hour / day / week
You can use a Cloud Functions Cron to fire up daily / weekly and calculate recommendations per users every week / day. This way you can achieve a result more or less similar to what Spotify does with their weekly playlists / recommendations.
The main advantage of this is that your users wouldn't have to wait for all 10,000 movies to be downloaded, as this would happen in a cloud function, every Sunday night, compile a list of 25 recommendations, and save into your user's data node, which you can download when the user accesses their profile.
Your cloud functions code would look like this :
var movies, allUsers;
exports.weekly_job = functions.pubsub.topic('weekly-tick').onPublish((event) => {
getMoviesAndUsers();
});
function getMoviesAndUsers () {
firebase.database().ref(moviesRef).on('value', function(snapshot) {
movies = snapshot.val();
firebase.database().ref(allUsersRef).on('value', function(snapshot) {
allUsers = snapshot.val();
createRecommendations();
});
});
}
function createRecommendations () {
// do something magical with movies and allUsers here.
// then write the recommendations to each user's profiles kind of like
userRef.update({"userRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
// etc.
}
Forgive the pseudo-code. I hope this gives an idea though.
Then on your frontend you would have to get only the userRecommendations for each user. This way you can shift the bandwidth & computing from the users device to a cloud function. And in terms of efficiency, without knowing how you calculate recommendations, I can't make any suggestions.
Solution 2
If you can't calculate suggestions every hour / day / week, and you have to do it each time user accesses their recommendations panel
Then you can trigger a cloud function every time the user visits their recommendations page. A quick cheat solution I use for this is to write a value into the user's profile like : {getRecommendations:true}, once on pageload, and then in cloud functions listen for changes in getRecommendations. As long as you have a structure like this :
userID > getRecommendations : true
And if you have proper security rules so that each user can only write to their path, this method would get you the correct userID making the request as well. So you will know which user to calculate recommendations for. A cloud function could most likely pull 10,000 records faster and save the user bandwidth, and finally would write only the recommendations to the users profile. (similar to Solution 1 above) Your setup would like this :
[Frontend Code]
//on pageload
userProfileRef.update({"getRecommendations" : true});
userRecommendationsRef.on('value', function(snapshot) { gotUserRecos(snapshot.val()); });
[Cloud Functions (Backend Code)]
exports.userRequestedRecommendations = functions.database.ref('/users/{uid}/getRecommendations').onWrite(event => {
const uid = event.params.uid;
firebase.database().ref(moviesRef).on('value', function(snapshot) {
movies = snapshot.val();
firebase.database().ref(userRefFromUID).on('value', function(snapshot) {
usersMovieTasteInformation = snapshot.val();
// do something magical with movies and user's preferences here.
// then
return userRecommendationsRef.update({"getRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
});
});
});
Since your frontend will be listening for changes at userRecommendationsRef, as soon as your cloud function is done, your user will see the results. This might take a few seconds, so consider using a loading indicator.
P.S 1: I ended up using more pseudo-code than originally intended, and removed error handling etc. hoping that this generally gets the point across. If there's anything unclear, comment and I'll be happy to clarify.
P.S. 2: I'm using a very similar flow for a mini-internal-service I built for one of my clients, and it's been happily operating for longer than a month now.

Firebase NoSQL JSON structure best practice is to "Avoid nesting data", but you said, you don't want to change your data. So, for your condition, you can have REST call to any particular node (node of your each movie) of the firebase.
Solution 1) You can create some fixed number of Threads via ThreadPoolExecutors. From each worker thread, you can do HTTP (REST call request) as below. Based on your device performance and memory power, you can decide how many worker threads you want to manipulate via ThreadPoolExecutors. You can have code snippet something like below:
/* creates threads on demand */
ThreadFactory threadFactory = Executors.defaultThreadFactory();
/* Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available */
ExecutorService threadPoolExecutor = Executors.newFixedThreadPool(10); /* you have 10 different worker threads */
for(int i = 0; i<100; i++) { /* you can load first 100 movies */
/* you can use your 10 different threads to read first 10 movies */
threadPoolExecutor.execute(() -> {
/* OkHttp Reqeust */
/* urlStr can be something like "https://earthquakesenotifications.firebaseio.com/movies?print=pretty" */
Request request = new Request.Builder().url(urlStr+"/i").build();
/* Note: Firebase, by default, store index for every array.
Since you are storing all your movies in movies JSON array,
it would be easier, you read first (0) from the first worker thread,
second (1) from the second worker thread and so on. */
try {
Response response = new OkHttpClient().newCall(request).execute();
/* OkHttpClient is HTTP client to request */
String str = response.body().string();
} catch (IOException e) {
e.printStackTrace();
}
return myStr;
});
}
threadPoolExecutor.shutdown();
Solution 2) Solution 1 is not based on the Listener-Observer pattern. Actually, Firebase has PUSH technology. Means, whenever something particular node changes in Firebase NoSQL JSON, the corresponding client, who has connection listener for particular node of the JSON, will get new data via onDataChange(DataSnapshot dataSnapshot) { }. For this you can create an array of DatabaseReferences like below:
Iterable<DataSnapshot> databaseReferenceList = FirebaseDatabase.getInstance().getReference().getRoot().child("movies").getChildren();
for(DataSnapshot o : databaseReferenceList) {
#Override
public void onDataChange(DataSnapshot o) {
/* show your ith movie in ListView. But even you use RecyclerView, showing each Movie in your RecyclerView's item is still show. */
/* so you can store movie in Movies ArrayList. When everything completes, then you can update RecyclerView */
}
#Override
public void onCancelled(DatabaseError databaseError) {
}
}

Although you stated your algorithm needs all the movies and all attributes, it does not mean that it processes them all at once. Any computation unit has its limits, and within your algorithm, you probably chunk the data into smaller parts that your computation unit can handle.
Having said that, if you want to speed things up, you can modify your algorithm to parallelize fetching and processing of the data/movies:
| fetch | -> |process | -> | fetch | ...
|chunk(1)| |chunk(1)| |chunk(3)|
(in parallel) | fetch | -> |process | ...
|chunk(2)| |chunk(2)|
With this approach, you can spare almost the whole processing time (but the last chunk) if processing is really faster than fetching (but you have not said how "relatively fast" your algorithm run, compared to fetching all the movies)
This "high level" approach of your problem is probably your better chance if fetching the movies is really slow although it requires more work than simply activating a hypothetic "speed up" button of a Library. Though it is a sound approach when dealing with large chunk of data.

Related

I don't understand the order that code executes when calling onAppear

I have been getting this problem now a few times when I'm coding and I think I just don't understand the way SwiftUI execute the order of the code.
I have a method in my context model that gets data from Firebase that I call in .onAppear. But the method doesn't execute the last line in the method after running the whole for loop.
And when I set breakpoints on different places it seems that the code first is just run through without making the for loop and then it returns to the method again and then does one run of the for loop and then it jumps to some other strange place and then back to the method again...
I guess I just don't get it?
Has it something to do with main/background thread? Can you help me?
Here is my code.
Part of my UI-view that calls the method getTeachersAndCoursesInSchool
VStack {
//Title
Text("Settings")
.font(.title)
Spacer()
NavigationView {
VStack {
NavigationLink {
ManageCourses()
.onAppear {
model.getTeachersAndCoursesInSchool()
}
} label: {
ZStack {
// ...
}
}
}
}
}
Here is the for-loop of my method:
//Get a reference to the teacher list of the school
let teachersInSchool = schoolColl.document("TeacherList")
//Get teacherlist document data
teachersInSchool.getDocument { docSnapshot, docError in
if docError == nil && docSnapshot != nil {
//Create temporary modelArr to append teachermodel to
var tempTeacherAndCoursesInSchoolArr = [TeacherModel]()
//Loop through all FB teachers collections in local array and get their teacherData
for name in teachersInSchoolArr {
//Get reference to each teachers data document and get the document data
schoolColl.document("Teachers").collection(name).document("Teacher data").getDocument {
teacherDataSnapshot, teacherDataError in
//check for error in getting snapshot
if teacherDataError == nil {
//Load teacher data from FB
//check for snapshot is not nil
if let teacherDataSnapshot = teacherDataSnapshot {
do {
//Set local variable to teacher data
let teacherData: TeacherModel = try teacherDataSnapshot.data(as: TeacherModel.self)
//Append teacher to total contentmodel array of teacherdata
tempTeacherAndCoursesInSchoolArr.append(teacherData)
} catch {
//Handle error
}
}
} else {
//TODO: Error in loading data, handle error
}
}
}
//Assign all teacher and their courses to contentmodel data
self.teacherAndCoursesInSchool = tempTeacherAndCoursesInSchoolArr
} else {
//TODO: handle error in fetching teacher Data
}
}
The method assigns data correctly to the tempTeacherAndCoursesInSchoolArr but the method doesn't assign the tempTeacherAndCoursesInSchoolArr to self.teacherAndCoursesInSchool in the last line. Why doesn't it do that?
Most of Firebase's API calls are asynchronous: when you ask Firestore to fetch a document for you, it needs to communicate with the backend, and - even on a fast connection - that will take some time.
To deal with this, you can use two approaches: callbacks and async/await. Both work fine, but you might find that async/await is easier to read. If you're interested in the details, check out my blog post Calling asynchronous Firebase APIs from Swift - Callbacks, Combine, and async/await | Peter Friese.
In your code snippet, you use a completion handler for handling the documents that getDocuments returns once the asynchronous call returns:
schoolColl.document("Teachers").collection(name).document("Teacher data").getDocument { teacherDataSnapshot, teacherDataError in
// ...
}
However, the code for assigning tempTeacherAndCoursesInSchoolArr to self.teacherAndCoursesInSchool is outside of the completion handler, so it will be called before the completion handler is even called.
You can fix this in a couple of ways:
Use Swift's async/await for fetching the data, and then use a Task group (see Paul's excellent article about how they work) to fetch all the teachers' data in parallel, and aggregate them once all the data has been received.
You might also want to consider using a collection group query - it seems like your data is structure in a way that should make this possible.
Generally, iterating over the elements of a collection and performing Firestore queries for each of the elements is considered a bad practice as is drags down the performance of your app, since it will perform N+1 network requests when instead it could just send one single network request (using a collection group query).

Firebase firestore complicated query

I'm wondering if this is possible, and if it's a good solution to my problem.
I want users to be able to subscribe to content. The content is associated with an id.. for instance:
'JavaScript': 1,
'C++': 2,
'Python': 3,
'Java': 4,
Let's say a user subscribes to 1, 3, and 4.
So their user json data would appear as:
'subscribed_to': [1,3,4]
Now in my firestore, I have posts. Each post gets assigned a content_id (1-4 for instance), and so when I query for the content that this user is subscribed to, how would I do that so as effectively as possible?
This is indeed a complex but common case, I would recommend to set a data structure similar to:
{
"subscriptions" {
javascript: { ... },
python: { ... },
java: { ... }
},
"users": {
1: {
subscribed_to: ['javascript', 'python']
}
}
}
It's very important that on your subscribed_to prop you use the doc name, cause this is the part that allows you to query them (the docs).
the big problem, how do I query this data? I don't have joins!
Case 1:
Assuming you have your user data when you apply load...
const fetchDocs = async (collection, docs) => {
const fetchs = docs.map(id => collection.doc(id).get())
const responses = await Promise.all(fetchs)
return responses.map(r => r.data())
}
const userSubscriptions = ['javascript', 'python']
const mySubscriptions = await fetchDocs(`subscriptions`, userSubscriptions)
Behind the scene, the sdk will group all the queries and do its best efforts to deliver em together. This works good, I'm 99% sure you still have to pay for each query individually each time the user logs in.
Case 2:
Create a view dashboard collection and pre-calculate the dashboard behind the scene, this approach involves cloud functions to listen or changes on the user changes (and maybe subscriptions as well) and copy each individual doc into another collection, let's say subscriptions_per_users. This is a more complex approach, I will require more time to explain but if you have a big application where costs are important and if the user is going to subscribe to a lot of things, then you might want to investigate about it.
Source: My own experience... a little of google can also help, there seems to be a lot of opinions about it, find what works best for you.

Firebase real time database, get values from different locations in one snapshot

I am working on using the Firebase database in a Unity project. I read that you want to structure your database as flat as possible for preformance so for a leaderboard I am structuring it like this
users
uid
name
uid
name
leaderboard
uid
score
uid
score
I want a class that can get this data and trigger a callback when it is done. I run this in my constructor.
root.Child("leaderboard").OrderByChild("score").LimitToLast(_leaderboardCount).ValueChanged += onValueChanged;
To trigger the callback with the data I wrote this.
void onValueChanged(object sender, ValueChangedEventArgs args)
{
int index = 0;
List<string> names = new List<string>();
foreach (DataSnapshot snapshot in args.Snapshot.Children)
{
root.Child("players").Child(snapshot.Key).GetValueAsync().ContinueWith(task =>
{
if (task.Result.Exists)
{
names.Add(task.Result.Child("name").Value.ToString());
index++;
if (index == args.Snapshot.ChildrenCount)
{
_callback(names);
}
}
});
}
}
I am wondering if there is a better way to do this that I am missing? I'm worried if the tasks finish out of order my leaderboard will be jumbled. Can I get the names and scores in one snapshot?
Thanks!
You can only load:
a single complete node
a subset of all child nodes matching a certain condition under a single node
It seems that what you are trying to do is get a subset of the child nodes from multiple roots, which isn't possible. It actually looks like you're trying to do a join, which on Firebase is something you have to do with client-side code.
Note that loading the subsequent nodes is often a lot more efficient than developers expect, since Firebase can usually pipeline the requests (everything goes over a single web socket connection). For more on this, see http://stackoverflow.com/questions/35931526/speed-up-fetching-posts-for-my-social-network-app-by-using-query-instead-of-obse/35932786#35932786.

Firestore transactions getting triggered multiple times resulting in wrong data

So I have a cloud function that is triggered each time a transaction is liked/unliked. This function increments/decrements the likesCount. I've used firestore transactions to achieve the same. I think the problem is the Code inside the Transaction block is getting executed multiple times, which may be correct as per the documentation.
But my Likes count are being updated incorrectly at certain times.
return firestore.runTransaction(function (transaction) {
return transaction.get(transRef).then(function (transDoc) {
let currentLikesCount = transDoc.get("likesCount");
if (event.data && !event.data.previous) {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 1 : transDoc.get("likesCount") + 1;
} else {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 0 : transDoc.get("likesCount") - 1;
}
transaction.update(transRef, { likesCount: newLikesCount });
});
});
Anyone had similar experience
Guys finally found out the cause for this unexpected behaviour.
Firestore isn't suitable for maintaining counters if your application is going to be traffic intensive. They have mentioned it in their documentation. The solution they suggest is to use a Distributed counter.
Many realtime apps have documents that act as counters. For example,
you might count 'likes' on a post, or 'favorites' of a specific item.
In Cloud Firestore, you can only update a single document about once
per second, which might be too low for some high-traffic applications.
https://cloud.google.com/firestore/docs/solutions/counters
I wasn't convinced with that approach as it's too complex for a simple use case, which is when I stumbled across the following blog
https://medium.com/evenbit/on-collision-course-with-cloud-firestore-7af26242bc2d
These guys used a combination of Firestore + Firebase thereby eliminating their weaknesses.
Cloud Firestore is sitting conveniently close to the Firebase Realtime
Database, and the two are easily available to use, mix and match
within an application. You can freely choose to store data in both
places for your project, if that serves your needs.
So, why not use the Realtime database for one of its strengths: to
manage fast data streams from distributed clients. Which is the one
problem that arises when trying to aggregate and count data in the
Firestore.
Its not correct to say that Firestore is an upgrade to the Realtime database (as it is advertised) but a different database with different purposes and both can and should coexist in a large scale application. That's my thought.
It might have something to do with what you're returning from the function, as you have
return transaction.get(transRef).then(function (transDoc) { ... })
And then another return inside that callback, but no return inside the inner-most nested callback. So it might not be executing the transaction.update. Try removing the first two return keywords and add one before transaction.update:
firestore.runTransaction(function (transaction) {
transaction.get(transRef).then(function (transDoc) {
let currentLikesCount = transDoc.get("likesCount");
if (event.data && !event.data.previous) {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 1 : transDoc.get("likesCount") + 1;
} else {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 0 : transDoc.get("likesCount") - 1;
}
return transaction.update(transRef, { likesCount: newLikesCount });
});
});
Timeouts
First of all, check your Cloud Functions logs to see if you get any timeout messages.
Function execution took 60087 ms, finished with status: 'timeout'
If so, sort out your function so that it returns a Promise.resolve(). And shows
Function execution took 344 ms, finished with status: 'ok'
Idempotency
Secondly, write your data so that the function is idempotent. When your function runs, write a value to the document that you are reading. You can then check if that value exists before running the function again.
See this example for ensuring that functions are only run once.

Re-create template while switching routes

How can we re-create template while switching routes?
For example, i have subscriber template. It detects when user scrolls down to a display and subscribes to more data. It takes several parameters.
Example:
amazing_page.html
{{#each}}
{{amazing_topic}}
{{/each}}
{{>subscriber name='topics' count=5}}
subscriber.js
//rough sample code
Template.subscriber.onCreated(function() {
var self = this;
var type = Template.currentData().name;
var count = Template.currentData().count;
var user = Template.currentData().user;
var skipCount = 0;
self.autorun(function(c){
self.subscribe(type, skipCount, user);
var block = true;
$(window).scroll(function() {
if (($(window).scrollTop() + $(window).height()) >= ($(document).height()) && block) {
block = false;
skipCount = skipCount + count;
console.log(type);
console.log(skipCount);
self.subscribe(type, skipCount, user, {
onReady: function() {
block = true;
},
onStop: function() {
console.log('stopped');
}
});
}
});
})
});
I use this template with different parameters in different routes.
The problem is if user switches some routes, and scrolls down in one page, all subscribers he gets in another pages will actualy work in this page. More, they will store increased values for them variables, and will do all included logic.
I found a bad decision when we use Route.getName (for example) comparing and name parameter of subscriber. It is not a best option. Can someone help me to find a good practice for that?:)
Simple Example:
We have 3 different routes:
1)News
2)Videos
3)Topics
These routes templates have included special subscriber-templates. And subscribtion works fine on scroll.
Ok, now let's visit all of them: News, Videos, Topics.
Good, now scroll down and... I have three instance of subscriber template what will subscribe on them own publications, because they not destroyed when we switch routes.
And, as a result - when user scrolling Topics page, he will call subscribtion for News and Videos too, and he will take data from these collections too;)
And - this is a problem:)
UPD:
Looks like we find a decision. If i use Template.instance (autorun/subscribe) it will start working expected, except some strange cases:)
First of all, when i go in another route in next iteration (scroll down) it returns me data from old, destroyed template + error. Next time (next iteration) it will start to subscribe to a correct data. Hmm...it looks like i have mistake in autorun section...or not?
Attached print screen from console
this
It sounds like you have multiple subscriptions to the same collection and that therefore the list of documents shown in various contexts can change in unexpected ways. Meteor manages multiple subscriptions on the same collection by synchronizing the union of the selected documents.
The simplest way to manage each of your views is to make sure that the data context for a particular view uses a .find() with the query you need. This will typically be the same query that your publication is using.
A different but less efficient approach is to .stop() the subscription when you leave a view.

Resources