I want to stress-test some things in a symfony application. For that reason I want to load a rather big amount of dummy data into the DB.
My fixture's load-function looks like this:
public function load(ObjectManager $manager)
{
for ($i = 1; $i <= 1000000; $i++) {
[... init entity ...]
$manager->persist($work);
if ($i % 800 === 0) {
var_dump("records done: ".$i);
$manager->flush();
$manager->clear();
}
}
}
I played around with different batch-sizes (somewhere between 50 and 10000) before flushing and clearing but it doesn't change the general picture that the processing time needed for the individual batches starts to increase bit by bit. In the beginning it's around 4 seconds per 1000 entities but before even half of the 1000000 entities are written to the DB its already at above 10 seconds per 1000 entities.
Why is that the case?
I thought with flushing and clearing the needed execution time per batch wouldn't degenerate significantly.
Are there further options to speed this up?
Related
I'm trying to solve some kind of task route assigning problem with Optaplanner. The task on a user's way can be of a different type. If there occurs a task with the type lunch_delivery then there needs to be a task lunch_pick before it (lunch_pick task always exists in the list of all the tasks).
I'm wondering if I can somehow connect these two tasks as it is necessary that the same user does both the lunch_pick and the lunch_delivery task (and not that one user would pick the lunch and the other would deliver it).
The current code that doesn't provide a good solution is that the task of type lunch_pick has a parameter linked_task with an ID of the lunch_delivery task. Code in constraint section looks like this:
Constraint lunchPickConflict(ConstraintFactory factory) {
return factory.from(Task.class)
.filter(task -> task.getVisitType().equals("lunch_pick"))
.penalize("lunch pick not scheduled good",
HardMediumSoftScore.ONE_MEDIUM,
task -> task.lunchConflictWeight());
}
The lunchConflictWeight function is simple for now and is defined as:
public int lunchConflictWeight(){
int score = 0;
// we want that lunch pick and lunch delivery are close to each other and that delivery is after pick-up
Task nextTask = getNextTask();
while (nextTask != null){
if (nextTask.visitType.equals("lunch_delivery") && nextTask.getId() == linked_task){
// this is what we want - we found a task that is connected with our lunch_pick
break;
}
else if(nextTask.visitType.equals("normal")){
score +=3;
}
nextTask = nextTask.getNextTask();
}
if (nextTask == null){
// the lunch pick was the last task or the delivery is done by some other user -> big penal
score = 100;
}
return score;
}
What would be a better solution for this problem? Can I somehow force that lunch_pick and lunch_delivery would be done by the same user? When I see my results I can easily see that just swapping two tasks between users would bring significant improvement in the score.
+extra question: Is there a way that one constraint would penalize hard in some cases and medium in others?
A) Usually, we don't assign lunch tasks. Instead, we get a shadow variable to delay the start time of the first task after noon by 1 hour. I think I cover it in the shadow vars video.
B) If lunch and lunch_pick are always assigned sequentially to the same user, why not just make them one task?
I'm creating an API based on Cosmos DB and ASP.NET Core 3.0. Using the Cosmos DB 4.0 preview 1 .NET Core SDK. I implemented paging using the OFFSET and LIMIT clause. I'm seeing the RU charge increase significantly the higher in the page count you go. Example for a page size of 100 items:
Page 1: 9.78 RU
Page 10: 37.28 RU
Page 100: 312.22 RU
Page 500: 358.68 RU
The queries are simply:
SELECT * from c OFFSET [page*size] LIMIT [size]
Am I doing something wrong, or is this expected? Does OFFSET require scanning the entire logical partition? I'm querying against a single partition key with about 10000 items in the partition. It seems like the more items in the partition, the worse the performance gets. (See also comment by "Russ" in the uservoice for this feature).
Is there a better way to implement efficient paging through the entire partition?
Edit 1: Also, I notice doing queries in the Cosmos Emulator also slow waaayyy down when doing OFFSET/LIMIT in a partition with 10,000 items.
Edit 2: Here is my repository code for the query. Essentially, it is wrapping the Container.GetItemQueryStreamIterator() method and pulling out the RU while processing IAsyncEnumerable. The query itself is the SQL string above, no LINQ or other mystery there.
public async Task<RepositoryPageResult<T>> GetPageAsync(int? page, int? pageSize, EntityFilters filters){
// Enforce default page and size if null
int validatedPage = GetValidatedPageNumber(page);
int validatedPageSize = GetValidatedPageSize(pageSize);
IAsyncEnumerable<Response> responseSet = cosmosService.Container.GetItemQueryStreamIterator(
BuildQuery(validatedPage, validatedPageSize, filters),
requestOptions: new QueryRequestOptions()
{
PartitionKey = new PartitionKey(ResolvePartitionKey())
});
var pageResult = new RepositoryPageResult<T>(validatedPage, validatedPageSize);
await foreach (Response response in responseSet)
{
LogResponse(response, COSMOS_REQUEST_TYPE_QUERY_ITEMS); // Read RU charge
if (response.Status == STATUS_OK && response.ContentStream != null)
{
CosmosItemStreamQueryResultSet<T> responseContent = await response.ContentStream.FromJsonStreamAsync<CosmosItemStreamQueryResultSet<T>>();
pageResult.Entities.AddRange(responseContent.Documents);
foreach (var item in responseContent.Documents)
{
cache.Set(item.Id, item); // Add each item to cache
}
}
else
{
// Unexpected status. Abort processing.
return new RepositoryPageResult<T>(false, response.Status, message: "Unexpected response received while processing query response.");
}
}
pageResult.Succeeded = true;
pageResult.StatusCode = STATUS_OK;
return pageResult;
}
Edit 3:
Running the same raw SQL from cosmos.azure.com, I noticed in query stats:
OFFSET 0 LIMIT 100: Output document count = 100, Output document size = 44 KB
OFFSET 9900 LIMIT 100: Output document count = 10000, Output document size = 4.4 MB
And indeed, inspecting the network tab in browser reveals 100 separate HTTP queries, each retrieving 100 documents! So OFFSET appears to be currently not at the database, but at the client, which retrieves EVERYTHING before throwing away the first 99 queries worth of data. This can't be the intended design? Isn't the query supposed to tell the database to return only 100 items total, in 1 response, not all 10000 so the client can throw away 9900?
Based on the code it would mean that the client is skipping the documents and thus the increase of RUs.
I tested the same scenario on the browser (cosmos.azure.com, uses the JS SDK) and the behavior is the same, as offset moves, the RU increases.
It is documented here in the official documentation, under remarks https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-offset-limit
The RU charge of a query with OFFSET LIMIT will increase as the number of terms being offset increases. For queries that have multiple pages of results, we typically recommend using continuation tokens. Continuation tokens are a "bookmark" for the place where the query can later resume. If you use OFFSET LIMIT, there is no "bookmark". If you wanted to return the query's next page, you would have to start from the beginning.
I am calling a method "allComments()" in ionViewWillEnter() method, but it is not not binding name and profile every time I am entering in view. Not able to find any valid reason. There's no error in the code
allComments() {
this.comments=[];
let arr=[];
this.com=[];
console.log(this.challengeId)
this.https.get('https://dareost.firebaseio.com/comments/'
+this.challengeId+'.json').map(res=>res.json()).subscribe(x=>{
this.com=x
for(let key in this.com) {
this.comments.push(this.com[key])
}
this.length = this.comments.length;
for(let i=0;i<this.comments.length;i++){
let a=this.comments[i]
console.log(a)
for(let key in a){
arr.push(a[key])
this.arr.push(a[key])
}
}
for(let i=0;i<this.arr.length;i++){
for(let j=0;j<this.users.length;j++){
console.log(this.arr[i].commentedBy)
if(this.arr[i].commentedBy == this.users[j].uid){
console.log(this.users[j].name1)
this.arr[i].name=this.users[j].name1
console.log(this.arr[i])
this.arr[i].profile=this.users[j].profileUrl
}
}
}
It is sometimes not printing this.arr[i].name on console and sometime it gets print.
ionViewWillEnter() gets for a less time and heavy loops and heavy processes cannot get process during this time. It totally depends on the device's RAM. A device having good processing speed will not load heavy data while the device with slow processing speed can process heavy data.
So I have a cloud function that is triggered each time a transaction is liked/unliked. This function increments/decrements the likesCount. I've used firestore transactions to achieve the same. I think the problem is the Code inside the Transaction block is getting executed multiple times, which may be correct as per the documentation.
But my Likes count are being updated incorrectly at certain times.
return firestore.runTransaction(function (transaction) {
return transaction.get(transRef).then(function (transDoc) {
let currentLikesCount = transDoc.get("likesCount");
if (event.data && !event.data.previous) {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 1 : transDoc.get("likesCount") + 1;
} else {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 0 : transDoc.get("likesCount") - 1;
}
transaction.update(transRef, { likesCount: newLikesCount });
});
});
Anyone had similar experience
Guys finally found out the cause for this unexpected behaviour.
Firestore isn't suitable for maintaining counters if your application is going to be traffic intensive. They have mentioned it in their documentation. The solution they suggest is to use a Distributed counter.
Many realtime apps have documents that act as counters. For example,
you might count 'likes' on a post, or 'favorites' of a specific item.
In Cloud Firestore, you can only update a single document about once
per second, which might be too low for some high-traffic applications.
https://cloud.google.com/firestore/docs/solutions/counters
I wasn't convinced with that approach as it's too complex for a simple use case, which is when I stumbled across the following blog
https://medium.com/evenbit/on-collision-course-with-cloud-firestore-7af26242bc2d
These guys used a combination of Firestore + Firebase thereby eliminating their weaknesses.
Cloud Firestore is sitting conveniently close to the Firebase Realtime
Database, and the two are easily available to use, mix and match
within an application. You can freely choose to store data in both
places for your project, if that serves your needs.
So, why not use the Realtime database for one of its strengths: to
manage fast data streams from distributed clients. Which is the one
problem that arises when trying to aggregate and count data in the
Firestore.
Its not correct to say that Firestore is an upgrade to the Realtime database (as it is advertised) but a different database with different purposes and both can and should coexist in a large scale application. That's my thought.
It might have something to do with what you're returning from the function, as you have
return transaction.get(transRef).then(function (transDoc) { ... })
And then another return inside that callback, but no return inside the inner-most nested callback. So it might not be executing the transaction.update. Try removing the first two return keywords and add one before transaction.update:
firestore.runTransaction(function (transaction) {
transaction.get(transRef).then(function (transDoc) {
let currentLikesCount = transDoc.get("likesCount");
if (event.data && !event.data.previous) {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 1 : transDoc.get("likesCount") + 1;
} else {
newLikesCount = currentLikesCount == 0 || isNaN(currentLikesCount) ? 0 : transDoc.get("likesCount") - 1;
}
return transaction.update(transRef, { likesCount: newLikesCount });
});
});
Timeouts
First of all, check your Cloud Functions logs to see if you get any timeout messages.
Function execution took 60087 ms, finished with status: 'timeout'
If so, sort out your function so that it returns a Promise.resolve(). And shows
Function execution took 344 ms, finished with status: 'ok'
Idempotency
Secondly, write your data so that the function is idempotent. When your function runs, write a value to the document that you are reading. You can then check if that value exists before running the function again.
See this example for ensuring that functions are only run once.
I am building a recommender system where I use Firebase to store and retrieve data about movies and user preferences.
Each movie can have several attributes, and the data looks as follows:
{
"titanic":
{"1997": 1, "english": 1, "dicaprio": 1, "romance": 1, "drama": 1 },
"inception":
{ "2010": 1, "english": 1, "dicaprio": 1, "adventure": 1, "scifi": 1}
...
}
To make the recommendations, my algorithm requires as input all the data (movies) and is matched against an user profile.
However, in production mode I need to retrieve over >10,000 movies. While the algorithm can handle this relatively fast, it takes a lot of time to load this data from Firebase.
I retrieve the data as follows:
firebase.database().ref(moviesRef).on('value', function(snapshot) {
// snapshot.val();
}, function(error){
console.log(error)
});
I am there wondering if you have any thoughts on how to speed things up? Are there any plugins or techniques known to solve this?
I am aware that denormalization could help split the data up, but the problem is really that I need ALL movies and ALL the corresponding attributes.
My suggestion would be to use Cloud Functions to handle this.
Solution 1 (Ideally)
If you can calculate suggestions every hour / day / week
You can use a Cloud Functions Cron to fire up daily / weekly and calculate recommendations per users every week / day. This way you can achieve a result more or less similar to what Spotify does with their weekly playlists / recommendations.
The main advantage of this is that your users wouldn't have to wait for all 10,000 movies to be downloaded, as this would happen in a cloud function, every Sunday night, compile a list of 25 recommendations, and save into your user's data node, which you can download when the user accesses their profile.
Your cloud functions code would look like this :
var movies, allUsers;
exports.weekly_job = functions.pubsub.topic('weekly-tick').onPublish((event) => {
getMoviesAndUsers();
});
function getMoviesAndUsers () {
firebase.database().ref(moviesRef).on('value', function(snapshot) {
movies = snapshot.val();
firebase.database().ref(allUsersRef).on('value', function(snapshot) {
allUsers = snapshot.val();
createRecommendations();
});
});
}
function createRecommendations () {
// do something magical with movies and allUsers here.
// then write the recommendations to each user's profiles kind of like
userRef.update({"userRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
// etc.
}
Forgive the pseudo-code. I hope this gives an idea though.
Then on your frontend you would have to get only the userRecommendations for each user. This way you can shift the bandwidth & computing from the users device to a cloud function. And in terms of efficiency, without knowing how you calculate recommendations, I can't make any suggestions.
Solution 2
If you can't calculate suggestions every hour / day / week, and you have to do it each time user accesses their recommendations panel
Then you can trigger a cloud function every time the user visits their recommendations page. A quick cheat solution I use for this is to write a value into the user's profile like : {getRecommendations:true}, once on pageload, and then in cloud functions listen for changes in getRecommendations. As long as you have a structure like this :
userID > getRecommendations : true
And if you have proper security rules so that each user can only write to their path, this method would get you the correct userID making the request as well. So you will know which user to calculate recommendations for. A cloud function could most likely pull 10,000 records faster and save the user bandwidth, and finally would write only the recommendations to the users profile. (similar to Solution 1 above) Your setup would like this :
[Frontend Code]
//on pageload
userProfileRef.update({"getRecommendations" : true});
userRecommendationsRef.on('value', function(snapshot) { gotUserRecos(snapshot.val()); });
[Cloud Functions (Backend Code)]
exports.userRequestedRecommendations = functions.database.ref('/users/{uid}/getRecommendations').onWrite(event => {
const uid = event.params.uid;
firebase.database().ref(moviesRef).on('value', function(snapshot) {
movies = snapshot.val();
firebase.database().ref(userRefFromUID).on('value', function(snapshot) {
usersMovieTasteInformation = snapshot.val();
// do something magical with movies and user's preferences here.
// then
return userRecommendationsRef.update({"getRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
});
});
});
Since your frontend will be listening for changes at userRecommendationsRef, as soon as your cloud function is done, your user will see the results. This might take a few seconds, so consider using a loading indicator.
P.S 1: I ended up using more pseudo-code than originally intended, and removed error handling etc. hoping that this generally gets the point across. If there's anything unclear, comment and I'll be happy to clarify.
P.S. 2: I'm using a very similar flow for a mini-internal-service I built for one of my clients, and it's been happily operating for longer than a month now.
Firebase NoSQL JSON structure best practice is to "Avoid nesting data", but you said, you don't want to change your data. So, for your condition, you can have REST call to any particular node (node of your each movie) of the firebase.
Solution 1) You can create some fixed number of Threads via ThreadPoolExecutors. From each worker thread, you can do HTTP (REST call request) as below. Based on your device performance and memory power, you can decide how many worker threads you want to manipulate via ThreadPoolExecutors. You can have code snippet something like below:
/* creates threads on demand */
ThreadFactory threadFactory = Executors.defaultThreadFactory();
/* Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available */
ExecutorService threadPoolExecutor = Executors.newFixedThreadPool(10); /* you have 10 different worker threads */
for(int i = 0; i<100; i++) { /* you can load first 100 movies */
/* you can use your 10 different threads to read first 10 movies */
threadPoolExecutor.execute(() -> {
/* OkHttp Reqeust */
/* urlStr can be something like "https://earthquakesenotifications.firebaseio.com/movies?print=pretty" */
Request request = new Request.Builder().url(urlStr+"/i").build();
/* Note: Firebase, by default, store index for every array.
Since you are storing all your movies in movies JSON array,
it would be easier, you read first (0) from the first worker thread,
second (1) from the second worker thread and so on. */
try {
Response response = new OkHttpClient().newCall(request).execute();
/* OkHttpClient is HTTP client to request */
String str = response.body().string();
} catch (IOException e) {
e.printStackTrace();
}
return myStr;
});
}
threadPoolExecutor.shutdown();
Solution 2) Solution 1 is not based on the Listener-Observer pattern. Actually, Firebase has PUSH technology. Means, whenever something particular node changes in Firebase NoSQL JSON, the corresponding client, who has connection listener for particular node of the JSON, will get new data via onDataChange(DataSnapshot dataSnapshot) { }. For this you can create an array of DatabaseReferences like below:
Iterable<DataSnapshot> databaseReferenceList = FirebaseDatabase.getInstance().getReference().getRoot().child("movies").getChildren();
for(DataSnapshot o : databaseReferenceList) {
#Override
public void onDataChange(DataSnapshot o) {
/* show your ith movie in ListView. But even you use RecyclerView, showing each Movie in your RecyclerView's item is still show. */
/* so you can store movie in Movies ArrayList. When everything completes, then you can update RecyclerView */
}
#Override
public void onCancelled(DatabaseError databaseError) {
}
}
Although you stated your algorithm needs all the movies and all attributes, it does not mean that it processes them all at once. Any computation unit has its limits, and within your algorithm, you probably chunk the data into smaller parts that your computation unit can handle.
Having said that, if you want to speed things up, you can modify your algorithm to parallelize fetching and processing of the data/movies:
| fetch | -> |process | -> | fetch | ...
|chunk(1)| |chunk(1)| |chunk(3)|
(in parallel) | fetch | -> |process | ...
|chunk(2)| |chunk(2)|
With this approach, you can spare almost the whole processing time (but the last chunk) if processing is really faster than fetching (but you have not said how "relatively fast" your algorithm run, compared to fetching all the movies)
This "high level" approach of your problem is probably your better chance if fetching the movies is really slow although it requires more work than simply activating a hypothetic "speed up" button of a Library. Though it is a sound approach when dealing with large chunk of data.