I am running a scheduler task every day. The scheduler calls an API endpoint that calculates the user's income and adds a commission on his purchases. There are around 50K-100K insert operations after checking their current income status.
The problem is it getting slower. The first 10K is inserted within 10 minutes after that it is taking more time for each operation. I can see for 50K purchases it is taking around 2 hours. After certain operations, the time increases almost exponentially and these get slower and slower.
How can I make faster the operations?
Thanks
Related
In my project, every user has a document with a field f. My project uses the sum of every user's f field frequently, probably hundred thousand times a day, and my project is planned to have millions of users.
Obviously it is not efficient to calculate the sum every time I need it, so my plan is to have an additional document to track the sum. Every time a user's f changed, update the sum too.
But I think roundoff error may occur after a period of time, so I plan to recalculate the sum every 24 hours or 7 days.
My problem is, if I have a million documents, does collection.get() still work? What about a billion? I've noticed WriteBatch has a 500 limit. Does collection.get() has limit too?
You can get as many documents as you can fit in memory on the machine where you issued the query. The backend will stream the results until you run out.
I'm trying to implement Azure MobileServices in Xamarin.Forms, following this tutorial: https://learn.microsoft.com/en-us/azure/developer/mobile-apps/azure-mobile-apps/quickstarts/xamarin-forms/offline
but I noticed that synchronization is very slow. For example, I synchronized a db containing 15 tables and about 60k records, and the entire process required about 6 mins! The result changes a little if I rerun the operation on a db already synchronized.
Does it possible to improve entire process?
I have some doubts that the technology is already used extensively, because there is very little documentation on internet and it is often out of date.
In this case, what are the alternatives?
Firstly, 60K records takes a long time to synchronize the first time. It's inevitable because of the amount of data to transfer. 6 minutes is not a surprise.
However, you should have implemented the appropriate stuff for incremental sync. That includes ensuring your model has the UpdatedAt and CreatedAt timestamps, plus a globally unique ID, and naming your query when you use PullAsync(). Something like:
await mTable.PullAsync('allItems', mTable.CreateQuery());
More information: https://azure.github.io/azure-mobile-apps/howto/client/dotnet/#syncing-an-offline-table
I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.
I'm currently testing Firebase on a non-production Firebase app which I am the only one who works on.
When I try to query the database to retrieve the data after there has not been any query during the last 24 hours, the query take about 8 seconds. After a query is done, the next ones would take normal amount of time (about 100ms).
This is not about caching the queries, by "next queries" I mean new queries which are not the same.
To reproduce it:
Create a database node called users, users children are user data (first name, last name, age, gender, etc)
Add 500,000 users to this node
Get a user by its UID and measure the time. (It should take about 100ms)
Wait 24 hours (I don't know the exact time, but I'm sure about 24 hours)
Get any user by its UID and measure the time. (It should take about 8sec)
Get any user by its UID and measure the time. (It should take about 100ms)
I want to know if this is a known issue to Firebase realtime database or not?
I reached Firebase support, they were able to recreate the issue and faced a wait time of about 6 seconds. Here is their answer after the investigation:
It looks like this is intended behavior. The realtime database queries work by building the index in-memory, which takes time linear to the number of nodes at that location. Once the index is built things are very fast, but the initial build can take a bit to build, especially for large locations.
If you wants the index to stay in memory on the database you should have a listener always listening for this query.
So basically the database takes a long time to process the query because of indexing the large database.
The problem can be solved by keeping a listener on the database or querying the database every few hours.
In production it is not very likely that you face this problem, because the database is being accessed by the user all the time, but if your database is not accessed all the time and you don't want the users experience that long wait time, you should utilize the discussed solution.
Firebase keeps recently used data in its internal cache. This cache is cleared after a few minutes.
But the exact numbers depend on how much data you're loading and how you're loading that data. Without seeing a specific setup that shows how to reproduce these numbers there really isn't much anyone can say.
I am trying to figure out whether a web development project is feasible at the moment and have so far learned that the total row count of the proposed database (30 million rows, 5 columns and about 3 gb of storage) is well within the budget limits in terms of storage requirements, but because of the anticipated large number of queries that users will make to the database I am not sure if this will cause an unrealistic load to manage for the server to provide adequate performance (within my budget).
I will be using this grid (a live demo of performance benchmarks for 300,000 rows - http://demos.telerik.com/aspnet-ajax/grid/examples/performance/linq/defaultcs.aspx). Inserting a search term in the "product name" box and pressing enter takes 1.6 seconds from query to results render. It seems to me (a newbie) that 300,000 rows which take 1.6 seconds all in all must take much longer with 30 million rows, and so I am trying to figure out
what the increase in time would be the more rows are added up to 30 million
what the increase in time would be for each additional 1000 people using the search grid at the same time.
what hardware requirements are necessary to reduce the delays to an acceptable level
Hopefully if I can figure that out I can get a more realistic assessment for feasibility. FYI: The database need not be updated very regularly, it is more for readonly purposes.
Can this problem be prototyped on paper for these 3 points?
Even wide ball park estimates- without considering optimisation, am I talking hundreds of dollars for 5000 users to have searches below 10 seconds each, thousands, or tens of thousands of dollars?
[Will be asp.net RadControls for AJAX Grid, One of these cloud hosted servers: 4,096MB RAM
160GB Diskspace, and either Microsoft® SQL Server® 2008 R2 and SQL Server 2012 ]
The database need not be updated very regularly, it is more for readonly purposes.
Your search filters allow for substring searches, so db indexes are not going to help you and the search will go row-by-row.
It looks like your data would probably fit in 5GB of memory or so. I would store the whole thing in memory and seach there.