Typical causes for SoftLayer autoscaling group to be scaling or suspended - scaling

When we deploy an autoscale group in SoftLayer, there are chances that we see an autoscale group has been scaling for hours or even days. Also we saw in other times they were suspended for some unknown reasons. Could you list a few typical reasons that an autoscale group endlessly stays in either "scaling" or "suspend" mode?
Based on my experience, this happens when I used a very large image, or created a very large VM instance. This also happens when our account hits a hourly limit that doesn't allow the current autoscale group to continue to be created. There are also chances that the compute resources, such as cpu, memory or disk has used up, or almost used up in the data center that I am about to create the autoscale group. Is there a list of full reasons that we can refer to, in order to better plan future use?

Check out this link: https://knowledgelayer.softlayer.com/articles/auto-scale-terms#status
Scaling: Used when any members are not yet provisioned or have an
active provisioning transaction running. In scaling status, an auto
scale group cannot be triggered for scaling and only non-actionable
properties can be edited.
Busy: Used when any members have an active
transaction but the transaction is not a provisioning or reclaim
transaction. In busy status, an auto scale group cannot be triggered
for scaling and only non-actionable properties can be edited.
Suspended: Used when a member is not scaling or busy, but has been set
to Suspended. In Suspended status, an auto scale group cannot be
triggered for scaling; however, any property can be edited. Any
actions that would occur as a result of a group create or edit don't
occur until the user resumes the auto scale group. An auto scale group
can be suspended manually by the user or automatically if there is an
error scaling the auto group for any reason. An auto scale group can
only be resumed by a user.
Active: An auto scale group is active if it
is not scaling, busy, or suspended. When active, an auto scale group
value can be edited and all triggers can be invoked.
This is the documentation available about the status, that information could help you to understand better what is going on. If you see scaling is because there are provisioning proccess runing (It could be because your VM is very large and the provisioning takes long time, there are several members which have provisioning proccess running, the provisionings have not been approved yet).
Suepend is for any error, such as when the datacenter does not have enough room, the provisioning has failed or such as the reasons you mentioned, it could be several reasons for an error and there is not any documentation about all the reasons.
Regards

Related

How to handle offline aggregation using Firestore?

I have been scouring the internet for days on a solution to this problem.
That is, how to handle aggregation when there is no network connection? I have a task management app that looks to aggregate meta data about user tasks. For example, the task can contain tags that can be aggregated to be shown in a dashboard to the user on a daily basis. This would be easy if the user is always online, so I could use transaction or cloud function to aggregate, but when the user is offline, the aggregation will appear to be incorrect, until the user restores their network connection.
Aggregation queries are explained here:
https://firebase.google.com/docs/firestore/solutions/aggregation
Which states a limitation:
Offline support - Client-side transactions will fail when the user's
device is offline, which means you need to handle this case in your
app and retry at the appropriate time.
However, there has yet to be any example or documentation on how to 'handle this case'. How would I go about addressing this problem?
Some thoughts:
I could cache the item if a transaction fails. This item will be aggregated on top of the stored aggregation. However, going down this line would mean that I can't take advantage of the Firestore's "offline mode", because I'm using my own cache on every write while offline anyway.
I could aggregate on demand. That is, never store the aggregation. This is going to be very heavy on read depending on how many tasks a user has. Furthermore, if the aggregation will need to be shared as insights to other users, this option will not work because other users do not have access to the tasks.
I'm at a loss and any help would be appreciated, thanks!
After a lot of research and trial and error I found a solution that can address this problem gracefully.
FieldValue.increment to the rescue.
What FieldValue.increment does is bypass the use of transaction while respecting the default Firestore's offline cache behaviour. It requires the use of set or update on the field directly. The drawback is the inability to use the 'withConverter' on the collection for type safety. I'm willing to live with the drawback considering how useful FieldValue.increment is.
I've done multiple tests and can confirm that the values can be incremented/decremented multiple times locally while offline. This offline value is reflected in a get or snapshot call to the cache. When the network connection is restored, the values are updated on the server.
The value itself is not stored on the cache, it simply stores the "difference" in the FieldValue sentinel for when it is time to update it on the server.
This method only works with incrementing and decrementing values. Storing averages will not be possible using this method. That is because the true total number of items is not known at the time of its calculation when offline.
Instead, the total number of items are stored along side the total value. The average is then calculated when and as needed. In this way the average will always be accurate from a local perspective when offline, and it will also be accurate when online when the total value and count has been synced.

Cloud Firestore throttling high-volume update syncing

(Note: sorry if I am using the relational DB terms here.)
Let's say I have ten clients that are connected to a database. This database has a sustained throughput of about 1k updates per second. Obviously sending 1k updates per second to a web-browser (let's say 1MB data changes per second) is not going to be a good experience for the end-user. Does Firebase have any controls as to how much data a client can 'accept' before it starts throttling it? I understand it may batch requests, but my point here is, Google can accept data/updates faster than a browser can (potentially from a phone on a weak internet connection), so what controls or techniques are there in place to control this experience for the end-user?
The only items I see from the docs are:
You should not update a single document more than once per second. If you update a document too quickly, then your application will experience contention, including higher latency, timeouts, and other errors.
https://firebase.google.com/docs/firestore/best-practices#updates_to_a_single_document
This topic is covered in here, keeping the language used to code in aside, the linked code in that answer can assist.
In a general explanation, if your client application is configured to listen for Firestore updates, it will receive all the update events to that listener (just like you mentioned is happening).
You can consider polling Firebase for changes. The poll can even be an extension of the client application code where the code tracks the frequency of the updates being received and has a maximum value of updates per second which, when reached, results in the client disconnecting as a listener and performing periodic polls for the data.
The listener could then be re-established after a period to continue the normal workflow when there are fewer updates again.
The above being said, this is not optimal and treats the symptom rather than the cause. If a listener is returning too many updates, you should consider the structure of the data and look to isolate the updates to only require updates to listeners that require it.
Similarly, the large updates can be mitigated by ensuring smaller records contain the changes resulting in less data.
A generalized example is where two fields of data are updated, but the record is 150 fields in size. Rather than returning the full 150 fields, shard the fields into different data sets, so the two fields are in their own record with an additional reference field used to correlate with a second data set of the remaining 148 fields (plus the reference field).
When the smaller record is updated, the client application receives the small update, determines if the update is applicable to itself, and if so, fetches the corresponding larger record.
To prevent high volumes of writes from overwhelming the client's snapshot listeners, you could periodically duplicate the writes to a proxy collection that the client watches instead.
Documents would need a field to record the time of the last duplicate write to the proxy collection, and the process performing the writes should avoid making writes to the proxy collection until after the frequency duration has elapsed.
A small number of unnecessary writes may still occur due to any concurrent processes you have, but these might be insignificant in practice (with a reasonably long duplication frequency).
If the data belongs to a user, rather than being global data, then you could conceivably adjust the frequency of writes per user to suit their connection, either dynamically or based on user configuration.
In this way, your processes get to control the frequency of writes seen by clients, without needing to throttle or otherwise reject ingress writes (which would presumably be bad news for the upstream processes).
Relevant part of the documentation below.
https://firebase.google.com/docs/firestore/best-practices#realtime_updates
Limit the collection write rate
1,000 operations/second
Keep the rate of write operations for an individual collection under 1,000 operations/second.
Limit the individual client push rate
1 document/second
Keep the rate of documents the database pushes to an individual client under 1 document/second.

Delete data in dynamoDB without bringing site down

I have multi-tenant product offering and use dynamodb database, so all our web-request is being served from dynamodb. I have use case where I want to move data of a tenant from one region to another, this would be background process.
How do I ensure background process does not hog the database ? otherwise it will give bad user experience and may bring website down.
Is there a way I can have dedicated read and write capacity provisioned for background process.
You cannot dedicate read and write capacity units to specific processes, but you could temporarily change the table's capacity mode to on-demand for the move, and then switch it back to provisioned mode later when the move is complete. You can make this capacity mode switch once every 24 hours. By changing to on-demand capacity mode, you are less likely to be throttled in this specific situation.
That said, without knowing your current table capacity mode and capacity settings on those tables, it is difficult for me to make concrete recommendations though.
Sorry answer from Kirk is not a good idea for saving $$$. DynamoDB has TTL feature so say you want to delete something, you expire the item, meaning queries for that used to get that item no longer retrieve it, because the TTL has expired.
But it is not yet DELETED ! It will be scheduled for deletion later, saving you those precious capacity units when it deletes items in batches as opposed to one by one, greatly saving you money and is what the technology is for.

Exactly-once semantics in Dataflow stateful processing

We are trying to cover the following scenario in a streaming setting:
calculate an aggregate (let’s say a count) of user events since the start of the job
The number of user events is unbounded (hence only using local state is not an option)
I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.
Thanks!
Approach 1: Session windows, datastore and Idempotency
Sliding windows of x seconds
Group by userid
update datastore
Update datastore would mean:
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
The datastore entry contains an idempotency id that equals the sliding window timestamp
Problem:
Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)
Approach: Session windows, datastore and state
Sliding windows of x seconds
Group by userid
update datastore
Update datastore would mean:
Pre-check: check if state for this key-window is true, if so we skip the following steps
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
Store in state for this key-window that we processed it (true)
Re-execution will hence skip duplicate updates
Problem:
Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice.
We can circumvent this by using multiple states, but then we could still drop data.
Approach 3: Global window, timers and state
Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:
A global window
Group by userid
Buffer/count all incoming events in a stateful DoFn
Flush x time after the first event.
A flush would mean the same as Approach 1
Problem:
The guarantees for exactly-once processing and state are unclear.
What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?
Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?
From your Approach 1 and 2 it is unclear whether out-of-order merging is a concern or loss of data. I can think of the following.
Approach 1: Don't immediately merge the session window aggregates because of out of order problem. Instead, store them separately and after sufficient amount of time, you can merge the intermediate results in timestamp order.
Approach 2: Move the state into the transaction. This way, any temporary failure will not let the transaction complete and merge the data. Subsequent successful processing of the session window aggregates will not result in double counting.

Is there anyway to monitor one single (class) of object in terms of cache?

I am trying to determine which implementation of the data structure would be best for the web application. Basically, I will maintain one set of "State" class for each unique user, and the State will be cached for some time when the user login, and after the non-sliding period, the state is saved to the db. So in order to balance the db load and the iis memory, I have to determine what is the best (expected) timeout for the cache.
My question is, how to monitor the particular cache activity for one set of object? I tried perfmon, and it gives roughly the % of total memory limit, but no idea on size or so (maybe even better, I could get a list of all cached objects and also the size and other performance issue data).
One last thing, I expect the program is going to handle 100,000+ cached user and each of them may do a request in about 10s-60s. So performance does matters to me.
What exactely are you trying to measure here? If you just want to get the size of your in-memory State instances at any given time, you can use an application-level counter and add/substract every time you create/remove an instance of State. So you know your State size, you know how many State instances you have. But if you already count on getting 100.000+ users each requesting at least once / minute you can actually do the math.

Resources