Mongodb automatically write into capped collection - collections

I need to manage the acquisition of many record at hour. About 1000000 records. And I need to get every second the last insert value for every primary key. It works quit well with sharding. I was thinking to try the use os capped collection to get only the last record for every primary key. In order to do this, I made two separated insert, there is a way, into mongodb, to make some kind of trigger to propagate the insert into a collection to another collection?

MongoDB does not have any support for triggers or similar behavior.
The only way to do this is to make it happen in your code. So the code that writes the first entry should also write the second.
People have definitely requested triggers. If they are necessary for your solution, please cast a vote on the feature request.

I disagree with "triggers is needed". People, MongoDB was created to be very fast and to provide as basic functionalities as can be. This is a power of this solution.
I think that here the best think is to create triggers inside Your application as a part of Data Access layer.

Related

Fetch new entities only

I thought Datastore's key was ordered by insertion date, but apparently I was wrong. I need to periodically look for new entities in the Datastore, fetch them and process them.
Until now, I would simply store the last fetched key and wrongly query for anything greater than it.
Is there a way of doing so?
Thanks in advance.
Datastore automatically generated keys are generated with uniform distribution, in order to make search more performant. You will not be able to understand which entity where added last using keys.
Instead, you can try couple of different approaches.
Use Pub/Sub and architecture your app so another background task will consume this last added entities. On entities add in DB, you will just publish new Event into Pub/Sub with key id. You event listener (separate routine) will receive it.
Use names and generate you custom names. But, as you want to create sequentially growing names, this will case performance hit on even not big ranges of data. You can find more about this in Best Practices of Google Datastore.
https://cloud.google.com/datastore/docs/best-practices#keys
You can add additional creation time column, and still use automatic keys generation.

Retrieving Random Single Items in Dynamo

We are trying to get our heads wrapped around a design question, which is not really easy in any DB. We have 100,000 random items, (could be a lot more), (we are talking a truly random key, we'll use UUIDs,) and we want to hand them out one at a time. Order is not important. We are thinking that we'll create a dynamo table of the items, then delete them out of that table as they are assigned. We can do a conditional delete to make sure that we have not already given the item away. But, when trying to find an item in the first place, if we do a scan or a query with a limit of 1, will it always hit the same first available record? I'm wondering what the ramifications are. Dynamo will shard on the UUID. We are worried about everyhone trying to hit on the same record all the time. First one would of course get delete, then they could all hit on the second one, etc.
We could set up a memcache/redis instance in elastic cache, and keep a list of the available UUDS in there. We can do a random select of items from this using redis SPOP, which gets a random item and deletes it. We might have a problem where we could get out of sync between the two, but for the most part this would work.
Any thoughts on how to do this without the cache would be great. If dynamo does scans starting at different points, that would be dandy.
I have the same situation with you that have a set of million of UUID as key in DynamoDB and I need to random select some of them in a API call. For the performance issue and easy implementation. I did use Redis as you said.
add the UUID to a Set in Redis
when the call comes, SPOP a UUID from the set
with that UUID, del in DynamoDB
The performance of Scan operation is bad, should try to avoid it as best as you can.

Solution for previewing user changes and allowing rollback/commit over a period of time

I have asked a few questions today as I try to think through to the solution of a problem.
We have a complex data structure where all of the various entities are tightly interconnected, with almost all entities heavily reliant/dependant upon entities of other types.
The project is a website (MVC3, .NET 4), and all of the logic is implemented using LINQ-to-SQL (2008) in the business layer.
What we need to do is have a user "lock" the system while they make their changes (there are other reasons for this which I won't go into here that are not database related). While this user is making their changes we want to be able to show them the original state of entities which they are updating, as well as a "preview" of the changes they have made. When finished, they need to be able to rollback/commit.
We have considered these options:
Holding open a transaction for the length of time a user takes to make multiple changes stinks, so that's out.
Holding a copy of all the data in memory (or cached to disk) is an option but there is heck of a lot of it, so seems unreasonable.
Maintaining a set of secondary tables, or attempting to use session state to store changes, but this is complex and difficult to maintain.
Using two databases, flipping between them by connection string, and using T-SQL to manage replication, putting them back in sync after commit/rollback. I.e. switching on/off, forcing snapshot, reversing direction etc.
We're a bit stumped for a solution that is relatively easy to maintain. Any suggestions?
Our solution to a similar problem is to use a locking table that holds locks per entity type in our system. When the client application wants to edit an entity, we do a "GetWithLock" which gets the client the most up-to-date version of the entity's data as well as obtaining a lock (a GUID that is stored in the lock table along with the entity type and the entity ID). This prevents other users from editing the same entity. When you commit your changes with an update, you release the lock by deleting the lock record from the lock table. Since stored procedures are the api we use for interacting with the database, this allows a very straight forward way to lock/unlock access to specific entities.
On the client side, we implement IEditableObject on the UI model classes. Our model classes hold a reference to the instance of the service entity that was retrieved on the service call. This allows the UI to do a Begin/End/Cancel Edit and do the commit or rollback as necessary. By holding the instance of the original service entity, we are able to see the original and current data, which would allow the user to get that "preview" you're looking for.
While our solution does not implement LINQ, I don't believe there's anything unique in our approach that would prevent you from using LINQ as well.
HTH
Consider this:
Long transactions makes system less scalable. If you do UPDATE command, update locks last until commit/rollback, preventing other transaction to proceed.
Second tables/database can be modified by concurent transactions, so you cannot rely on data in tables. Only way is to lock it => see no1.
Serializable transaction in some data engines uses versions of data in your tables. So after first cmd is executed, transaction can see exact data available in cmd execution time. This might help you to show changes made by user, but you have no guarantee to save them back into storage.
DataSets contains old/new version of data. But that is unfortunatelly out of your technology aim.
Use a set of secondary tables.
The problem is that your connection should see two versions of data while the other connections should see only one (or two, one of them being their own).
While it is possible theoretically and is implemented in Oracle using flashbacks, SQL Server does not support it natively, since it has no means to query previous versions of the records.
You can issue a query like this:
SELECT *
FROM mytable
AS OF TIMESTAMP
TO_TIMESTAMP('2010-01-17')
in Oracle but not in SQL Server.
This means that you need to implement this functionality yourself (placing the new versions of rows into your own tables).
Sounds like an ugly problem, and raises a whole lot of questions you won't be able to go into on SO. I got the following idea while reading your problem, and while it "smells" as bad as the others you list, it may help you work up an eventual solution.
First, have some kind of locking system, as described by #user580122, to flag/record the fact that one of these transactions is going on. (Be sure to include some kind of periodic automated check, to test for lost or abandoned transactions!)
Next, for every change you make to the database, log it somehow, either in the application or in a dedicated table somewhere. The idea is, given a copy of the database at state X, you could re-run the steps submitted by the user at any time.
Next up is figuring out how to use database snapshots. Read up on these in BOL; the general idea is you create a point-in-time snapshot of the database, do whatever you want with it, and eventually throw it away. (Only available in SQL 2005 and up, Enterprise edition only.)
So:
A user comes along and initiates one of these meta-transactions.
A flag is marked in the database showing what is going on. A new transaction cannot be started if one is already in process. (Again, check for lost transactions now and then!)
Every change made to the database is tracked and recorded in such a fashion that it could be repeated.
If the user decides to cancel the transaction, you just drop the snapshot, and nothing is changed.
If the user decides to keep the transaction, you drop the snapshot, and then immediately re-apply the logged changes to the "real" database. This should work, since your requirements imply that, while someone is working on one of these, no one else can touch the related parts of the database.
Yep, this sure smells, and it may not apply to well to your problem. Hopefully the ideas here help you work something out.

Calculating Number Of Comments/Posts

I'm using ASP.net and an SQL database. I have a blog like system where a number of comments are made against a post and I want to display the number of those comments next to the post. To get that number I could either hold it in the post record and add/subtrack when a comment is added or deleted or I could use the SQL to calculate the number of comments using a query each time a user hits the page. The latter seems to be a bad idea as its going to hit my SQL database harder however holding the number against the record feels like it could be error prone. What do you think is best coding practice in this case?
Always start with a normalized database (your second option). Only denormalize if you have an absolute necessity for performance reasons. Designing it in the denormalized way (which is error-prone as you guessed) is premature optimization. With proper indexes it should be fine calculating the number on the fly.
I think the SQL statement should be fine. The other is duplication of data you already have. A count query should be quick.
Don't optimize prematurely. Use the simple solution and pagefault in optimizations only when they're needed.
I would query the database each time you want the information. I would revisit it later if you find that performance is lacking (optimize later). For the traffic most blog type applications will get, that should be sufficient.
Perhaps get the count back as part of the main thread query so as to limit the number of hits on the actual DB from the webserver. But I would always query the actual count and not try and keep it in a field, data will eventually get out of sync as that is reality.
To increase performance, you could keep a flag in the main table to indicate if the item has any comments but only use this as a 'hint' as to whether or not to perform an additional query to count and retrieve comments at a later time.
Imagine a photo gallery that returns 50 photos to rotate through. Each photo could have its own comments.
The initial page load would return a list of photos plus a flag indicating if a photo has comments.
When a photo is displayed, if the comments flag is set to True, your app would make an ajax request to count and fetch the comments for that photo.
If only 3 out of the 50 photos have comments, you just saved yourself 47 additional requests!
This does denormalize the data, but on a limited level.
Creating hints can really help improve performance for very busy sites.
Depending on how your data model looks...Don't add the total post count to the main thread record, it is error prone, you should calculate the comment count when needed based on the thread ID, IMHO
Caching the pages and updating that cache as comments are added/removed would be a good option a long with the SQL count query if you are that worried about the number of queries happening against the db..
I usually use an indexed view for this kind of thing. This allows you to denormalize the data for quick retrieval, but there is no way for it to get out of sync. Folks will also not be confused and think the view is the master of the data. I have mostly used the standard sku of SS2K5, so I have to specify the (noexpand) hint to get it to actually use the index on the view (enterprise will do it automatically). So for standard sku, I always create a wrapper view that everyone hits so I know the hint is always in place.
Coding this on the web page, so hopefully no syntax errors ;)
create view postCount__
as
select
threadId
,postCount=count_big(*)
from thread
group by threadId
go
create unique clustered index postCount__xpk_threadid on postCount__(threadId)
go
create view postCount
as
select
threadId
,postCount=cast(postCount as int)
from postCount__ with (noexpand)
go
So I use a nomenclature on the actual indexed view to let everyone know not to query it directly. Instead they look for the associated wrapper view that enforces the noexpand hint. Using an indexed view forces you to do count_big, so I often cast down to int in the wrapper view to be able to keep our asp.net code lazily using 32 bit ints. It would be better to omit the cast, but it hasn't been of any significant impact for me.
EDIT - I can tell you that forum software always denormalizes the post count to the thread table. It kills the DB to continually count the post count on every page view if you have an active forum. I love that mssql has indexed views so you can define the denormalization declaratively rather than maintain it yourself.

How do I implement "pessimistic locking" in an asp.net application?

I would like some advice from anyone experienced with implementing something like "pessimistic locking" in an asp.net application. This is the behavior I'm looking for:
User A opens order #313
User B attempts to open order #313 but is told that User A has had the order opened exclusively for X minutes.
Since I haven't implemented this functionality before, I have a few design questions:
What data should i attach to the order record? I'm considering:
LockOwnedBy
LockAcquiredTime
LockRefreshedTime
I would consider a record unlocked if the LockRefreshedTime < (Now - 10 min).
How do I guarantee that locks aren't held for longer than necessary but don't expire unexpectedly either?
I'm pretty comfortable with jQuery so approaches which make use of client script are welcome. This would be an internal web application so I can be rather liberal with my use of bandwidth/cycles. I'm also wondering if "pessimistic locking" is an appropriate term for this concept.
It sounds like you are most of the way there. I don't think you really need LockRefreshedTime though, it doesn't really add anything. You may just as well use the LockAcquiredTime to decide when a lock has become stale.
The other thing you will want to do is make sure you make use of transactions. You need to wrap the checking and setting of the lock within a database transaction, so that you don't end up with two users who think they have a valid lock.
If you have tasks that require gaining locks on more than one resource (i.e. more than one record of a given type or more than one type of record) then you need to apply the locks in the same order wherever you do the locking. Otherwise you can have a dead lock, where one bit of code has record A locked and is wanting to lock record B and another bit of code has B locked and is waiting for record A.
As to how you ensure locks aren't released unexpectedly. Make sure that if you have any long running process that could run longer than your lock timeout, that it refreshes its lock during its run.
The term "explicit locking" is also used to describe this time of locking.
I have done this manually.
Store the primary-key of the record to a lock table, and mark record
mode attribute to edit.
When another user tries to select this record, indicate the user's
ready only record.
Have a set-up maximum time for locking the records.
Refresh page data for locked records. While an user is allowed to
make changes, all other users are only allowed to check.
Lock table should have design similar to this:
User_ID, //who locked
Lock_start_Time,
Locked_Row_ID(Entity_ID), //this is primary key of the table of locked row.
Table_Name(Entity_Name) //table name of the locked row.
Remaining logic is something you have to figure out.
This is just an idea which I implemented 4 years ago on special request of a client. After that client no one has asked me again to do anything similar, so I haven't achieved any other method.

Resources