Efficeintly maintaining a cache of distinct items in a huge DB table - asp.net

I have a very large (millions of rows) SQL table which represents name-value pairs (one columns for a name of a property, the other for it's value). On my ASP.NET web application I have to populate a control with the distinct values available in the name column. This set of values is usually not bigger than 100. Most likely around 20. Running the query
SELECT DISTINCT name FROM nameValueTable
can take a significant time on this large table (even with the proper indexing etc.). I especially don't want to pay this penalty every time I load this web control.
So caching this set of names should be the right answer. My question is, how to promptly update the set when there is a new name in the table. I looked into SQL 2005 Query Notification feature. But the table gets updated frequently, very seldom with an actual new distinct name field. The notifications will flow in all the time, and the web server will probably waste more time than it saved by setting this.
I would like to find a way to balance the time used to query the data, with the delay until the name set is updated.
Any ides on how to efficiently manage this cache?

A little normalization might help. Break out the property names into a new table, and FK back to the original table, using a int ID. you can display the new table to get the complete list, which will be really fast.

Figuring out your pattern of usage will help you come up with the right balance.
How often are new values added? are new values added always unique? is the table mostly updates? do deletes occur?
One approach may be to have a SQL Server insert trigger that will check the table cache to see if its key is there & if it's not add itself

Add a unique increasing sequence MySeq to your table. You may want to try and cluster on MySeq instead of your current primary key so that the DB can build a small set then sort it.
SELECT DISTINCT name FROM nameValueTable Where MySeq >= ?;
Set ? to the last time your cache has seen an update.
You will always have a lag between your cache and the DB so, if this is a problem you need to rethink the flow of the application. You could try making all requests flow through your cache/application if you manage the data:
requests --> cache --> db

If you're not allowed to change the actual structure of this huge table (for example, due to huge numbers of reports relying on it), you could create a holding table of these 20 values and query against that. Then, on the huge table, have a trigger that fires on an INSERT or UPDATE, checks to see if the new NAME value is in the holding table, and if not, adds it.

I don't know the specifics of .NET, but I would pass all the update requests through the cache. Are all the update requests done by your ASP.NET web application? Then you could make a Proxy object for your database and have all the requests directed to it. Taking into consideration that your database only has key-value pairs, it is easy to use a Map as a cache in the Proxy.
Specifically, in pseudocode, all the requests would be as following:
// the client invokes cache.get(key)
if(cacheMap.has(key)) {
return cacheMap.get(key);
} else {
cacheMap.put(key, dababase.retrieve(key));
}
// the client invokes cache.put(key, value)
cacheMap.put(key, value);
if(writeThrough) {
database.put(key, value);
}
Also, in the background you could have an Evictor thread which ensures that the cache does not grow to big in size. In your scenario, where you have a set of values frequently accessed, I would set an eviction strategy based on Time To Idle - if an item is idle for more than a set amount of time, it is evicted. This ensures that frequently accessed values remain in the cache. Also, if your cache is not write through, you need to have the evictor write to the database on eviction.
Hope it helps :)
-- Flaviu Cipcigan

Related

How to introduce a new column in dynamo DB running in production?

I have a use case where DynamoDB is running in production and I need to add a new column IDUpdatedAt which will also be serving as a sort key for one of the GSIs.
I tried a thing in test where my application adds the new rows with IDUpdatedAt, it's working fine but what about the existing rows? How to add the values for those?
Also the new rows will not be added without IDUpdatedAt, but how will the search be impacted for older rows?
PS: IDUpdatedAt is being used as a filter in the application, i.e., user can search for specific ID and can get results sorted by date. That's why IDUpdatedAt is also a part of GSI (sort key).
Please help.
You've got the right idea by adding the field to new items. After all, DynamoDB does not enforce a particular schema outside of the primary key.
This also happens to be a very useful feature, especially when defining a GSI on that attribute; if the atttibute exists on the item, it ends up in the index! For example, imagine modeling an email inbox in DDB where each item represents an email. You could include an attribute 'is_read' and define a GSI using that atttibute.
If the 'is_read' attribute exists on the item, it's in the index. Otherwise, it's not. A cool way to use GSIs to implement filtering.
Pretty neat stuff!
However, there is no way to retroactively update all items with a new attribute other than manually updating each item (or in batches). The equivalent in SQL databases is defining a new column. Unfortunately, an analogous operation in DDB does not exist.

How can I query for all new and updated documents since last query?

I need to query a collection and return all documents that are new or updated since the last query. The collection is partitioned by userId. I am looking for a value that I can use (or create and use) that would help facilitate this query. I considered using _ts:
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value]
The problem with _ts is that it is not granular enough and the query could miss updates made in the same second by another client.
In SQL Server I could accomplish this using an IDENTITY column in another table. Let's call the table version. In a transaction I would create a new row in the version table, do the updates to the other table (including updating the version column with the new value. To query for new and updated rows I would use a query like this:
SELECT * FROM table WHERE userId=[some-user-id] and version > [some-value]
How could I do something like this in Cosmos DB? The Change Feed seems like the right option, but without the ability to query the Change Feed, I'm not sure how I would go about this.
In case it matters, the (web/mobile) clients connect to data in Cosmos DB via a web api. I have control of the entire stack - from client to back-end.
As the statements in this link:
Today, you see all operations in the change feed. The functionality
where you can control change feed, for specific operations such as
updates only and not inserts is not yet available. You can add a “soft
marker” on the item for updates and filter based on that when
processing items in the change feed. Currently change feed doesn’t log
deletes. Similar to the previous example, you can add a soft marker on
the items that are being deleted, for example, you can add an
attribute in the item called "deleted" and set it to "true" and set a
TTL on the item, so that it can be automatically deleted. You can read
the change feed for historic items, for example, items that were added
five years ago. If the item is not deleted you can read the change
feed as far as the origin of your container.
Change feed is not available for your requirements.
My idea:
Use Azure Function Cosmos DB Trigger to collect all the operations in your specific cosmos collection. Follow this document to configure the input of azure function as cosmos db, then follow this document to configure the output as azure queue storage.
Get the ids of changed items and send them into queue storage as messages.When you want to query the changed item,just query the messages from the queue to consume them at a specific unit time and after that just clear the entire queue. No items will be missed.
With your approach, you can get added/updated documents and save reference value (_ts and id field) somewhere (like blob)
SELECT * FROM collection WHERE userId=[some-user-id] AND _ts > [some-value] and id !='guid' order by _ts desc
This is a similar approach we use to read data from Eventhub and store checkpointing information (epoch number, sequence number and offset value) in blob. And at a time only one function can take a lease of that blob.
If you go with ChangeFeed, you can create listener (Function or Job) to listen all add/update data from collection and you can store those value in some collection, while saving data you can add Identity/version field on every document. This approach may increase your cosmos DB bill.
This is what the transaction consistency levels are for: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Choose strong consistency and your queries will always return the latest write.
Strong: Strong consistency offers a linearizability guarantee. The
reads are guaranteed to return the most recent committed version of an
item. A client never sees an uncommitted or partial write. Users are
always guaranteed to read the latest committed write.

What should be best approach to keep data in memory temporarily(at user level & for reuse the data) in ASP.NET?

Currently I am using 'Session' to keep the datatables in memory.
But after doing few R&Ds, I came to know the it is not a good practice
e.g.
Session("Syllabus") = RegistartionLogic.GetSyllabusInfo(Session("StudentID"))
Requirement:
The items of dropdown will be different based on student-type.
The dropdown data will be fetched from DB and these controls are used
in more than one screen.
Multiple DB call is not preferred from different screens for same
data.
So I need to call only one DB call, keep the data in memory and
then read data from memory next time onward.
I tried with 'cache' as well, but the issue was "Cache is not unique to the user.The scope of the data caching is within the application domain unlike "session". Every user can able to access this objects".
Kindly help me out.
For your scenario, HttpContext.Current.Cache should work.
Yes cache is not unique to the user. But cache key can be made unique.
var studentId = GetStudentIdFromRequest();
var cacheKey = "SyllabusInfoCacheKey_" + studentId;
Then you can make use of the unique cachekey, to insert and later get values for the particular student.
Session is also at Application level. Every user has a ASP.NET_SessionId cookie that is sent from client side and used at the server side to store/retrieve values.
Note: For Session and Asp.Net Cache to work in a load balanced environment, load balancer should be sticky.

Column Specific SqlCacheDependency in ASP.NET

I want to have a column specific SqlCacheDependency.
The Row Specific SqlCacheDependency is valid but i dont know how can i make Column Specific SqlCacheDependency
Example:
The query:
SELECT
[Extent1].[Price] AS [Price]
FROM [dbo].[Products] AS [Extent1]
where [Extent1].[ID] = 31167
causes the notification if the Row with ID = 31167 changes.
But the problem is that the Cache becomes invalid if any of the column of that row get changed but i want the cache becomes invalid only if Price of the ID 31167 get changed
I googled it for long but don't get any help.
Thanks
Any help is appreciated.
SqlDependency (which is used by SqlCacheDependency) does not offer column-level control. The semantics of the change notifications is that they are sent if a row returned or used by the query "might have changed."
If this is an important feature for you, you would need to implement it yourself, probably using triggers and either Service Broker to queue and deliver the change notifications, or through polling as with the old-style table-based notifications in .NET.
Another possibility might be to pull your columns of interest off into a separate table (either an explicit copy or one maintained with triggers) or a new table that you then join to the original one with views or SPs. Each row would then only contain the column you're interested in (plus the PK).

how to generate unique id per user?

I have a webpage Default.aspx which generate the id for each new user after that the id will be subbmitted to database on button click on Default.aspx...
if onother user is also entering the same time the id will be the same ... till they press button on default.aspx
How to get rid of this issue...so that ... each user will be alloted the unique id ...
i m using the read write code to generate unique id ..
You could use a Guid as ids. And to generate an unique id:
Guid id = Guid.NewGuid();
Another possibility is to use an automatically incremented primary column in the database so that it is the database that generates the unique identifiers.
Three options
Use a GUID: Guid.NewGuid() will generate unique GUIDs. GUIDs are, of course, much longer than an integer.
Use intelocked operations to increment a shared counter. Interlocked.Increment is thread safe. This will only work if all the requests happen in the same AppDomain: either process cycling on a refresh of the code will create a new AppDomain and restart the count.
Use an IDENTITY column in the database. The database is designed to handle this, within the request that inserts the new row, use SCOPE_IDENTITY to select the value of the identity to update in memory data (ORMs should handle this for you). (This is SQL Server, other databases have equivalent functionality.)
Of there #3 is almost certainly best.
You could generate a Guid:
Guid.NewGuid()
Or you could let the database generate it for you upon insert. One way to do this is via a Sequence. See the wikipedia article for Surrogate Keys
From the article:
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.
The Sequence/auto-incremented column option is going to be simpler, and easier to remember when manually querying your DB (during debugging), but the DBA at my work says he's gotten 20% increases in performance by switching to Guids. He was using Oracle, and his database was huge, though :)
I use a utility static method to generate id's, basically use the full datetime(including seconds) and generate a random number of say 3 or 4 characters and return the whole thing, then you can save it to the database.

Resources