I want a collection for storing two types: string and DateTime.
The string should be the key of my collection and the DateTime is the insertion time of it into the collection. I want to remove items from the collection in a FIFO manner.
The collection should reject duplicate keys and queryable by DateTime so if want to now the number of items older than a given date it could answer.
There is no single builtin C# datatype that do all those things with maximal efficiency, mostly as you indicated two things you'd have to lookup by.
That being said, a Dictionary<string, DateTime> will be the simplest solution that gives you all the features you need, basically out of the box. However, that collection will give O(n) complexity for the DateTime lookups, and worse-than-O(1) removal time. That is probably not a big deal, but you didn't describe your performance requirements, the expected sizes of your dataset, or which access types happen most frequently.
To improve on the "older-than-DateTime" lookup performance and the FIFO removal, you could also keep a second index, such as a SortedList. More memory usage and somewhat-slower overall insertion time but DateTime and removal queries will be faster. For "older-than-DateTime" you can use a binary search of the SortedList.Keys.
It sounds like System.Collections.Generic.Dictionary<string, DateTime> should do your trick. It has methods to process the collection as you need.
Related
I was just designing something and was wondering if this was any bad programming practice after all.
If I were to have a Dictionary and have the Tvalue updating real-time(Here, I meant to say every frame or every physics frame), would I be terribly mistaken?
This 'design' is to, in the end, sort out a single GameObject out of the Dictionary index while the Tvalue being the comparing factor for the Tkeys to be sorted out for. I was doing this with List, but dictionary seemed more of a rational choice if I wanted to pair another value for their comparisons after all.
According to the MSDN documentation the performance of Dictionary for retrieving a value is close to O(1), meaning that the time to retrieve an item is independent of the size (number of elements stored) of the Dictionary.
Retrieving a value by using its key is very fast, close to O(1),
because the Dictionary class is implemented as a hash
table.
I don't know the details of your project but I think you can update the TValue in every frame without too much performance overhead.
I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.
Everyone was telling me that a List is heavy on performance, so I was wondering is it the same with a dictionary? Because a dictionary doesn't have a fixed size. Is there also a dictionary with a fixed size, just like a normal array?
Thanks in advance!
A list can be heavy on performance, but it depends on your use case.
If your use case is the indexing of a very large data set, in which you plan to search for elements during runtime, then a Dictionary will behave with O(1) Time Complexity for retrievals (which is great!).
If you plan to insert/remove a little bit of data here and there at runtime then that's okay. But, if you plan to do constant insertions at runtime then you will be taking a hit on performance due to the hashing and collision handling functions.
If your use case requires a lot of insertions, removals, iteration through the consecutive data, then a list would be and fast. But if you are planning to search constantly at runtime, then a list could take a hit performance-wise.
Regarding the Dictionary and size:
If you know the size/general range of your data set then you could technically account for that and initialize accordingly. Or you could write your own Dictionary and Hash Table implementation.
In all:
Each data structure has it's advantages and disadvantages. So think about what you plan to do with the data at runtime, then pick accordingly.
Also, keeping a data structure time and space complexity table is always handy :P
This is depends on your needs.
If you just add and then iterate items in a List in sequental way - this is a good choice.
If you have a key for every item and need fast random access by key - use Dictionary.
In both cases you can specify the initial size of the collection to reduce memory allocation.
If you have a varying number of items in the collection, you'll want to use a list vs recreating an array with the new number of items in the collection.
With a dictionary, it's a little easier to get to specific items in the collection, given you have a key and just need to look it up, so performance is a little better when getting an item from the collection.
List and dictionary are part of the System.Collections namespace, which are mutable types. There is a System.Collections.Immutable namespace, but it's not yet supported in Unity.
I have a Dictionary<int, string> cached (for 20 minutes) that has ~120 ID/Name pairs for a reference table. I iterate over this collection when populating dropdown lists and I'm pretty sure this is faster than querying the DB for the full list each time.
My question is more about if it makes sense to use this cached dictionary when displaying records that have a foreign key into this reference table.
Say this cached reference table is a EmployeeType table. If I were to query and display a list of employee names and types should I query for EmployeeName and EmployeeTypeID and use my cached dictionary to grab the EmployeeTypeIDs name as each record is displayed or is it faster to just have the DB grab the EmployeeName and JOIN to get the EmployeeType string bypassing the cached Dictionary all together.
I know both will work but I'm interested in what will perform the fastest. Thanks for any help.
Optimization 101 says don't do it unless you need to:- Tips for optimizing C#/.NET programs
But, yes, if this really is a totally static lookup for the lifetime of the application AND it takes up very little RAM then caching it would seem fairly harmless and a Dictionary lookup from RAM will be faster than a trip to the database.
As for the 2nd part you might as well let the database do the join, it'll probably have that table in RAM already, and the increased network payload would seem small.
But again, if you don't need to do it, don't do it! The danger here is that you do this one, then another, then another, the code grows ever more complex and RAM fills up with things you think you might need but which in fact are used rarely leaving less space for the OS/ORM/DB to do its work. Let the compiler, ORM and database decide what to keep in memory instead - they have a much bigger team focused on optimization!
I know you won't like the answer, but common sense dictates you do the easiest thing and if too slow then you put remedy to it.
I'll explain myself. As a matter of fact if you cache it it'll probably be faster as you wouldn't be hitting the database every time you load the page, but the gain might not be noticeable for what you're doing (i.e. you might have some other bottle-neck that makes that gain insignificant), defeating the purpose of caching in the first place.
The only way, again, is to do it the easiest way (no caching) and if you're not happy only then you'll go the extra bit.
I have a series of objects I have created:
Item
Order
Song
etc.
Each object has a reasonable number of properties, and I use a datareader where I pass it "SELECT * FROM .objectname." and then I fill a collection of objects, and return the collection. This works as: GetOrdersCollection(), GetSongsCollection(), etc.
I understand SELECT * to be a performance problem, and additionally, sometimes I prefer to include additional columns in the select statement which do not exist in the object, and have those all returned as well.
So my question is, what is the best way to approach this problem?
Should I create a new object for every query type?
I tried performing a check to see if column is in datareader before storing it, but this presents perf. issues. Is there a negligible perf. way to avoid IndexOutOfRange?
Should I just use Datatable and read right from the table?
I understand SELECT * to be a
performance problem,
It's not a performance problem if there are only a few columns, or you need all of the columns anyway.
1.Should I create a new object for every query type?
You should create a new object for each table, and a new method for each query type.
2.I tried performing a check to see if column is in datareader before storing
it, but this presents perf. issues. Is
there a negligible perf. way to avoid
IndexOutOfRange?
If you are referring to your fields by name rather than index, there shouldn't be any IndexOutOfRange problems. If you are referring to your fields by index, you can loop thru them where your index is less than the column Count(), and there shouldn't be any IndexOutOfRange problems.
3.Should I just use Datatable and read right from the table?
That's a perfectly good approach to start out with. Consider spending some time to learn a simple ORM as others have suggested. Subsonic is a good "first" ORM.
Performance-wise reading from a forward only data structure like DataReader is going to net you the best performance and resource conservation.
On the other hand populating object (like a OR/M does) can be negligible so long as you are not returning more than a handful of objects.
Your first step should be to profile your database and ensure that you have proper indexes. Write some tests to see where your largest time expense is in the process and optimize the target areas that cost you the most.
Are there any reasons you can't use a simple ORM generator like SubSonic? This will allow you to very easily access these types of collections, and they'll be strongly typed. You also won't have to worry about the SQL since the queries will be built by SubSonic.