Everyone was telling me that a List is heavy on performance, so I was wondering is it the same with a dictionary? Because a dictionary doesn't have a fixed size. Is there also a dictionary with a fixed size, just like a normal array?
Thanks in advance!
A list can be heavy on performance, but it depends on your use case.
If your use case is the indexing of a very large data set, in which you plan to search for elements during runtime, then a Dictionary will behave with O(1) Time Complexity for retrievals (which is great!).
If you plan to insert/remove a little bit of data here and there at runtime then that's okay. But, if you plan to do constant insertions at runtime then you will be taking a hit on performance due to the hashing and collision handling functions.
If your use case requires a lot of insertions, removals, iteration through the consecutive data, then a list would be and fast. But if you are planning to search constantly at runtime, then a list could take a hit performance-wise.
Regarding the Dictionary and size:
If you know the size/general range of your data set then you could technically account for that and initialize accordingly. Or you could write your own Dictionary and Hash Table implementation.
In all:
Each data structure has it's advantages and disadvantages. So think about what you plan to do with the data at runtime, then pick accordingly.
Also, keeping a data structure time and space complexity table is always handy :P
This is depends on your needs.
If you just add and then iterate items in a List in sequental way - this is a good choice.
If you have a key for every item and need fast random access by key - use Dictionary.
In both cases you can specify the initial size of the collection to reduce memory allocation.
If you have a varying number of items in the collection, you'll want to use a list vs recreating an array with the new number of items in the collection.
With a dictionary, it's a little easier to get to specific items in the collection, given you have a key and just need to look it up, so performance is a little better when getting an item from the collection.
List and dictionary are part of the System.Collections namespace, which are mutable types. There is a System.Collections.Immutable namespace, but it's not yet supported in Unity.
Related
First of, I know how Firestore works and have spent a lot of time, evaluating different approaches for a good structure. Still I am considering following scenario:
There is a database of known recipes. Users can add recipes, but they have to be confirmed to be real recipes and not just some variations. So every user can choose receipes from the user-generated list of recipes to state, that they know how to cook them (or add new ones).
Now I want users to share their list of receipes with others, but this is where I am not sure how this can be best accomplished using Firestore. The trick is, that I want to show all the recipes at once, and don't want to paginate them.
I am currently evaluating two possibilities:
Subcollections
Whenever a user shares his list, the user looking at said list will have to load the entire list of the recipes which can result in a high amount of document reads (I suppose realistically ~50, in very rare cases maybe 1000).
Pros:
More natural structure
Easier to maintain (e.g. deleting a recipe, checking if a specific one exists)
Easier to add fields (e.g. timeOfCreation, comment, personalRating, ...)
Cons:
Can result in a high amount of reads on the long run
Arrays
I could save every known recipe (the id and an imageURL) inside the user's document (or as a single subdocument "KnownRecipes") within an array. This array could be in form of
recipesKnown: [{rid: 293ndwa, imageURL: image1.com, timeAdded: 8371201332},
{rid: 9012831, imageURL: image1.com, timeAdded: 8371201871},
{rid: jd812da, imageURL: image1.com, timeAdded: 8371201118},
...
]
Pros:
I only need one document read whenever someone wants to see another user's list
Reading a user's list is probably faster
Cons:
It's hard to update a specific recipe (e.g. someone wants to change the imageURL: I need to change the list locally and send the entire document as an update to the server - since I cannot just change a single element in the array)
When a user decides to have around 1000 recipes (this will maybe never happen, but it could), the 1MiB limit of the Firestore limit could be reached. A possible workaround would be to create a seperate document and split those two arrays into these two documents.
For me, the idea with Subcollections seems to be the more "clean" solution to this problem, but maybe I am missing some arguments on why one of those solutions would be superior over the other.
My most common queries are as follows (ordered descending by importance):
Which recipes can a user cook
Add a recipe a user can cook to the user's list
Who can cook a specific recipe (there is a Recipe -> Cooks subcollection)
Update an existing recipe a user can cook
The answer to your question depends on the level of scalability you want to achieve.
If by design the amount of sub-data you want to store is limited and very low, you should use arrays, since you reduce the number of document reads, which means lower costs.
If your sub-data is supposed to increase "unlimitedly" over time, you should use sub-collections.
If you're building a database which is not supposed to scale in any direction (Proof of concept, very small business, etc.) just go with what you feel more comfortable with.
I'm researching the same question...
One of the questions is whether the data held in the document will be ever go pass 1MB that is the limit for a document. Researching a bit on how much it can be held in plain text in 1MB well it's a hell of a lot. Still if it were to be incredible bigger it would crash in the end. Thus if you think in a big-big way sub-collections.
If we had to use the Firebase element logic the answer would be sub-collections.
Still I guess the major point is the data pulled. If you call the user you will directly be pulling out that MB of data. Instead with a sub-collection it won't load, even if you loaded it you can still lazy-load.
I guess for the kind of setup you are doing sub-collections.
key is an additional collection's con/pro
key could help to avoid duplicates; but this requires thinking of what is duplicate's definition (which might change);
array's no-key behavior could be emulated via auto-id.
p.s. #Thomas's list of pros/cons in the question has been quite helpful.
Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?
Ran into an annoying problem - I need some way to tell if the bucket I'm trying to fill is empty or not (the buckets are stored as an array of value type structs for key-value pairs).
If I were to reserve a key value for marking things empty then that would just mean that some data unfortunate enough to stumble on that hash value would never be accessible.
On the other hand, including a boolean in the KVP struct would increase the size of the struct from 16 to 24, (such a waste and I'm tight on memory as it is). Has anybody figured out a good solution for this?
This is a problem that is as intrinsic to hash tables as collisions. A related problem is dealing with deleting from a hash table, again, in the context of collisions. There's no solution that doesn't involve some compromise in performance, so it's pretty common to see hash table implementations that have a particular key that is illegal.
By far the most direct solution is to just special-case the key value that you're using to mean empty. That is, if the user is trying to store a key value 0, you just put it in a special array you keep around for that purpose.
Really lame hash tables that only work with pointers don't usually have this issue, since you can always find a pointer value which the caller can't pass in (such as a pointer to an object you own). Obviously hash tables using linked lists or array elements don't have this problem either, but then, there's a massive performance penalty for those.
You could probably find some clever way to encode it inside the table itself, by using multiple elements. The only way this would be better is if its somehow unified with deleted element handling or something else, so it would be free or faster than checking some separate list.
First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don't know enough about how Lucene handles writes and reads.
I'm not sure if it's a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn't so quick. It's by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of string and byte[] as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field's data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc...).
The real killer when used for many pieces, you can't update a document in an index. When you pull out a document to update it and change the field, the call to UpdateDocument() actually marks the old document as deleted and inserts a new document.
Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It's temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to Optimize() will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.
However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing IndexReader you have open will still be able to search and read from the index when you're optimizing it.
In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you're searching through many documents for many documents. It's an inverted index and it's good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you'll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we've added generics and removed a bit of the costly boxing done in previous versions.
Lucene is not a good fit for the scenario you're describing. You're looking at caching data.
Why not use the Asp.net cache? If you need a more robust caching solution, there's memcached and a whole host of other ones ... even NoSql stores like mongo, redis, etc.
Obviously, you'll need to manually remove items from the cache on updates to stop serving stale data.
I think this is a viable solution, and I say this because there is a major open source content management system that is using a technique very similar to what you've described. It's called Umbraco, and it's version 5 is going to be using a customized version of Lucene.NET for a sort of cache.
you can look at the project and source here: http://umbraco.codeplex.com/SourceControl/changeset/view/5a7c9af9bbf9
I have a flex webapp that retrieves some names & addresses from a database. Project works fine but I'd like to make it faster. Instead of making a call to the database for each name request, I could pre-load all names into an array & filter the array when the user makes a request. Before I go down this route though I wanted to check if it is even feasible to have an application w/ 50,000 or 1 million elements in an array? What is the limit b/f it slows down the app? (I anticipate that it will have a lot to do w/ what else is going on in my app but for this sake lets just assume the app ONLY consists of this huge array).
Searching through a large array can be slower than necessary, particularly if you're talking about 1 million records.
Can you split it into a few still-large-but-smaller arrays? If you're always searching by account number, then divide them up based on the first digit or two digits.
To directly answer your question though, pure AS3 processing of a 50,000 element array should be fine. Once you get over 250,000 I'd think you need to break it up.
Displaying that many UI elements is different though. If you try to bind a chart to a dataProvider with 10,000 elements, it's too much. Same for a list or datagrid.
But for pure model data, not ui bound, I'd recommend up to 250,000 in my experience.
If your loading large amounts of data (not sure if your using Lists though), you could check out James Wards post about using AsyncListView with paging to grab the data in chuncks as its needed. Gonna try and implement something like this soon. His runnable example uses 100,000 rows with paging of 100 (works with HttpService/AMF type calls):
http://www.jamesward.com/2010/10/11/data-paging-in-flex-4/
Yes, you could probably stuff a few million items in an array if you wanted to, and the Flash player wouldn't yell at you. But do you really want to?
Is the application going to take longer to start if it has to download the entire database locally before being able to work? If the additional time needed to download that much data isn't significant, are a few database lookups really worth optimizing?
If you have a good use case to do this, you're going to have to pay attention to the way you use those data structures. Looping over the array to find an item is going to be a bit slow, so you'll want to create indexes locally, most likely by using a few hash structures. The more flexible you allow the search queries to be, the more interesting the indexing issues will be.