is .net Guid guranteed to be unique when called within a tick? - guid

There are 10,000 ticks in a millisecond.
one of answers in How does C# generate GUIDs? mentions that there are "emergency uniquifier bits" but I don't know enough about that to really tell if I call Guid.GetNext() at tick 1 and 2, I get different GUID.

Related

Storing and querying for announcements between two datetimes

Background
I have to design a table to store announcements in DynamoDB. Each announcement has the following structure:
{
"announcementId": "(For the frontend to identify an announcement to the backend)",
"author": "(id of author)",
"displayStartDatetime": "",
"displayEndDatetime": "",
"title": "",
"description": "",
"image": "(A url to an image)",
"link": "(A single url to another page)"
}
As we are still designing the table, alterations to the structure are permitted. In particular, announcementId, displayStartDatetime and displayEndDatetime can be changed.
The main access pattern is to find the current announcements. Users have a webpage which they can see all current announcements and their details.
Every announcement has a date for when to start showing it (displayStartDatetime) and when to stop showing it (displayEndDatetime). The announcement is should still be kept in the table after the current datetime is past displayEndDatetime for reference for admins.
The start and end datetime are precise to the minute.
Problem
Ideally, I would like a way to query the table for all the current announcements in one query.
However, I have come to the conclusion that it is impossible to fuse two datetimes in one sort key because it is impossible to order two pieces of data of equal importance (e.g. storing the timestamps as a string will mean one will be more important/greater than the other).
Hence, as a compromise, I would like to sort the table values by displayEndDatetime so that I can filter out past announcements. This is because, as time goes on, there will be more past announcements than future announcements, so it will be more beneficial to optimise that.
Compromised Solution
Currently, my (not very good) solutions are:
Use one "hot" partition key and use the displayEndDatetime as the sort key.
This allows me to filter out past announcements, but it also means that all the data is in a single partition. I could run a scheduled job every now and then to move the past announcements to a different spaced out partitions.
Scan through the table
I believe Scan will look at every item in the table before it performs any filtering. This solution doesn't seem as good as 1. but it would be the simplest to implement and it would allow me to keep announcementId as the partition key.
Scan a GSI of the table
Since Scan will look through every item, it may be more efficient to create a GSI (announcementId (PK), displayEndDatetime (SK)) and scan through that to retrieve all the announcementIds which have not passed. After that, another request could be made to get all the announcements.
Question
What is the most optimised solution for storing all announcements and then finding current announcements when using DynamoDB?
Although I have listed a few possible solutions for sorting the displayEndDatetime, the main point is still finding announcements between the start and end datetime.
Edit
Here are the answers to #tugberk's questions on the background:
What is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
I am uncertain of how the admins will use this system, announcements can be very regular (about 3/day) or very infrequent (about 3/month).
How much new data do you anticipate storing daily, and how do you think this will grow?
As mentioned above, this could be about 3 announcements a day or 3 a month. This is likely to remain the same for as long as I should be concerned about.
What is the rate of reads (e.g. peak reads per second)?
I would expect the peak reads per second to be around 500-1000 reads/s. This number is expected to grow as there are more users.
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most).
I would expect the maxmimum number of viewable announcements to be up to 30-40. This is because there could be multiple long-running announcements along with short-term announcements. On average, I would expect about 5-10 announcements.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
I think the speed which the announcement starts showing is important, especially if the admins decide that this is a good platform for urgent announcements (likely urgent to the minute). However, when it stops showing is less important, but to avoid confusing the users the announcement should stop display at most 4 hours after it is past its display end datetime.
This type of questions are always hard to answer here as there is so many assumptions on the answer as it's really hard to have all the facts. But I will try to give you so ideas, which may help you think about your data storage choice as well as giving you further options.
I know what I am doing, and really need to use DynamoDB
Edited this answer based on the OP's answers to my original questions.
As you really need to us DynamoDB for this for internal reasons, I think it's more suitable to store the data in two DynamoDB tables for both serving reads and writes as nearly all access patterns I can think of will hit multiple partitions if you have one table. You can get away with a GSI, but it's not too straight forward how to do it, and I am not sure whether there is any advantage to doing it that way.
The core thing you need to optimize for is the reads as you mentioned it can go up to 2K/rps which is big enough to make this the part where you optimize your architecture against. Based on your assumptions of having 3 announcements a day, it's nothing to worry about as far as the writes are concerned.
General idea is this:
I would consider using one DynamoDB table to handle writes where you can configure author identifier as the partition key, and announcement identifier as the sort key (and make your primary key as the combination of both). This will allow you to query all the announcements for a given author easily.
I would also have a second DynamoDB table to handle reads, where you will only store active announcements which your application can query and retrieve all of it with a Scan query (i.e. O(N)), which is not a concern as you mentioned there will only be 30-40 active announcments at any point in time. Let's imagine this to be even 500, you are still OK with this structure. In terms of partition and sort key, I would just have an active boolean field as the partition key, which you will always have it as true, you can have the announcement id as the sort key, and make the combination of both as the primary key. If you care about the sort of these announcements, you can adjust the sort key accordingly but make sure it's unique (i.e. consider concatenating the announcement identifier, e.g. {displayBeginDatetime-in-yyyyMMddHHmmss-format}-{announcementId}. With this way you will guarantee that you will only hit one partition. However, you can actually simplify this and have the announcement identifier as the partition key and primary key as I am nearly sure that DynamoDB will store all your data in one partition as it's going to be so small. Better to confirm this though as I am not 100% sure. The point here is that you are much better of ensuring hitting one partition with this query.
Here is how this may work, where there are some edge cases I am overlooking:
record the write inside the first DynamoDB for an announcement. When an announcement is written, configure displayEndDatetime as the TTL of that row, with the assumption that you don't need this record in this table when an announcement expires.
have a job running for N minute (one or more, depending on the data inconsistency gap you can handle), which will Scan the entire DynamoDB table across partitions (do it in a paginated way), and makes decisions on which announcements are currently visible. Then, write your data into the second DynamoDB table, which will handle the reads, in the structure we have established above so that your consumer can read from this w/o worrying about any filtering as the data is already filtered (e.g. all the announcements here are visible ones). Note that Scan is fine here as you are running this once every N minutes, with the assumption that you are ok with at least 1 minute + processing time data inconsistency gap. I would suggest running this every 10 minutes or so, if you don't have strong data consistency requirements.
On the read storage system, also configure displayEndDatetime as the TTL for the row so that it gets automatically deleted.
Configure DynamoDB streams on the first DynamoDB table, which has 24 hours retention and exactly once delivery guarantee, and have a lambda consumer of this stream, which to handle when an item is deleted (will happen when TTL kicks in for a particular row) to keep a record of this announcements somewhere else, for longer retention reasons, and will need to expose it through different access pattern (e.g. show all the announcements per author so that they can reenable old announcements), as you mentioned in you question. You can configure a lambda event sourcing with DynamoDb streams, which will allow you to handle failures with retries, etc. Make sure that your logic in these lambdas are idempotent so that you can retry safely.
The below is the parts from my original question, which are still relevant to anyone who might be trying to achieve the same. So, I will leave them here but they are less relevant as the OP needs to use DynamoDB.
Why DynamoDB?
First of all, I would question why you need DynamoDB for this, as it seems like your requirements are more read heavy than it's being write heavy, where I think DynamoDB shines the most due to its partitioned out of the box nature.
Below questions would help you understand whether you really need DynamoDB for this, or can you get away with a more flexible data storage system:
what is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
how much new data do you anticipate storing daily, and how do you think this will grow?
what is the rate of reads (e.g. peak reads per second)?
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most). This will help you understand whether you need will be OK pulling all the visible announcements in one go, or need a pagination system.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
Actually, I don't need DynamoDB
Based on my assumptions on your consumption and admin needs for this use case, I believe you don't need DynamoDB for this with the assumption of not having high number of writes for this (which might be wrong), and if these assumptions are correct, the above is a super over engineered solution for you. Let's say it's correct, I think you are better of using PostgreSQL for this, which can give you easy ability to change your access pattern as you see fit with further indexing, and for the current access pattern you have, you can have a range query over the start and end times.

Unity + Firebase: is it possible to append data to a keys value, or do I have to retrieve keys data every time?

I'm a bit worried that I will reach the free data limits of Firebase in a student project.
Basically my question is:
is it possible to append to the end of the string instead of retrieving key and value, appending and uploading again.
What I want to achieve:
I have to create statistics of user right/wrong answers for particular questions.
I want to have a kvp:
answers: 1r/5w/3r
Where number is the number of users guesses and r/w means right wrong. Whenever the guessing session ends I want to add /numberOfGuesses+RightOrWrongAnswer and the end.
I'm using Unity 2018.
Thank you in advance for all the help!
I don't know how your game is architected or how many people are playing, but I'd be surprised if you hit your free limit on a student project (you can store 1GB and download 10GB). That string is 8 bytes, let's assume worst case scenario: as a UTF32 string, that would be 32 bytes of data - you'd have to pull that down 312 million times to hit a cap (there'll be some overhead, but I can't imagine it being a hugely impactful). If you're afraid of being charged, you can opt to not have a credit card on file to be doubly sure you stay on a student budget.
If you want to reduce the amount of reading/writing though, I might suggest that instead of:
key: <value_string> (so, instead of session_id: "1r/5w/3r")
you structure more like:
key:
- wrong: 5
- right: 3
So have two more values nested under your key. One for all the wrong answers, just an incrementing integer. Then one for all the right answers: just an incrementing integer.
The mechanism to "append" would be a transaction, and you should use these whether you're mutating a string or counter. Firebase tries to be smart with data usage and offline caching, but you don't get much more control other than that.
If order really matters, you might want to get cleverer. You'll generally want to work with the abstractions Realtime Database gives you though to maximize any inherent optimizations (it likes to think in terms of JSON documents, so think about your data layout similarly). This may not be as data optimal, but you may want to consider instead using a ledger of some kind (perhaps using ServerValue.Timestamp to record a single right or wrong answer, and having a cloud function listening to sum up the results in the background after a game - this would be especially useful if you plan on having a lot of users trying to write the same key at the same time).

API design: naming "I want one more value outside time boundaries"

I'm designing an API to query the history of a value over a time period. Think about a temperature value, and you want to query all the values for today.
I have a from and a to parameter to specify the boundaries of the query.
The values available may not exactly match the boundaries requested. For example, if from is 2016-02-17T00:00:00Z, the first value may be on 2016-02-17T00:04:30Z. To fully represent a graph of the period, it is necessary to retrieve one more value outside the given range. The value on 2016-02-16T23:59:30Z is useful and it would be convenient for the user to not have to make another query to retrieve it.
So as the API designer I'm thinking about a parameter with a pair a of boolean values that would tell for each boundary: give me one more value if there is no value exactly on the boundary.
My question is how to name this parameter as English is not my native language.
Here are a few ideas I have so far but with which I'm not totally satisfied:
overflow=true,true
overstep=true,true
edges=true,true
I would also appreciate any links to existing APIs with that feature, either web API or in programming languages.
Is it possible to make this more of a function/RPC that a traditional rest resource endpoint, so rather than requesting data for a resource between 2 dates like
/myResource?from=x&to=x
something more like
/getGraphData?graphFrom=&graphTo=x
Whilst its only a naming thing, it makes it a bit more acceptable to retrieve results for a task wrapped with outer data, rather than violating parameters potentially giving unexpected or confusing results.

Alternative to GUID with Scalablity in mind and Friendly URL

I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.

What are some good practices when returning LARGE amounts of data from an ASP.net web service? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm currently working on some web services for a client of ours. Before we make it available to them we'd like to optimize performance of our database calls, as very LARGE amounts of data can potentially be returned. (could be tens of thousands of objects, could be millions, each object containing about 12 lists of other objects)
We don't want to strain our servers, nor do we want to limit the web service unnecessarily.
One of the web service methods returns all data within a specified date range, I was thinking that if the amount of data being returned was larger than a set amount, return a message saying something like:
"Data too large, please reduce date range"
Is limiting the user's scope like that a good idea??
I have to limit the amount of data our client can retrieve in one shot, but still keep at as convenient as possible for them. I mean, they're programmers too so it doesn't have to be too simple, but simple enough to use.
What are some good practices concerning returning large amount of data through a web service??
Thanks!
You might be able to adapt the common technique of paging data displayed in a list or grid. The call to the database specifies the number of records to return, and the page number.
So, for example, if they are displaying 10 records on a page, only 10 records will be returned for display. Records 1 - 10 (or 0 - 9, if you prefer) are returned for page 1, and 11 - 20 for page 2, and so on.
Also often returned is the total number of records available.
This way, the user can continue scrolling through a large number of records, or they can choose to refine their search criteria to yield a smaller resultset.
You could consider this kind of paging or chunking approach for your web service. The web service call supplies the number of records to be sent in each chunk, and the "page" or "chunk" number. The web service returns the requested records, along with the total number of records available.
With this approach, the developer who is consuming the web service remains in control.
The calling code can be placed in a loop so that it continues requesting chunks, if that behavior is desirable. If someone really wants a truckload of records, they can just set the number of records argument to be a very large number (or you could make that an optional parameter, and return everything if it is null, zero, empty).
It really depends on your needs. This seems more of a design question than a coding question to me, but in our systems we have two approaches. I'll share them to give you some ideas to consider.
In the first one, we're the ones providing the data. We allow customers to download transaction data on their accounts, and for some customers this can be a fairly large amount of data. We limit them to X days worth of data, and they are fine with this.
In the second one, we're consuming data from a web service from an established vendor that tracks vehicle location data, and other data of interest to our dispatchers and management. Every truck in our fleet gives regular updates of their geolocation, plus other data (loading/unloading/driver on break, etc)
In these web services, we request a date range, and the service returns a set limit of records (1000 records per call).
In addition to passing in the date range, we pass in a "position" integer field. On the first call the "position" is set to zero.
The web service returns a "More Data Exists" boolean field.
If "More Data Exists"=true, then we call the web service again with the "position" parameter incremented, and repeat until "More Data Exists" = false (It's a simple while loop in our code)
I think the first one is great for either programmers OR end users. The second works fine when dealing with programmers.
That sounds like potentially lot of data for a webservice.
But here's a page from MSDN which talks about setting up the server to send/receive 'large' volumes of data: http://msdn.microsoft.com/en-us/library/aa528822.aspx
You can give them a chunk of data per time like (1000 obj) They have to specified a start index and amount of data they want and you just use that to pull your data.
In SQL there is TAKE and Skip that can do this easy.

Resources