UUID/GUID Advantages and Disadvantages
I am a big proponent of GUIDs as a surrogate key.
To misquote David Packard:
Ids are too important to be left to the database.
I find it much more practical to describe the entire model (including Ids) in the code than to expect the database to "suggest" a value for such an essential thing as Id.
The biggest problem with UUIDs (or Microsoft UUID implementation - GUID) is their 128bit size. Similar alternatives with a very high chance of being unique (like MongoDB ObjectId (96bit)) are also larger than 64bit.
Can we have a 64bit key with easiness of GUID generation and (significantly) low collision chance?
No. 128bits are why UUIDs are (almost certainly) unique and can be generated from anywhere, without any coordination. For an interesting math explanation, check Eric Clippert's Socks, birthdays, and hash collisions.
GUIDs are not sequential, and they can even be used for randomization.
entities.OrderBy(x => Guid.NewGuid()).ToList();
Is this a good idea?
Probably not. GUIDs are made to be unique, not for randomization.
Can we prove that we would get a better result with the new Random().Next()
?
We can run a statistical test, like Die Hard tests and get a "quality of random number generator." There are several tests like Die Hard tests. It would be interesting to check how "random" are GUIDs.
Is Random().Next()
the good idea?
For cryptographic random numbers use RandomNumberGenerator
(or the RNGCryptoServiceProvider
in the older .NETs).
If you have GUID as the clustered index, you will have O(log n) insert speed instead of O(1). However, modern databases know how to work with GUIDs, e.g. SQL Server does not sort GUIDs by bits they are stored, but by segment based on the MAC address. It is still O(log n), but it would be much faster in practice.
So, no randomization with GUID?
We could use only randomized bits 11th byte of a 16-byte wide value.
Practical? No. But, still possible.
What Are Little GUIDs Made Of?
GUIDs are made from 4 parts:
- 60 bits - Timestamp
- 48 bits - Computer identifier,
- 14 bits - Uniquifier
- 6 bits - Fixed
For more info about GUID generation and format check GUIDs are globally unique, but substrings of GUIDs aren't
Popular 64bit keys
Another popular approach is to use 64bit keys and to have a central registration authority or coordination between generation nodes. Attempts like this are much more complex and require serious investment in deployment and additional architectural complexity. Many popular services are using similar solutions:
- Twitter Snowflake Id
- Twitter's (original) Snowflake was made from the timestamp, worker number, and sequence number.
- For details check announcing snowflake or snowlake on GitHub
- Not directly connected but still interesting Twitpocalypse - Before Snowlake, Twitter was using 32bit integer for IDs
- Instagram Id
- Instagram is using (or at least was using in 2012) - "41 bits for a time in milliseconds (gives us 41 years of IDs with a custom epoch), 13 bits that represent the logical shard ID, 10 bits that represent an auto-incrementing sequence, modulus 1024."
- For more info about Instagram's problem and details about the implementation, check Sharding & IDs at Instagram
Having a 64bit key instead of 128bit (GUID) has some advantages:
- Faster Indexing
- Predefined integral types in C# (and most modern languages)
- Predefined integral types in SQL Servers (and most modern databases)
Building Blocks
Timestamp
Epoch time is a system describing a point in time using seconds. It is represented as a 32 bit in most scenarios, which gives us about 136 years in total. The minimum date is 1901-12-13, and the maximum is 2038-01-19. The "final point in time" is known as the year 2038 problem or the Epochalypse (but of course).
If we want to keep the 32bits and the seconds as the smallest point, you don't have to limit yourself to the year 2038; choose a new arbitrary time, instead of 00:00:00 UTC on 1 January 1970. Calculations for converting to "human-readable" format should be quick and straightforward.
If you want to use milliseconds, minutes (or maybe even a number that represents 7.6 seconds) instead of seconds will depend on the properties of the system you are building.
Worker Number / Shard Id / Device Id
If we want to skip some coordination between workers (id creators), we can add an Id for each worker node for distributed systems. 8bits will give you 256 nodes. You can skip this part if you are sure that the system will not be distributed.
Autoincrement or Random
If we can spare bits, we can add a random number to the Id. If we use enough bits for randomness and have a small enough timestamp for our use case, we can make a system with a very low chance of Id collision. With randomness, instead of auto-increment, we don't have to worry about previously generated Ids; we will just run the same function and get the id. If we decide to use autoincrement, we will have to coordinate Ids between every function call. It would not be as complex as Id coordination in a distributed system; we could use the Singleton pattern.
Multitentancy & Other Data Grouping
In the id for multitenant solutions, we can also embed tenant id. If we put the tenant id as the first parameter, we can rely heavily on sorted sets and clustered indexing. Also, this makes Ids humanly identifiable - e.g., the first four digits represent Tenant Id.
Can we use Composite Indexing instead of inserting Tenant Id in the primary Id?
Of course, and you should, this is just another approach.
Event Sourcing and State Ids
For the most straightforward solutions, where we always store the entire state, we could create an Id consisting of IdEntity and the second IdState. IdState part should be sequential or timestamp. With this approach, we would have a humanly identifiable Id and sortable by IdEntity and then by date of creation. For the systems with multiple types of entities in the same storage, we can add a third part - a type of entity.
For "true event sourcing" systems that do not store states but event data, we can add yet another part to the id - type of command (probably at the end).
Working with Binary, Decimal, and Hexadecimal Encoding
You can do many clever tricks with bitwise and shift operators. Still, you will lose (human) readability from the decimal-based base system, e.g., the first three numbers are customer Id, the following two numbers represent a type of entity, etc. (72 is easier to understand than 1001000)
For 128bit keys, you don't have many options; there is no support for general variables in this size. You can always convert exactly 128bits to GUID and use UUID (e.g., unique identifier in SQL Server) as store type - even if you are not generating the Ids with the GUID generation algorithm. But keep in mind that software that uses GUIDs is optimized to work with GUID-generated data, not any 128bits (check above, for example).
Conclusions - Is it worth creating Custom Format Id?
Custom format Ids are not meant to be a replacement for auto-increment integers and GUIDs. Auto-increment integers are small, fast, and sorted, and GUIDs are easy to create and use, even in complex distributed systems.
Custom format Ids are an interesting alternative for:
- Humanly-readable Ids - but more optimized than using string
- Optimized large distributed systems - that would like to use 64bits (or less) per key
- Ids for URLs - when using a descriptive string in URL is not practical or would put too much burden on the system
- Aggregate root Ids - you can identify the type of entity from the id
- etc.
It's just another tool for your toolbox. Use it when it makes sense.