Using UUIDs as Database Identifiers
Introduction
Why would anyone want to use a UUID as a database identifier? It's ugly. It's (pseudo)random. It takes more space than a regular number. And it can kill your database's performance. All of that is true. So is there any practical use case and can it be implemented to not slow down the database? I think so.
Numeric limitations
Traditional numeric identifiers are small and fast, but come with some limitations and trade-offs, especially in the distributed environment:
- Their sequential nature means that they have to be centrally generated - usually by the Database itself in a form of a Sequence or a form of an auto-generated value.
- If we fetch values from a Sequence, we need to either use a library (like Hibernate, which I don't like to use at all) to cache and manage generated Sequence values, or we need to implement this logic for ourselves.
- If we use any kind of Database-side id auto-generation, we need to retrieve the generated value, which also means more logic and can differ from DB to DB.
- The sequential nature of numeric identifiers makes it easy for a potentially hostile party to estimate the next/previous value of an identifier (which can cause an Insecure Direct Object Reference attack) and/or to gain information about the volumes of the data processed by your application.
- They do have a maximum value that can be generated (even if it's insanely high), so systems that process huge amounts of data may potentially reach that max value and will have to reset the identifier or create a composite one.
Of course UUIDs also don't come without their flaws. The most significant one is that they are much less readable than traditional numbers. For me that's a small price to pay for solving issues listed above, but everyone has to make a decision for oneself.
How unique is unique enough?
Can we avoid all of that limitations by just using UUIDs? Yes, but we have to know what we're doing. There are many different strategies and implementations that allow us to generate an UUID. The simplest one in Java is:
UUID generatedUuid = UUID.randomUUID();
But is this UUID a good candidate for a Database identifier? Most certainly not. It is an "UUID v4" type of unique identifier. That means, simply speaking, it tries to be as random and unique as possible. And that is where we can get in trouble, when we will try to use it as a Database Identifier.
Database Identifiers are indexed. Whether it is a Primary Key or a Foreign Key, we will use those Identifiers in SQL Statements to look up and join data. And the very random nature of the standard type UUID v4 makes it horribly bad for indexing. Why?
Database Indexes are made of binary trees, that are sorted data structures. When using numeric identifiers, the new values will always be greater than the ones that are already present in the Database. They are sequential after all. 2 comes after 1 and 3 comes after 2. Using a UUIDv4 pretty much guarantees that the generated values won't be sequential and Database will have to place them in random spots in the Index. That can make things very slow during the Insert/Update process and quickly turn into a bottleneck for an entire system.
Can we generate a UUID value that will be random enough to be used as a Database Identifier, but at the same time will maintain a sequential nature of plain old numeric Identifiers? Let's take a look at one of UUID's Constructors:
/**
* Constructs a new {@code UUID} using the specified data. {@code
* mostSigBits} is used for the most significant 64 bits of the {@code
* UUID} and {@code leastSigBits} becomes the least significant 64 bits of
* the {@code UUID}.
*
* @param mostSigBits
* The most significant bits of the {@code UUID}
*
* @param leastSigBits
* The least significant bits of the {@code UUID}
*/
public UUID(long mostSigBits, long leastSigBits) {
this.mostSigBits = mostSigBits;
this.leastSigBits = leastSigBits;
}
You can create a UUID by passing 2 long values into it. How can we use that to our advantage? UUID is just 16 bytes of binary data - which happens to match 2 longs - each long takes 8 bytes of space. So maybe we could pass some sort of sequential data into the most significant part of the UUID value, to make it sequential/sortable? And maybe we could pass some random data into the least significant part of the UUID value, to make it unique enough for our purposes? That way instead of UUIDv4 we will create a custom type of an UUID value. One that will be a perfect drop-in replacement for numeric Database Identifiers and will solve their issues at the cost of slightly more space taken per row (16 bytes instead of 4):
import java.security.SecureRandom;
import java.util.UUID;
public abstract class IdGenerator {
private static final SecureRandom RANDOM = new SecureRandom();
private static final MonotonicClock CLOCK = new MonotonicClock();
private IdGenerator() {
}
/**
* Custom UUID generator - Prefix COMB: combination of the creation millisecond (prefix) with random bytes.
* Such UUIDs are sortable (by the time of their generation) and can be used for optimized DB PK/FK indexing.
*
* @return UUID generated ID
*/
public static UUID generateId() {
return new UUID(CLOCK.tick(), randomValue());
}
/**
* You can customize this to maximize the uniqueness - for example adding a hashCode of the host's IP address
* or an Instance ID that would be generated during the start of the application.
*/
private static long randomValue() {
return Thread.currentThread().getId() + RANDOM.nextLong();
}
/**
* Clock that returns a unique value for every tick().
* Can generate 10k unique values per 1 millisecond.
*/
private static class MonotonicClock {
private static final long MULTIPLIER = 10_000L;
private long last;
MonotonicClock() {
last = read();
}
synchronized long tick() {
long current = read();
if (last < current) {
last = current;
} else {
last++;
}
return last;
}
private long read() {
return System.currentTimeMillis() * MULTIPLIER;
}
}
}
Storage considerations
OK but how should we store those values on the Database side? From my experiences it is best to use binary data types. Oracle has RAW(16), H2 has Binary etc. Don't use any form of Varchar because it takes more space and will be slower when performing Joins. Storing the ids that way will also to enable you to use Database side functions to generate compatible values - for example for populating test database. You can use functions sys_guid() from Oracle or NEWID() from H2.
Performance and Collisions
UUIDs generated by the code above behave exactly the same as regular numbers in terms of sortability. The only performance penalty can come from their increased size, but modern databases should handle them without a sweat.
As with every pseudo-random generator, there is always some chance to generate a collision. I've never experienced it myself and if you do, you can always bump up the entropy of the randomValue() method or refactor the Monotonic Clock to be more precise/generate more values per 1 millisecond.
Conclusion
I'm using this method of generating Database Identifiers for some time now on a distributed data processing system that handles millions of rows. It's fast, safe and very convenient.
A great read, thanks!
ReplyDelete