'Does a surrogate (INT) key almost always yield better performance than an unique natural (VARCHAR) key (in MySQL)?

I am struggling to understand what datatype to use for an MySQL database table.

Let's say we book publishing company and we need to create a database of all our books and authors in MySQL database. We have around 500000 books. A book has an unique ISBN (for example 978-3-16-148410-0).

So we have two options here to store our books:

  1. Create a id VARCHAR(24) NOT NULL natural primary key column and store our ISBNs there, or
  2. Create a surrogate id INT NOT NULL AUTO_INCREMENT but then also a isbn UNIQUE VARCHAR(24) column

What from I understand, the general consensus is not to use VARCHAR(n) as a primary keys as it takes more storage and performance doing lookups and joins, and generally that makes sense for me.

However, if all our operations will be towards ISBNs (SELECT * FROM books WHERE isbn = ?, UPDATE, DELETE, etc) - Why not use the VARCHAR(24) as the primary key?

I am struggling to understand that if you have an immutable natural key (like a book's ISBN) and 95% of all database operations require using that field regardless, shouldn't using a VARCHAR(24) always outperform a surrogate key design?

I feel like having a surrogate AUTO_INCREMENT INT key here, is just totally meaningless. It does not give any benefits.

Or am I missing something fundamental here when it comes to determining primary keys.



Solution 1:[1]

I would use the ISBN as the primary key.

Primary key lookups in MySQL's default storage engine InnoDB are more efficient than lookups by secondary index.

It's true an integer takes less storage space than a 24-character varchar, but in your case, I assume you have to store the ISBN anyway. If you could use an integer instead of the ISBN, that would save storage.

The comment above that natural keys tend to violate uniqueness is a good warning in general. The violations usually come from the marketing department. ;-)

But for a given dataset, you can be sure that the natural key is free of duplicates. If you do experience an error reading the ISBN in your library collection, the librarian will have to resolve that manually. But I don't expect that to happen very often for 500,000 books.

Tip: Define the varchar with a binary collation, and it'll be a bit faster to do string comparisons. For example:

CREATE TABLE Books (
 isbn varchar(24) COLLATE utf8mb4_bin,
 -- ...other columns...
 PRIMARY KEY (isbn)
) DEFAULT CHARSET=utf8mb4;

Solution 2:[2]

Surrogate performance (General discussion)

  • "almost always"? No.

  • AUTO_INCREMENT that is not even used for Joining -- why bother having it if you have a "natural PK"; it takes space without providing any benefit.

  • UUID/GUID -- Frequently worse. This is due to lack of "locality of reference".

  • Many-to-many mapping table -- Always worse. Best: The natural PK is the pair of ids. A secondary key is the pair, but in opposite order.

  • Space: Since the PK is silently included in every secondary index, the bigger the PK, the bulkier the secondary index is. If you have exactly one secondary index, the size is a tossup. With more than two secondaries, space is being chewed up.

  • Range scan on the PK. If you need to use BETWEEN (etc) it is often beneficial to have that range key be the PK. If, instead, the range scan is going through a secondary index (very efficiently), but then having to reach into the data's BTree, this is a lot of extra work.

  • "Index merge". Some DB vendors perform AND or OR by gathering the "row identifiers", then and'ing or or'ing the lists. Then looking up the actual rows. The way InnoDB is structured, this is almost never worth doing.

  • Locality of reference. An auto-inc value lays out the data in roughly chronological order. This may be a benefit. Or it may not. Take sensor data or stock quotes. The main queries look up multiple rows based on sensor_id or ticker or author or user, not datetime, yet the data arrives in time-order. It is actually better (overall) to have PRIMARY KEY(ticker, datetime). (Or if there can be dups, then PRIMARY KEY(sensor_id, datetime, id), INDEX(id).) I saw one system double in throughput when they rearranged the PK of there biggest table in that way.

  • I have created hundreds (thousands?) of tables. Scanning through their PKs, I find that only 1/3 use an AUTO_INCREMENT surrogate PK.

Bottom line: Since I ponder your question every time I type CREATE TABLE, I would say that a "natural" key is better 2/3 of the time.

Looking at ISBN

Off to your specific, ISBN. Look at your tables. You have one that 'wants' to have ISBN as its PK, correct? How often do you JOIN to that table? How many secondary indexes are there on that table? You probably never do a range query like WHERE ISBN > "...", correct?

If you have id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, you will probably also have UNIQUE(isbn). But not vice versa.

ISBN should probably be VARCHAR(24) CHARACTER SET ascii COLLATE ascii_bin. This helps (a little) with both speed and space.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bill Karwin
Solution 2 Rick James