'Is it more efficient to create one categorical column, or two exclusive boolean columns?
Would it be more "efficient" (in terms of memory usage, searching, indexing, any or all) to add one new categorical column (color, values BLUE, RED, YELLOW) or 2+ boolean columns (is_blue, values 0/1, is_red, values 0/1, and is_yellow, values 0/1), where exactly one value must be true across the boolean columns?
My use case is MySQL, if it makes a difference. Would love an in-depth answer.
Solution 1:[1]
Wherever a value can be set separately, don't worry about "efficiency", and sacrifice memory for lower processing times, by creating separate columns.
But for values that are mutually exclusive (like in your colors case), you should definitely use one categorical column.
Even if you needed separate Boolean values, you would still use a categorical column (even if you have to invent a surrogate key). Then you'd define a domain table for the categories. The domain table would contain the Boolean flags that correspond to each category.
If you were to put the Boolean values directly into the original table, you'd be violating 3rd normal form. Also you'll have issues with indexing (e.g. ignored due to lack of selectivity) and query execution plans.
Solution 2:[2]
Don't worry about "efficiency". Think about how you're going to use the data.
...
Your possible definitions of "efficient" include "memory usage, searching, indexing, any or all", but probably the most important is to use the data structure that most fits how you work with the data.
Say you're making a table of people who work at the company, so you figure your options are:
employee_type char(10) /* constrained to "individual" or "manager" */
or you could have
is_individual boolean,
is_manager boolean,
As far as memory usage or searching or indexing it probably won't make any difference, and you shouldn't worry about it unless you know there's a problem.
However...
Say a year from now you have to add a new type of employee, "executive". If you went with the original single-field approach, you would just change a constraint:
employee_type char(10) /* "individual" or "manager" or "executive" */
versus having to add another boolean column
is_individual boolean,
is_manager boolean,
is_executive boolean
In this case, the single-field approach seems to be flexible.
On the other hand...
What if you have to add a new type of employee, "consultant"? And what if "consultant" isn't an actual employee type, but an attribute of a person, meaning you can have a consultant individual as well as a consultant manager. In that case, you won't just be able to add "consultant" to the possible values for your single employee_type column.
Maybe you need a combination? Maybe you need
is_individual boolean,
is_manager boolean,
is_consultant boolean
but you have to constrain the data so that every record must have is_individual OR is_manager, and then is_consultant is independent of that.
Don't optimize your code and data prematurely.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Top-Master |
| Solution 2 | Andy Lester |
