Information Architecture Blog

Data Normalization Is Not Just About Relational Models or Databases

Data normalization is about normalizing logic. Unfortunately, data normalization is often incorrectly considered to be a process only applied to physical models or physical databases, specifically, relational models and databases. There are just as many benefits, if not more, to normalizing semantic models (conceptual data models) and logical data models as there are to normalizing physical data models or databases. The highest number of benefits are realized when ontology or other semantic models and logical data models are normalized to at least 4th normal form (4NF). Some data architects even prefer 5th normal form (5NF).

Ironically, in this day and age almost all physical models and databases are DE-normalized, typically for performance reasons. At least this has been true at most of the companies where I have worked. This is especially true for specialized databases such as data warehouse and data marts. Therefore, to say that normalization only applies to relational physical models or databases is being very shortsighted and ignorant (meaning lacking awareness or knowledge).

Who Invented Data Normalization?

I have read articles that incorrectly claim normalization was invented by the father of relational modeling. Codd may have been the first one to introduce the concept of using mathematical systems to construct more sound relational models, and he

most likely coined the phrase "database normalization". And, without a doubt, Codd's work was important and noteworthy. The same is true of the work done in later years by Boyce and others that added to the normal forms. But they didn't really invent normalization. Behind the principles of normalization are mathematical process, such as first-order logic (also called first-order predicate calculus) and set theory, that existed LONG before computers, databases, and data modeling formalisms and tools were ever conceived, let alone in existence.

Do All Models Representing the Conceptual and Logical Data Architecture Need To Be Normalized?

Normalization needs to be applied to all models representing Conceptual and Logical Data Architectures. This applies to all data models, not just logical data models depicted using E/R, ERD, and UML formalisms and notations. It also applies to models representing Conceptual Data Architectures, such as Terms and Facts Models, Semantic Models, and Ontology Models defined and managed in OWL or OWL2 (Semantic Technology). Yep, I said it - ontology models created and managed using OWL or OWL2 (and other types of RDF languages) need to be normalized. 

I completely understand that normalization is applied differently for for ontologies or semantic models built using Semantic Technology (such as OWL and OWL2 models) than it is for data models built using ER, ERD, and UML notations. I also am fully aware of, and over the moon about, the fact that most Semantic Technology modeling tools integrate business rules with the ontologies they manage. Also, I am aware of the fact that most of the tools immediately notify modelers about rule inconsistencies or violations they attempted to enter in their models. In some cases, the tools flat out disallow inconsistencies or rule violations regardless of how far apart the rules are depicted in the models - even if the inconsistent rules or violations of rules are 300 relationships away from each other. (The number 300 is just an arbitrary number I chose, not the maximum allowed.) However, regardless of the advantages of the ontology defining languages, the powerful tools that manage them, and the fact those tools essentially slap a modeler's hands if they violate modeling rules, normalization processes still need to be applied.

Biggest Pet Peeves Related to Data Normalization?

While I'm on the topic of normalization, I'd like to state a couple of my biggest pet peeves related to normalization in today's world. (1) Vendor produced models that are represented as being "normalized" to at least 3NF, yet they lack definitions, real business rules, relationship labels, specific relationships, and often cardinality - which means they CANNOT be normalized.(2) Data Architects (DAs) and Database Architects/Analysts (DBAs) who completely misunderstand what it means for a model to be normalized.

It is absolutely astounding to me how many models are incorrectly misrepresented as being normalized when in fact they are not normalized to a minimum of 3rd normal form - let alone 4th or 5th normal form. There are hundreds of books and articles written on the topic of data normalization, so I'm baffled at the number of models misrepresented as being fully normalized.

I cannot even guess the number of times over the course of my career I have had to struggle to keep a straight face when someone handed me a "normalized" logical data model - they swore it was normalized to 3rd normal form - and one or more of the entities or classes in that model contained over 200 attributes! One model even had an entity with over 400 attributes! 

Early in my career I created and managed physical data models. I quickly moved onto logical data models and then conceptual data models.  I have created hundreds of data models of varying types and at all levels - ontology/semantic/conceptual and logical data models.  Also, I have taught modeling classes for almost as long as I have been modeling. This is in addition to the fact I constantly study and read works written by information management and modeling masters. Therefore, I'm sure what I'm about to say will be no surprise to most seasoned and/or proficient data modelers.  In each example I mentioned earlier, ("normalized" models with 200 plus or 400 plus attributes) I didn't need to read the entity, attribute, or relationship definitions (most of which were missing); I didn't need to know any of the business rules associated with the model (most of which were nonexistent except for a few rules somewhat loosely represented in cardinality); nor did I even need to actually LOOK at the model to know that there was not a chance in heck that those models were not even remotely close to being in 3NF, let alone 4NF or 5NF! 

In my entire career, I have never managed or reviewed a properly normalized logical data model properly normalized that contained a single entity consisting of more than 10 or 15 attributes, give or take a few. This number of attributes for a single entity or class is inclusive of the identifier(s) or key(s) that are required by some data modeling notations. In fact, in most properly normalized model, the majority of the entities (or classes) will consist of approximately 5 - 10 attributes. This is especially true in models normalized to 4NF or 5NF.

The main problem causing the logical data models to fail at being normalized? In most cases, the attempt to normalize was done purely based attribute or even physical column names. It was clear they did not take into account any definitions, business rules, explicit and specific relationships, actual data instances or usage, or any other factors that need to be considered during the definition and normalization process. Additionally, many of them were based on specific technology or implementation choices.

Each time an attribute was duplicated within the same model, it usually had a different name and appeared in a different entity, but it was the same attribute. It is important to understand I am not referring to one or more attributes that are role named for correct reasons, such as an identifying attribute of a super-type entity that is role named as an identifying attribute of each of the sub-type entities, or an attribute that is involved in multiple relationships between or among the same entities. Nor am I referring to attributes that migrate as a foreign keys in model notations that display foreign keys. I am referring to attributes that, for a variety of reasons, incorrectly have multiple names, usually in multiple entities.

In Summary .....

Data Normalization is set of mathematical processes applied to logic; Data normalization is about normalizing logic, not just relational models or physical databases. When managing Information Architectures, especially Enterprise Information Architectures, most benefits of data normalization are realized when applying it to Semantic Models and Logical Data Models. This is especially true in today's world when most physical models are de-normalized. Also, the normalization process needs to take into account many factors, not just the names of the attributes.