Django and sub-classing (#54 - #56) - Data modelling for developers

When modelling data, we often need to distinguish between different types of objects. For example, a mapping application may need to store information about both roads and lakes, or a library catalogue may need to store data about both books and DVDs. With these examples, the objects can be quite diverse and the details stored about each may be very different.

Should the information about each be stored in different tables or the same table? A very simple and non-specific table structure can be used to store information about absolutely anything. For example, you might have a table called “OBJECT” with the columns “ATTRIBUTE1”, “ATTRIBUTE2”, ATTRIBUTE3, etc.. This sort of table can be used to describe anything at all, but then the use of each column needs to be specified and documented in fine detail or it becomes very difficult to get any benefit from the data stored. The more specific a table is, the more useful its data can be.

A typical process in data modelling is to categorise objects into classes where the information about each class or type of object is stored separately from that of other objects that are different in significant ways. Of course, the first requirement is to decide what is meant by “significant”. Is a road significantly different from a lake? Is a book significantly different from a DVD? The answer will normally depend on the usage of the data, particularly for searching or editing. For a mapping application that provides route planning, for example, objects (like a road) which can be included in a route are significantly different from objects which cannot be included in a route (like a lake).

In such a situation, it would be normal for roads and lakes to be treated as different classes of object, one of which requires route information while the other does not. Such different classes would commonly be stored in different tables and modelled as different classes in Django.

Sub-classes

What about the differences between a two-way street and a one-way street? Are those differences significant enough to warrant a different table? Imagine we have a table STREET that stores the name of a street, the type of street, the geographic path it follows and the authority that is responsible for its maintenance. Would we store both two-way streets and one-way streets in this table or should we have a TWO_WAY_STREET table and a ONE_WAY_STREET table? Here is the nub of sub-classing and one of the fundamental difficulties of data modelling.

If we have a different table for each possible type of street, we could end up with tens or hundreds of different tables for various paving materials, the varied importance of different roads, loading limits, height limits and many other things. Finding a route could then involve querying each of these tables or a constantly-changing subset of them based on particular search criteria.

Alternatively, if we store all the information in one table, we may end up with many columns that are only relevant to one specific type of road or maybe just a limited set of road types. Thus, a tollway may require toll information which will not be useful for a freeway. A seasonal road may require information about when it is closed or open, but a major highway would rarely need this information.

Which should we choose? Many very specific tables, or few tables with many columns that are irrelevant to most of the objects stored in the table?

This brings sub-classing sharply into focus.

Simply put, fundamentally different classes are stored in different tables, while similar classes can be grouped into one table. Between these two extremes is the area where classes are similar, but different enough to require different storage.

Once again, this is most easily shown by example.

Lakes and roads are fundamentally different. On some levels, freeways and tollways are fundamentally different, but in other ways they are very much the same. If one were to look at the ideal table for each, the only difference would be the information that relates to tolls and any limitations that may arise from a road being a tollroad.

Let’s consider three options. We could:

Use a single table (eg. ROADS) with extra columns for toll information which are left blank for freeways.
Use one table for tollways (eg. TOLLWAYS) and one for freeways (eg. FREEWAYS) with the appropriate columns for each.
Use one table for both tollways and freeways (eg. ROADS) containing the columns that are shared between the two types of roads and another table (TOLLWAYS) that stores toll information, with rows inserted only for tollways.

In Django, option 1 is the only one to use if tollways and freeways are to belong to the same Python class. However, Django will allow all three options to be used while still having the different types of roads belonging to different Python classes.

Analysing the data we are modelling to define the classes and/or sub-classes required is one of the essential skills in data modelling – whether we use Django or not.

Road example

Continuing with this road example, freeways and tollroads, we will now look at the options available in Django and restate the three options considered above. We could:

Use a single table (eg. ROADS) with extra columns for toll information which are left blank for freeways.
Use one table for tollways (eg. TOLLWAYS) and one for freeways (eg. FREEWAYS) with the appropriate columns for each. Our basic road columns will appear in both tables.
Use one table for both tollways and freeways (eg. ROADS) containing the columns that are shared between the two types of roads and another table (TOLLWAYS) that stores toll information, with rows inserted only for tollways.

All the Python code included below needs to be put into the models.py file included in the Django project.

Let’s start with a base class which includes all of the attributes needed for all roads:

# A base class for the shared road attributes
class RoadBase(models.Model):
    Name = models.CharField(max_length=250)
    RoadID = models.CharField(max_length=20)
    Type = models.CharField(max_length=20)
    Maintainer = models.CharField(max_length=250)
    Length = models.FloatField()
    SpeedLimit = models.FloatField()

    class Meta:
        abstract = True

Option 1

Now Option 1 (Case A) can be achieved in Django by using the following Python code:

# Case A - One table containing all the attributes
class FreewayA(RoadBase):

    # Freeway attributes
    UsageRestrictions = models.CharField(max_length=20, blank=True)

    class Meta:
        abstract = True


class TollRoadA(FreewayA):

    # Tollroad attributes
    IsTollroad = models.BooleanField(default=False)
    HasTollBooth = models.BooleanField(default=False)
    eTagType = models.CharField(max_length=250, blank=True)
    Cost = models.FloatField(blank=True)

    class Meta:
        abstract = True

# Single table with all of the columns
class RoadA(TollRoadA):

    class Meta:
        db_table = "ROADA"

This code defines one table called “ROADA”. Using the makemigrations command with your manage.py file, the necessary database modifications can be determined. Using the sqlmigrate command will then display the SQL necessary to make these modifications. In our case, the SQL generated for a SQLite database from adding these classes is (formatted so that it is easier to read):

BEGIN;

--
-- Create model RoadA
--
CREATE TABLE "ROADA" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "Name" varchar(250) NOT NULL,
  "RoadID" varchar(20) NOT NULL,
  "Type" varchar(20) NOT NULL,
  "Maintainer" varchar(250) NOT NULL,
  "Length" real NOT NULL,
  "SpeedLimit" real NOT NULL,
  "UsageRestrictions" varchar(20) NOT NULL,
  "IsTollroad" bool NOT NULL,
  "HasTollBooth" bool NOT NULL,
  "eTagType" varchar(250) NOT NULL,
  "Cost" real NOT NULL);

COMMIT;

One table has been created containing all of the attributes for roads, freeways and tollroads. Note that our freeway-specific and tollroad-specific attributes are still marked “NOT NULL”, but these will be stored as empty strings in SQLite. To store true NULL values in the database, we would need to include “null=True” in our attribute definitions, but this can allow confusion to arise between empty strings and NULL. If we are using an Oracle database, then empty strings are stored as NULL anyway.

Option 2

Our second option we will call Case B, and our goal is to have two tables, one for freeways and one for tollways, with each containing the common columns as well as the columns specific to each type of road. Our Python code is:

# Case B: Two tables with necessary columns in each
class FreewayB(RoadBase):

    # Freeway attributes
    UsageRestrictions = models.CharField(max_length=20)

    class Meta:
        db_table = "FREEWAYB"


class TollRoadB(RoadBase):

    # Tollroad attributes
    IsTollroad = models.BooleanField()
    HasTollBooth = models.BooleanField()
    eTagType = models.CharField(max_length=250)
    Cost = models.FloatField()

    class Meta:
        db_table = "TOLLROADB"

In this example, our freeway and tollroad classes each inherit from our basic road class with all of its attributes and then add some of their own. The end result is two leaf classes, each stored in a table.

After using makemigrations, the sqlmigrate command shows the following SQL:

BEGIN;

--
-- Create model FreewayB
--
CREATE TABLE "FREEWAYB" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "Name" varchar(250) NOT NULL,
  "RoadID" varchar(20) NOT NULL,
  "Type" varchar(20) NOT NULL,
  "Maintainer" varchar(250) NOT NULL,
  "Length" real NOT NULL,
  "SpeedLimit" real NOT NULL,
  "UsageRestrictions" varchar(20) NOT NULL);

--
-- Create model TollRoadB
--
CREATE TABLE "TOLLROADB" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "Name" varchar(250) NOT NULL,
  "RoadID" varchar(20) NOT NULL,
  "Type" varchar(20) NOT NULL,
  "Maintainer" varchar(250) NOT NULL,
  "Length" real NOT NULL,
  "SpeedLimit" real NOT NULL,
  "IsTollroad" bool NOT NULL,
  "HasTollBooth" bool NOT NULL,
  "eTagType" varchar(250) NOT NULL,
  "Cost" real NOT NULL);

COMMIT;

Two tables will be created, with the first seven columns in each being the same. This reflects our Option 2.

Option 3

Our third option (Case C) aims to store the shared road attributes in one table, while another table stores the attributes specific to freeways and a third table stores the attributes specific to tollroads. Our simple Python is:

# Case C: Two tables with necessary columns in each
class RoadC(RoadBase):

    class Meta:
        db_table = "ROADC"


class FreewayC(RoadC):

    # Freeway attributes
    UsageRestrictions = models.CharField(max_length=20)

    class Meta:
        db_table = "FREEWAYC"


class TollRoadC(RoadC):

    # Tollroad attributes
    IsTollroad = models.BooleanField()
    HasTollBooth = models.BooleanField()
    eTagType = models.CharField(max_length=250)
    Cost = models.FloatField()

    class Meta:
        db_table = "TOLLROADC"

Again we have two leaf classes, but this time each of the classes is stored in two tables, one being the shared ROADC table. Each freeway will have a row in the ROADC table and a row in the FREEWAYC table. Likewise, each tollroad will have a row stored in the ROADC table and another row in the TOLLROADC table.

After using makemigrations, the sqlmigrate command shows the following SQL for creating the three tables:

BEGIN;

--
-- Create model RoadC
--
CREATE TABLE "ROADC" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "Name" varchar(250) NOT NULL,
  "RoadID" varchar(20) NOT NULL,
  "Type" varchar(20) NOT NULL,
  "Maintainer" varchar(250) NOT NULL,
  "Length" real NOT NULL,
  "SpeedLimit" real NOT NULL);

--
-- Create model FreewayC
--
CREATE TABLE "FREEWAYC" (
  "roadc_ptr_id" integer NOT NULL PRIMARY KEY REFERENCES "ROADC" ("id"),
  "UsageRestrictions" varchar(20) NOT NULL);

--
-- Create model TollRoadC
--
CREATE TABLE "TOLLROADC" (
  "roadc_ptr_id" integer NOT NULL PRIMARY KEY REFERENCES "ROADC" ("id"),
  "IsTollroad" bool NOT NULL,
  "HasTollBooth" bool NOT NULL,
  "eTagType" varchar(250) NOT NULL,
  "Cost" real NOT NULL);

COMMIT;

The ROADC table contains the six shared columns (plus the unique identifier column) as we would expect. One extra column is then added to each of the FREEWAYC and TOLLROADC tables to connect the rows in these tables to the rows in the ROADC table. This extra reference column contains the key of the relevant row in the ROADC table, which ensures that there can be no rows in these tables that are not linked to rows in the ROADC table.

Notes

If you are set up to use GeoDjango, you could also store the geographic path of the roads in the RoadBase class using:

Path = models.LineStringField()

However, this article will not include this since it complicates the example and the SQL generated.

Reasons for choosing each sub-classing method

Continuing to look at sub-classing in Django and why people might use the different storage results which the different sub-classing methods produce, I have tried to list advantages and disadvantages for each option. Often, though, what one person sees as an advantage, another sees as a curse. It is not possible to include everybody’s personal preferences, so I have tried to present a list which will give a general flavour of the pros and cons of these options. It is important to note that this comparison does not include any performance issues – we plan to look at those later.

In this summary, I refer to analysing the data a few times. For road data, the sort of analysis referred to could be:

summing the length of the roads,
finding the average speed limit,
counting the number of roads maintained by each maintainer,
finding the number of roads with each different speed limit,
or any other simple or more complex analysis.

Our modelling will have an impact on how easy or difficult this analysis is.

Option 1 – A single ROADS table

In this option, we used a single table called “ROADSA” so that all of the data about roads is collected into one table, independent of the type of road. This makes querying and fetching simpler. Statistical comparisons (counts, totals, averages, etc.) are also easier to make in one table rather than between tables.

The main disadvantage is that each type of road contains information that relates to the other type of road, and this makes the querying of individual classes a little more complex. In our example, we had a tollroad column IsTollRoad which is only used to distinguish between whether a road is a tollroad or not. It exists only because all types of roads are being kept in the same table. There are other ways of achieving the same thing, such as a class name column.

Advantages

All the data is in one table.
Easy to combine all types of road for analysis.
Reporting of non-type-specific information is easy.
Use of the data by external users is easiest.

Disadvantages

Harder to isolate different types of road for selection or reporting.
Each type of road stores attributes that are only relevant to other types of roads. (This can give us lots of nullable attributes – generally bad as they may then not be filled out for the road types where they are needed.)
Extra storage space is used for unused attributes in each row.

None of these advantages or disadvantages are likely to be very significant, although anything that makes it easy for bad data to be stored is dangerous.

By this stage you may be thinking either that it is obvious that it doesn’t really matter which option we pick, or that Option 1 is clearly the best/worst. However, an old proverb in the Bible says, “The one who states his case first seems right, until the other comes and examines him” (Proverbs 18:17), so let’s consider the other two options also.

Option 2 – A table for each type of road

In this option we created a ‘FREEWAYB’ table and a ‘TOLLROADB’ table, each of which contained both shared road attributes and some attributes specific to the type of road.

Note that the IsTollroad attribute was included in the tollroad class even though it will always be true.

Advantages

Different types of road store only attributes relevant to that type of road.
Each table contains only the roads of a given type.
Attributes shared with other types of roads are stored in columns of the same name and datatype.
Analysis of individual types of roads is a little easier.
Reporting for each type of road is easy.

Disadvantages

Analysing shared attributes for both types of road requires the joining of tables. This is a little more complex.
The existence of two tables gives the impression that different types of roads are completely different.
Duplicated roads (one copy of each type) are likely to be easy to create because uniqueness is harder to maintain across two tables.
Integration with external users is easy.

None of these advantages or disadvantages are going matter much either – although, again, anything that allows possible duplication of data should be viewed with some suspicion.

Option 3 – Shared base table

This is the most complicated of the options, although conceptually it probably makes the most sense. We tend to think of both freeways and tollroads as roads with some minor distinguishing factors, and this option reflects that simple categorisation.

The “road” information is stored in a ‘ROADC’ table, while the information about the freeway aspect of freeways is included in a ‘FREEWAYC’ table and the tollroad-specific aspects of tollroads is stored in a table called ‘TOLLROADC’.

Advantages

Easiest to code.
Avoids duplication of data wherever it matters with simple uniqueness constraints in each table.

Disadvantages

Table structure is the most complex.
Information for each road is stored across two tables.
Analysis of leaf classes can be more complex.
Selection and reporting can be more complex.
Integration with external users is more complex.

Overall, the most significant disadvantages of this option relate to complexity. Integration with other users still should not be very difficult, but it is certainly more complex than the other two options. Note that the data in Option 3 can easily be presented to external users in the formats created by either Option 1 or Option 2 (or both!) through the use of simple database views.

The main difficulties of this option may well be with the way that Django handles this method of sub-classing: but we will discuss this more in the next issue.

So what do you think? Which option would you take so far?

Django and sub-classing (#54 – #56)