Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do I represent a hierarchy in RDBMS?
38 points by m33k44 on May 28, 2019 | hide | past | favorite | 63 comments
Say, I have following hierarchy:

   Animal ==+==> Dog
            |
            +==> Cat
and another "class", Pet, that references Animal.

   Pet --1-----+--> Animal
The way I can represent this in RDBMS is by creating tables for each class, Pet, Animal, Cat and Dog. And then additional tables for the relationships, i.e.

   PetAnimal(pet_id, animal_id) [one-to-many]
   AnimalDog(animal_id, dog_id) [one-to-one]
   AnimalCat(animal_id, cat_id) [one-to-one]
This way I am able to represent the hierarchy of Animal, Dog and Cat. So, tomorrow if I want to add Rabbit, then I will just add Rabbit and AnimalRabbit tables.

My only concern is that the Animal table will grow rapidly as more pets and animal types are added and will be a performance issue. What is a better way to represent hierarchical structures in RDBMS to avoid both storage space explosion and performance issues?




You've asked about storage size and performance, which are very unlikely to be an issue for storing rows of two integers.

What you really want to be thinking about is the data access patterns.

Are you going to be querying the entire data set in a way that can only be practically expressed or executed as a graph algorithm? My guess is no, because that's specialized and uncommon. What you want to avoid is giving up the reliability, maturity, predictability of an RDBMS unless you really have to, because you really need the secret sauce of a graph (or other specialized) database.

Even if you do have graph algorithms to run, I would still recommend an RDBMS for primary storage and day to day single-animal lookups if that's a frequent use case. Load data into a secondary graph database to handle offline or one-off analysis. Unless your entire application is truly centered around graph algorithms.

Switching off of an RDBMS very quickly becomes a tail-wagging-the-dog situation if you only look at the data and not the access use cases.

The important thing to note is that the entity relationships alone aren't the right information to allow anyone to make a meaningful recommendation.



Your storage space concern is unfounded, particularly without denormalizing your relationships (to gain performance).

It seems you have ‘animals’, some of which can be ‘pets’, but not all.

You also have ‘dogs’, ‘cats’, ‘rabbits’, etc., which may or may not be pets, but are all animals.

So, my questions to you, what attributes do all animals share?

What attributes do all pets share?

What attributes are unique to dogs?

What attributes are unique to cats?

There are two main ways I could model this in an RDBMS, but I need to know more specifics.


  > What attributes do all pets share?
Pets have owner, feeding time, outdoor time etc.

  > What attributes are unique to dogs?
Type, Height, Medical record, Sex, Barking pitch etc.

  > What attributes are unique to cats?
Type, Colour, Weight, Medical record, Sex etc.


So, since you listed ‘sex’ as a unique attribute for both cats and dogs, I think you need to alter your perspective.

We want to model the relationships between entities, not necessarily represent OOP ‘objects’ in a DB.

So, you should have an ‘animal’ table that would have fields for any attribute that is shared by all animals, such as ‘sex’, ‘medical record’, and ‘weight’ (I am assuming all animals will have a gender).

Medical record would be a lookup key, probably not an integer, so you cannot enumerate or get an off by one error, use a UUID or hashid.

I am assuming non-pets can have medical records.

Your dog and cat tables would reference the animal table via a foreign key for those details.

For cat type and dog type, I would make those foreign key references to lookup tables for each of the respective types.

Animal table would also have a foreign key reference to the pet table, but this can be NULL, if the animal is not a pet.

Thoughts?


So something similar to what @rodeoclown described below?


Exactly.


If you're using Postgres, you can use recursive queries using common table expressions. Disqus implemented this sort of thing 10 years ago: https://pastebin.com/Fe2twMRr (I can't find the original talk for some reason)

Here's the meat of the solution:

  CREATE TABLE comments (
        id SERIAL PRIMARY KEY,
        message VARCHAR,
        author VARCHAR,
        parent_id INTEGER REFERENCES comments(id)
  );
  INSERT INTO comments (message, author, parent_id)
  VALUES ('This thread is really cool!', 'David', NULL), 
  ('Ya David, we love it!', 'Jason', 1), ('I agree David!', 'Daniel', 1), 
  ('gift Jason', 'Anton', 2),
  ('Very interesting post!', 'thedz', NULL), 
  ('You sir, are wrong', 'Chris', 5), 
  ('Agreed', 'G', 5), ('Fo sho, Yall', 'Mac', 5);
What we’ve done now, is setup a basic comment model. We’ve included the message, the author, and the parent comment (which is optional). Now let’s learn how to use a recursive query to easily re-order this date, showing us a fully threaded view, sorted in ascending order by id.

  WITH RECURSIVE cte (id, message, author, path, parent_id, depth)  AS (
  SELECT  id,
          message,
          author,
          array[id] AS path,
          parent_id,
          1 AS depth
  FROM    comments
  WHERE   parent_id IS NULL

  UNION ALL
 
  SELECT  comments.id,
          comments.message,
          comments.author,
          cte.path || comments.id,
          comments.parent_id,
          cte.depth + 1 AS depth
  FROM    comments
          JOIN cte ON comments.parent_id = cte.id
  )
  SELECT id, message, author, path, depth FROM cte
  ORDER BY path;


As other people are saying, you haven't really given enough information for someone to suggest a definitive data model.

By "Pet" class, do you mean a table for pet records?

Pets are animals that humans decide to take care of. Should a "Pet" be a state of an Animal? So, should it be a simple boolean column on the Animal table?

Having an AnimalRabbit table implies a many-to-many relationship between Animal and Rabbit. Are you storing different rabbit types as parent records? If so, why not avoid the join table, store "rabbit" as a record in the Animal table, and have the different rabbit types in the Rabbit table.

Without more detail on what you're trying to do from an application standpoint, it's tough to give recommendations. Entity-Relationship Modeling purists believe in defining entities and their relationships without implementation detail. I've always found that to be a crock. You need a certain amount of detail to model better.


  > By "Pet" class, do you mean a table for pet records?
Yes.

  > Having an AnimalRabbit table implies a many-to-many relationship between Animal and Rabbit. Are you storing different rabbit types as parent records?
The whole point on making an Animal table was so that it could be referenced by the Pet table, because Pets can be more than one type. If I include Dog and Cat directly in Pet table then there will be lots of fields in the Pet table empty; also tomorrow if I want to add Rabbit as a Pet then I need to modify the Pet table to include Rabbit fields with null values in the previous rows. Having a separate Animal table allows me to add new types of Pets without modifying the Pet table and also saves table space.


To reiterate, without knowing about your application, it isn't easy to make recommendations.

It sounds like the Animal table is going to be the join table between Pet and the different types of animal tables. Correct?

You haven't mentioned a need to store differentiating information - e.g. species, genus, family - in the different animal tables (Dog, Cat, etc.). So it sounds like you could skip the one table per animal notion and just have the Animal table contain animal types like rabbit, dog, etc. Then it becomes the "one" in the one-to-many relationship with Pet.


  > It sounds like the Animal table is going to be the join table between Pet and the different types of animal tables. Correct?
Yes, that is correct.


A simple solution is to have a single table with the fields: id, parent, name

id is your primary key. parent is the id of the parent (if one exists) or a sentinel value (NULL or 0) if no parent exists, and name is "Dog", "Cat", "Pet", etc.

If you structure your queries well this will be very performant up to 1 - 2 million records on commodity hardware (e.g. an EC2 t2-large instance) and may be suitable for your needs for much larger data-sets, depending on your application.

For improved lookup performance, you might split the fixed-width and variable-width (varchar) columns into separate tables. Then you'd end up with an "animal_names" table consisting of (id, name) and an "animal_relationships" table consisting of (id, parent_id, name_id) -- but again, that depends on usage patterns and which specific queries you need to be performant.


The question is not really about hierarchies, but more about representing diverse-but-similar domain entities.

It's a problem found in GUI modeling also, where various GUI widgets will or can have different properties. Some properties will be shared, such as "parent_ID" (AKA "container"), but many are specific to a particular widget or subset of widgets.

Further, new widgets may be added down the road, and we don't want to have to invent a new table for each one. Current RDBMS modelling techniques handle these situations poorly.

I've been a fan of "dynamic relational" to handle these kinds of problems, but so far it hasn't been implemented: https://news.ycombinator.com/item?id=15352786


There is no RDBMS that will choke on a table with every animal in existence in it.

To my knowledge, the two most common ways of handling this: self-referential nested set[1] and with a join table[2]

1. https://en.wikipedia.org/wiki/Nested_set_model

2. https://coderwall.com/p/lixing/closure-tables-for-browsing-t...


Are there any significant differences between AnimalCat and AnimalDog? Feels like there can just be an "animals" table with an enum "type" field or a one to many animaltype(id, type).

Normally with such hierarchies it doesn't neccesary need to represent the same structure unless there is a significant difference as we can filter by animaltype with a where clause. The same approach cannot be applied to rmdbs design as to oo because the paradigms are different.


  > Are there any significant differences between AnimalCat and AnimalDog? Feels like there can just be an "animals" table with an enum "type" field
The AnimalCat and AnimalDog are just relationship tables. The alternative to not using separate relationship tables is, just as you said, to use an enum type field in the Animal table. I was actually trying to avoid using an extra field in the table to indicate type, hence my question on what are better approaches to avoid both table size increase and performance issues.


An enumerated field will be represented most likely as an integer value in the row, so keep that in mind.

Otherwise, you are going to use a foreign key lookup to get your type, which will also be an integer stored in that same row.

But, if you wanted to add a new animal type, and you used a lookup table, even if you have billion of rows, you are not messing with the table with a bunch of data. That may not be the case with an enum.

It sucks to ALTER a live table in production with billions of rows being actively INSERTed.


What is the reason for avoiding an extra field considering the alternatives mentioned throughout this post?


Just one 64 bit size field in a 1 billion row table will amount to about 8 GB data for that table.

Edit: Basically what I am looking for is a trade-off between space and performance.


8gb seems quite small in the grand scheme of things. Storage is fairly cheap nowadays. I'm sure there will be other concerns with scaling when you do hit the 1 billion mark.


Start with the animal table:

Animal(animal_id PK, ... animal attributes)

No need for join tables or surrogate keys on these tables. The animal_id can serve as both the primary key and foreign key.

Dog(animal_id PK FK, ... dog only attributes) Cat(animal_id PK FK, ... cat only attributes)

No need for surrogate key on pet, and don't forget the owner key:

Pet(owner_id PK FK, animal_id PK FK)

This allows multiple owners of one animal. If you don't want that you need a unique constraint on animal_id.


What are the pros and cons of this approach?


The pros are you aren't creating unnecessary sequences and tables, which has both storage space and join complexity savings.

The cons are...none, really; the extra surrogate keys aren't buying you anything.


This is really a domain modeling question masquerading as a data modeling one. If you really were modeling animals, then you could put the whole thing in one animals table, with self-referencing columns. This is because you only have one 'type' of domain concept, with details that not all instances of the class will share. So you leave those columns NULL.

So you do one animals table with an animal_id and species_type columns, the species_type column could be a enumerated string field or foreign key to a species table, and pet as a boolean field. Now soon you might want to know 'whose' pet it is, so you'll need a human table to store the humans, and then you'd need a human_id field on the animals table to store that information. You 'could' use a pets join table if you want pets to be a first class domain concept.

All these decisions revolve around what your app will be doing. Ideally you'd have a REST web architecture, this makes domain modeling a bit easier. Anything that has it's own web page, give a database table.


In Postgres you can use tree data with LTREE.

http://patshaughnessy.net/2017/12/11/trying-to-represent-a-t...


great for data that is mostly read intensive. not so great if you want lots of writes.

@maxk42 approach with the relationship table works well for both read and write intensive applications.


Depending the DB, some databases have build-in support for HierarchyId or ParentId ( eg. SQL server : https://docs.microsoft.com/en-us/sql/relational-databases/ta... )

Otherwise nested sets of closure tables are popular db patterns.

I love the trick of nested sets though. Calculating a number for each side of a type do your RightBower and LeftBower can be used ( https://images.app.goo.gl/Bv2wiC3gG6DW869r5 )


You can just use Animal_id as the PK for all the tables (in the other tables it will also be an FK to Animals) and dump the relationship tables.

> My only concern is that the Animal table will grow rapidly as more pets and animal types are added and will be a performance issue.

Why? You aren't duplicating data anywhere (except perhaps the unnecessary IDs), so why do you think there is a more compact representation? It doesn't matter much what table the data is in as long as it appears only once.


Sorry, I am embarking on a flight so maybe I will have to come back to you later.

For the time being: https://stackoverflow.com/questions/3413459/designing-sql-da... - mine is the "chosen" answer and I hope will give you a pointer in the right direction.


Does this need to be tabular data? If you want to model relationships, graph databases are great.

Despite their name, relational databases are not great at relationships. They're great at tables, but can also handle relationships if you have to. Many problems can be solved by an RDBMS, but as this question and many others like it show, it's not always a natural fit.


> Does this need to be tabular data?

No, not a requirement. But what will I lose if I choose graph DB over RDBMS?


From experience, so many things that graph dbs are not worth the effort in something like 99% of cases. Stick with postgres or similar if you can!


Graphdbs are excellent at joins when a relationship exists (they are O(1), not O(logn) like SQL). The query language many of them use (Cypher) is very expressive for these types of queries. Take a look, the basics are very easy to pick up.

The biggest draw-back is that you likely already know SQL and would need to learn a lot up front. The resources out there are good, but there are so many more good SQL references. If you want to do aggregations and the like, SQL Databases might be a little more suited for those tasks. Popular SQL DBs are also a lot more stable and optimized, just given their age.

To make use of these features though, all your data must fit on one box.


The difference in experience is really the only major downside to graph DBs. They're an excellent tool for many of the situations where we currently use RDBMSs, mostly because that's what we're familiar with while graph DBs are new and unfamiliar.

I've now been using Neo4J for a couple of months, and I don't want to go back. If relations between different items are what matters to you, then graph DBs make that so painless compared to the joins from relational DBs.

The main thing relational DBs still have over graph DBs is tables. Give me a table with all items from this table that meet this condition. Graph DBs can do that too, but it's less natural, because they don't store their data in tables, and nothing guarantees that all items of a certain type actually have the same properties. RDBMSs are statically typed, Graph DBs dynamically typed.


Familiarity, tons of features, existing well-worn supporting libraries for lots of languages and ecosystems, and depending on which one also safety/reliability and, for a variety of common query patterns including many that look pretty damn graphy, performance and predictability.

FWIW my experience is with Neo4j.



Use the right data model and - for example - plug Agensgraph into Postgres. Plug your data in Agensgraph. Problem solved.

https://github.com/bitnine-oss/agensgraph


He's not talking about an object hierarachy, but more of a type hierarchy



great for data that is mostly read intensive. not so great if you have lots of writes.

@maxk42 approach with the relationship table works well for both read and write intensive applications.


I agree if you can use a recursive table, do it. If you're using an ORM do it as the ORM allows you and think about performance later as soon as it grows big.

If tables are small (or you have good indexes), a non-optimal structure will still work for you.


  > I agree if you can use a recursive table
Could you please elaborate on that? Which table will have recursive relation in this case?


  Category ID  Category Name Category Parent
  -----------  ------------- ---------------
  1            Animal        NULL
  2            Dog           1
  3            Cat           1
  4            Rabbit        1
  5            Beagle        2
  6            Lion          3
  7            Tiger         3
  8            Bengal Tiger  7
A Bengal Tiger (8) is a Tiger (7) is a Cat (3) is an Animal (1).


just for your sanity in a relational database alway put a root id for hierarchies. Also, you might want to skip the null and use a self reference depending on how you need to build the hierarchy. Root only matters if you have multiple hierarchies in a table.


+1 for root-id, this is a huge failing in a lot of threaded discussion table designs.


Yeah, it really helps if you have to do some processing in the stored procedures and it helps with the query speed if properly indexed. I have seen folks put a level number in the table also, but that really depends on what the hierarchy represents.


One possible solution is a junction table: https://en.wikipedia.org/wiki/Associative_entity

An RDBMs within indexes is probably fine.


Recursive relation within the same table. A single table can solve this problem. Some animal is pet and some are not this is not their class but these are their property.


  > Recursive relation within the same table. A single table can solve this problem.
How? Which table will recurse?


Not parent commenter but a single table is enough. Each row pointing to its parent row.


Also not parent commenter, and not a guru, but I did manage to implement this with an ORM.

https://github.com/BitWorks/xbrlstudio/blob/master/BookDatab...


There will never be so many animals that they won't fit into RAM on any reasonable DB server (even including a smartphone).


If you used Neo4j you would get an easy hierarchy. If you need multiple inheritance then no problem. If all of your classes have completely different properties then no problem. If some discovery reclassified the tree somehow then no problem. But it all depends on how important this use case is to the overall application.


Have a table for Types

The types table have references among them

Types Table

id name parent_id

1 Animal null

2 Pet 1

3 Dog 2

4 Cat 2

Have the actual animals reference the Types

id name type_id

1 Rover 3

2 Oscar 4

Now, you can get the relationships out of these.


A self referencing foreign key (to the same table).


I agree with the general sentiment that this is a better stack overflow question. But, it's made it to the HN front page, so it deserves some discussion.

In general, there is no one-size-fits-all answer, because the answer is really driven by your data. Which brings up the most important point:

Your database is your application's source of truth. It must have the best data model, and will most likely outlast all source code that you write.

So don't try to model your data as if you're writing an object oriented application. Model your data at the database first, and then write your application to fit the model of your data. (Don't put the cart before the horse.)

Anyway, the general patterns in this case are:

- One table with lots of null columns

- Lots of tables that join: (Animal joins pet joins Dog/Cat/Fish)

- Separate tables (Dog/Cat/Fish, no joins)

All approaches have tradeoffs. The one big table approach will have "waste" with the null columns, but remember that joining pet to animal to dog has a penalty on read, and that separate tables has implementation complexity.

IMO: If your entities are known, (IE, you won't add a new type of animal every month,) and your database is read-heavy, go with a single table with null columns. Otherwise, you need to justify the complexity of multiple tables.


One "trick" applicable to the mega table solution (one table with all data) to preserve performance (kind of) is to do indicies that spans each type, or do views (which technically will be almost the same as indicies to the DB internally). I'm by no means a DB pro, but I've looked up the same answer and been working with these cases (with these twists).


This would almost certainly be removed on stackoverflow.


This is a question better posted to Stack Overflow. With more technical implementation details.


There's no problem with posting questions like this to HN! If it gratifies intellectual curiosity, it's just fine.

https://news.ycombinator.com/newsguidelines.html

Edit: I should add something important, though. This submission works because there haven't been similar questions recently. If it started a pattern of people posting questions of this sort, that would quickly be less interesting, and therefore off topic by the curiosity principle.


If it triggers heated debate among experienced people, it probably qualifies as an "intellectual curiosity". If most come to a consensus about the solution(s), then it's more of a technical how-to question. The problem is that it can be hard to know ahead of time; newbies often stumble into age-old conundrums. So far, I find this question in the first camp: it's not something the current RDBMS are well-equipped to handle for reasons described in a nearby reply.


I guess I'd have to say it's not a stack overflow question without those implementation details at any rate, and the whole point of a question like this is you are nowhere near getting implementation details together.

Maybe better suited to https://dba.stackexchange.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: