A random event from my calendar serializes to 740 bytes of iCalendar. An extremely busy calendar containing one event every 15 minutes 9am-5pm 7 days per week only contains 11680 entries for a full year, which works out to around 8MB.
Would suggest instead of designing a schema at all, a calendar is a good example of a problem that might be far better implemented as a scan. Optimizing an iCalendar parser to traverse a range of dumped events at GB/sec-like throughputs would mean the above worst-case calendar could be scanned in single-digit milliseconds.
Since optimizing a parser is a much simpler problem to solve once than changing or adding to a bad data model after it has many users, and that the very first task involving your new data model is probably to write an iCalendar importer/exporter anyway, I think this would be a really great trade off.
Wouldn't you have to re-invent things like ranges searches which are 'BETWEEN ... AND ...' queries in SQL? The same for finding events that are for users 1, 2 and 3, etc.
In a real application you'd probably have some user accounts of some sort that are stored in a relational database already and then you'd suddenly have to scan for events in a directory that you then have to connect to those records in the database.
So there might be some specific set of applications where you are right, but there are specific things that a database is really good at, which would make it a really good choice. With the proper indices you'd probably get the same or even better throughputs, unless you come up with some clever directory structure for your events, which would in fact be the same as an index and only on one dimension whereas in a database you'd be able to create indices for many dimensions and combinations of dimensions.
I didn't mean to imply avoiding use of a database entirely, almost any DB system tasked with copying a few long strings around in a simple query won't perform much worse than a literal raw disk file.
Even just something like: CREATE TABLE calendar(id, user_id, blob)
Yes sure. I can imagine though that normally you'd also want to be able to query on details of an event. In which case having most things in columns would make sense because you can combine it with JOIN queries, etc.
Also, in the context of web applications, you probably already have a database and probably don't have persisted disks on your application servers, which then adds complexity to the file based scenario. In which case using blobs is a perfectly fine solution indeed.
Still you are right that in many cases, let's say a desktop application, you are probably better off reading directly from tens of files on disk rather than having to deal with the complexity of a database.
The same applies to vector databases. I read an article a few months ago that spoke about just storing vectors in files and looping through them instead of setting up a vector database and the performance was pretty much the same for the author's use case.
If I understood him correctly, I think this is where some language collections libraries ought to shine. PHP/Laravel collection or C# Link for exemple. Tell it how to load a record into the collection, add one liners for each criterias you want to define, and in a few dozen lines you're free to go.
Showed up to a doctors appointment during their lunch break once because iCal supports the time format "TZ is whatever is local, trust me" and the device I used to add an appointment was set to UTC, not UTC+1.
My doctors calendar app vendor was pretty happy I found the root cause of their very occasional mystery appointment drift too.
If I invite you (who lives in Spain, let’s say) to fly over and enjoy my New Year’s party (in New Zealand) it definitely is not going to occur at “midnight local time wherever you are” - Trust me.
(Actually there’s a famous missing person’s case related to a New Year’s party in New Zealand … I don’t think that timezones were part of what went wrong, but I can’t be sure.)
It is also a data model that amounts to a de facto standard across all real calendaring apps. The text serialization sure is ugly and who would want to touch it unless they had to, but that's just a parsing problem. The wiki page shows 14 component types, who is ever going to think of that many when designing a replacement schema? Why not just use the data model you already have 'for free'
A defacto standard for interchange ("feeds" announcing changes to activities), not for a database (remembering a large set of activities indefinitely).
The appropriate schemas are vastly different: a message like iCalendar should be simple and self-contained (leaving cleverness to the applications exchanging it), while a database like the article discusses should be normalized (nontrivial structures and queries aren't a problem).
> > you can tell it was designed before XML/JSON were "hot
> or you can tell it was designed by people who care more about performance.
I doubt parsing iCal is significantly more performant than JSON for most use cases. In fact I can image it being less so in more cases than it is more so, and as close to the same as makes no odds in the vast majority of cases.
Parsing iCal was far more performant than parsing JSON as your parser could simply start at the first byte, and follow the spec — whereas a JSON parser would first need to wait for ten years before JSON was invented.
The original comment implied that parsing JSON was less efficient (either in terms of coding required, or the resulting CPU use) than parsing iCal.
While it is true that dealing with json by hand is going to be more work than just importing an iCal library, I'm comparing apples to apples and suggesting dealing with json "by hand" is no more faff than dealing with iCal "by hand", and that dealing with either via a good library equally gives no benefit to iCal.
Most data formats used in high performant scenarios, for example in finance, are very basic for a reason. You also get the benefit of being able to start processing the file immediately line by line, instead of having to read and parse the entire file.
Maybe you can do without a DBMS, but you still need a good schema for your data, and it isn't a bunch of iCalendar records: for example, instead of a stable user ID there are only mutable and repeated email addresses.
What I take is convert everything to UTC is fine if it's historical data, even unix timestamp is fine. However for the future datetime, it's more complicated than that.
Having built recurring stuff in the past (date based with no time component, luckily for me) I think you gain a lot of usability gains for generating a row for each occurrence of the event.
Inevitably the user will come back and say "oh, I want it monthly except this specific instance" or if it's a time based event "this specific one should be half an hour later". You could just store the exceptions to the rule as their own data-structure but then you need to correlate the exception to the scheduler 'tick' and if they can edit the schedule, well, you're S.O.O.L either way but I think having concrete occurrences is potentially easier to recover from.
The worst I have ever messed up during an interview was building a simple booking system that had to do recurring appointments. I never felt that lost and confused trying to accomplish something in code since my early university days lol
To this day whenever I have to work with anything datetime related I dread it, it just does not click in my head for some reason
Asking a design question like "Design a booking system, but with recurring appointments" is like asking "Write a function to order a list of string, but it has to work with arbitrary UTF-8 strings, in any locale, respecting alphabetical order conventions". It's a deceptively simple question that gets impossibly hairy if you try to make it work in a real-world, general, way.
(As someone who hires) the correct answer to any question involving datetimes is always "use a [good] library - do not attempt to reimplement all of human civilization from scratch".
Don’t feel bad about it. I was tasked with improving a calendar in a CMS to support a bunch more functions, and basically eaten alive by it. I had 8yrs of experience or so by then.
It is deceptively hard and requires excellent data modeling skills.
Modeling a system is an under-appreciated skill. In a new domain, however, this should really start with an analysis of the entire problem so as to capture both the static structure of the system (i.e. class model) and its dynamic behavior (i.e. use cases). Jumping directly into a static database model tends to leave out the dynamic behavior. That might be OK in a simple CRUD app like this one, but could be a big mistake in more complex systems.
Data scientist, and I have had a few examples of seemingly simple "how would you build a schema" job interview questions that I had a difficult time with on the spot.
So last one I remember was how would you build a product table with coupons. Ok, so two tables right, no big deal. Well, we are going to need to keep a history right? So now I need to update and have datetimes for different products and coupons. And now I should think about how to do indexes on the tables, and gosh my join to get the discounted price is that a good way to do that? Most coupons only allow a person to use them once, how the hell am I going to implement that?
They probably just wanted the simple product + coupon table, but let me spin on it for quite a while like a madman.
I'd say this is exactly what the interviewers wanted. They're interested in how you break down the problem, the types of solutions you consider, your understanding of the trade-offs involved. For example, I interviewed somebody who was adamant they could prevent double-booking by polling an end-point and storing the state in Redux. Fantastic JavaScript skills, terrible knowledge of databases.
I don't like these questions. Data warehouse schema design is out of scope for data science, it's data engineering. Yes we have to do a lot of ad-hoc data engineering along the way, but it's such a strange thing to interview for in lieu of the many possible data/math/stats skills and more directly relevant programming skills. It signals a lack of respect for division of labor and specialization, and that lack of respect will be visible in the form of a stretched inefficient team.
To be clear, I think the ability to be your own data engineer is a great attribute as a data scientist. I just don't think it's reasonable to expect it: it's not part of the core job description.
I expect a skilled data scientist to be able to articulate the complexities of the real world in relational data. Otherwise, how the hell will they be able to infer the real world from relational data?
> this should really start with an analysis of the entire problem so as to capture both the static structure of the system (i.e. class model) and its dynamic behavior (i.e. use cases).
How does the deliverable for this sort of analysis look? Do you have a link to some example documents (or maybe chapters in a book or whatever)?
When I did analysis for a living years ago, we used a subset of Unified Modeling Language (UML) and delivered the result as a Word document containing both the use cases and a class model. That format seems a bit heavy to me now, but the basic idea is still valid.
Coincidentally, the guy who taught me this process just published a book about it, which he now calls "Agile with Blueprints": https://www.agilewithblueprints.com/
Caveat: It's hard to convey the value of this approach because it shines most on complex systems that are hard to wrap your head around, and it feels like overkill when used on the kind of small project that one might want to start with. That said, I learned a lot of valuable lessons from it.
I found this to be a good introduction and a well-chosen domain to demonstrate modelling.
The term “anchor” feels kind of weird to me, but the explanation is so concrete/grounded (like an actual anchor) that I guess it works well enough.
The concept of defining the attributes via a question is solid, great way to get clarity quickly. Too often we jump to a minimal column/property name without defining what question we’re trying to answer, and thus not shaking loose any ambiguity in the mind of the customer(s).
This term was inherited from Anchor Modeling (https://en.wikipedia.org/wiki/Anchor_modeling). "Entity" is pervasive, but I don't want to use this word because it may bring unnecessary baggage and assumptions.
Also, this term is heavily overloaded: basically everything in CS is either an object or an entity, lol.
> The concept of defining the attributes via a question is solid, great way to get clarity quickly.
Thank you, this is an important confirmation for the validity of this approach.
I’ve been thinking about this since writing the gp comment yesterday and quite like the anchor terminology.
Entity/object and “aggregate root” bring all sorts of complications/ambiguity.
The example you gave of a non-anchor (price was it?) was helpful to clarify.
Examples (and counter examples) are far easier for human brains to understand than “definitions”.
so often the example domain for these sorts of exercise is “a shopping cart website” —- and I find shopping websites such a poor example, because the specifics of how the shopping experience should work at a small Website are very custom to what is being sold, to whom, and why etc.
Other common domains when learning about databases
— “a book lending library” (I found this a good domain when I was learning, but people’s experiences of libraries are very different now…)
- a blog (posts, comments etc…) - but only fools like me write their own blog software now… (and or allow the public to comment? Yuck!)
I’ve written scripts that query outlook appointments and found that their handing of recurring appointments was very odd! Writing a query that simple asks “what am I doing today?” is far from a trivial exercise, as a consequence.
The choices and trade offs around storing every instance of a recurring appointment versus storing just the pattern plus the exceptions, has some pretty weird/nuanced implications. And I think the optimum approach really depends on the way I. Which the tool is used… which you don’t know until after the tool is used! (I guess that lends further proof that the most important aspect of software is that it’s open to being change in unexpected way… as any software worth writing will also need to be modified later.)
Timezones will bend your mind, especially around the transition times.
Assuming your timezone jumps forward one hour for daylight savings time and falls back one hour for transition to standard time...
When your time skips forward one hour, your 1 hour event may now be displayed as spanning two hours - the second hour will not be reachable/does not exist.
When your time falls back one hour, your 1 hour event may now show as spanning 2 hours or 0 hours.
Timezones are a man-made construct so don't hardcode values cause things will change...
It gets even more interesting if you're tracking time while considering movement. First time I ran into this was timetracking aboard a ship that is at sea for multiple days. You can cross timezones repeatedly in both directions, the date line, can have a local start datetime that is after the local end datetime, etc.
You don't even need a ship to get into some real confusion.
Arizona is Mountain Standard Time, no day light savings.
Navajo Nation, which overlaps state of Arizona, is MST with DST (since much of it is outside Arizona, and that's more common).
Hopi Reservation, which is fully contained inside both of Navajo Nation and Arizona, is MST with no DST.
You can drive 35 miles from Gray Mountain, AZ (non-reservation) through Tuba City (Navajo Nation) to Moenkopi (Hopi Reservation) and experience noDST->DST->noDST in just over half an hour.
Continue southeast for 100 miles to experience a bonus noDST->DST->noDST->DST->noDST->DST transition.
If you rely on your phone's automatic clock adjustment, best of luck!
Are you suggesting to just make the calendar in UTC with zero timezones? I agree that they are something we created, but we created them for better local organization. I work with customers globally and everyone uses them. This is a pretty ridiculous ask, in my opinion.
Rather just focus on eliminating the Daylight Savings concept from the few localization that still use it, as they tend to cause the most confusion across timezones, especially when planning past an upcoming shift.
> Rather just focus on eliminating the Daylight Savings concept from the few localization that still use it
Oh, certainly, change the laws and cultures of foreign states and peoples in order to simplify your code. Can you get them to just write everything in ASCII while you're at it?
One approach is to store everything in UTC, and display in the timezone of the user.
Dealing with timezones - including DST - properly is a must-have, no way around it. I live in a country that uses DST, a lot of Europe does that. If my calendar would be off by 1 hour half of the year, I'd consider it broken and seriously doubt the competence of its authors. This is the core domain of a calendar app! It would be like an email app that just silently drops every other email.
I'd love for us to ditch DST by the way, hate it every time. Its bad for the economy, its bad for our health, its bad for software.
I don't trust UTC for future dates, only for things that have already happened. Future dates are tentative in that their UTC representation can change. For example, if I converted a Kazakh user's event slated for April 1, 2024 to UTC before February 1, 2024, the event time would be off by one hour.
Timezones are not carved in stone. Prepare for that.
A location could go from a +1/-1 timezone as in most of USA/Canada to a fixed timezone with no transitions.
There are various ways to adapt, but the user-friendly way involves a lot more work in the app, especially if the app thought that timezone data wouldn't change.
Indeed. There are multiple timezone changes each year, causing gaps in time. Also, not every timezone is a neat 1 hour increment, some are 15 minutes or worse. There are even people living in the same geographical location but in different timezones!
Dealing with timezones will drive you mad, quickly.
The timezone inventors specified the timezone transitions would occur when most people weren't awake or affected - early in the day in the middle of a Saturday/Sunday weekend for USA/Canada.
But remote teamwork threw a wrench in that. 2AM Sunday meetings sounded unlikely unless your team needs to communicate with a team several hours ahead or behind and Sunday is a regular workday for one of the teams.
> Dealing with timezones will drive you mad, quickly.
I guess you mean "implementing timezone logic yourself". other than that, the suggested approach (store everything in UTC) and translating to the relevant user timezone with a TZ database in the frontend is the way to go.
Not sure, that's good advice for times in the past, or for times in the next few months, but if we're arranging to meet at midday local time in two years time, I don't think a change in timezone rules in 2025 should cause our meeting to take place at a different time.
I suppose you could store the meeting in UTC and use the creation time of the meeting to decide to use the 2024 timezone rules for conversion not the 2026 rules, but that seems pretty confusing too!
You don't want to store two dates for an event. It's easier to store the start time of the event and a duration for the event. This will make the logic for updating your events simple. You can always calculate the end time of the event based off the start and duration.
> You don't want to store two dates for an event. It's easier to store the start time of the event and a duration for the event.
Frankly speaking you could store all three (begin, end, duration) and just use whatever you need for different purposes. Just introduce a single point of entry/update that keeps alternative representations in sync.
In practice performance problems are a nonissue. Your API should consist of start time and duration because that is how people think about appointments and meetings. (An hour long meeting starting at noon) It's a pain to write a UI that updates two pieces of information when one piece of information changes. If you are truly worried about query performance you can denormalize the data before saving by storing start/end times as a datetime range field, but it still makes no sense to expose that in your API.
How would your code generate all the appointments for a given day?
Requirement: Must be able to handle appointments that span a day, i.e. show all Sunday appointments when there's a party appointment that starts at 8PM Saturday and ends at 2AM Sunday, or in your data model, Saturday 20:00 for six hours.
Showing duration is helpful but so is the exact end time. Visually as a user I would like to see the exact time when the appointment ends instead of calculating it in my head. While it is not that hard to process when an appointment with a duration of 4:15 ends after starting at 2:30 but still.
As for API it makes a lot of sense to expose end time. If you for example are creating a calendar widget then it has start and end datetime for all events. With only duration available in the API output you know how to calculate the end time. More lines of codes for you.
Fetching from the API you would in most cases limit it to certain dates, for example next week. So now you suddenly do have to deal with start and end time. Not having it otherwise makes no sense.
Never had any developers ask for outputting duration in our scheduling API. It would be useful to include it but since no one have asked about it then I think having end time is more critical. https://developer.makeplans.com/#attributes-1
Years ago I worked on a calendar application with recurrence. After lots of research I settled on using RRules to represent this, which I was very pleased with. That initial work was when I was at an agency.
Later I joined the company full time and discovered to my amazement that a contractor from a different company had removed the RRules in favour of creating and destroying instances of events on the fly. It had no/little fault tolerance so sometimes the script (which did other things that would sometimes fail) would fail to create new events. You'd have monthly recurring events with missing months.
I found it so frustrating that (after going through a lot of thought and research) that someone hadn't put anywhere near as much effort into removing mine. It took just a few weeks at that company to realise that the CEO expected the Engineering team to pump out features (that nobody used) at his will and, in the uncertainty of the job market, sadly I stayed there for 2 years.
Unrelated footnote: After Googling them, it's really sad to see what are blatantly fake reviews by the CEO on Glassdoor all written in the same style with nothing bad to say. I (and a bunch of other people I know who worked there) hated him, but the silver lining is that I wrote some of my best essays there. The CTO was hopeless too.
I once implemented the backend of a calendar and resource control for a low code platform.
The control is highly customizable, with a lot of views to chose from, daily, monthly, yearly... but also resource views (you can book resources with custom groupings, by plugin, by the resource-ID, whatever...), define "plugins" on the data sources, what's the from- and to- columns, the title column, what's the resource (may be from a foreign key / 1:1 relationship or 1:N if it's from a "child" data source or from the same data source/table).
Furthermore I've implemented different appointment series, to chose from (monthly, weekly (which weekdays), daily...), which column values should be copied. Also appointment conflicts (or only conflicts if they book the same resource). You could also configure buffers before and after appointments where no other appointment can be.
That was a lot of fun and also challenge sometimes regarding time zones and summer/winter time in Europe and so on :-)
> This is one of those aspects which is not really touched on in most courses.
Thank you! As I was working on all of this, the more it seemed to me also that what you're saying is true. The goal of this approach is to quickly get a very concrete deliverable that you can confirm with the business people and then proceed with implementing in the specific database, using your favorite table design strategy.
I'm genuinely curious what you mean when you say "relational data"? I've seen this phrase thrown around and I think it's something of a misconception.
The way you use the term implies that you're referring to the type of data, but the term generally refers to the method used for storing the data.
This distinction is important because it leads to a circular reasoning dynamic: many of us are accustomed to storing the data in tabular form using a relational data model. But choosing to use that particular model to represent objects or entities or ideas does not make those objects or entities or ideas fundamentally relational data.
Is that really true these days? Setting up Postgres read replicas with automatic fail over across multiple machines is pretty trivial in the cloud with services like RDS, spanner etc. And although doing it in your own datacenter is still a big job it's far from impossible.
Just curious why? I have never worked with NoSQL but I always envy the people who does that because they are a bit further away from business than us data warehouse modellers. They are usually our upstream, paid more and less hassle from business.
This is an extremely misleading statement. You have to jump through hoops to do expressive queries, with a limited set of index primitives, a query optimizer that doesn't take advantage of anything that resembles nested JSONB, and syntax and storage overhead that will leave you thinking there's nothing good about storing data with rich JSONic structure begin with.. worst of all worlds
And yet almost all of the major websites you use today rely on NoSQL.
Also if you have non-traditional data structures e.g. document, star, graph, time series then storing them in a SQL database will cause you nothing but problems.
There are no black/white answers in tech. Always right tool for the right job.
> And yet almost all of the major websites you use today rely on NoSQL.
They may use NoSQL in specific use cases, but certainly not exclusively. Using the right tool for the job is crucial; otherwise, you’re doing yourself and your product a disservice.
In this case, NoSQL database architecture and internals provide little to no advantage over relational databases. I can’t imagine building a calendar implementation with NoSQL. Some flexible parts of the event model might be stored as NoSQL, but in general? No way.
Edit: looks like I wrote my comment as you were editing yours. We agree :-)
It's a weird argument because the article is wrong.
Google Calendar is not implemented on top of a traditional SQL database but rather on top of Spanner which is more akin to a NoSQL database with a SQL front end.
Up to chapter 6 the method is completely database-independent. In chapter 7, a classic relational table approach is explained.
You can choose a different approach if you want to use something like Cassandra, MongoDB or DynamoDB (or some other NoSQL solution). I am not an expert in all three, but I hope some day to write "alternative ending" for this tutorial and show how it can be applied in those environments.
Whatever database you choose, you still must do all the work of logical modeling (first six chapters) for your application. If you don't do that explicitly, you would have just to do that implicitly, intertwined with physical design and thus inherently more confusing.
The for most people somewhat counter-intuitive answer is that NoSql is very rigid. It is counter intuitive, because having no required schema up front appears to be more flexible, not less.
However, having your database not handle schema means your application must do it, there is no way around it. If you ask for an DayEvent and you get back something totally different, what do you do?
The rigidness in most NoSql (assuming some form of document store like MongoDB) comes from its inability to combine data in new ways in a performant manner (joins). This is what SQL excels at. That implies you need to design your data in exactly the way it is going to be consumed, because you can't easily recombine the pieces in different ways as you iterate your application. Generally you must know your data access patterns in advance to create a well behaved NoSql database. Changes are hard. This is rigid.
Thus, it actually makes more sense to go from sql to a nosql, as you gain experience and discover the data access patterns. The advantage of nosql is not flexibility, that is actually its disadvantage! The advantage is rather its horizontal scalability. However, a decent sql server with competently designed schema will go a very long way.
I think you have a very thoughtful take but believe it's a mistake to think of 'NoSQL' as a monolithic category..
There's a very wide spectrum from having an evolvable document oriented data model with evolvable strongly consistent secondary indexes, transactions, aggregations, and joins to simplistic key/value stores like DynamoDB and Cassandra that do force you into a very much waterfall posture that I think you are spot on in pointing out.
Because events are related to users and they both are related to timezones and events can be related to each other. MongoDB is really good for storing big blobs of data you want to retrieve quickly, with some basic search and index, but it's awful at relations between data.
Would suggest instead of designing a schema at all, a calendar is a good example of a problem that might be far better implemented as a scan. Optimizing an iCalendar parser to traverse a range of dumped events at GB/sec-like throughputs would mean the above worst-case calendar could be scanned in single-digit milliseconds.
Since optimizing a parser is a much simpler problem to solve once than changing or adding to a bad data model after it has many users, and that the very first task involving your new data model is probably to write an iCalendar importer/exporter anyway, I think this would be a really great trade off.