VSCode does the "correct" behavior of Bad #3, but doesn't even need to do the "bad" part about pushing the bytewise carat position around, as it logically maintains two characters, but visually coalesces both the middle position and the front together. Wonder why it wasn't mentioned.
It's probably bad because there isn't an additional kludge added: decomposition of combined character entities for editing. This would involve a concept of sub-character (code-point) 'tabs' (which ideally would be distinct UTF-8 entities).
Another option would be that deleting the 'a' doesn't completely delete it but instead replaces it with a zero-width space or zero-width non-joiner, so it looks like Bad #1 (but is Unicode-compliant) and hitting delete again gives Bad #3.
The whole example is a bit contrived though, nobody is going to enter in a skin tone modifier character by hand in daily use. They'll select an appropriately-colored emoji.
IMO, having delete trigger a zero-width space insertion would be the worst option. I mention in a sibling comment that VSCode gets around this by having two separate logical carat positions combined into a single visual position. So the byte offset of the cursor changes as expected, while still maintaining "Unicode correctness", for whatever thats worth.
I disagree that the "Bad #3" example in the "Emoji Modifiers" section is actually bad though - it's the outcome I would expect of an editor.