Having tried recently to implement a good MQTT library for embedded devices from scratch (there is surprisingly none which can actually do async message delivery) I found the protocol to have a surprising amount of shortcomings. I found the biggest one to be about the actual reliability of QoS1/2 transfers due to how sessions and message IDs work:
Each message in MQTT carries a 16bit message ID. There protocol allows for retries of message delivers for QoS1/2, but in revised specifications limits to allow retries only if the session got disconnected.
Now this opens the question what a client should actually do when it does not get a ACK in a certain timeframe? Since retries are not allowed, an alternative is to close the connection. That allows for resending this message, but opens up other problems. First of all reconnecting when only a single message timed out introduces a huge overhead, which is not what we want for small connected devices.
In addition to that there exists an issue with the persistent session feature, which requires the client and server to track all message IDs across connection attempts. This has the implication that if a client does not get an ACK for a certain message ID it can never reuse the ID - even not with reconnects (assuming clean_session=false).
The tiny 16bit message ID space also requires clients to remember sent messages and prevents it from easily dropping/cancelling pending transmissions if they are outdated or superseded. The server might still respond to those IDs, and if we send a new message with the same ID there exists ambiguity. So technically the client would need to track all used message IDs until they are ACKd. This might be never, and is not reasonable for a small IoT device. Common clients just ignore the problem and reuse message IDs whenever convenient, which means they are actually not reliable. A bigger address space would have made the problem going away by being able to utilize unique IDs. But with the current spec it seems hard to really provide reliability on non-ambiguity utilizing the MQTT QoS guarantees.
So thereby the conclusion was that there is no way to provide real reliability with the defined QoS classes on MQTT, since that would not allow for ambiguity.
From my point of view the best way to add reliability on top of MQTT is to add a custom reliability layer on the application layer and just use QoS0 transmissions. Those work somewhat better.
Depending on the application a different protocol (fully custom, HTTP long polling with a persistent connection, Thrift, grpc, etc) might also be a reasonable choice.
I'm pretty ignorant on this topic, but shouldn't retrying and missing messages be handled by the TCP layer? You don't want multiple network layers be doing the same work after all
Yes, TCP already guarantees reliable byte streaming. However messages can get lost on higher levels: Some MQTT libraries or brokers will drop messages if they are out of memory of some internal queue is full. Or the application level software does the same or forgets to explicitly ACK a message (some MQTT libraries delegate all ACK sending to the user, and don't handle it in the library). In those cases the remote peer would want need to handle the missing ACK in a reasonable fashion.
We can learn from the mechanisms applied by the TCP layer for reliable end to end packet transmission and adapt those mechanisms at application layer for reliable message delivery. For example, for any pair of applications that need to send/receive messages, they can efficiently keep track of sequential message ids that have been transmitted, and acknowledged, yet to be acknowledged, via a windowing mechanism. Then stop transmitting and wait for acks when the unack'ed message window is full, have timeout for these waits and reset the windows to recover and retransmit. We can have performance statistics that provide visibility without much fuss.
Each message in MQTT carries a 16bit message ID. There protocol allows for retries of message delivers for QoS1/2, but in revised specifications limits to allow retries only if the session got disconnected.
Now this opens the question what a client should actually do when it does not get a ACK in a certain timeframe? Since retries are not allowed, an alternative is to close the connection. That allows for resending this message, but opens up other problems. First of all reconnecting when only a single message timed out introduces a huge overhead, which is not what we want for small connected devices.
In addition to that there exists an issue with the persistent session feature, which requires the client and server to track all message IDs across connection attempts. This has the implication that if a client does not get an ACK for a certain message ID it can never reuse the ID - even not with reconnects (assuming clean_session=false).
The tiny 16bit message ID space also requires clients to remember sent messages and prevents it from easily dropping/cancelling pending transmissions if they are outdated or superseded. The server might still respond to those IDs, and if we send a new message with the same ID there exists ambiguity. So technically the client would need to track all used message IDs until they are ACKd. This might be never, and is not reasonable for a small IoT device. Common clients just ignore the problem and reuse message IDs whenever convenient, which means they are actually not reliable. A bigger address space would have made the problem going away by being able to utilize unique IDs. But with the current spec it seems hard to really provide reliability on non-ambiguity utilizing the MQTT QoS guarantees.
So thereby the conclusion was that there is no way to provide real reliability with the defined QoS classes on MQTT, since that would not allow for ambiguity.
From my point of view the best way to add reliability on top of MQTT is to add a custom reliability layer on the application layer and just use QoS0 transmissions. Those work somewhat better.
Depending on the application a different protocol (fully custom, HTTP long polling with a persistent connection, Thrift, grpc, etc) might also be a reasonable choice.