Start with rewriting the Azure Python SDK.

existencebox · on Nov 12, 2020

Honest response, if somewhat off-topic, since I can't pass this up: How could it be improved/what's been most painful for you?

Full disclosure, I'm one of the <many> maintainers working on a subset of the Azure Python SDK. Currently, there actually _is_ a large-scale rewrite in progress to bring the various SDKs up to a consistent level of quality and python standards, since it's no secret the original batch of SDKs grew rather organically. As such, this is a VERY APT time to hear this sort of comment. (And yes, I'm taking it totally deadpan even if it wasn't necessarily meant that way :P)

Do feel encouraged to file issues on the azure-sdk-for-python github as well as things come to mind; there are more formal triage processes there than "I happened to read this over lunch" :)

salex89 · on Nov 13, 2020

Hi, didn't expect such an offer, and the original comment is made in a more ranty tone than it should have, but let's try.

Disclaimer: some (or maybe all?) of the things I'm mentioning here might be caused by the API itself, rather than the SDK. The order is in which it comes to mind, rather than importance.

1. Some kind of strongly typed errors. I've turned the SDK upside down, and the only error that seems to pop up 90% of the time is CloudError with CloudError data. So our SaaS application (which utilizes Azure IaaS heavily) has a bunch of regexes that parse the messages contained in CloudError and throw something from our domain. Wrapping SDK errors by the implementing clients is a smart thing to do, but not this way :). Some messages change from time to time, so we try to make our regexes more tolerant, but you get the picture. In general the error handling is very strange, a lot of low-level errors sometimes flowing into high-level calls.

2. Documentation with more examples. The easiest thing to do is sometimes looking at the API or CLI documentation, and try replicating it with the SDK. Parameters named differently etc.

3. For some reason the Storage API/SDK keeps changing and it's like it's from another world than compute and network, which look pretty similar. The storage portion is... wild.

4. Improved working with data disks. Updating the VM with an array of disks has shown bad performance (and Azure support helped us and said that it can't be improved because it's a cross-RM event) but also poor reliability. A separate set of methods to attach/detach disks to VMs would be great.

5. Because we heavily rely on asynchronous operations, what we usually do is make a raw HTTP request for an operation (like VM deployment), serialize it and store it, then after some time deserialize it and check if the operation is done. We do the same with hundreds of requests at the same time, being able to process a lot of cloud operations with not much resources on our side :) . It works, but I found the raw request and serialization is a bit clunky. We're also stuck with a couple of older SDK libraries because we've noticed the raw responses changed in the newer ones, so now we need efforts to find all of the differences and consolidate the serialization. For example, the headers have had multiple urls to check the deployment status, sometimes it's `async-url`, sometimes `location-url` etc. This probably isn't something the SDK is to blame, and this is a niche case, but serializable async responses would be nice :D .

I have to commend the effort to break up the libraries from the original SDK into independent units. The versioning and dependency management became easier than with than one monolithic "4.0.0" schema.

Maybe some of the things are out of scope and our use-case is disproportionately affected by some stuff, it's just my two cents. I'm less satisfied with Azure as a service, with its unreliability, sub-par support, idiosyncrasies and poor "hypervisor" performance, than the SDK, but it's the SDK that gets all the blame because it's the one we're interfacing with.

existencebox · on Nov 14, 2020

Hah, feel no regret about the potentially ranty tone, frankly, I often prefer to get feedback from people who are willing/able to rant, since they're both willing to be candid AND clearly have some interesting thoughts built up.

I'm honestly appreciative that you'd give this long a response, I've unironically been sharing this with peers internally for what an interesting spread of feedback it gives. I have to give a similar disclaimer of "no promises" BUT...

1. 100% agreed with you. We're currently working through this discussion for ServiceBus, actually, it's VERY apt you mention this now, as we've been working with the architects and other-language SDK owners to find proper semantic unification between Python's inclination for more fine-grained strongly typed errors, vs. e.g. dotnet's inclination to have unified exception types and "code"/"reason" fields. Can only speak for the data plane SDKs here (vs. -mgmt, that's a different team) but our architects have within the last 2 days expressed their alignment with the pattern you're requesting. (We explicitly as a guideline want users to not have to rely on any ad-hoc string parsing to distinguish exceptions) We've also been attempting (to lead into #2) produce both better samples EXPLICITLY to show common failure modes and "best practice defensive coding", but also conceptual guidelines (e.g. here [0]).

2. Preaching to the choir. Samples are a First Class Entity for our Track2 APIs (complete with validation, smoke tests, etc), used not only as inlined samples in docstrings and refdocs, but as long-form examples of E2E scenarios as you suggest. (and for more esoteric subject matter that may not be covered in primary long-form docs)

3. I'm not 100% sure I'm thinking of the right component when you say Storage, do you mean e.g. `azure-storage-blob`? #4 implies you may mean more of the block storage/VM interconnect logic, and unfortunately I can't speak with deep familiarity surrounding that, other than to note that the ARM template parameters for configuring this hasn't always been "great" (e.g. I think I recall there being a need to format a disk as a separate step? not sure if this is the sort of thing you're referring to, half thinking out loud, but that's more distant from my area) Would be curious to know specifically which transition was painful though/what kept changing, since it is a "core awareness" going forward that migration pain is a Major Friction Point for users ("well duh" says every dev ever) so I'm just curious to see if your specific callout is something we'd have been aware of/can impact.

4. See the above re core surface area of "working with disk"; and to your later point some of this may be more service-side than SDK, but your mention about attach/detach disks is salient and something I can perhaps throw a shout at someone about.

5. To make sure I've understood, you're serializing like, metadata/a record of the long-running operation and doing it yourself? The track 2 APIs use something called a Long Running Operation poller internally to facilitate this sort of operation; now this may not solve your scenario depending on how your control flow gets passed off or not/your need for serialization, so this may not actually change anything tractably for you, but mentioning on the side in case. Curious re: what you mean by "is a bit clunky" though more precisely as well. And in terms of raw response consistency, yeah, that's not surprising, that's not one of the things I think we pin in backcompat (or even feasibly could, to your point, BUT we may be able to take more into account if there's any way to offer a continuous experience if this changes and folks were relying on it. Regardless it's something for me to keep in mind when doing design discussions with the service teams)

Glad the library fragmentation is well received. I'll candidly admit I was worried about that (was a user, not a maintainer when it happened, and did the normal grumpy-engineer thing of "they moved my cheese") but it does seem to have worked out nicely and given some good modularity.

Finally; Two cents, this was ten cents! I feel like I owe you big time for having given such a well-thought-out response. Thank you; sincerely, you didn't have to do this and have done us a big favor.

Absolutely get where you're coming from re: SDK being the public face, and that's what leads me to be eager to try and improve what we can. (And selfishly gotta make sure I can feel proud of the code that has my name attached to it :P ) I cannot tell you how much Reliability resonates with me. If nothing else, I can assure you this dev is in your corner; hardening, reliability, stability are the mantras within my purview. (signed, an-artist-formerly-known-as-SDET.)

[0] https://github.com/Azure/azure-sdk-for-python/tree/master/sd...

salex89 · on Nov 16, 2020

Hi, nice conversation going here :) . Let me just reiterate on a few points.

3. Correct, I'm referring to azure-storage-blob and azure-mgmt-storage. I'm not concerned with the unmanaged disks so much, which utilize the storage accounts (thank god, we're over that :D ), more with working with regular blobs. From the initialization of the BlobClient to the operations which are sometimes weird in their input parameters, it's not up to par with lets say azure-mgmt-compute.

4. This point is not related to point 3. I'm completely referring to managed data disks, which are first class* entity (like NIC, for example, not a Page Blob any more). Attaching them to an existing VM is a nightmare, and because of some features we do that often. Also they lack basic things like per-disk disaster recovery (but this is way out of the scope of the SDK).

5. Correct, we're doing it by ourself. The LROPoller is also there in other parts of the SDK, we also use (if it's the same poller), but I think we were not able to utilize it. Ny clunky, I meant the response itself sometimes has headers with redundant values, we're not sure which one to use so we just take from experience (if there are two similar) etc. I haven't looked at the docs recently, but the "raw" requests are not documented that well.

* - I did say that managed disks are a first class entity, but we found out the hard way that the managed disks and managed snapshots are actually just in some storage accounts on the backplane, and when something goes South, storage-level errors leak out to our logic :) . Also throttling, thresholds and quotas apply as if it were just another storage account, but even worse, we don't control how the disks are distributed between those backplane storage accounts, so we get funky issues from time to time. We solve them by being extra careful and conservative with some operations, but heh, talking about leaky abstractions. This is not an SDK thing, just wanted to point that out for someone reading.

johanste-MSFT · on Nov 16, 2020

Yeah, agreed that the raw requests were/are somewhat clunky/not well documented (they were added at one point as an escape hatch for when things didn't work as expected/the service did unexpected things for some requests). I would love to hear about what weird input parameters you are seeing in the azure-storage-blob package, however.

And +100 on not having to manage quotas/storage accounts/figure out how many accounts I need to distribute disks across in order to avoid throttling. There are many things I'd rather do :).