They found out about the failing transistors via colleagues at a conference. Have any of you learned of something of this magnitude in the same way? It got me thinking that I need to interact with my fellow devs more often.
This is a classic thing with Industry, they qualify a process that is working and shows good performance, but this process needs to be changed for reason XYZ, often because it is maybe a bit too expensive or doesn't align with the rest of their processes. The small change in the process wasn't that small and takes a little while to be identified because by the time you catch it you might be further down the line and this would be caught by a QA process and not a QC process, that might have deemed at that point not necessary because you had no reason to fault the part.
The second part is that some things are rated and verified but not tested extensively, since you might have prototype you might misdiagnose a failure of a component for a behaviour of your prototype, when in fact you had a deeper problem, but timelines with the added fact that so far you didn't think about that problem because it shouldn't have been a problem can catch you really off guard. This is usually where people testing the same thing in an exotic environment can ring alarm bells for others and that often happens at conferences...
People often under estimate how much you can get bitten in the back by such little details that become huge details.
Depending on the electronics and where the MOSFETS are, I would be them I would probably trash the electronics, take the spare that they had, validate components that get in and rebuild a control box and re-integrate it, provided that this is doable. It's expensive but provided that you have no choice that gives you a backup system that you can test code on before pushing it on the actual probe and might help for problem solving by being able to do measurements and test on the actual setup... Provided that they have the time and resources. Otherwise I wouldn't YOLO it given the fact that it might just straight up not work at the moment you need it the most and a little delay is better than nothing and they can spend the time re-checking part of the design that might also be weaker...
But heh, who am I but a random guy on the internet...
They found it was cured by lemon juice, but they didn't understand the details. Over years, they switched to lime juice (less vitamin C), put it in copper pipes (leaches vitamin C). But ships were faster so there was more fresh food available, masking the problem. Then scurvy starts mysteriously popping up again 100 years after it was first "cured."
Hard to keep track of the effects of all the details in the face of various co-dependent things changing simultaneously. Recipe for surprises.
Yep. Chatting with other practitioners is a powerful way to learn how things actually work. There are tons of things that "everyone" knows that are not well documented, and therefore unavailable to people outside the network.
This is a more-consequential example of the things you can learn by chatting with others; it is an extreme example of, "Hey, are you guys using components from Widget Inc.? Their datasheets are good, but sometimes we get a bad batch."
Those little things can save you a ton of time. In this case, it may have prevented mission-failure.
Part of the blame falls to NASA, too. If the outcome is your responsibility, then open-loop trust of a vendor for a known failure-mode may not be acceptable. Integration rad-hard testing may be requisite.
In the spacecraft environment, qualifying components is very difficult -- there's a good chance that NASA has these MOSFETs on an approved list because they've worked well before and have had few (or known) faults. They're probably not on that list anymore.
This is what forums like this one are for. Ordinary news isn't going to have more than a passing mention of the xz hack, or log4j, or meltdown, or heartbleed. Find (or start) a private group chat for technologists you know to share news like this.
I can't believe the manufacturer didn't alert them and they had to hear it from another customer. Surely the manufacturer wouldn't want to be named as the reason that a spacecraft orbiting Jupiter went dark due to their faulty components.
The article mentions that the defense sector discovered the issue. Rad hard defense electronics have more stringent TID (total ionizing dose) requirement than space, due to a need to survive in nuclear war scenarios. Space usually caps out at 100 krad, with some very stringent environments needing up to 300 krad. Defense can go all the way up to 1 MRad in some cases.
My guess is the parts failed TID at the more stringent levels, and Infineon didn't follow up with NASA or their contractor because they assumed that NASA was okay with the lower rad tolerance levels typical of space. Usually that would be the case, but Europa Clipper is special because it's going to an extremely harsh radiation environment.
The big question for me is: did the Europa Clipper program order a lower TID and try to upscreen, or did they order the high TID part? If it's the former, it's on NASA. If it's the latter, that's extremely concerning because Infineon should know that nobody orders expensive high TID parts for funsies, and they should have followed up with all customers as soon as they confirmed there was an issue. Just assuming NASA over-specified a part is absurd. The rad hard electronics market is small, everyone knows each other. Trust is king.
Finally, I'm not sure if it's the part in question, but it looks like Infineon discontinued their 1 MRad MOSFETs in 2020, citing low order volumes: https://irf.com/product-info/hi-rel/alerts/fv5-d-21-0004.pdf. In the light of this reporting, I have to wonder if there was more to it than that?
> and Infineon didn't follow up with NASA or their contractor because they assumed that NASA was okay with the lower rad tolerance levels typical of space
It's more likely that Infineon's folks talking to NASA were equally clueless about this change.
Ultimately, NASA bought a part with a specified TID tolerance. Any manufacturer of space qualified parts keeps detailed records of lot acceptance testing as well as who purchased from that lot. The reps interfacing with NASA didn't necessarily need to know that there was a process change, but as soon as test failures below the datasheet spec were communicated from customers and confirmed, Infineon's quality department should have immediately reached out to NASA (or more specifically NASA's contractor working on the electronics).
" Infineon's quality department should have immediately reached out to NASA (or more specifically NASA's contractor working on the electronics)."
Is there any actual evidence they didn't reach out to every single buyer of the electronics?
The article goes out of its way to say Infineon did not contact NASA. But even in your description, they would not have, they would have contacted NASA's contractor working on the electronics.
I still go back to "if there was actual evidence that Infineon did not notify who it was supposed to, the article probably would have cited it". There isn't, so they instead cast aspersions.
Instead they make a bunch of hay about a statement from Infineon that seems totally innocuous - they didn't notify people they didn't know about. Shocker.
Look, i actually hate Infineon - i've been forced to try to make their wifi and bluetooth modules work properly before ;-)
But this kind of lazy-at-best journalism doesn't help anyone.
Or maybe those making a living on selling a product claiming certain parameters should raise their voice when those parameters are not met, regardless if that product is used for space travel or turn on and off a ketle for f's sake.
Also the hallway conversation thing. Most of the time it’s small talk and minor social interaction, every now and then it’s critical out of band information that would not have shown up in normal processes.
To me it's a matter of fostering serendipity. and a bit ironic that research has shown conferences to be a great place for serendipity to take place, as that's what happened here.
I experienced this kind of situation, where only by chance conversation was a crisis averted, very much at my last FT. So much that I'm working on a startup for fostering serendipitous communication for remote teams, like private notes from coworkers left on stackoverflow questions (or anything on the web)
Probably inevitable these days given hallway conversations are going to be a pretty random thing. Of course, assumes someone needs to think something is important enough to put in chat and doesn't mind putting it out in public. (Ignore $XYZ project that other group is doing. It's got all sort of problems.)
Your highly usable dashboard will get filled with 99% of worthless fluff just because it's there and somebody feels the need to always say something.
Have you even been in one of those meetings that just won't finish despite everything being done? Making it written doesn't solve the problem. Instead, it makes it worse.
On point. This is also why good CI/CD automatically alerts users of major issues. It's just not a thing humans are good at to pay attention to a long stream of mostly boring information.
Computers are good at this though.
Now the only question is how you can automate the spec comparison such that issues with the spec and the parts used can be automatically compared.
And that starts with a computer readable spec that is updated by the manufacturer.
Yes, all the time. This is normal. Big news of problems travel through back channels. Nobody is gonna announce their big fuckup for the world to see, unless compelled by law, and even then most won't do it. We had to sign NDAs to find out about severe Intel processor silicon bugs. Obviously you're not gonna read that info on Hackernews afterwards.