PCIe trouble with 4TB Crucial T500 NVMe SSD for >1 power cycle on MSI PRO X670-P

bunnie · 2024-12-28T10:43:21 1735382601

Reading the thread it looks like the issue is leakage power on the internal 3.3v line. When the system is off 1.9v is still present. This is not uncommon, although 1.9v is a bit high. A lot of laptops have explicit active pull downs on power supplies to clamp them to zero when power is off to ensure peripherals are not accidentally powered on by stray leakage (because laptops are extremely low power by design and there is not enough stray leakage to bring the power lines down in a sleep state). My guess is main boards might not have this feature because normally there is enough off state loading that it takes care of itself. however maybe in this case the loading is not enough.

A dirty fix could be to just put a static load on the 3.3v line to ground. I'd start with a 1/4w resistor around 100 ohms and just stick it from 3.3v to ground to see if that does not soak up the stray current. if it works just leave it, it's about 0.1 watts of static power and no big deal for a non portable setup.

The larger picture is that the controller on the nvme might not hit its power on reset condition because it may be rated to run at 1.8v (just a guess), so 3.3v is not going low enough for the controller to perceive the system has been power cycled. Usually a supplemental power monitor is needed in those cases to ensure a reset is generated in case of leakage problems like this.

starslab · 2024-12-28T19:06:02 1735412762

Hi! I'm the OP from the Level1Techs thread.

That HDMI power has some grunt behind it. During power-off state with that 1.90v phantom voltage, I put a 48ohm resistor between 3V3 and ground, the phantom voltage only dropped to 1.80v, and the SSD still didn't work when I powered the machine back on.

oneplane · 2024-12-28T19:28:10 1735414090

Depending on the PMIC and the SSD DC conditioning, even 1.2v might be enough for it to brownout/latchup without self-resetting. (or it might power up the PHY partially or in a bad state and never link up)

Try more resistors in series? (or just a bigger one if you have any -- scratch that we needed smaller ;-) ).

starslab · 2024-12-28T20:52:41 1735419161

12 ohms brings the rail down to 1.47 volts, still no SSD. 6 ohms is enough to finally break/trip whatever circuit is allowing this situation, bringing the rail down to 0v in power-off. Of course, that's almost 2 watts of constant draw during the power-on state, so not a long-term solution.

oneplane · 2024-12-28T21:00:51 1735419651

Oof, that is a giant leak somewhere. It's really sad we have to go to some shady websites to find schematics for mainboards, otherwise we could just get to the cause of this pretty quickly.

bunnie · 2024-12-30T16:17:18 1735575438

Hmm the usual solution in this case is to put a FET to ground that is turned on when the standby signal goes active. The reason you need such a low impedance is that a linear resistance pulls less current as the voltage goes down so the solution is less effective the stronger the leakage path is.

The amount of leakage is on the high side, so it does point to some device that is wired to the standby power that is probably 'powering' the 3.3v line via some ESD network between the power rail and pads. this is usually indicative of a part that was not chosen correctly for the job, or the correct part was wired to the wrong power bus.

If I were the designer I'd put the resistor in, power down the system, get a cup of coffee, come back and image the system with a thermal imager and the problem chip will be the 'hot one' (slightly warm, if you're unlucky its hot enough to just feel with your finger).

2 watts static current is nothing for a beefy PC motherboard but it will need a special heat sunk resistor which is annoying. The bigger problem is whatever is supplying the leakage is possibly an i/o pad and eventually that thing will fry if you keep pulling power from it (but from the description this feels possibly like a buck regulator that is 'off' and just leaking current, in which case it's fine to draw that amount of power).

It's one of those annoying things where it's hard to say who did what wrong. On the one hand, you could say the motherboard is out of spec by not bringing 3.3v to 0; but in reality 'we all know' that at power off things always hang out around 0.7v (a diode drop above ground) for a very long time without active countermeasures. on the other hand you can say the drive is out of spec for not going into reset when 3.3v goes out of regulation. I haven't looked at the atx spec specifically so maybe this edge case is specified, but when you are doing a fully integrated design this is just one of the things you learn to check for in integration testing.

numpad0 · 2024-12-28T22:49:41 1735426181

6 Ohms! Might as well just jumper it(don't)

Does it sound like reverse current through SBD? They have higher reverse current and leaky I-V curve. 3.3V of drop must mean something inline.

starslab · 2024-12-28T21:12:33 1735420353

> scratch that we needed smaller ;-)

Well... Needed smaller in terms of resistance, but needed bigger in terms of power rating, in the interests of not catching fire.

tfwnopmt · 2024-12-28T04:25:25 1735359925

HDMI provides power - that's how old chromecasts can work without a separate power plug.

The comment about NPNs and PNPs is garbage, but there is a design fault with the board - it shouldn't allow HDMI power to flow backwards into the motherboard when the motherboard shuts off. That would likely cause a power rail sequencing issue on the board or SSD, leading to latch-up of various ICs, and non-detection on the SSD on the flowing bootup

ssl-3 · 2024-12-28T11:34:08 1735385648

HDMI does provide power, but this is not how Chromecast (or similar) devices have ever been powered.

It supplies 5v at up to 50mA from a sink device like a TV.

That's only a quarter of a Watt, which is perhaps enough for something like an EDID ROM, or maybe a switch or perhaps an extender. It is not enough power to run a Chromecast.

HDMI 2.1b Amendment 1 [0] can supply up to 300mA at 5v, but that specification is only a year or so old. It requires a special cable. And 1.5 Watts maximum isn't enough to run a Chromecast, either. (The intent is to be able to use it to run a somewhat thirstier extender than the earlier specifications would permit.)

0: https://www.hdmi.org/spec21sub/cablepower

kalleboo · 2024-12-28T14:05:50 1735394750

> It supplies 5v at up to 50mA from a sink device like a TV.

And USB is also only supposed to supply 100 mA until the device negotiates for more.

But literally every device in the real-world just wires the port to the 5V rail with 2 A overcurrent protection and your "dumb" USB-powered fan gadget can draw as much as it wants without any negotiation.

I can totally see TVs doing the same

mschuster91 · 2024-12-28T21:27:36 1735421256

> But literally every device in the real-world just wires the port to the 5V rail with 2 A overcurrent protection

Except Macs, Macbooks, iMacs, I think also at least the Thunderbolt Display from <very many years ago>. They all have a software overcurrent protection that is very triggerhappy. No negotiation and it will whine and shut the offending device off, and same if the negotiated current draw is exceeded.

Might be worth a try somewhen when I'm rich enough to afford a dynamic resistor bank to verify all the characteristics...

kalleboo · 2024-12-29T09:38:05 1735465085

I've never had any issues running dumb USB loads off any of my MacBooks. Just tested it, no problem running 1.7 A of dumb resistors just soldered to the power pins with nothing on the data pins at all (not even the passive "apple charging" resistors) https://kalleboo.com/linked/usb-dummy-load.jpg

Macs will shut down a port if it goes over 2.4 A (IIRC) without USB-PD negotiation (mainly with the cable rather than the device).

But the USB standard says they should limit to 100 mA without USB 1.x negotiation, and it's not doing that.

userbinator · 2024-12-28T22:08:58 1735423738

I've looked at Macbook (pre M1) schematics; they do the same as any other PC laptop. The USB power switches do not have adjustable current limits.

indrora · 2024-12-29T20:47:05 1735505225

> But literally every device in the real-world just wires the port to the 5V rail with 2 A overcurrent protection

Not quite. To be USB Compliant, you have to do some work here and there. There's about six different options. The most common is overcurrent detection, such as is seen in [1]. There is a whole specification built by USB-IF on how to handle higher current ("battery charge") situations, spurred by apple [2], with all sorts of weird corner cases [3].

Now, USB-C changes that and specifically calls out that a "compliant" downstream device has to negotiate USB PD or declare yourself a USB-2.0 type-C device. [4] It's not uncommon for newer devices that conform strictly to the USB4 specification to not even power a port that hasn't negotiated USB-PD or Legacy PD -- if you encounter devices that get weird when powered via a usb-c to usb-c cable but work fine on a usb a-to-c cable, you've seen someone skimp out on $0.00001 in resistors.

[1] https://www.microchip.com/en-us/development-tool/EVB-USB2514... [2] https://www.usb.org/document-library/battery-charging-v12-sp... [3] https://www.graniteriverlabs.com/en-us/technical-blog/usb-ba... [4] https://community.infineon.com/t5/Knowledge-Base-Articles/Te....

gbil · 2024-12-28T05:37:19 1735364239

>HDMI provides power - that's how old chromecasts can work without a separate power plug.

I still have the first Chromecast released, it doesn't operate without external power plugged in so I'm not sure about the validity of your comment, at least for the chromecast part

kuschku · 2024-12-28T09:31:38 1735378298

The first chromecast actually operated without external power, but it only worked with some TVs.

It's possible yours didn't provide enough power via HDMI, but at least ours worked just fine.

ssl-3 · 2024-12-28T11:47:25 1735386445

It is possible that your memory of a device from a decade ago is faulty. No Chromecast has ever been able to be powered by HDMI alone. That has never been a thing.

You may instead by remembering the fact only some TVs back then were successful at powering the Chromecast without an external power brick, using a USB port on the TV itself to power up the Chromecast.

In applications where this worked (and it often did work, although it also often did not work), it could provide a solution that existed entirely on the back of the TV with nothing additional plugged into the wall.

But it was still [micro] USB that provided the power to the OG streaming stick, not HDMI.

kuschku · 2024-12-28T12:14:19 1735388059

> It is possible that your memory of a device from a decade ago is faulty. No Chromecast has ever been able to be powered by HDMI alone. That has never been a thing.

It is not - I still use my 11yo Chromecast Gen1 today. And it still works fine without USB power (as long as you don't try to play YouTube videos).

altcognito · 2024-12-28T14:11:59 1735395119

I also had this device and would concur it was supposed to work without USB power, but in my experience worked extremely poorly.

lightedman · 2024-12-28T13:19:49 1735391989

"You may instead by remembering the fact only some TVs back then were successful at powering the Chromecast without an external power brick, using a USB port on the TV itself to power up the Chromecast."

I'm looking at my first gen plugged into the ARC HDMI port on my Vizio TV. It is ONLY attached to the HDMI port and nothing else.

486sx33 · 2024-12-28T18:24:13 1735410253

+1 my visio powers this as well It also powers lots of stuff via usb

Maybe because it’s NOT a smart tv and doesn’t have some crazy android chip SoC to constantly power. I mean obviously you can make a power supply that could do both - or neither. But it likely comes down to price for the manufacturer of the tv

smileybarry · 2024-12-28T15:20:20 1735399220

Right, but I think it wasn't a real intended use case and that some TVs provided amperage over the spec (maybe by accident? simpler circuit bridging the same power pin for USB and HDMI?).

I had the same first gen Chromecast (may even have it lying around somewhere) but it came with explicit directions to use the included power cable, so maybe they updated the included guide some time after release.

photon_rancher · 2024-12-29T01:59:49 1735437589

They probably just provide extra power over the port. It costs extra to design an extra supply for a specific port so it’s probably shared, and likewise also costs extra to current limit each port. So more than likely a cost saving measure

bradfitz · 2024-12-28T06:05:37 1735365937

https://www.hdmi.org/spec21sub/cablepower

rzzzt · 2024-12-28T08:36:07 1735374967

  Connection is the same as attaching an ordinary, "wired" HDMI Cable, except 
  that active cables can only be attached in one direction: One end of the cable 
  is specifically labeled for attachment to the HDMI Source (transmitting) 
  device, and the other end of the cable must be attached to the HDMI Sink 
  (receiving) device. If the cable is attached in reverse, no damage will occur, 
  but the connection will not work.

  HDMI Cables with HDMI Cable Power include a separate power connector for use 
  with source devices that do not support the HDMI Cable Power feature.

This is not your run-of-the-mill HDMI cable for sure.

numpad0 · 2024-12-28T13:54:02 1735394042

No, not that feature. HDMI supported 5V/55mA power out for years. It's meant for EDID ROM chips and maybe HDMI selectors too, not Linux based computers, but some TVs could take it in gross violation of specifications and its spirits.

nosrepa · 2024-12-28T06:40:27 1735368027

And the serial number of that power plug is MST3K-US

LeifCarrotson · 2024-12-28T04:41:26 1735360886

And by "the board" I trust you mean the MSI PRO X670-P WIFI motherboard.

There's nothing incorrect about the behavior of the SSD when it's being operated outside the prescribed voltage and power thresholds.

If there's a trickle (and to be clear, the 5V at 300 mA available from an HDMI cable is a trickle for a full motherboard) of current into the 3V3 bus on the ATX connector, something will be the very lowest PMIC to turn on. It's just that on this system, the SSD was the first thing. If anything, the SSD will probably be highly tolerant of brownouts because its LDO will run at around 1.9V.

Dylan16807 · 2024-12-28T07:17:18 1735370238

> There's nothing incorrect about the behavior of the SSD when it's being operated outside the prescribed voltage and power thresholds.

I'd put some more emphasis on "when", though. If it never comes back when power comes back that's not particularly correct.

crest · 2024-12-28T12:34:41 1735389281

That's because if this theory is correct from the point of view of the SSD there was no reboot yet, because there was never any total power loss.

smileybarry · 2024-12-28T15:22:24 1735399344

It should still handle PCIe probing and (logical) reconnection without a reboot, though, e.g.: PCIe redirection for a VM.

Dylan16807 · 2024-12-28T13:33:38 1735392818

It handles warm reboots without power loss just fine, so it deciding now it needs to wait for power loss seems like a flaw.

wtallis · 2024-12-29T02:34:57 1735439697

If the SSD reacts to the start of a brown-out with supply voltage dropping way below spec as a signal that an unplanned power loss is happening, then it may do an emergency flush and shutdown that leaves it simply waiting for power to finish dropping to zero. It makes at least some sense for the drive to not try to wake up from that state without a clean power cycle.

Dylan16807 · 2024-12-29T05:21:49 1735449709

I think "makes at least some sense" and "not particularly correct" can be true at the same time.

hulitu · 2024-12-28T06:59:32 1735369172

> There's nothing incorrect about the behavior of the SSD when it's being operated outside the prescribed voltage and power thresholds.

It shall set itself in Reset state.

shadowpho · 2024-12-28T17:09:35 1735405775

Only few devices are actually able to do that. Vast majority require require proper voltage sequencing, because to do otherwise is to add cost to your IC

LeifCarrotson · 2024-12-28T13:41:54 1735393314

That would be nice, in practice, the SSD requires its power rails to start up in a particular sequence and with very particular voltages.

magic_smoke_ee · 2024-12-28T04:46:01 1735361161

The reality is retail PC electronics, like much consumer electronics with short lifespans, are designed/engineered and manufactured more-or-less like disposable e-waste garbage. Eevblog Dave or Bigclive might be able to get to the bottom of the circuit or manufacturing design error, albeit with some help if it turns out to be a digital-or-up-the-stack issue.

KeplerBoy · 2024-12-28T10:07:05 1735380425

meh, I rarely have electronics fail these days. Whatever corners designers are cutting seem perfectly adequate to be cut to make stuff affordable.

lazide · 2024-12-29T09:56:09 1735466169

The rise of mass produced cheap ICs with somewhat reasonable behavior are the cause. It’s cheap to add some logic to something when you’re making a million or more of them, than when it’s an additional couple discrete components and an additional circuit you need to add yourself.

globnomulous · 2024-12-28T20:18:53 1735417133

My office stereo has physical connections between the following devices (simplifying a bit)

- Speakers connect via speaker wire to monoprice 7x200 amp

- Monoprice amp connects via RCA to denon x3800h

- X3800h receives HDMI from desktop computer and sends HDMI to a monitor.

- Same computer connects via Displayport to the same monitor

I used to hear an infuriating buzz when my 2080TI started to work hard. It changed depending on the screen output, GPU strain, and mouse activity but was constant. It acted like a combination ground loop cum coil whine.

The first fix I discovered was to ground my monoprice amp to the 2080 TI PCB by wrapping one end of the exposed-copper (12 awg, I think) grounding wire through and around one of the holes in the board and attaching the other end to the Monoprice amp's grounding pin.

This fixed the issue completely.

Then I realized I could fix the issue more elegantly and elminate the need for grounding: I removed the grounding wire and replaced my normal HDMI and Displayport cables with fiber optic HDMI and Displayport cables. The buzz has never recurred.

I've never delved further into the problem, but my conclusion is the same as yours: there's a design fault somewhere on the board, which is causing electricity to flow in ways it shouldn't. I'm using an MSI z690 ddr4 edge wifi board. Same brand, same generation, as the board where this guy is having his SSD power issue.

I still hear a weird, loud buzz through the stereo (including a separate amp and separate pair of speakers) when my partner runs her hair dryer upstairs, even though my stereo runs on its own separate circuit, so regardless of the design issues in the board, there's definitely also an issue in my electrical system.

tinfever · 2024-12-28T21:44:49 1735422289

Interestingly, the PCIe 8-pin power cable into a GPU doesn't carry all of the return current. If you put a current clamp meter around the +12V wires and then the ground wires, you'll measure more amps on the +12V wires than the ground wires. This means some of the return current goes through the PCIe slot into motherboard and makes its way back to the PSU. This lets the GPU create audio noise because GPUs draw high current pulses at the frame rate of your monitor, which means the return current through the motherboard has high current pulses, which can create ground bounce on the motherboard where the ground voltage level moves up and down and that can affect other devices in the system.

I don't totally know how that noise would traveling over the ground shield of the HDMI cable into the analog section of the Denon receiver though. Maybe some of that GPU return current is going through the HDMI cable, through the Denon receiver to mains earth, and then through your building wiring back to the ATX PSU? Grounding is freaking weird.

globnomulous · 2024-12-29T05:05:29 1735448729

Oh, wow, yeah, that's really interesting. I don't understand electricity or know nearly enough about electrical engineering to be sure I understand the effect or flow you're describing, but if I (dimly) grasp what you're saying, it would explain exactly the behavior I observed.

Grounding really is incredibly weird (and, again, I say this as someone who is shamefully ignorant of electrical principles). It's no surprise that some 'audiophiles' become so superstitious about electricity. Its behavior in a stereo can be mysterious. Just looking at an amp funny seems like enough to cause a ground loop.

transpute · 2024-12-28T20:50:27 1735419027

Power conditioner can improve AC isolation

https://www.amazon.com/Furman-AC-215A-Conditioner-Auto-Reset...

https://surgestop.com/surge-products/m-474.html

globnomulous · 2024-12-28T21:05:34 1735419934

Thanks, this is great advice. I'm using two SurgeX SX 2120-SEQ power conditioner+sequencers -- one for the desktop devices and one for the stereo.

I'm baffled that, even with the conditioners and even though I'm a separate circuit in my office, the hairdryer is still able to do something to affect the electricity in my office.

alduin32 · 2024-12-29T04:38:56 1735447136

> the hairdryer is still able to do something to affect the electricity in my office.

This may indicate that your neutral line is undersized and/or damaged.

globnomulous · 2024-12-29T05:06:12 1735448772

How could I test this?

alduin32 · 2024-12-29T09:22:20 1735464140

A first thing to test would be that your voltages are nominal, but the exact details depend on how many phases are coming from the transformer, how they are wired, and whether you are on a TT, TN-C-S or other kind of grounding system, which depends mostly on where you live. Also, you need to take your voltages both at low impedance (simulates a load) and at high impedance (negligible load, "classical" meters are generally high impedance).

Generally, you want to measure the voltage difference between live and neutral depending on the load. However, depending on the tools you have access to, taking this reading properly can be a bit tricky both because simple high-impendance multimeters can easily be tricked by ghost voltages caused by bad connections and inductions from other cables, and also because understanding what to measure requires knowing how is the electrical system wired.

If you know you are in a TT system with 240V between Live/Neutral, I can tell my procedure for inspecting neutrals. In a two-pole TN-C-S system with 120V between L1/Neutral and 240V between L1/L2, I suppose it would be similar, expect that we'd have to do more tests (both L1 and L2 to neutral, and I imagine also L1 to L2).

EDIT: a first simple check to do is to check, using any multimeter, if there is voltage drop in your office when the hairdryer is in use.

globnomulous · 2025-01-01T12:47:48 1735735668

Thanks so much for the write-up!

0xTJ · 2024-12-28T12:07:15 1735387635

The HDMI source, not the HDMI sink, provides the power at 5 V. As far as I know, every Chromecast required an external power connection.

zamadatix · 2024-12-28T05:40:14 1735364414

On the topic of odd failure modes involving Crucial SSDs and MSI motherboards (though one that seems to actually be the drives fault) I have a t705 which at some point started only coming up as x2 lanes instead of x4 no matter which board I put it into (with no visible damage or indication as to why, though I did try to wipe down the contact side with some rubbing alcohol anyways).

The particularly interesting part is I have a new x870 motherboard which supports m.2 slot 2 as being 0x, 2x, or 4x CPU direct lanes depending if you want 4x, 2x, or 0x to go to the USB 4 ports respectively. At first it sounds like a good combo - put the drive which wants to run at x2 only in the extra slot where x2 only mode is a reasonable tradeoff and still get great bandwidth because those lanes are pcie 5 and not through the chipset. For whatever reason though that drive only ever comes up in an x4 slot (at x2 speed) but not any x2 slots I've tried. I don't know enough about PCIe to assume why that is for sure but it seemed odd to me it was any way but "something is wrong with the 3rd or 4th lane and setting the slot to x2 lets the first 2 work at x2 the same as when the slot is set to x4 and it only comes up as x2".

tfwnopmt · 2024-12-28T08:25:35 1735374335

I came across this in a manual/datasheet:

>16.Link Width Negotiation in the Presence of Bad Lanes

>In an effort to maximize the link width when one or more lanes of a multi-lane link are not functioning correctly (i.e., reliable communication of training sets across the lane is not possible), PES64H16G2 down-stream switch ports automatically attempt a lane reversed configuration when doing so has the potential to enhance the achievable link width.For example, if lane 1 of a x4 link is not operating correctly, the device's downstream switch port attached to the link attempts a lane reversed configuration to form a x2 link using lanes 2 and 3 (Figure 7.4(d)). If the link partner accepts the lane reversed configuration, the optimal x2 link will be formed using lanes2 and 3. If the link partner does not accept the lane reversed configuration, but instead requests a lane configuration supported by the PES64H16G2 (e.g., x1 link using lane 0), the device accepts the configuration and forms the reduced width link. Otherwise, if the lane numbering agreement fails, the device automatically re-trains the link from the Detect state. During this re-training, the PES64H16G2 port does not re-attempt a lane reversed configuration, but rather tries to form the link without reversing the lanes. As a result, a x1 link is formed using lane 0 (Figure 7.4 (e)).

My guess is it's likely a bad BGA solder ball on Lane1, or possibly ESD damage if you took the SSD out and molested it or rubbed it on a cat right before it broke. Does it indicate it's using reversed lanes?

zamadatix · 2024-12-28T13:47:26 1735393646

Nice digging, that lines up perfectly with the observed behavior! I'll have to poke around and see if anything indicates that's the operational mode to be sure.

The failure mode was that one day I just noticed it was copying sequential data from another drive slower than it normally did. Don't recall it ever having been touched after install (it is the heatsinkless variant of the T705 4TB mounted on the motherboard m.2 hearsink for that slot). Temps always reported quite reasonable, even when under stress bench load (which was rare, the drive was just a secondary drive for loading games). Since then it's been popped between about 10 boards in confusion though haha. No cat yet!

magicalhippo · 2024-12-28T05:56:35 1735365395

PCIe devices are required to boot up using x1 lane only, and then negotiate further lanes with upstream.

AFAIK it shouldn't matter if they're direct to CPU or not, at least not logically.

I note the drive is Gen5 capable, does it negotiate x2 5.0 lanes or something else?

zamadatix · 2024-12-28T13:37:38 1735393058

Negotiates to 2x 5.0 so long as the board it's plugged into supports it. 2x 4.0 or 3.0 otherwise. Hadn't tested even lower.

lizknope · 2024-12-29T01:43:56 1735436636

I just got a Crucial T700 last month which is a PCIE Gen5 x4 NVMe M.2 drive.

I put it in an ASUS PRIME Z890M-PLUS motherboard with an Intel Core Ultra 7 265K

Started to install Fedora Linux version 41. The drive would just completely disappear from the OS and the kernel would report I/O errors on a missing device. Sometimes this happened during the initial install. Sometimes 5 minutes after the install when starting a terminal. I couldn't even type "ls" because the "ls" command is on the drive that went away.

Saw reports of PCIE Gen5 incompatibilities so I moved it to a Gen4 slot and then it worked.

But the machine had so many other random crashes and errors reported in system logs saying "This is a hardware error not software" and stuff like that. Returned it all.

Just got an AMD Ryzen 9 9950X and Gigabyte X870E AORUS PRO

The Gen5 drive seems to be working at Gen5 speeds.

lspci -vv shows

02:00.0 Non-Volatile memory controller: Micron/Crucial Technology T700 NVMe PCIe SSD (prog-if 02 [NVM Express])

                LnkSta: Speed 32GT/s, Width x4

sebazzz · 2024-12-28T07:00:03 1735369203

I have something similar with my webcam, which is connected to my Samsung monitor usb hub, which is connected to a usb-c dongle, which is connected to my work laptop.

If my laptop crashes during a Microsoft Teams call, possibly due to the webcam, it will not show up in Windows again without it physically being disconnected from the USB hub in my Samsung monitor. I can disconnect the USB-C dongle or the monitor from USB, change ports, power off the laptop, it doesn't matter because that doesn't work. Only physically disconnecting and reconnecting it makes it show up in device manager again.

userbinator · 2024-12-28T08:26:52 1735374412

This is a good cautionary story of why random parts-swapping can be a waste of time and money. Getting out the DMM and measuring voltages is something fewer and fewer people know how to do when troubleshooting electronics, but it certainly saved the OP here; I'd go a little further and figure out why the monitor seems to be leaking power into its HDMI input when switched off --- possibly an ESD-damaged MOSFET or similar?

The issue does not occur when the monitor is connected via DisplayPort.

https://en.wikipedia.org/wiki/DisplayPort#DP_PWR_(pin_20)

Standard DisplayPort cable connections do not use the DP_PWR pin.

There's also an interesting paragraph there, about some nonstandard cables connecting that pin through.

Arcanum-XIII · 2024-12-28T10:00:55 1735380055

Not all DMM have probe small enough to connect to the lane. If it's even possible. What's more, you need to know where to put it, which can be daunting without the proper knowledge. Switching hardware is easier, faster and often the best solution in those case.

Finding hardware fault is hard. Tracing it is even harder.

userbinator · 2024-12-28T10:11:32 1735380692

I think there's something wrong with your DMM probes if you can't measure the ATX power connector with them.

hamandcheese · 2024-12-29T18:01:03 1735495263

On the other hand, I recently fried a motherboard while trying to probe it with a multimeter. My fat fingers shorted out two adjacent pins, causing a loud spark and magic smoke.

BearOso · 2024-12-28T15:43:19 1735400599

Since we're talking SSDs, I wonder if we could get some attention to the Phison E18 degradation issue [1]. Only one manufacturer, Kingston, has put out firmware containing Phison's fix, while the others just ignore it.

A bunch of these drives with this controller were on sale during black Friday, so a lot more people are going to have problems in a month or so.

1. https://www.reddit.com/r/pcmasterrace/comments/1f1piwf/psa_p...

ciupicri · 2024-12-29T01:26:46 1735435606

Kingston doesn't seem to offer any support for Linux, so their new firmware is virtually non-existent to me. Why can't I just download the firmware and use standard nvme-cli tools to update the SSD, beats me. If Seagate (which by the way uses Phison E18 too) can do it, so can Kingston, Samsung, Crucial, Western Digital and many others.

Even better would be use Linux Vendor Firmware Service (https://fwupd.org/).

userbinator · 2024-12-29T01:07:37 1735434457

That sounds like NAND degradation (retention failures) which can only be partially worked around in firmware (and causing more write cycles on already-marginal QLC). Unfortunately the real solution is "use better NAND", which is unlikely to happen unless enough people demand it.

ciupicri · 2024-12-29T01:32:57 1735435977

Kingston KC3000 supposedly uses Micron 176L TLC memory [1].

The Seagate Firecuda 530 datasheet clearly says "Built with a Seagate-validated E18 controller and the latest 3D TLC NAND". A review is more precise: "Phison PS5018-E18" & "Micron B47R 176-layer 3D TLC NAND" [2].

[1]: https://www.tomshardware.com/reviews/kingston-kc3000-m2-ssd-...

[2]: https://www.kitguru.net/components/ssd-drives/simon-crisp/se...

userbinator · 2024-12-29T03:12:55 1735441975

B47R is indeed TLC, rated for only 1000 cycles (and 35k in SLC mode, at 1/3 the capacity.) There's also the question of whether this is "true" Micron NAND, or SpecTek which is basically Micron's rejects (and rated for even fewer cycles; only 300 in the case of their B16A.)

bb88 · 2024-12-28T22:55:40 1735426540

So these guys [1] mention something similar where HDMI from a TV is backfeeding 40-50 volts into a cable box. This could be because of many things from electrical outlet wiring to power supply issues on the monitor to a bad component on the monitor giving a high voltage, or the monitor is badly grounded, etc, etc.

I read the original thread but it doesn't look like you've measured the voltage at the HDMI port wrt motherboard ground. I think we're assuming it's 5 volts, but it could be higher, and it could have shorted (or weakened) a component on your motherboard. And that would explain why a 100 ohm resistor didn't give a meaningful voltage drop.

If you need an isolation solution, Amazon sells a 50ft fiber optic one way HDMI cable [2]. The thing I don't know is if there's any actual copper to provide power over the link. There are other options which transmit the HDMI signal over pure multimode fiber as well [3].

Or you can go with a DP KVM, since you're on L1T, they sell a few DP models. I have one I purchased from L1T, and I like it a lot.

Definitely though I would check out the outlets to make sure they were wired correctly. Incorrectly wired outlets because someone tried to DIY it in the US is absolutely a problem.

[1] https://www.avsforum.com/threads/hdmi-cable-backfeeding-volt...

[2] https://www.amazon.com/HDMI-FURUI-HDCP2-2-18Gbps-Subsampling...

[3] https://fibercommand.com/products/8k-fiber-plugs?gQT=1

starslab · 2024-12-28T23:43:31 1735429411

I already own one of those fiber-hdmi cables. Brilliant, but sometimes doesn't interoperate with DVI devices using passive DVI -> HDMI adapters. I've no idea if it has any copper conductors for HDMI power, though one end is labelled for the source and one for the display, suggesting that however it's designed it's not bi-directional.

I'd love a DisplayPort KVM, but not every device that comes across my bench has a DisplayPort output, and those few that have DisplayPort but no HDMI can be accommodated with one of those commodity DisplayPort -> HDMI adapters. This situation is actually getting worse over time, not better, as many modern devices and laptops are skipping DisplayPort in favor of USB-c alt-mode.

This issue has actually been going on through a monitor change on my testbench. It has happened with a Samsung SyncMaster 204T though my KVM switch, an HP ZR24w through my KVM switch, and the ZR24w directly connected. I don't think this is an issue with the rest of my equipment.

This electrical was done about 15 years ago, by a ticketed electrician. One of those $5 plug testers indicates all is well, and I have no reason to believe there's any issue here.

By almost pure coincidence, I have an MSI PRO X870-P motherboard on order. I'm looking forward to seeing if this same 3V3 leakage issue is present on this board too.

amelius · 2024-12-28T15:48:22 1735400902

I have a similar problem with a Jetson board. If I turn off the power long enough (one night) and then turn it on, the only PCI card is not recognized and I have to power-cycle it to get it running.

structural · 2024-12-28T21:05:32 1735419932

Mind sharing what board/Jetson module you've seen this on? I've seen this exact symptom very intermittently on a custom board and we've wondered for a long time if was an issue with a specific type of module (or manufacturing lot of modules).

amelius · 2024-12-28T22:06:00 1735423560

This one: https://www.avermedia.com/professional/product-detail/D315%2...

My startup logic now power-cycles it until the PCI board is recognized; it works, but it's not a great solution.

structural · 2024-12-28T22:18:46 1735424326

Interesting, we're using a completely different module (Xavier NX). And the same, disgustingly hacky, fix, of forcing a reset until it works.

amelius · 2024-12-28T22:27:49 1735424869

I also run these commands:

    echo 1 > /sys/bus/pci/rescan
    sleep 1

Sometimes it brings the PCI card back, so I just run this as part of my boot sequence.

qingcharles · 2024-12-28T07:59:53 1735372793

I hate faults like that.

Used to work in PC repair. Man brings in PC, mouse right click doesn't work. Everything else operates perfectly.

Replaced in this order: mouse, IO card, hard drive with fresh OS, RAM, CPU, graphics card, motherboard. Still no right-click.

Replaced the PSU last. Right-click works. FML.

Frenchgeek · 2024-12-28T09:34:04 1735378444

You didn't have to replace the house's wiring at least (Happened to an aunt of mine: Gave her a computer, it worked perfectly outside of her home. The electrician was a tad horrified. She still scoffed when I suggested the computer wasn't the problem first.)

Moru · 2024-12-28T09:47:25 1735379245

I plugged my old Atari into an outlet in the old basement in a different building. The HDD-cable started burning.

Electric company plugged in some device to measure power over time. Turns out the power was slightly below normal but within tollerances. The OEM power supply that was powering my Atari wasn't up to standards. If I remember right, badly designed PSU's can feed too high current if the voltage is too low. Or something like that, was a very long time ago...

ajb · 2024-12-28T10:08:07 1735380487

Many switch mode power supplies will increase the current draw if the voltage drops, that's why many of them will work on both 120 and 248V, while old school power supplies need a manual switch. I had a brownout once and thought my washing machine was broken because that was the only thing that stopped working (Until evening when I switched on the lights. That was back in the days of incandescents, oddly though led lights still dim with lower power, I don't know how they do voltage conversion).

We have so many cheap power supplies in our houses that it would not surprise me if at least some become unsafe if the source voltage drops too low. Being unsafe with only a slight drop is weird though.

qingcharles · 2024-12-29T23:29:41 1735514981

Yeah, lucky it didn't get that far, "Sorry bro, I gotta try just replacing your house."

donalhunt · 2024-12-29T10:45:56 1735469156

Reminds me of an old hwops story where one machine just constantly failed despite replacing every part on the tray multiple times. The conclusion was that the tray was bad.

Google's definition of a server was (and still is afaik) based on the tray (chassis) so there was no way to replace it. IIRC it was "retired" with vengeance leaving a gap in the cabinet — a warning to other trays to behave.

ksec · 2024-12-29T07:26:59 1735457219

>Replaced the PSU last. Right-click works. FML.

My experience is always replace DRAM, and then PSU, and then Swap Motherboard.

I don't think people realise how many faults there are with DRAM, PSU and MB. DRAM quality has gotten a lot better in the past 10 years so that is less of an issue. PSU, however it where cost cutting are and more often than not causes problems.

jauntywundrkind · 2024-12-28T04:35:23 1735360523

I can't get my Crucial P3+ to wake from sleep.

I'd like to dig in more but I haven't had this issue with any other SSD in this system. Pretty close to saying I'm done with Crucial.

wtallis · 2024-12-28T04:50:38 1735361438

Is this on a Linux system? NVMe power management has always been hit or miss for consumer SSDs under Linux because the SSD vendors don't write their firmware against the NVMe spec, they write it to work with the Microsoft Windows NVMe driver and any feature Windows doesn't use is liable to be broken. This applies to basically every SSD brand, by the way.

jauntywundrkind · 2024-12-28T06:23:31 1735367011

Yes, it's an NVMe.

Western Digital & OCZ nvme drives have both worked fine in this system, so I'm feeling a bit salty about this. Would like to try some Samsung drives at some point.

(Running Linux 6.11.7 atm.)

NewJazz · 2024-12-28T04:40:17 1735360817

I've had a similar experience with a crucial nvme drive, but a kernel update seems to have introduced a quirk-based fix. Not sure how much of a kludge that fix is, though.

wtallis · 2024-12-28T04:58:16 1735361896

The quirks tables in the Linux NVMe drivers are impressive and depressing:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

And they're not even close to being comprehensive.

fulafel · 2024-12-28T05:11:55 1735362715

Interesting that there are also some anti-quirk special cases in the vendor combo function (second link above), so a certain platform is excepted from the quirk workaround:

   \*
   \* Exclude some Kingston NV1 and A2000 devices from
   \* NVME_QUIRK_SIMPLE_SUSPEND. Do a full suspend to save a
   \* lot fo energy with s2idle sleep on some TUXEDO platforms.
   \*/
  if (dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") ||
      dmi_match(DMI_BOARD_NAME, "NS5x_7xAU") ||
      dmi_match(DMI_BOARD_NAME, "NS5x_7xPU") ||
      dmi_match(DMI_BOARD_NAME, "PH4PRX1_PH6PRX1"))
   return NVME_QUIRK_FORCE_NO_SIMPLE_SUSPEND;

wtallis · 2024-12-28T05:21:33 1735363293

I think some of those issues probably stem from the fact that there's not really any alignment between the NVMe spec and the PCIe spec with respect to power management capabilities. I've encountered drives that have implicit dependencies where certain NVMe power management features only work as intended when certain PCIe power management features are available, but there's no way for the drive to express those requirements to the host system, and no standard compliance test suite that will reveal the broken behavior that can occur in the wild.

Sometimes figuring out who to blame for misbehaving hardware requires custom kernel patches, a hardware protocol analyzer at the M.2 slot, and reverse-engineering the motherboard firmware. Most of the entries in the quirks tables are based on a lot of guess-work and inferences because the kernel developers don't have the resources to fully investigate and reproduce these kinds of issues (and the hardware vendors simply don't care about thoroughly ironing out these bugs). It really sucks when you have to look at power flow out of the laptop battery and try to figure out from that whether your SSD is pulling more power than it should.

jandrese · 2024-12-28T08:06:13 1735373173

Oh yeah, and in some cases if the system attempts to go into S2 sleep it simply bricks the SSD forever. I lost a whole lab worth of drives once before I figured it out. The vendor was the opposite of helpful, refusing to acknowledge the problem and then wiping their hands of it and walking away. The only solution I've found requires a hardware modification of the drive, downloading a rip of the vendor's internal repos from a sketchy russian website, building a new firmware from scratch, and then flashing it with some custom hardware.

fulafel · 2024-12-28T05:32:24 1735363944

Wow. I guess this also explains some of the s2idle troubles, with S3 sleep there are the vendor-tested motherboard+peripheral combos that are shown to work with the power states attempted by suspend and any hw/fw bugs get troubleshot before they make it out of the vendors lab.

hulitu · 2024-12-28T07:03:11 1735369391

That would explain why, sometimes, my linux will not find the NVME SSD when booting. (MSI mobo with Kingston SSD).

doubled112 · 2024-12-28T15:27:53 1735399673

I have a pair of ASUS Vivobook laptops with Kingston NVMEs.

While running the factory install of Windows, those NVMEs would cause a BSOD every third or so boot. Clean install didn't help either, nor any driver or firmware update.

No Linux install has shown any signs of problems.

chupasaurus · 2024-12-28T15:02:37 1735398157

Model or at least year of that SSD? Early on Kingston used faulty controllers that randomly fail to initiate and degrade with power cycles.

hulitu · 2024-12-28T16:00:30 1735401630

Since last year.

Astronaut3315 · 2024-12-28T15:33:48 1735400028

I returned a Crucial P3+ after I discovered a massive performance degradation with Bitlocker. It was slower than spinning rust. Seems these drives have some unresolved firmware issues.

okanat · 2024-12-28T15:08:23 1735398503

I bought the same model SSD for my Thinkpad P1 last month and saw the exact issue. I had to return it because it was breaking the NVMe detection completely. So it wasn't a broken unit but a design issue after all?

bdavbdav · 2024-12-29T14:41:44 1735483304

I had the same on an AORUS X570. Displayport cables with a line tied both ends (shouldn’t be, but many are) would cause BIOS resets, corruption and memory retraining.

blagie · 2024-12-29T19:22:03 1735500123

I was an exclusive user of Crucial for memory and storage until about a year ago. My general thought was that:

- It would give me a trusted supply chain, since the company makes the silicon; and

- I would have a credible standing behind it, which wasn't likely to want to tarnish its reputation cutting corners.

The thinking was very much along the lines of "No one got fired buying IBM." And I think it was pretty correct for most of the past quarter-century. Historically, storage had a lot of counterfeits and shenanigans, and a credible vendor was nice. Price/performance for memory was adequate; there was a modest premium.

However, post-2020, I bought a defective Crucial DIMM (and didn't find out it was defective until I was past the return window). The RMA experience was strange. Crucial said they could either:

- Replace it with an inferior part with different, slower timings, which may or may not have worked in my system

- Give me a quickly-expiring store credit for "fair market value" (never disclosing what that was, and stopped responding to emails when I asked)

Neither of these was helpful at all.

Reading online, there were many similar stories, unfortunately. They seem to be going the same direction as Sandisk / Western Digital. I replaced it with a cheap TeamGroup DIMM which worked without problems.

I'm not quite sure what to do about the continued enshitification. There seem to be almost no credible brands left.

sciencesama · 2024-12-29T22:24:04 1735511044

I have a similar issue with nvme on a wlan slot on the lenovo thinkpad gen 8 !!

geor9e · 2024-12-28T06:50:26 1735368626

Why's a random tech support forum post from yesterday with 2 people replying getting reposted to HN

ejiblabahaba · 2024-12-28T15:23:59 1735399439

For what it's worth, this post just helped me explain several years of failure to wake from sleep state, across several different MSI-based machines, when I've connected them to an HDMI port in my TV. I think this debug is interesting in its own right, and unlike 99% of the content on this website, it was directly and immediately useful to me. I doubt I'm the only one, too.

transpute · 2024-12-28T16:41:31 1735404091

This post described a rare interoperability failure with unexpected root cause, of possible interest to:

  Motherboard designers
  People upgrading PCs/laptops
  SSD firmware developers
  BIOS developers attempting PCIe device boot
  OS/hypervisor developers attempting PCIe device reset

If you don't like this HN story, you could contribute your first story to HN.

aprilnya · 2024-12-28T07:38:06 1735371486

I personally found it interesting.

frantathefranta · 2024-12-28T15:19:49 1735399189

Slow week but people probably enjoy the methodical troubleshooting.

undertaken · 2024-12-28T21:07:05 1735420025

anecdotal/weird computer experience:

I have a rebadged Tongfang laptop (NB02 GMxRGxx w/ Ryzen 9) and upgraded it shortly after purchase.

The machine arrived with lower capacity Samsung SODIMMs. Swapped in 64GB of Crucial DDR5.

Shortly afterwards the machine became instable to the point of RMA. Kernel logs clogged with all sorts of panics related to NVMe, PCIe, and filesystem. Freezing. Reboots.

Spent hours diagnosing it. Many permutations of kernel command line arguments; pcie, acpi tables, iommu. All for naught.

The machine passed memtest86 / memtest86+ with flying colors.

bonnie++ absolutely trashed it. reliably.

Occasionally the NVMe drives fell off the pci bus and it wouldn't boot until I disable the slot in bios, power cycle, then re-enable the slot.

Fast forward to me getting fed up with a dysfunctional system, I attmepted RMA and gave them the rundown of all the weird seemingly chipset related failures.

They pushed back with "Try our RAM again."

I nearly had an aneurysm when everything was stable again.

After thanking the support staff profusely I bought larger capacity Samsung DIMMs in the same chip family. Still running flawlessly after almost a year.

Maybe try new RAM for yucks? ;)