Hacker News new | past | comments | ask | show | jobs | submit login
Memory Mapping an FPGA from an STM32 (serd.es)
157 points by hasheddan 4 months ago | hide | past | favorite | 64 comments



Be veeeery careful. STM32H QSPI peripheral is FULL OF very nasty bugs, especially the second version (supports writes) that you find in STM32H0B chips . You are currently avoiding them by having QSPI mapped as device memory, but the minute you attempt to use it with cache or run code from it, or (god help you) put your stack, heap, and/or vector table on a QSPI device, you are in for a world of poorly-debuggable 1:1,000,000 failures. STM knows but refuses to publicly acknowledge, even if they privately admit some other customers have "hit similar issues". Issues I've found, demonstrated to them, and wrote reliable replications of:

* non-4-byte-sized writes randomly lost about 1/million writes if QSPI is writeable and not cached

* non-4-byte-sized writes randomly rounded up in size to 2 or 4 bytes with garbage, overwriting nearby data about 1/million writes if QSPI is writeable and cached

* when PC, SP, and VTOR all point to QSPI memory, any interrupt has about a 1/million chance of reading garbage instead of the proper vector from the vector table if it interrupts a LDM/STM instruction targeting the QSPI memory and it is cached and misses the cache

Some of these have workarounds that I found (contact me). I am refusing to disclose them to STM until they acknowledge the bugs publicly.

I recommend NOT using STM32H7 chips in any product where you want QSPI memory to work properly.


What the hell is going on at ST? Every STM uC I've tried to use in the past few years has had showstopper bugs with loads of very similar complaints online dating back to the release of the part. Bugs that have been in the wild for years and still exist in the current production run.

After burning enough company time chasing bugs through ST's crappy silicon, I've had to just swear them off entirely. We're an Atmel house now. Significantly fewer (zero) problems, and some pretty nifty features like UPDI.


In college, our SoC design instructor told us that to pass the class, our modules should be better than ST's "which is not that high of a bar" :P


It seems endemic with embedded devices. Only big customers get the true list of errata, and of course the errata are random PDFs rather than a useful format. Even just having them on an ftp site with all the errata in one spot would save so much pain!


Sometimes you don't even get a PDF. Their Ethernet drivers for Cortex-M7 have been broken for years with subtle cache coherency bugs and the only discussion of it is a forum thread with now obsolete example code.


Oh no! I luckily never dealt with the ethernet drivers, but I had plenty of small issues with other drivers


They churn out new parts and don't bring in fixes. See all the chips in their lineup that have a USB host controller. Every one of them (they use Synopsys IP) will fail with multiple LS devices through a hub. We talked to our FAE about this and they have no plans to fix it. The bug has existed for years and the bad IP is being baked into all the new chips still. Solution? Just use yet another chip for its host controller, and don't use a hub.


Have you tried PSoC parts? How does that stack with other microcontrollers?


Thanks for the heads up. I have a design at fab that uses the H7's OctoSPI so this concerns me. I steered away from the memory mapped mode because it seemed too good to be true - wanted to be able to qsort() and put heaps in this extra space.

I suspect ST only ever tested it with their single PSRAM they intend this mode for. My intent is to use indirect mode and manually poke the peripheral, though DMA will have to happen still.

Back on the PIC32MX platform there was a similar type of bug that doesn't exist anywhere else but to me: If any interrupt fires while the PMP peripheral is doing a DMA, there is a 1 in a million chance that it will silently drop 1 byte. Noticed this because all my accesses were 32bit (4 bytes) and broke horribly at the misalignment. The solution is to disable all interupts while doing DMA.


it is worse: i think they also did not test random access. I suspect their test was to: fill PSRAM linearly and then read it back and verify linearly. Random word accesses in unachached mode also randomly lose writes. I am unable to replicate quickly on purpose, only randomly, so i guess it is under 1/100mil so it is not in my list above. My workarounds avoid these crashes too though.


I have encountered issues with QSPI (mostly caused by the annoying prefetch queue) which is why I am switching to the FMC for FPGA interfacing (i.e. not using OCTOSPI). That was the whole point of this experiment, validating FMC as a replacement for my legacy OCTOSPI based MCU-APB bridge. I have a previous board using QSPI reliably in indirect mode (i.e. not memory mapped) but found it was full of pain when memory mapped specifically in writes. So that firmware memory maps it for reads but switches to indirect mode for writes. And has cache disabled.

So far I have it working quite reliably (my test firmware does a loopback test with 100K reads/writes of a 32-bit register at the start that I had written with intent of using it for link training of the PLLs to optimize read/write capture timing but never ended up using as such) and my iperf test can push tens of thousands of packets per second without issue.


The NXP IMXRT-series chips have a similar EMC (external memory controller) as well as "FlexIO" - PIO-like programmable IO. I've used both for this kind of FPGA interface without issue.

The IMXRT1064 is around $7 and is also an M7 core with an HS USB PHY, programmable PLL-connected LVDS clock output, 2 EMACs, excellent hardened IP generally.


I have some RT1176's in my "to try" pile.

The big thing holding me back was that their crypto accelerators were all locked behind NDAs (a dealbreaker for F/OSS work) while the ST ones are documented in the freely downloadable datasheet you can just google up.

But I did find some third party wrapper libraries that seemed to be able to use the crypto registers so it might be possible to figure things out from that. I haven't tried yet.

The other issue I had with the RT is that they lacked internal flash so PCB complexity is slightly higher than with a STM32.


> I have some RT1176's in my "to try" pile.

Keep in mind the dual-core 11xx chips are a bit harder to boot than the rest of the line - but you probably need the power domain flexibility for most FPGA projects (1064 has way fewer practically-usable 1v8 banks.)

> crypto accelerators were all locked behind NDAs

I've been able to use every bit of hard IP and high-assurance boot from registers using no vendor code whatsoever.

Here's what you are looking for:

https://github.com/JayHeng/imxrt-level2-boot/blob/master/dev...

> The other issue I had with the RT is that they lacked internal flash

The IMXRT1064 has a 4MB Winbond QSPI chip in-package, by the way!

> PCB complexity is slightly higher than with a STM32.

The Xilinx FPGA that is sitting next to your MCU incurs multiple orders of magnitude more PCB-complexity than a little QSPI flash, haha.


> 100K reads/writes of a 32-bit register

You'll hit almost no bugs if you keep accessing the same address in a loop. Lucky you :)


Yeah but again, we're talking about the FMC here not the OCTOSPI.

Have you hit issues with the FMC? From what other people are telling me, the OCTOSPI is full of land mines and the FMC is pretty decent. The worst errata I've encountered so far is two dummy clocks with CS# asserted at the end of a read burst.


lol, I'm working on an FMC-FPGA interface at work right now and discovered this same chipselect behavior.


It's a documented errata, 2.6.1 on page 8 of ES0491.


As far as using QSPI memory, one thing I have planned (and will be thoroughly testing) is using an external SPI flash as configuration data storage. Right now if I want to store any nonvolatile settings with power loss protection I need to burn two 128 kB erase blocks (one primary and one secondary, so I can ping-pong data between them and not lose anything if I have a power loss during a write cycle or similar) of the on-chip flash, space that I'd much rather use for firmware.

MicroKVS expects to be able to memory map data fetches (uncached), but is fine with using indirect access for writes.


But if I can memory map the FPGA via the FMC, I can simply put an APB memory mapped QSPI controller on the FPGA and store my config there, using the same flash for the FPGA bitstream as well.

This saves a chip on the board, reduces the amount of PCB routing required, and eliminates use of the sketchy OCTOSPI peripheral entirely. Testing that out is on my list of things to do on this board eventually.


I almost always include I2C EEPROM - just too cheap and pretty easy to route.


That can't be memory mapped, so I'd need to rewrite my KVS code which currently expects to be able to return a pointer to the raw on-flash image of the config data. Doable but a pain.


I recommend checking out SpinalHDL generally - I do a ton of this very same kind of work with these same chips (7 series, US+) and would never look back to Verilog!

AXI (and all memory-mapped bus protocol schemes) becomes very very pleasant. SV interfaces get you 5% of the way there, though!

Also - I was under the impression that S1000-2M is a higher-end material, not cost-optimized? (But not Rogers, of course.)


S1000-2 is quite cheap and lossy (Df 0.016), slightly better than Isola 370HR (0.021) but nowhere near the stuff I usually use. At my usual Chinese board house it's one of the lowest cost substrates available for prototypes since it's always in stock and there's no need to special order.

For higher end digital work I typically reach for Taiwan Union TU872SLK (Df 0.009) which also has a better range of prepregs and glass styles available to help minimize fiber weave effect. Still quite a bit lossier than e.g. RO4350B but far less expensive and if you have decent equalizers on your SERDES the difference is typically not significant unless you're making some kind of humongous backplane. I get wide open eyes with just a tiny bit of post-cursor emphasis on the TX FFE at 10.3125 Gbps on TU872SLK for my typical shortish high speed tracks (FPGA to SFP+ cage).


Also S1000-2 is not rated/controlled past 1GHz. It shouldn't vary that much so for small runs the risk is minimal. But for volume production that's exactly the sort of thing you never want to have to investigate in hindsight.


Curious who you are using in CN for higher-speed FPGA boards, if you can share!

I haven't seen these as directly-advertised options at any of my usual suspects.


Multech (multech-pcb.com) is my preferred manufacturer these days for high end stuff. I've done six layer HDI any-layer via stackups, ten layers with filled via-in-pad, RO4350B, TU872SLK, flex, 75 micron trace/space, etc. And that's nowhere near the limit of their capabilities, I just haven't needed higher end yet.

I have some 25/100G stuff in the pipe for probably some time next year that I plan to make with them too.

Their website undersells, I get the impression most of the actual sales contacts are word of mouth. I talk to my sales rep by skype mostly (the alternatives are expensive international phone calls or wechat).

The really cool thing is that you get a 10+ page QA report with every order including measured copper/dielectric/soldermask thicknesses, hole sizes, ionic contamination measurements, and a ton of other metrics. And they send the TDR strips and polished cross section with every order as their way of saying "look, we actually did the QA, double check our measurements if you don't trust us". (I actually have repeated some of the measurements to spot-check and got results within a few percent of their QA department, no surprises there).

And they don't make silent gerber changes or anything. They do a full CAM review and send you working gerbers and a list of suggested DFM tweaks for you to sign off before beginning manufacture. If something doesn't look right you have a chance to say "wait there's a problem".

For example, one time they wanted to make a really large width adjustment for impedance on some RF traces that I had carefully modeled in an EM solver. But they didn't make a bad board without telling me, they flagged it on the CAM review and we went back and forth before realizing the mistake was on their end (they had calculated impedance assuming solder mask over the traces, while they were actually exposed copper). They re-ran the numbers which then closely matched my simulations, I signed off on the modified design, and the board was manufactured without issue.


While we're talking about this sort of architecture, I'd like to plug Elixir.

For some development hardware, we had Elixir running on the ARM of a Zynq Ultrascale, running in tandem with some digital logic. It required one C code "port" that integrated with UIO to expose the registers to our application and then we had a great programming environment.

Elixir for embedded doesn't get talked about that much, but that is actually the origin story of Erlang (software component of telephony hardware). Basic language features like binary pattern matching work very well, and the concurrency approach makes it very easy to write clean performant real-time software. We had a lot of functionality that did used digital logic and then had the stateful stuff in software and it worked very well.

Plus, I could then do stuff like trivially spin up a Web UI with a graphical display of all the register state, that updated live (Phoenix LiveView). And be happy that that running wasn't going to interfere with the realtime stuff.

We did this using Nerves which is a Linux platform set up to boot the BEAM and nothing else (e.g. no init system, just a special pid 0 binary that boots the BEAM and lets that handle all other processes). It had some plus points like making firmware upgrade trivial and simplifying the system, but not being a "normal" linux platform was a bit irritating sometimes. You could equally well just run Elixir as an application normally.


This is dope. I work with Zynq/Versal quite a bit and respect and understand (conceptually) the decisions you have made!

You get to own every aspect of your toolchain and with that will come a lot of power.

Are you familiar with:

https://github.com/corundum/corundum

Perhaps you can build a support package for your platform.


This is really crisp work and nice to see. Before the Zynq era I worked with some designs that used a DSP or StrongARM along with a medium-sized FPGA, where the FPGA would be both the glue logic for RAM as well as custom peripherals, but I've been out of that world for a while. It would be fun to find an application for a big FPGA and a modern microcontroller.


Neat! I love that H7 chip and its gargantuan inatruction manual... ...and you didn't even mention its 2nd core :)


H735 is one of the single core SKUs. Just a 550 MHz M7.

Would not surprise me if the M4 was there and fused off (i.e. same die as multicore H7 offerings), but it's not active.


Probably not. The dual-core parts are DIE450 (which is shared with some single-core parts like the H750 series!), but STM32H735 is DIE483.


I have a H735 on a retired board slated for decap so we'll find out once I open it up.

Do you know if it's fabbed in house, TSMC, or Samsung? I've seen ST silicon from all 3 foundries but the only thing I've seen stated publicly is 40nm. When I get it opened up it should be easy to tell, TSMC and Samsung processes have distinctive features on them that I recognize by sight.


No idea - I'm reading the die IDs out of the STM32Cube DB. I haven't looked at the silicon, but I have no reason to doubt what the DB says, especially since it confirms that a lot of allegedly different parts use the same dies.


> distinctive features on them that I recognize by sight

Now this would be a cool blog post!


real quite high level sorry, most of your embedded projects going forward are MCU+fpga to do what? I thought a custom router but 284mbps isn't nearly fast for a network.


The intent is for the high performance datapath to live entirely in FPGA (and the project you're probably thinking of is switching, not routing).

The MCU is for control plane only. Several hundred Mbps between the control and data plane is more than enough for a SSH management CLI and poking registers on the FPGA to move a port to a different VLAN in response to a CLI command or add an ACL rule or something.


It's a good question. A lot of FPGA projects I see (including some real life products I've looked into recently) don't really need an FPGA. One I was asked to evaluate recently could easily have been done with a microcontroller with PWM outputs. The frequencies involved were well under 40MHz. Yes, there were a couple of multiplications going on in the FPGA, but there those could've been easily handled by a micorcontroller. An RP2040 would've sufficed instead of what they had - a microcontroller + an FPGA.


The projects in question include things like a 48 port gigabit Ethernet switch with packet datapath in the FPGA, and dual 10/25G SFP28 uplinks. You're not doing that on a MCU. Also higher end oscilloscope work (e.g. 10 Gsps 12-bit JESD204B)

But a STM32 is more than sufficient for the management interface on both.


I think a lot of people don't fully appreciate how fast a modern "microcontroller" is. That 'H735 is probably faster than every computer I had up to and including the iBook G4 I until early 2009.


I keep running into that also. It's like the common mental model of a microcontroller froze around Y2K as a sort of headless VIC-20. I had an FAE, and a good one, from a major supplier tell me "you can't implement a filter" on a low-end micro that was roughly as powerful as an early nineties DSP.


Cortex-Ms, man. A lot of 'em you just give 3.3V and a couple of bypass caps, and GCC will use the single-cycle hardware MAC (for M4 and above) if you just write the straight-forward C code and you can put it on there in 600ms with DFU. I'm a hobbyist, not an embedded wizard, but it really seems *pretty* good compared to what I understand about the old days.

(Like I like retro stuff and during COVID I bought an old DSP56k dev board with a book about the assembly language but oh boy, oh dear)


They're amazing. And - you can run a PC emulator on an ESP32. Sure, you need the fancy ram. OK. And then people will tell me an ESP32 can't do things that people definitely were doing on bare-bones PCs in the eighties.


Zynq 7010s are $2.50 and are a hell of a lot more chip than an RP2040. If you already have the design (or copy one of the 50 available), it's a good option when you don't want to fight the chip.

PIO has extraordinarily sloppy timing (skew in all categories) compared to the cheapest and smallest FPGAs.


Where are you getting them for $2.50?? The XC7Z010-1CLG225C is $74.83 at Digikey in qty 1.

Checking sketchier places Win-Source has the CLG400 package for $22.20 and even the cheapest aliexpress seller wants $4.84 for something marked as a 7Z010 that may or may not be legit.

Also "fight the chip" is pretty much the definition of what I did last time I did a zynq project. Just give me a plain FPGA and MCU with no wizards or GUIs or automatic code generation.


https://www.aliexpress.us/item/3256803970893483.html

I've ordered trays (and they send the OEM tray) - unique barcodes, legit.

> Just give me a plain FPGA and MCU with no wizards or GUIs or automatic code generation.

You can pretty much cut out all of their tools and get a pure Yocto/Vivado TCL build for the bitstream for the 7 series Zynqs. Very low touch.

Their IO planner (in the Vivado IP integrator) is somewhat necessary for complex peripheral scenarios and is one of the few things I ever use Xilinx GUI applications for anymore.


I was interested to see, or at least what state they're in so I grabbed a couple. Might try to compare them against some genuine ones with CT and destructive inspection.

On the chance they're half reasonable, thanks for the link.


Let me know if the barcodes are anything but unique/perfect. This has been the case with many vendors, but these chips are cheap enough that you can try 50 vendors.


Will do. Not too worried about trying low cost parts to gain some minimal confidence in a possible source, they're compelling for weekend side-projects at least.

I've previously struggled to roll the dice for higher end parts as the cost difference isn't as extreme and had some obvious reballed parts a few years ago. If they're OK then their $20 XC7K325T will be at the top of my list...



That's low-end Zynq and Artix and I'm thinking more >676 Kintex, though I appreciate any discussion on sourcing.

In the way that that Aliexpress vendor lists the 7010 parts at 1/10th the price of LCSC, some of their $20-60 listings are also shockingly cheap in comparison.


Ah, missed the K! For the K420T, I pay around $30-45 from CN (not Aliexpress) sources with original barcodes (but these vendors are quite a bit harder to deal with.)


I have some obviously reballed (but well done) aliexpress XCKU5P's that I got for $55 a while back. Haven't tested yet but the price was so good I couldn't resist.


I’ve been bitten before but the risk/reward is probably worth it for parts like that.

Thanks for your write-ups btw, been following glScope development for a while.


I don’t even find Vivado that bad, but maybe I have Stockholm syndrome. Or maybe it’s because I’m forced to use Intel Quartus right now and wish it was Vivado every time.


How do you assemble your board with BGA packages? Or do you procure the parts and then send them somewhere for assembly?


Vapor phase reflow.


What parts do you have to fight? I’ve been using Zynqs for about almost a decade or so, and I really enjoy them. But that’s for personal projects with a lot of freedom, so I’m curious what problems arise in more commercial/professional settings.

Nowadays, I only often wish I had their ARMv8 chips instead of the old ARMv7 32bit architecture because that’s just showing its age, but that’s par for the course of using ARMv7, and doesn’t affect the PL side (much, except for interfacing sometimes).


> Zynq 7010

I've always wanted to do an FPGA project but haven't looked seriously into where to start. Can the Zynq 7010 handle something like data transfer from a 4K image sensor to a USB 3 transceiver?

> PIO has extraordinarily sloppy timing

Do you have any data on this? I was under the impression the PIO has fairly precise timing if you set up your clocks right, but maybe I've been misled here.


> Can the Zynq 7010 handle something like data transfer from a 4K image sensor to a USB 3 transceiver?

The Zynq 7010 is less-than-ideal for this because you'd have to use some kind of USB3 interface PHY - which would increase cost and be pretty limited functionality-wise.

If you use an FPGA with transceivers (some 7 series Artix chips - https://www.lcsc.com/product-detail/Programmable-Logic-Devic..., most 7 series Virtex/Kintex chips, all US/US+ chips), you can implement USB3 without an external PHY: https://github.com/enjoy-digital/usb3_pipe

The Z7010 probably has the area to do this type of translation but not the transceivers. There are other chips in the 7 series Zynq family with capable transceivers, but they are much more expensive ($15-$35 from CN).

> I was under the impression the PIO has fairly precise timing if you set up your clocks right, but maybe I've been misled here.

No measured data, but when I was once implementing JTAG and SPI at 50MHz+ with an extremely overclocked chip, the edges were very inconsistent in relation to each other and in pulse width - 5-15ns (estimating from memory, they were sloppy.)

PIO is very precise within its specified capabilities, this range is just very low compared to cheap FPGAs.


Thanks for the reply and the links! I'm okay with a more expensive FPGA chip if it significantly simplifies the implementation. I'm probably biting off more than I can chew trying to start FPGA dev with something like this, so I'll also look into simpler starter projects.

> PIO is very precise within its specified capabilities, this range is just very low compared to cheap FPGAs.

Good to know, I don't plan on pushing RP2040s to their limit any time soon. They're still excellent for lower speed projects.


Embedded projects are never about doing things as fast as computers: we have full scale computers (and routers, and firewalls, and switches) for that.

Embedded is about solving problems more physical in nature, as you are physically closer to reality in nearly all aspects.

--------

An MCU + FPGA project could implement... say... the VFIR IrDA (Infrared) protocol at 16Mbit.

Traditional IrDA is widely supported at SIR and MIR levels (upto 1.152MBit or so). Anything faster and the equipment has basically been lost to the 1990s (and never was very popular anyway).

IrDA I'd explain as a remote-controller on steroids. Its infrared based (like TV Remote Controllers), so you need to line up both devices and have them looking at each other. Infrared can reliably travel about 3 meters over the open air in a variety of conditions. IrDA allows for bidirectional communications. Its a truly wireless protocol, albeit one that requires significant alignment to function correctly. But ~3 meters is good range and practical for many applications.

Nominally, you could use an entire MCU to handle the encoding / decoding of these light-pulses. However, that's a bit redundant. Its far more cost efficient to dedicate a few LUTs in an FPGA to the task.

Yes, the MCU is needed for the final application-level / OSI layer 4/5/6/7 aspects of IrDA protocol. But the lowest PHY and MAC levels of the protocol can and (probably) should be a small section of FPGA.

Upgrading from standard MCU 1MBit to 16MBit would be a 1600% improvement to communications compared to what's readily available with commercial-off-the-shelf solutions. If you've determined that IR Communications is good for whatever purpose you're using, maybe the 1600% improvement is going to be useful.

------------

EDIT: The "physicality" of this is because photodiodes react very quickly to light pulses. And an expensive enough transistor can amplify that at the ~100MHz speeds needed to run VFIR (at least in theory. I've never done this).

The FPGA (or MCU if you go that route...) just needs to clock at 100MHz or so, and interpret the start-of-frame and end-of-frame signals, while also interpreting a few other low-level details. Overall, this turns the sequence of light pulses into bits-and-bytes for higher-level processing (which code can and should handle).


> Embedded projects are never about doing things as fast as computers: we have full scale computers (and routers, and firewalls, and switches) for that.

That's a bit of an overly broad statement.

I have several times at work done projects that use FPGAs to do things faster than a computer can do, e.g. some specific processing of a full 10Gbps UDP stream, which is much easier to get 100% reliable without packet drop in digital logic than it is in software land.

The FPGA+CPU combo allows you to do a LOT of things. I tend to use just a straight Zynq so I already have my memory buses wired up from processing system to digital logic, but this is an interesting architecture.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: