Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Parsing RISC-V assembly (utk.edu)
56 points by azhenley on Oct 26, 2020 | hide | past | favorite | 13 comments


> since the ones I looked at use crude, regex-based parsers that don't maintain information about the structure.

It's assembly, not anything like C, C++, etc. It's simple. You're supposed to be able to handle it with regexes.

> If you are looking for a hand-written lexer and parser for RISC-V assembly that builds a parse tree

Huh? Parse tree for assembly?

> For example, "xor xor, xor, xor, xor" will be parsed just fine, despite it not being legal RISC-V assembly.

Then ... why parse it??????

See, this is what happens when you refer to assembling as compiling. They are not the same.


Yes. If basic assembler.

No. If you're going to implement an assembler that supports constants and symbols which can be combined into expressions. And expressions can easily be parsed for example[0]:

    〈primary expression〉->〈constant〉|〈symbol〉|〈local operand〉|@|
                        (〈expression〉)|〈unary operator〉〈primary expression〉
    〈term〉->〈primary expression〉|〈term〉〈strong operator〉〈primary expression〉
    〈expression〉->〈term〉|〈expression〉〈weak operator〉〈term〉
    〈unary operator〉-> + | − | ~ | $ | &
    〈strong operator〉-> * | / | // | % | << | >> | &
    〈weak operator〉-> + | − | | | ^
Your code might be hard to read if you'll not create a parse tree for it.

[0] - http://www.mmix.cs.hm.edu/doc/mmixal.pdf


> Huh? Parse tree for assembly?

Why not? The structure you need to represent something like

    lea eax, [eax + eax * 4]
is tree-shaped. It's not as general as many other parse trees, since you don't have arbitrary nesting. But it is a tree-shaped thing that represents a parse, and if you need a name for it, "parse tree" is reasonable. The list containing all the labels/instructions/whatnot in the program is also a trivial tree.

(I'm aware that RISC-V addressing modes are probably simpler than the x86 example above.)


Only if you want a crude Assembler, if you want something like MASM/TASM, then one needs definitely a proper parser.


If we move away from general-purpose computing toward a mix of specialized processors, will there be a whole set of standardized RISC-V extensions for GPU, ML, etc.? Or is RISC-V going to mainly stick to general-purpose code?

If there are a ton of extensions, will we need to compile all code for a particular machine so that it generates the right code for that machine's mix of special-purpose chips? What will that mean for virtualization and/or containerization?


To my knowledge, and looking at http://riscv.org, this is supposed to be an open-ISA (instruction set architecture). Their specification allows chip manufacturers to write their own extensions using the "custom" opcodes.

The Kendryte K210 is a RISC-V-compliant CPU. It has off-core components, such as what they're calling a KPU The ML and GPU cores are controlled via I/O, not by the CPU directly. These are called platform-level components. In general, this uses MMIO with a hard-wired memory address to control. You can see the KPU (their ML accelerator) here: https://s3.cn-north-1.amazonaws.com.cn/dl.kendryte.com/docum...

See section 3.2.

I think the extensions are meant to be modular. Right now, not many embedded devices allow for the H mode, and hypervisor-ing is still in development. Currently, I know of Machine mode, Supervisor mode, and User mode, but since the change from 2018 to 2019, they have really started to ramp up the virtualization ISA support.


So RISC-V is open, but the extensions are often proprietary?

And what does virtualization even mean when extensions proliferate? Will you need a distinct physical machine for each combination of extensions that someone might want to virtualize?


Properitary = custom...it just means that it's particular to a single chip manufacturer. They have set aside opcode space for doing just that. RISC-V can't forecast everything someone will want to do with a RISC-V chip. Instead, they promise not to use those opcodes so that it won't conflict with a chip manufacturer's "custom" instructions.

Here's a post from SiFive talking specifically about DSAs in RISC-V: https://www.sifive.com/blog/part-1-fast-access-to-accelerato...

In terms of virtualization, most of the extensions can be emulated. This happens a lot with the hodgepodge of extensions laid on Intel/AMD, such as SSE, SSSE, AVX, and so forth. Just because the underlying physical machine doesn't necessarily support it doesn't mean that the guest can't. At the operating system level, the OS can read the misa (Machine ISA) register to see which extensions are supported, and emulate those which are not. I don't think RISC-V solves the issues that virtualizing Intel/AMD also suffer--and I don't think that's really their goal.

In terms of the host, if there is an extension that cannot be emulated, then yes, I would think you'd need the physical machine to be able to support it.


The SDKs for the non big name (Nvidia, Google, AMD) accelerators have very poor developer experiences. (Granted CUDA/ROCm isn't much better) A number of the embedded SDKs require you to use their custom framework and won't support Tensorflow/Torch models out of the box. Onnx and similar conversion frameworks target only the big name chips, not off the shelf generic AI accelerator chips. Embedded systems need more software engineering.


This works for the basics of RISC-V assembly, but fails for a lot of assembly code that GCC accepts. It should be fine for most human written code, but machine generated code is also common.

Most notably, labels starting with a period are not accepted, but gcc generated assembly has tons of labels like that.


Thanks! That should be an easy fix. The documentation is terrible.

What else did I miss?


> Unfortunately, the documentation on legal syntax is absolutely terrible.

I totally agree. Determining the semantics of wierd assembly is even harder. At least for syntax you can just see confirm if gnu-as accepts it.

> Did I miss anything else?

I haven't looked into what is actually parsing at all, but it correctly accepts things like %hi, strings delimited by quotes, @function, labels that are just numbers. I had expected some of those to be rejects based on the post.

I don't know what else might fail because I can't try any auto generated code because of the period label issue. I would recommend trying the assembly generated by compiling this project; it shows a bunch of weird pieces of syntax.


Why? For assembly you generally just need a tokenizer/lexer and that's about it. However, a full on parser? Are you trying to make a macro assembler or something?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: