A compiler is often referred to be a mysterious piece of software. It takes a program written in a high-level language, applies dozens of transformations on it, and then spits out optimized machine code. It sounds like black magic.
In this article, however, I would like to provide a counter argument by implementing a new dummy instruction in LLVM, and hopefully demonstrate how surprisingly straightforward it is if you know your way around.
Here is our plan: add a new instruction to the RISC-V target; make this
instruction available through a feature flag; and finally use and assembler to
assemble a program. Every file I mention from now on is assumed to be inside
the llvm/lib/Target/RISCV
directory unless otherwise specified. I will be
using LLVM 20.1, in case you would like to reproduce.
The instruction #
We will be creating an instruction called foo
that takes two operands in
registers and stores the result in another register. In RISC-V parlance, this is
an R-type (register) instruction. For example, we expect instructions of the
form:
1foo x1, x2, x3
R-type instructions in RISC-V are 32-bit wide (as most other instructions) and have the following encoding:
The first field is called opcode
, is 7-bits wide, and identifies the
instruction. We will be using the opcode 0b0001011
, also called custom-0
,
that is reserved in the ISA specification for non-standard instructions. The
next field, rd
encodes the destination register, which is one of the 32
available general purpose registers r0
through r31
. Fields func3
and
funct7
serve as complements to the opcode and help further identify which
instruction is this. For example, opcodes usually identify a larger class of
instructions (i.e. arithmetic and logic instructions) and these fields identify
if it is an addition, subtraction, shift, exclusive or, etc. For our purposes
here, both will be set to all zeroes. Finally, fields rs1
and rs2
encode the
two source operands.
Interlude #
Before going into the implementation, we need to talk about TableGen. This is a domain-specific language used in the LLVM project that helps us declare a bunch of "records". It is easier to understand with an example, consider the following TableGen snippet.
1class Person {
2 string name;
3 int age;
4}
5
6def Gustavo : Person {
7 let name = "Gustavo Leite";
8 let age = 30;
9}
There are two types of top-level declarations: classes and definitions. Classes
declare a template for creating records. Inside them, we declare which fields
records must provide along with their respective types. Definitions create the
records themselves. In the example above, we define a class Person
that has a
name and an age, then we declare a record Gustavo
of type Person
and fill in
the necessary fields. Classes can be parameterized and inherit from other
classes. This turns TableGen into a powerful templating engine for defining
records.
Implementing the instruction #
Now it's the time for the fun stuff. If you look in the RISCVInstrFormats.td
file you can find TableGen classes for the base instructions types in RISC-V. In
particular, there are classes for defining R-type instructions.
1// File: llvm/lib/Target/RISCV/RISCVInstrFormats.td:333
2class RVInstRBase<bits<3> funct3, RISCVOpcode opcode, dag outs,
3 dag ins, string opcodestr, string argstr>
4 : RVInst<outs, ins, opcodestr, argstr, [], InstFormatR> {
5 bits<5> rs2;
6 bits<5> rs1;
7 bits<5> rd;
8
9 let Inst{24-20} = rs2;
10 let Inst{19-15} = rs1;
11 let Inst{14-12} = funct3;
12 let Inst{11-7} = rd;
13 let Inst{6-0} = opcode.Value;
14}
15
16class RVInstR<bits<7> funct7, bits<3> funct3, RISCVOpcode opcode, dag outs,
17 dag ins, string opcodestr, string argstr>
18 : RVInstRBase<funct3, opcode, outs, ins, opcodestr, argstr> {
19 let Inst{31-25} = funct7;
20}
Class RVInstRBase
takes a bunch of parameters (funct3
, opcode
, ...),
defines new fields rd
, rs1
, rs2
for the destination and source registers,
and defines which bits in the instruction correspond to what. This is achieved
through the Inst
field. Bits 6-0 correspond to the opcode, bits 11-7 to the
destination register, bits 14-12 to funct3
, etc. Note that these bit ranges
correspond exactly to the figure I showed you in the introduction!
There is also a derived class RVInstR
that inherits from RVInstRBase
and
defines the funct7
field on top of it. This is what I meant when I said that
inheritance enables a kind of templating for records. Each class in the
hierarchy serves as a template for the classes that inherit from it.
With all that said, in order for us to define our foo
instruction, we must
create a record of type RVInstR
and set the parameters accordingly. We will do
that in the file RISCVInstrInfo.td
. Here it its:
1let mayLoad = 0, mayStore = 0, hasSideEffects = 0 in {
2 def FOO : RVInstR<
3 /*funct7=*/0b0000000,
4 /*funct3=*/0b000,
5 /*opcode=*/OPC_CUSTOM_0,
6 /*outs=*/(outs GPR:$rd),
7 /*ins=*/(ins GPR:$rs1, GPR:$rs2),
8 /*opcodestr=*/"foo",
9 /*argstr=*/"$rd, $rs1, $rs2"
10 >;
11}
In the first line, we state some properties of our instruction: it never loads
from memory; it never stores in memory; and it produces no visible side-effects
in the architectural state. Inside this block, we define the FOO
record from
RVInstR
. Fields funct7
, funct3
, and opcode
are exactly what you expect.
The field outs
list which operands are outputs. In this case rd
is an output
an output in a register from the GPR
class (General Purpose Register). The
ins
field is states that rs1
and rs2
are both input operands from the same
GPR
class. The opcodestr
states what is the assembly mnemonic for this
instruction, which is foo
. Finally, argstr
declares how the operands should
be printed and parsed to and from assembly, which are just the registers
separated by comma with the destination coming first.
And that's it. No, really, we just added a new instruction. See how easy that was? Eleven lines of code!
It is important to know that TableGen files do nothing on their own. During the
build process, the llvm-tblgen
tool will be invoked, it will read the records,
and generate C++ code based on those. Because of this we don't need to write
any C++ by hand.
We can now compile LLVM with the RISC-V target enabled and test our new addition. I'll leave the compilation part as an exercise. After the build completed (it may take some time), we can write our test program:
1dummy_function:
2 foo a0, a0, a1
And assemble it with:
1./build/bin/clang -c -o dummy.o dummy.s
Then disassemble it with:
1./build/bin/llvm-objdump -d dummy.o
We get the disassembled program back as:
dummy.o: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 <dummy_function>:
0: 00b5050b foo a0, a0, a1
Mind blowing, isn't it?
Defining a new feature #
I could have ended this post in the previous section, but there is one other detail that is crucial when you are implementing a new set of instructions into a target. RISC-V is highly modular and each extension is gated behind a feature flag. For example, if you wish to use vector instructions in your program, you need to inform the assembler of that, otherwise it won't recognize them.
This is achieved by creating a new record of type RISCVExtension
inside the
file RISCVFeatures.td
. I decided to call this extension dummy
, therefore we
add the following lines to this file.
1def FeatureVendorXDummy
2 : RISCVExtension<0, 1, "Dummy Instruction Extension">;
3
4def HasVendorXDummy
5 : Predicate<"Subtarget->hasVendorXDummy()">
6 , AssemblerPredicate<(all_of FeatureVendorXDummy),
7 "'XDummy' (Dummy Instruction Extension)">;
The first record defines the extension passing the major and minor version (e.g.
0.1) and a string that with the name of the extension. The name here is
important, you must use FeatureVendorX
as a prefix in the record name. The
second record defines a predicate that will be used in TableGen to "fence" the
foo
instruction to this particular extension.
Now, we go back to the instruction encoding definition and add the new predicate:
1diff --git i/llvm/lib/Target/RISCV/RISCVInstrInfo.td w/llvm/lib/Target/RISCV/RISCVInstrInfo.td
2index 62424ed57565..098f34697f49 100644
3--- i/llvm/lib/Target/RISCV/RISCVInstrInfo.td
4+++ w/llvm/lib/Target/RISCV/RISCVInstrInfo.td
5@@ -2088,7 +2088,7 @@ def : Pat<(binop_allwusers<add> GPR:$rs1, immop_oneuse<AddiPair>:$rs2),
6 (AddiPairImmSmall imm:$rs2))>;
7 }
8
9-let mayLoad = 0, mayStore = 0, hasSideEffects = 0 in {
10+let Predicates = [HasVendorXDummy], mayLoad = 0, mayStore = 0, hasSideEffects = 0 in {
11 def FOO : RVInstR<
12 /*funct7=*/0b0000000,
13 /*funct3=*/0b000,
Compile everything again and assemble the program like this:
1./build/bin/clang -march=rv64g_xdummy -c -o dummy.o dummy.s
Which states that we would like to compile to rv64g
and enable the xdummy
extension. You should try omitting this new flag to see what happens.
Parting words #
Compilers are fun! Both theory and practice are beautiful. In this post I hope to have convinced you, even a little bit, that tinkering with compilers is not impossible. I brushed up on a lot of details in order to keep the reading short and interesting. It is important to note that TableGen is used all throughout LLVM, Clang, and MLIR for different purposes. I was intentionally vague here because I plan to write on it in the future.
If you would like to learn more, I recommend reading the files
RISCVRegisterInfo.td
, RISCVInstrFormats.td
, and RISCVInstrInfo.td
in their
entirety. You will build a better idea of what how to implement new
instructions.
Acknowledgements #
I would like to thank Professor Hervé Yviquel from the Instute of Computing at Unicamp for reading a draft of this post.