Sunday, July 19, 2020

QEMU TCG

[I'm not sure what's the best title to describe this blog. So just use those two words. If you arrive here, that means either this title or something below has the keywords that you care about.]


Recently I had a need to touch TCG code in QEMU to fix something. I have been using QEMU for years and even longer with VirtualBox and VMWare. But at most time, the virtualization of CPU is done in kernel by real CPU, via KVM in QEMU. I heard about the user mode emulation, and even read the VBox's doc which has some detailed explanation on how they do it, and of course some fancy PPTs, but I have never had a chance to really dig into the source code to know how exactly it's done.

So after countless days and nights I finally had a roughly deep understanding. I feel that I'd better record it somewhere because it's so complex and hard to remember the details even just after couple months.

I'll describe the steps bellow about how to analyze its execution flow.

I've never been a real virtualization guy untill last year when I needed to fix something deeply inside the virtualized CPU. So if some terms or concepts below are wrong, please forgive me and just use whatever names that you think they are correct.

Ingredients


We need QEMU 5.0.0 source code, GDB, a Linux system (I use Windows Subsystem for Linux Ubuntu on Windows 10) on x86_64 Intel CPU and a hand-crafted assembly code to be emulated as a user mode process.


Build the user mode process


QEMU can run in several different modes (https://en.wikipedia.org/wiki/QEMU#Operating_modes). The one we use here is user-mode emulation.

So we need to make a very simple user-mode process with just two assembly instructions like the following:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
global _start

section .data
	align 16
	
section .text
	_start:

	mov rax, 0x1234567890abcdef
	
	ret;
Then compile it:
1
2
ray@DESKTOP:/mnt/d$ nasm -felf64 -g tcg.asm
ray@DESKTOP:/mnt/d$ ld -g tcg.o -o tcg
Let's double check that the compiled binary does have the instruction 'mov' and its bytes in hex:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
ray@DESKTOP:/mnt/d$ gdb ./tcg
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tcg...
(gdb) set output-radix 16
Output radix now set to decimal 16, hex 10, octal 20.
(gdb) set disassemble-next-line on
(gdb) show disassemble-next-line
Debugger's willingness to use disassemble-next-line is on.
(gdb) disassemble _start
Dump of assembler code for function _start:
   0x0000000000401000 <+0>:     movabs $0x1234567890abcdef,%rax
   0x000000000040100a <+10>:    retq
End of assembler dump.
(gdb) x/10xb 0x0000000000401000
0x401000 <_start>:      0x48    0xb8    0xef    0xcd    0xab    0x90    0x78    0x56
0x401008 <_start+8>:    0x34    0x12
(gdb)
This piece of asm code is not perfect. It will crash if you run it directly from the shell. But the only thing that we care about is just 'mov', so it's good enough to go for our exercise. ('ret' is not the part of this exercise.)

We also run it in GDB to make sure that RAX does load the value of 0x1234567890abcdef
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
(gdb) b _start
Breakpoint 1 at 0x401000
(gdb) r
Starting program: /mnt/d/tcg

Breakpoint 1, 0x0000000000401000 in _start ()
=> 0x0000000000401000 <_start+0>:       48 b8 ef cd ab 90 78 56 34 12   movabs $0x1234567890abcdef,%rax
(gdb) p $rax
$1 = 0x0
(gdb) si
0x000000000040100a in _start ()
=> 0x000000000040100a <_start+10>:      c3      retq
(gdb) p $rax
$2 = 0x1234567890abcdef
(gdb)
Now the user-mode target looks ok and we can move to the next step.

Build QEMU


This step is to build QEMU.

After extracting QEMU 5.0.0 source code, run:
1
ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ ../configure --target-list=x86_64-linux-user --extra-cflags=-g3 --enable-debug --enable-system --enable-user --enable-linux-user --enable-debug-tcg --enable-debug-info --disable-sdl --disable-vnc --disable-kvm
enable-debug-tcg is very useful while debugging TCG. You'll see the benefit later.

Then build QEMU with make.
1
ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))') 

Execution Flow


This step is to use GDB to discover how TCG decodes just one single instruction, generates the intermediate code and the real host code, finally how to execute the whole bulky code to just emulate a single 'mov' instruction.

The first thing that we need to understand is how TCG does the above jobs.

TCG uses a dynamically generated function to emulate only one single instruction in our case (or a bunch of instructions for improved performance. Talk about it later). That function is formed into 3 parts: a prologue to save some registers and stack space, a subroutine to perform the actual job of the target instruction, and an epilogue to clean up things and return to the caller. So, we'll trace the code in the order of prologue -> subroutine -> epilogue.

We start QEMU via GDB with the following command:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ gdb --args x86_64-linux-user/qemu-x86_64 -L / -d op,in_asm,out_asm -singlestep ../../tcg
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from x86_64-linux-user/qemu-x86_64...
warning: File "/mnt/d/qemu-5.0.0/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /mnt/d/qemu-5.0.0/.gdbinit
line to your configuration file "/home/ray/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/ray/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
(gdb) set output-radix 16
Output radix now set to decimal 16, hex 10, octal 20.
(gdb) set disassemble-next-line on
(gdb) show disassemble-next-line
Debugger's willingness to use disassemble-next-line is on.
(gdb)
There are several things that you need to pay attention on the command line.

First, "-d op,in_asm,out_asm" tells QEMU to log the target instruction 'mov' in 'IN' log, the intermediate code in 'OP' log and the final host code in 'OUT' log during the whole process. This is very useful to understand the whole process and is only available when compiling QEMU with enable-debug-tcg as we did in the previous step.

Second, "-singlestep" tells TCG to just translate instructions one by one, otherwise it will translate them in a batch. Obviously, the latter is faster.

Then, we need to do the first important step:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
(gdb) break disas_insn
Breakpoint 1 at 0x13c792: file /mnt/d/qemu-5.0.0/target/i386/translate.c, line 4488.
(gdb) r
Starting program: /mnt/d/qemu-5.0.0/build/x86_64-linux-user/qemu-x86_64 -L / -d op,in_asm,out_asm -singlestep ../../tcg
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffff230700 (LWP 5566)]
PROLOGUE: [size=45]
0x7ffff0000000:  55                       pushq    %rbp
0x7ffff0000001:  53                       pushq    %rbx
0x7ffff0000002:  41 54                    pushq    %r12
0x7ffff0000004:  41 55                    pushq    %r13
0x7ffff0000006:  41 56                    pushq    %r14
0x7ffff0000008:  41 57                    pushq    %r15
0x7ffff000000a:  48 8b ef                 movq     %rdi, %rbp
0x7ffff000000d:  48 81 c4 78 fb ff ff     addq     $-0x488, %rsp
0x7ffff0000014:  ff e6                    jmpq     *%rsi
0x7ffff0000016:  33 c0                    xorl     %eax, %eax
0x7ffff0000018:  48 81 c4 88 04 00 00     addq     $0x488, %rsp
0x7ffff000001f:  c5 f8 77                 vzeroupper
0x7ffff0000022:  41 5f                    popq     %r15
0x7ffff0000024:  41 5e                    popq     %r14
0x7ffff0000026:  41 5d                    popq     %r13
0x7ffff0000028:  41 5c                    popq     %r12
0x7ffff000002a:  5b                       popq     %rbx
0x7ffff000002b:  5d                       popq     %rbp
0x7ffff000002c:  c3                       retq


Thread 1 "qemu-x86_64" hit Breakpoint 1, disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488
4488    {
=> 0x000000000813c792 <disas_insn+0>:   f3 0f 1e fa     endbr64
   0x000000000813c796 <disas_insn+4>:   55      push   %rbp
   0x000000000813c797 <disas_insn+5>:   48 89 e5        mov    %rsp,%rbp
   0x000000000813c79a <disas_insn+8>:   53      push   %rbx
   0x000000000813c79b <disas_insn+9>:   48 81 ec 48 02 00 00    sub    $0x248,%rsp
   0x000000000813c7a2 <disas_insn+16>:  48 89 bd b8 fd ff ff    mov    %rdi,-0x248(%rbp)
   0x000000000813c7a9 <disas_insn+23>:  48 89 b5 b0 fd ff ff    mov    %rsi,-0x250(%rbp)
(gdb)
disas_insn is the entry point of decoding 'mov' on i386 target (Intel CPU). When we hit it, there is a log message of PROLOGUE (actually include the epilogue), which is generated during QEMU initialization by tcg_target_qemu_prologue. The instructions before and including 'jmpq *%rsi' is the prologue. Anything after that is the epilogue. 'jmpq *%rsi' is where the prologue transfers the execution to the subroutine to perform 'mov'.

We can verify this in GDB:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
(gdb) x/19i 0x7ffff0000000
   0x7ffff0000000:      push   %rbp
   0x7ffff0000001:      push   %rbx
   0x7ffff0000002:      push   %r12
   0x7ffff0000004:      push   %r13
   0x7ffff0000006:      push   %r14
   0x7ffff0000008:      push   %r15
   0x7ffff000000a:      mov    %rdi,%rbp
   0x7ffff000000d:      add    $0xfffffffffffffb78,%rsp
   0x7ffff0000014:      jmpq   *%rsi
   0x7ffff0000016:      xor    %eax,%eax
   0x7ffff0000018:      add    $0x488,%rsp
   0x7ffff000001f:      vzeroupper
   0x7ffff0000022:      pop    %r15
   0x7ffff0000024:      pop    %r14
   0x7ffff0000026:      pop    %r13
   0x7ffff0000028:      pop    %r12
   0x7ffff000002a:      pop    %rbx
   0x7ffff000002b:      pop    %rbp
   0x7ffff000002c:      retq
(gdb)
From the backtrace, we can also see the calling sequence:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
(gdb) bt
#0  disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488
#1  0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570
#2  0x00000000080d16fe in translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070,
    tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:102
#3  0x000000000814b7c3 in gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8632
#4  0x00000000080cf6c1 in tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000)
    at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1718
#5  0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407
#6  0x00000000080ccfb6 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:731
#7  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
#8  0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228)
    at /mnt/d/qemu-5.0.0/linux-user/main.c:872
(gdb) 
Now, we let disas_insn fiish its job:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
(gdb) finish
Run till exit from #0  disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488
0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570
8570        pc_next = disas_insn(dc, cpu);
   0x000000000814b5d3 <i386_tr_translate_insn+88>:      48 8b 55 d0     mov    -0x30(%rbp),%rdx
   0x000000000814b5d7 <i386_tr_translate_insn+92>:      48 8b 45 f0     mov    -0x10(%rbp),%rax
   0x000000000814b5db <i386_tr_translate_insn+96>:      48 89 d6        mov    %rdx,%rsi
   0x000000000814b5de <i386_tr_translate_insn+99>:      48 89 c7        mov    %rax,%rdi
   0x000000000814b5e1 <i386_tr_translate_insn+102>:     e8 ac 11 ff ff  callq  0x813c792 <disas_insn>
=> 0x000000000814b5e6 <i386_tr_translate_insn+107>:     48 89 45 f8     mov    %rax,-0x8(%rbp)
Value returned is $1 = 0x40100a
(gdb) finish
Run till exit from #0  0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570
translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070,
    tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:106
106             if (db->is_jmp != DISAS_NEXT) {
=> 0x00000000080d16fe <translator_loop+671>:    48 8b 45 d0     mov    -0x30(%rbp),%rax
   0x00000000080d1702 <translator_loop+675>:    8b 40 18        mov    0x18(%rax),%eax
(gdb) finish
Run till exit from #0  translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070,
    tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:106
----------------
IN:
0x00401000:  48 b8 ef cd ab 90 78 56  movabsq  $0x1234567890abcdef, %rax
0x00401008:  34 12

gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8633
8633    }
=> 0x000000000814b7c3 <gen_intermediate_code+95>:       90      nop
   0x000000000814b7c4 <gen_intermediate_code+96>:       48 8b 45 f8     mov    -0x8(%rbp),%rax
   0x000000000814b7c8 <gen_intermediate_code+100>:      64 48 33 04 25 28 00 00 00      xor    %fs:0x28,%rax
   0x000000000814b7d1 <gen_intermediate_code+109>:      74 05   je     0x814b7d8 <gen_intermediate_code+116>
   0x000000000814b7d3 <gen_intermediate_code+111>:      e8 78 9b f2 ff  callq  0x8075350 <__stack_chk_fail@plt>
   0x000000000814b7d8 <gen_intermediate_code+116>:      c9      leaveq
   0x000000000814b7d9 <gen_intermediate_code+117>:      c3      retq
(gdb) 
We see the new 'IN' log. It shows the same instruction and byte sequence as we saw previously by diretly launching the binary 'tcg' with GDB. We also verify it with host memory:
1
2
3
4
5
6
(gdb) x/1i 0x00401000
   0x401000:    movabs $0x1234567890abcdef,%rax
(gdb) x/10xb 0x00401000
0x401000:       0x48    0xb8    0xef    0xcd    0xab    0x90    0x78    0x56
0x401008:       0x34    0x12
(gdb)
Let's continue finishing each function in the backtrace untill we see this message:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
(gdb) finish
Run till exit from #0  gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1)
    at /mnt/d/qemu-5.0.0/target/i386/translate.c:8633
tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000)
    at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1719
1719        tcg_ctx->cpu = NULL;
=> 0x00000000080cf6c1 <tb_gen_code+572>:        64 48 8b 04 25 a0 ff ff ff      mov    %fs:0xffffffffffffffa0,%rax
(gdb) finish
Run till exit from #0  tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000)
    at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1719
OP:
 ld_i32 tmp11,env,$0xfffffffffffffff0
 movi_i32 tmp12,$0x0
 brcond_i32 tmp11,tmp12,lt,$L0

 ---- 0000000000401000 0000000000000000
 movi_i64 tmp0,$0x1234567890abcdef
 mov_i64 rax,tmp0
 movi_i64 tmp3,$0x40100a
 st_i64 tmp3,env,$0x80
 exit_tb $0x0
 set_label $L0
 exit_tb $0x7ffff0000043

OUT: [size=53]
0x7ffff0000100:  8b 5d f0                 movl     -0x10(%rbp), %ebx
0x7ffff0000103:  85 db                    testl    %ebx, %ebx
0x7ffff0000105:  0f 8c 1e 00 00 00        jl       0x7ffff0000129
0x7ffff000010b:  48 bb ef cd ab 90 78 56  movabsq  $0x1234567890abcdef, %rbx
0x7ffff0000113:  34 12
0x7ffff0000115:  48 89 5d 00              movq     %rbx, (%rbp)
0x7ffff0000119:  48 c7 85 80 00 00 00 0a  movq     $0x40100a, 0x80(%rbp)
0x7ffff0000121:  10 40 00
0x7ffff0000124:  e9 ed fe ff ff           jmp      0x7ffff0000016
0x7ffff0000129:  48 8d 05 13 ff ff ff     leaq     -0xed(%rip), %rax
0x7ffff0000130:  e9 e3 fe ff ff           jmp      0x7ffff0000018

0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407
407             tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
   0x00000000080cc786 <tb_find+84>:     8b 7d a8        mov    -0x58(%rbp),%edi
   0x00000000080cc789 <tb_find+87>:     8b 4d cc        mov    -0x34(%rbp),%ecx
   0x00000000080cc78c <tb_find+90>:     48 8b 55 d0     mov    -0x30(%rbp),%rdx
   0x00000000080cc790 <tb_find+94>:     48 8b 75 d8     mov    -0x28(%rbp),%rsi
   0x00000000080cc794 <tb_find+98>:     48 8b 45 b8     mov    -0x48(%rbp),%rax
   0x00000000080cc798 <tb_find+102>:    41 89 f8        mov    %edi,%r8d
   0x00000000080cc79b <tb_find+105>:    48 89 c7        mov    %rax,%rdi
   0x00000000080cc79e <tb_find+108>:    e8 e2 2c 00 00  callq  0x80cf485 <tb_gen_code>
=> 0x00000000080cc7a3 <tb_find+113>:    48 89 45 e0     mov    %rax,-0x20(%rbp)
Value returned is $2 = (TranslationBlock *) 0x7ffff0000040 <code_gen_buffer+19>
(gdb)
From above, we can see two new logs: 'OP' and 'OUT'.

'OP' log shows the intermediate code that TCG has generated to represent 'mov'. It's not a real machine code. Just some pseudo opcodes and variables to indicate what kind of the operations that TCG needs to perform.

'OUT' log shows the real host ASM instructions that will be executed on the host CPU. In our case, we can tell that these are real x86_64 instructions.

The intermediate code is just stored in a list and cannot be directly seen by GDB. But the final code (the sub-routine) is the real code in the memory so we can easily verify it:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
(gdb) x/9i 0x7ffff0000100
   0x7ffff0000100 <code_gen_buffer+211>:        mov    -0x10(%rbp),%ebx
   0x7ffff0000103 <code_gen_buffer+214>:        test   %ebx,%ebx
   0x7ffff0000105 <code_gen_buffer+216>:        jl     0x7ffff0000129 <code_gen_buffer+252>
   0x7ffff000010b <code_gen_buffer+222>:        movabs $0x1234567890abcdef,%rbx
   0x7ffff0000115 <code_gen_buffer+232>:        mov    %rbx,0x0(%rbp)
   0x7ffff0000119 <code_gen_buffer+236>:        movq   $0x40100a,0x80(%rbp)
   0x7ffff0000124 <code_gen_buffer+247>:        jmpq   0x7ffff0000016
   0x7ffff0000129 <code_gen_buffer+252>:        lea    -0xed(%rip),%rax        # 0x7ffff0000043 <code_gen_buffer+22>
   0x7ffff0000130 <code_gen_buffer+259>:        jmpq   0x7ffff0000018
(gdb) x/53xb 0x7ffff0000100
0x7ffff0000100 <code_gen_buffer+211>:   0x8b    0x5d    0xf0    0x85    0xdb    0x0f    0x8c    0x1e
0x7ffff0000108 <code_gen_buffer+219>:   0x00    0x00    0x00    0x48    0xbb    0xef    0xcd    0xab
0x7ffff0000110 <code_gen_buffer+227>:   0x90    0x78    0x56    0x34    0x12    0x48    0x89    0x5d
0x7ffff0000118 <code_gen_buffer+235>:   0x00    0x48    0xc7    0x85    0x80    0x00    0x00    0x00
0x7ffff0000120 <code_gen_buffer+243>:   0x0a    0x10    0x40    0x00    0xe9    0xed    0xfe    0xff
0x7ffff0000128 <code_gen_buffer+251>:   0xff    0x48    0x8d    0x05    0x13    0xff    0xff    0xff
0x7ffff0000130 <code_gen_buffer+259>:   0xe9    0xe3    0xfe    0xff    0xff
(gdb) 
We should also notice the addresses of two 'jmp'. Both of them are to return to the epilogue code based on the error check.

We also should pay attention to a global variable:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
(gdb) p *tcg_ctx
$3 = {
  pool_cur = 0x85859f0 "\352\r\353\r\354\r\355\r\356\r\357\r\360\r\361\r\362\r\363\r\364\r\365\r\366\r\367\r\370\r\371\r\372\r\373\r\374\r\375\r\376\r\377\r", pool_end = 0x858cc10 "", pool_first = 0x8584c00, pool_current = 0x8584c00,
  pool_first_large = 0x0, nb_labels = 0x1, nb_globals = 0x24, nb_temps = 0x31, nb_indirects = 0x0, nb_ops = 0xb,
  code_buf = 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036",
  tb_jmp_reset_offset = 0x7ffff000009c <code_gen_buffer+111>,
  tb_jmp_insn_offset = 0x7ffff00000a0 <code_gen_buffer+115>, tb_jmp_target_addr = 0x0, reserved_regs = 0x30,
  tb_cflags = 0xff000000, current_frame_offset = 0x80, frame_start = 0x80, frame_end = 0x480,
  frame_temp = 0x84f7b40 <tcg_init_ctx+2848>, code_ptr = 0x7ffff0000135 <code_gen_buffer+264> "", temps_in_use = 0x0,
  goto_tb_issue_mask = 0x0, vecop_list = 0x0, code_gen_prologue = 0x7ffff0000000, code_gen_epilogue = 0x7ffff0000016,
  code_gen_buffer = 0x7ffff000002d <code_gen_buffer>, code_gen_buffer_size = 0x7ffefd3,
  code_gen_ptr = 0x7ffff0000140 <code_gen_buffer+275>, data_gen_ptr = 0x0,
  code_gen_highwater = 0x7ffff7ffec00 <code_gen_buffer+134212563>, tb_phys_invalidate_count = 0x0, cpu = 0x0,
  pool_labels = 0x0, exitreq_label = 0x8584c10, free_temps = {{l = {0x1800000000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
        0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {
        0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0,
        0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
        0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}},
  temps = {{reg = TCG_REG_EBP, val_type = TEMP_VAL_REG, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64,
      fixed_reg = 0x1, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x0,
      temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x0, mem_offset = 0x0,
      name = 0x82ae35b "env", state = 0x2, state_ptr = 0x8585908}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM,
      base_type = TCG_TYPE_I32, type = TCG_TYPE_I32, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0,
      mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0xa8, name = 0x82c384c "cc_op", state = 0x3,
      state_ptr = 0x858590c}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64,
      type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0,
      mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x90, name = 0x82c3852 "cc_dst", state = 0x3,
      state_ptr = 0x8585910}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64,
      type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0,
      mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x98, name = 0x82c3859 "cc_src", state = 0x3,
      state_ptr = 0x8585914}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64,
      type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0,
      mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0xa0, name = 0x82c3860 "cc_src2", state = 0x3,
      state_ptr = 0x8585918}, {reg = TCG_REG_EBX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64,
      type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x1,
      mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x1234567890abcdef,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x0, name = 0x82c38e0 <reg_names> "rax", state = 0x3,
      state_ptr = 0x858591c}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64,
      type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0,
      mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0,
      mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x8, name = 0x82c38e4 <reg_names+4> "rcx", state = 0x3,
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb)
tcg_ctx plays a very important role in this TCG thing. 'code_ptr', 'code_buf', 'code_gen_prologue' and 'code_gen_epilogue' are the ones that we need to pay attention. But since 'mov' is so simple, 'code_ptr' is not used at all and only 'code_gen_prologue' and 'code_gen_epilogue' point to prologue and epilogue repectively. 'code_buf' is very tricky. I'll show why later.

Next step is to verify that the generated code is actually executed. We put a break point at the prologue and continue:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(gdb) b *0x7ffff0000000
Breakpoint 2 at 0x7ffff0000000
(gdb) bt
#0  0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407
#1  0x00000000080ccfb6 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:731
#2  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
#3  0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228)
    at /mnt/d/qemu-5.0.0/linux-user/main.c:872
(gdb) c
Continuing.

Thread 1 "qemu-x86_64" hit Breakpoint 2, 0x00007ffff0000000 in ?? ()
=> 0x00007ffff0000000:  55      push   %rbp
(gdb) bt
#0  0x00007ffff0000000 in ?? ()
#1  0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172
#2  0x00000000080ccd22 in cpu_loop_exec_tb (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>,
    last_tb=0x7ffffffed9a8, tb_exit=0x7ffffffed9a0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:619
#3  0x00000000080ccfd2 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:732
#4  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
#5  0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228)
    at /mnt/d/qemu-5.0.0/linux-user/main.c:872
(gdb)
So, from the above we can see the prologue is called by 'tcg_qemu_tb_exec' (accel/tcg/cpu-exec.c:172). Then, we should check the guest RAX:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(gdb) f 4
#4  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
207             trapnr = cpu_exec(cs);
   0x000000000810d63d <cpu_loop+44>:    48 8b 45 e0     mov    -0x20(%rbp),%rax
   0x000000000810d641 <cpu_loop+48>:    48 89 c7        mov    %rax,%rdi
   0x000000000810d644 <cpu_loop+51>:    e8 5e f7 fb ff  callq  0x80ccda7 <cpu_exec>
=> 0x000000000810d649 <cpu_loop+56>:    89 45 dc        mov    %eax,-0x24(%rbp)
(gdb) p env->regs[0]
$7 = 0x0
(gdb)
The guest RAX currently is 0 (zero).

The above back trace also shows that the prologue is called from somewhere in cpu_tb_exec accel/tcg/cpu-exec.c:172. When we go to that line, we can see the following C code:
1
ret = tcg_qemu_tb_exec(env, tb_ptr);
Also the definition of tcg_qemu_tb_exec:
1
2
# define tcg_qemu_tb_exec(env, tb_ptr) \
    ((uintptr_t (*)(void *, void *))tcg_ctx->code_gen_prologue)(env, tb_ptr)
No need to explaint how important env is in QEMU, but the second parameter is something more interesting:
1
2
3
4
5
6
7
8
9
(gdb) f 1
#1  0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172
172         ret = tcg_qemu_tb_exec(env, tb_ptr);
(gdb) p tb_ptr
$1 = (uint8_t *) 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036"
(gdb) p tcg_ctx->code_buf
$2 = (tcg_insn_unit *) 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036"
(gdb)
When we search the previous GDB output, we notice that tb_ptr is same as tcg_ctx->code_buf.

So the generated host code is called like this: the first parameter is env and tcg_ctx->code_buf is the second parameter. If we check x86_64 calling convention for Linux, RDI is used by env and RSI is used by tcg_ctx->code_buf. If we still recall 'jmpq *%rsi' in the prologue, it's so clear to us that how the execution is transfered from the prologue to the sub-routine.

Let's see what will happen after executing the whole generated code by setting a break point in epilogue:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
(gdb) b *0x7ffff000001f
Breakpoint 3 at 0x7ffff000001f
(gdb) c
Continuing.

Thread 1 "qemu-x86_64" hit Breakpoint 3, 0x00007ffff000001f in ?? ()
=> 0x00007ffff000001f:  c5 f8 77        vzeroupper
(gdb) ni
0x00007ffff0000022 in ?? ()
=> 0x00007ffff0000022:  41 5f   pop    %r15
(gdb)
0x00007ffff0000024 in ?? ()
=> 0x00007ffff0000024:  41 5e   pop    %r14
(gdb)
0x00007ffff0000026 in ?? ()
=> 0x00007ffff0000026:  41 5d   pop    %r13
(gdb)
0x00007ffff0000028 in ?? ()
=> 0x00007ffff0000028:  41 5c   pop    %r12
(gdb)
0x00007ffff000002a in ?? ()
=> 0x00007ffff000002a:  5b      pop    %rbx
(gdb)
0x00007ffff000002b in ?? ()
=> 0x00007ffff000002b:  5d      pop    %rbp
(gdb)
0x00007ffff000002c in ?? ()
=> 0x00007ffff000002c:  c3      retq
(gdb)
0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172
172         ret = tcg_qemu_tb_exec(env, tb_ptr);
   0x00000000080cbfd7 <cpu_tb_exec+279>:        48 c7 c0 a0 ff ff ff    mov    $0xffffffffffffffa0,%rax
   0x00000000080cbfde <cpu_tb_exec+286>:        64 48 8b 00     mov    %fs:(%rax),%rax
   0x00000000080cbfe2 <cpu_tb_exec+290>:        48 8b 80 a0 00 00 00    mov    0xa0(%rax),%rax
   0x00000000080cbfe9 <cpu_tb_exec+297>:        48 89 c1        mov    %rax,%rcx
   0x00000000080cbfec <cpu_tb_exec+300>:        48 8b 55 d8     mov    -0x28(%rbp),%rdx
   0x00000000080cbff0 <cpu_tb_exec+304>:        48 8b 45 d0     mov    -0x30(%rbp),%rax
   0x00000000080cbff4 <cpu_tb_exec+308>:        48 89 d6        mov    %rdx,%rsi
   0x00000000080cbff7 <cpu_tb_exec+311>:        48 89 c7        mov    %rax,%rdi
   0x00000000080cbffa <cpu_tb_exec+314>:        ff d1   callq  *%rcx
=> 0x00000000080cbffc <cpu_tb_exec+316>:        48 89 45 e8     mov    %rax,-0x18(%rbp)
(gdb) bt
#0  0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>)
    at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172
#1  0x00000000080ccd22 in cpu_loop_exec_tb (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>,
    last_tb=0x7ffffffed9a8, tb_exit=0x7ffffffed9a0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:619
#2  0x00000000080ccfd2 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:732
#3  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
#4  0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228)
    at /mnt/d/qemu-5.0.0/linux-user/main.c:872
(gdb) f 3
#3  0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207
207             trapnr = cpu_exec(cs);
   0x000000000810d63d <cpu_loop+44>:    48 8b 45 e0     mov    -0x20(%rbp),%rax
   0x000000000810d641 <cpu_loop+48>:    48 89 c7        mov    %rax,%rdi
   0x000000000810d644 <cpu_loop+51>:    e8 5e f7 fb ff  callq  0x80ccda7 <cpu_exec>
=> 0x000000000810d649 <cpu_loop+56>:    89 45 dc        mov    %eax,-0x24(%rbp)
(gdb) p env->regs[0]
$8 = 0x1234567890abcdef
(gdb)
Now we can see the guest RAX has 0x1234567890abcdef. This is exactly same as what we saw previously by running 'mov' in GDB directly.

Summary


Hope now you have some ideas about how TCG actually works.

But, because 'mov' is very simple and straight forward, this exercise does not show other more complex parts of the job. For example, if the input instruction is AVX, TCG might need to use one of its template-based helper function. That part is also quite chanlleging to understand.

Another thing is the performance. While singlestep is easy for learning TCG, it actually does the job in bulk in the real world. TranslateBlock structure is the key player in this case. The input is a group of ASM instructions (grouped by branch instructions) and each TranslateBlock is connected with jumps. TCG also has a hash table (key is the address of the instruction) to cache previously translated instructions, so the new translation is only necessary when it's not found.

According to its document TCG was based on another compiler project (I feel that TCG works like a JVM). So it includes a disassembler. As I found, there are three disassemblers in QEMU: TCG, disas and capstone. disas is used by TCG to log those 'PROLOGUE', 'IN' and 'OUT' log messages, so they are well format and easy to understand. I haven't figured out how capstone works. Maybe need another blog if I get a chance to learn.

Finally, I think the below picture is useful for someone who's totally new to TCG and is wondering how it works. The same question bothered me for very long time until I could do the above debug sessions. These calling sequences are quite similar between QEMU 5 and QEMU 2. QEMU 1 and older have much more differences but I guess probably nobody really cares now. 




No comments:

Post a Comment