[I'm not sure what's the best title to describe this blog. So just use those two words. If you arrive here, that means either this title or something below has the keywords that you care about.]
Recently I had a need to touch TCG code in QEMU to fix something. I have been using QEMU for years and even longer with VirtualBox and VMWare. But at most time, the virtualization of CPU is done in kernel by real CPU, via KVM in QEMU. I heard about the user mode emulation, and even read the VBox's doc which has some detailed explanation on how they do it, and of course some fancy PPTs, but I have never had a chance to really dig into the source code to know how exactly it's done.
So after countless days and nights I finally had a roughly deep understanding. I feel that I'd better record it somewhere because it's so complex and hard to remember the details even just after couple months.
I'll describe the steps bellow about how to analyze its execution flow.
I've never been a real virtualization guy untill last year when I needed to fix something deeply inside the virtualized CPU. So if some terms or concepts below are wrong, please forgive me and just use whatever names that you think they are correct.
Ingredients
We need QEMU 5.0.0 source code, GDB, a Linux system (I use Windows Subsystem for Linux Ubuntu on Windows 10) on x86_64 Intel CPU and a hand-crafted assembly code to be emulated as a user mode process.
Build the user mode process
QEMU can run in several different modes (https://en.wikipedia.org/wiki/QEMU#Operating_modes). The one we use here is user-mode emulation.
So we need to make a very simple user-mode process with just two assembly instructions like the following:
1 2 3 4 5 6 7 8 9 10 11 | global _start section .data align 16 section .text _start: mov rax, 0x1234567890abcdef ret; |
Then compile it:
1 2 | ray@DESKTOP:/mnt/d$ nasm -felf64 -g tcg.asm ray@DESKTOP:/mnt/d$ ld -g tcg.o -o tcg |
Let's double check that the compiled binary does have the instruction 'mov' and its bytes in hex:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ray@DESKTOP:/mnt/d$ gdb ./tcg GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./tcg... (gdb) set output-radix 16 Output radix now set to decimal 16, hex 10, octal 20. (gdb) set disassemble-next-line on (gdb) show disassemble-next-line Debugger's willingness to use disassemble-next-line is on. (gdb) disassemble _start Dump of assembler code for function _start: 0x0000000000401000 <+0>: movabs $0x1234567890abcdef,%rax 0x000000000040100a <+10>: retq End of assembler dump. (gdb) x/10xb 0x0000000000401000 0x401000 <_start>: 0x48 0xb8 0xef 0xcd 0xab 0x90 0x78 0x56 0x401008 <_start+8>: 0x34 0x12 (gdb) |
This piece of asm code is not perfect. It will crash if you run it directly from the shell. But the only thing that we care about is just 'mov', so it's good enough to go for our exercise. ('ret' is not the part of this exercise.)
We also run it in GDB to make sure that RAX does load the value of 0x1234567890abcdef
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | (gdb) b _start Breakpoint 1 at 0x401000 (gdb) r Starting program: /mnt/d/tcg Breakpoint 1, 0x0000000000401000 in _start () => 0x0000000000401000 <_start+0>: 48 b8 ef cd ab 90 78 56 34 12 movabs $0x1234567890abcdef,%rax (gdb) p $rax $1 = 0x0 (gdb) si 0x000000000040100a in _start () => 0x000000000040100a <_start+10>: c3 retq (gdb) p $rax $2 = 0x1234567890abcdef (gdb) |
Now the user-mode target looks ok and we can move to the next step.
Build QEMU
This step is to build QEMU.
After extracting QEMU 5.0.0 source code, run:
1 | ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ ../configure --target-list=x86_64-linux-user --extra-cflags=-g3 --enable-debug --enable-system --enable-user --enable-linux-user --enable-debug-tcg --enable-debug-info --disable-sdl --disable-vnc --disable-kvm |
enable-debug-tcg is very useful while debugging TCG. You'll see the benefit later.
Then build QEMU with make.
1 | ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))') |
Execution Flow
This step is to use GDB to discover how TCG decodes just one single instruction, generates the intermediate code and the real host code, finally how to execute the whole bulky code to just emulate a single 'mov' instruction.
The first thing that we need to understand is how TCG does the above jobs.
TCG uses a dynamically generated function to emulate only one single instruction in our case (or a bunch of instructions for improved performance. Talk about it later). That function is formed into 3 parts: a prologue to save some registers and stack space, a subroutine to perform the actual job of the target instruction, and an epilogue to clean up things and return to the caller. So, we'll trace the code in the order of prologue -> subroutine -> epilogue.
We start QEMU via GDB with the following command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ray@DESKTOP:/mnt/d/qemu-5.0.0/build$ gdb --args x86_64-linux-user/qemu-x86_64 -L / -d op,in_asm,out_asm -singlestep ../../tcg GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from x86_64-linux-user/qemu-x86_64... warning: File "/mnt/d/qemu-5.0.0/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load". To enable execution of this file add add-auto-load-safe-path /mnt/d/qemu-5.0.0/.gdbinit line to your configuration file "/home/ray/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/home/ray/.gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" (gdb) set output-radix 16 Output radix now set to decimal 16, hex 10, octal 20. (gdb) set disassemble-next-line on (gdb) show disassemble-next-line Debugger's willingness to use disassemble-next-line is on. (gdb) |
First, "-d op,in_asm,out_asm" tells QEMU to log the target instruction 'mov' in 'IN' log, the intermediate code in 'OP' log and the final host code in 'OUT' log during the whole process. This is very useful to understand the whole process and is only available when compiling QEMU with enable-debug-tcg as we did in the previous step.
Second, "-singlestep" tells TCG to just translate instructions one by one, otherwise it will translate them in a batch. Obviously, the latter is faster.
Then, we need to do the first important step:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | (gdb) break disas_insn Breakpoint 1 at 0x13c792: file /mnt/d/qemu-5.0.0/target/i386/translate.c, line 4488. (gdb) r Starting program: /mnt/d/qemu-5.0.0/build/x86_64-linux-user/qemu-x86_64 -L / -d op,in_asm,out_asm -singlestep ../../tcg [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffff230700 (LWP 5566)] PROLOGUE: [size=45] 0x7ffff0000000: 55 pushq %rbp 0x7ffff0000001: 53 pushq %rbx 0x7ffff0000002: 41 54 pushq %r12 0x7ffff0000004: 41 55 pushq %r13 0x7ffff0000006: 41 56 pushq %r14 0x7ffff0000008: 41 57 pushq %r15 0x7ffff000000a: 48 8b ef movq %rdi, %rbp 0x7ffff000000d: 48 81 c4 78 fb ff ff addq $-0x488, %rsp 0x7ffff0000014: ff e6 jmpq *%rsi 0x7ffff0000016: 33 c0 xorl %eax, %eax 0x7ffff0000018: 48 81 c4 88 04 00 00 addq $0x488, %rsp 0x7ffff000001f: c5 f8 77 vzeroupper 0x7ffff0000022: 41 5f popq %r15 0x7ffff0000024: 41 5e popq %r14 0x7ffff0000026: 41 5d popq %r13 0x7ffff0000028: 41 5c popq %r12 0x7ffff000002a: 5b popq %rbx 0x7ffff000002b: 5d popq %rbp 0x7ffff000002c: c3 retq Thread 1 "qemu-x86_64" hit Breakpoint 1, disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488 4488 { => 0x000000000813c792 <disas_insn+0>: f3 0f 1e fa endbr64 0x000000000813c796 <disas_insn+4>: 55 push %rbp 0x000000000813c797 <disas_insn+5>: 48 89 e5 mov %rsp,%rbp 0x000000000813c79a <disas_insn+8>: 53 push %rbx 0x000000000813c79b <disas_insn+9>: 48 81 ec 48 02 00 00 sub $0x248,%rsp 0x000000000813c7a2 <disas_insn+16>: 48 89 bd b8 fd ff ff mov %rdi,-0x248(%rbp) 0x000000000813c7a9 <disas_insn+23>: 48 89 b5 b0 fd ff ff mov %rsi,-0x250(%rbp) (gdb) |
We can verify this in GDB:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | (gdb) x/19i 0x7ffff0000000 0x7ffff0000000: push %rbp 0x7ffff0000001: push %rbx 0x7ffff0000002: push %r12 0x7ffff0000004: push %r13 0x7ffff0000006: push %r14 0x7ffff0000008: push %r15 0x7ffff000000a: mov %rdi,%rbp 0x7ffff000000d: add $0xfffffffffffffb78,%rsp 0x7ffff0000014: jmpq *%rsi 0x7ffff0000016: xor %eax,%eax 0x7ffff0000018: add $0x488,%rsp 0x7ffff000001f: vzeroupper 0x7ffff0000022: pop %r15 0x7ffff0000024: pop %r14 0x7ffff0000026: pop %r13 0x7ffff0000028: pop %r12 0x7ffff000002a: pop %rbx 0x7ffff000002b: pop %rbp 0x7ffff000002c: retq (gdb) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | (gdb) bt #0 disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488 #1 0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570 #2 0x00000000080d16fe in translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:102 #3 0x000000000814b7c3 in gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8632 #4 0x00000000080cf6c1 in tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000) at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1718 #5 0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407 #6 0x00000000080ccfb6 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:731 #7 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 #8 0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228) at /mnt/d/qemu-5.0.0/linux-user/main.c:872 (gdb) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | (gdb) finish Run till exit from #0 disas_insn (s=0x0, cpu=0x0) at /mnt/d/qemu-5.0.0/target/i386/translate.c:4488 0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570 8570 pc_next = disas_insn(dc, cpu); 0x000000000814b5d3 <i386_tr_translate_insn+88>: 48 8b 55 d0 mov -0x30(%rbp),%rdx 0x000000000814b5d7 <i386_tr_translate_insn+92>: 48 8b 45 f0 mov -0x10(%rbp),%rax 0x000000000814b5db <i386_tr_translate_insn+96>: 48 89 d6 mov %rdx,%rsi 0x000000000814b5de <i386_tr_translate_insn+99>: 48 89 c7 mov %rax,%rdi 0x000000000814b5e1 <i386_tr_translate_insn+102>: e8 ac 11 ff ff callq 0x813c792 <disas_insn> => 0x000000000814b5e6 <i386_tr_translate_insn+107>: 48 89 45 f8 mov %rax,-0x8(%rbp) Value returned is $1 = 0x40100a (gdb) finish Run till exit from #0 0x000000000814b5e6 in i386_tr_translate_insn (dcbase=0x7ffffffed660, cpu=0x853c070) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8570 translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:106 106 if (db->is_jmp != DISAS_NEXT) { => 0x00000000080d16fe <translator_loop+671>: 48 8b 45 d0 mov -0x30(%rbp),%rax 0x00000000080d1702 <translator_loop+675>: 8b 40 18 mov 0x18(%rax),%eax (gdb) finish Run till exit from #0 translator_loop (ops=0x8479220 <i386_tr_ops>, db=0x7ffffffed660, cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/accel/tcg/translator.c:106 ---------------- IN: 0x00401000: 48 b8 ef cd ab 90 78 56 movabsq $0x1234567890abcdef, %rax 0x00401008: 34 12 gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8633 8633 } => 0x000000000814b7c3 <gen_intermediate_code+95>: 90 nop 0x000000000814b7c4 <gen_intermediate_code+96>: 48 8b 45 f8 mov -0x8(%rbp),%rax 0x000000000814b7c8 <gen_intermediate_code+100>: 64 48 33 04 25 28 00 00 00 xor %fs:0x28,%rax 0x000000000814b7d1 <gen_intermediate_code+109>: 74 05 je 0x814b7d8 <gen_intermediate_code+116> 0x000000000814b7d3 <gen_intermediate_code+111>: e8 78 9b f2 ff callq 0x8075350 <__stack_chk_fail@plt> 0x000000000814b7d8 <gen_intermediate_code+116>: c9 leaveq 0x000000000814b7d9 <gen_intermediate_code+117>: c3 retq (gdb) |
1 2 3 4 5 6 | (gdb) x/1i 0x00401000 0x401000: movabs $0x1234567890abcdef,%rax (gdb) x/10xb 0x00401000 0x401000: 0x48 0xb8 0xef 0xcd 0xab 0x90 0x78 0x56 0x401008: 0x34 0x12 (gdb) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | (gdb) finish Run till exit from #0 gen_intermediate_code (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, max_insns=0x1) at /mnt/d/qemu-5.0.0/target/i386/translate.c:8633 tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000) at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1719 1719 tcg_ctx->cpu = NULL; => 0x00000000080cf6c1 <tb_gen_code+572>: 64 48 8b 04 25 a0 ff ff ff mov %fs:0xffffffffffffffa0,%rax (gdb) finish Run till exit from #0 tb_gen_code (cpu=0x853c070, pc=0x401000, cs_base=0x0, flags=0x40c0b3, cflags=0xff000000) at /mnt/d/qemu-5.0.0/accel/tcg/translate-all.c:1719 OP: ld_i32 tmp11,env,$0xfffffffffffffff0 movi_i32 tmp12,$0x0 brcond_i32 tmp11,tmp12,lt,$L0 ---- 0000000000401000 0000000000000000 movi_i64 tmp0,$0x1234567890abcdef mov_i64 rax,tmp0 movi_i64 tmp3,$0x40100a st_i64 tmp3,env,$0x80 exit_tb $0x0 set_label $L0 exit_tb $0x7ffff0000043 OUT: [size=53] 0x7ffff0000100: 8b 5d f0 movl -0x10(%rbp), %ebx 0x7ffff0000103: 85 db testl %ebx, %ebx 0x7ffff0000105: 0f 8c 1e 00 00 00 jl 0x7ffff0000129 0x7ffff000010b: 48 bb ef cd ab 90 78 56 movabsq $0x1234567890abcdef, %rbx 0x7ffff0000113: 34 12 0x7ffff0000115: 48 89 5d 00 movq %rbx, (%rbp) 0x7ffff0000119: 48 c7 85 80 00 00 00 0a movq $0x40100a, 0x80(%rbp) 0x7ffff0000121: 10 40 00 0x7ffff0000124: e9 ed fe ff ff jmp 0x7ffff0000016 0x7ffff0000129: 48 8d 05 13 ff ff ff leaq -0xed(%rip), %rax 0x7ffff0000130: e9 e3 fe ff ff jmp 0x7ffff0000018 0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407 407 tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask); 0x00000000080cc786 <tb_find+84>: 8b 7d a8 mov -0x58(%rbp),%edi 0x00000000080cc789 <tb_find+87>: 8b 4d cc mov -0x34(%rbp),%ecx 0x00000000080cc78c <tb_find+90>: 48 8b 55 d0 mov -0x30(%rbp),%rdx 0x00000000080cc790 <tb_find+94>: 48 8b 75 d8 mov -0x28(%rbp),%rsi 0x00000000080cc794 <tb_find+98>: 48 8b 45 b8 mov -0x48(%rbp),%rax 0x00000000080cc798 <tb_find+102>: 41 89 f8 mov %edi,%r8d 0x00000000080cc79b <tb_find+105>: 48 89 c7 mov %rax,%rdi 0x00000000080cc79e <tb_find+108>: e8 e2 2c 00 00 callq 0x80cf485 <tb_gen_code> => 0x00000000080cc7a3 <tb_find+113>: 48 89 45 e0 mov %rax,-0x20(%rbp) Value returned is $2 = (TranslationBlock *) 0x7ffff0000040 <code_gen_buffer+19> (gdb) |
'OP' log shows the intermediate code that TCG has generated to represent 'mov'. It's not a real machine code. Just some pseudo opcodes and variables to indicate what kind of the operations that TCG needs to perform.
'OUT' log shows the real host ASM instructions that will be executed on the host CPU. In our case, we can tell that these are real x86_64 instructions.
The intermediate code is just stored in a list and cannot be directly seen by GDB. But the final code (the sub-routine) is the real code in the memory so we can easily verify it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | (gdb) x/9i 0x7ffff0000100 0x7ffff0000100 <code_gen_buffer+211>: mov -0x10(%rbp),%ebx 0x7ffff0000103 <code_gen_buffer+214>: test %ebx,%ebx 0x7ffff0000105 <code_gen_buffer+216>: jl 0x7ffff0000129 <code_gen_buffer+252> 0x7ffff000010b <code_gen_buffer+222>: movabs $0x1234567890abcdef,%rbx 0x7ffff0000115 <code_gen_buffer+232>: mov %rbx,0x0(%rbp) 0x7ffff0000119 <code_gen_buffer+236>: movq $0x40100a,0x80(%rbp) 0x7ffff0000124 <code_gen_buffer+247>: jmpq 0x7ffff0000016 0x7ffff0000129 <code_gen_buffer+252>: lea -0xed(%rip),%rax # 0x7ffff0000043 <code_gen_buffer+22> 0x7ffff0000130 <code_gen_buffer+259>: jmpq 0x7ffff0000018 (gdb) x/53xb 0x7ffff0000100 0x7ffff0000100 <code_gen_buffer+211>: 0x8b 0x5d 0xf0 0x85 0xdb 0x0f 0x8c 0x1e 0x7ffff0000108 <code_gen_buffer+219>: 0x00 0x00 0x00 0x48 0xbb 0xef 0xcd 0xab 0x7ffff0000110 <code_gen_buffer+227>: 0x90 0x78 0x56 0x34 0x12 0x48 0x89 0x5d 0x7ffff0000118 <code_gen_buffer+235>: 0x00 0x48 0xc7 0x85 0x80 0x00 0x00 0x00 0x7ffff0000120 <code_gen_buffer+243>: 0x0a 0x10 0x40 0x00 0xe9 0xed 0xfe 0xff 0x7ffff0000128 <code_gen_buffer+251>: 0xff 0x48 0x8d 0x05 0x13 0xff 0xff 0xff 0x7ffff0000130 <code_gen_buffer+259>: 0xe9 0xe3 0xfe 0xff 0xff (gdb) |
We also should pay attention to a global variable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | (gdb) p *tcg_ctx $3 = { pool_cur = 0x85859f0 "\352\r\353\r\354\r\355\r\356\r\357\r\360\r\361\r\362\r\363\r\364\r\365\r\366\r\367\r\370\r\371\r\372\r\373\r\374\r\375\r\376\r\377\r", pool_end = 0x858cc10 "", pool_first = 0x8584c00, pool_current = 0x8584c00, pool_first_large = 0x0, nb_labels = 0x1, nb_globals = 0x24, nb_temps = 0x31, nb_indirects = 0x0, nb_ops = 0xb, code_buf = 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036", tb_jmp_reset_offset = 0x7ffff000009c <code_gen_buffer+111>, tb_jmp_insn_offset = 0x7ffff00000a0 <code_gen_buffer+115>, tb_jmp_target_addr = 0x0, reserved_regs = 0x30, tb_cflags = 0xff000000, current_frame_offset = 0x80, frame_start = 0x80, frame_end = 0x480, frame_temp = 0x84f7b40 <tcg_init_ctx+2848>, code_ptr = 0x7ffff0000135 <code_gen_buffer+264> "", temps_in_use = 0x0, goto_tb_issue_mask = 0x0, vecop_list = 0x0, code_gen_prologue = 0x7ffff0000000, code_gen_epilogue = 0x7ffff0000016, code_gen_buffer = 0x7ffff000002d <code_gen_buffer>, code_gen_buffer_size = 0x7ffefd3, code_gen_ptr = 0x7ffff0000140 <code_gen_buffer+275>, data_gen_ptr = 0x0, code_gen_highwater = 0x7ffff7ffec00 <code_gen_buffer+134212563>, tb_phys_invalidate_count = 0x0, cpu = 0x0, pool_labels = 0x0, exitreq_label = 0x8584c10, free_temps = {{l = {0x1800000000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = { 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, {l = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}}, temps = {{reg = TCG_REG_EBP, val_type = TEMP_VAL_REG, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x1, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x0, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x0, mem_offset = 0x0, name = 0x82ae35b "env", state = 0x2, state_ptr = 0x8585908}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I32, type = TCG_TYPE_I32, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0xa8, name = 0x82c384c "cc_op", state = 0x3, state_ptr = 0x858590c}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x90, name = 0x82c3852 "cc_dst", state = 0x3, state_ptr = 0x8585910}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x98, name = 0x82c3859 "cc_src", state = 0x3, state_ptr = 0x8585914}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0xa0, name = 0x82c3860 "cc_src2", state = 0x3, state_ptr = 0x8585918}, {reg = TCG_REG_EBX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x1, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x1234567890abcdef, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x0, name = 0x82c38e0 <reg_names> "rax", state = 0x3, state_ptr = 0x858591c}, {reg = TCG_REG_EAX, val_type = TEMP_VAL_MEM, base_type = TCG_TYPE_I64, type = TCG_TYPE_I64, fixed_reg = 0x0, indirect_reg = 0x0, indirect_base = 0x0, mem_coherent = 0x0, mem_allocated = 0x1, temp_global = 0x1, temp_local = 0x0, temp_allocated = 0x0, val = 0x0, mem_base = 0x84f7398 <tcg_init_ctx+888>, mem_offset = 0x8, name = 0x82c38e4 <reg_names+4> "rcx", state = 0x3, --Type <RET> for more, q to quit, c to continue without paging--q Quit (gdb) |
Next step is to verify that the generated code is actually executed. We put a break point at the prologue and continue:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | (gdb) b *0x7ffff0000000 Breakpoint 2 at 0x7ffff0000000 (gdb) bt #0 0x00000000080cc7a3 in tb_find (cpu=0x853c070, last_tb=0x0, tb_exit=0x0, cf_mask=0x0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:407 #1 0x00000000080ccfb6 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:731 #2 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 #3 0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228) at /mnt/d/qemu-5.0.0/linux-user/main.c:872 (gdb) c Continuing. Thread 1 "qemu-x86_64" hit Breakpoint 2, 0x00007ffff0000000 in ?? () => 0x00007ffff0000000: 55 push %rbp (gdb) bt #0 0x00007ffff0000000 in ?? () #1 0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172 #2 0x00000000080ccd22 in cpu_loop_exec_tb (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, last_tb=0x7ffffffed9a8, tb_exit=0x7ffffffed9a0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:619 #3 0x00000000080ccfd2 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:732 #4 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 #5 0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228) at /mnt/d/qemu-5.0.0/linux-user/main.c:872 (gdb) |
1 2 3 4 5 6 7 8 9 10 | (gdb) f 4 #4 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 207 trapnr = cpu_exec(cs); 0x000000000810d63d <cpu_loop+44>: 48 8b 45 e0 mov -0x20(%rbp),%rax 0x000000000810d641 <cpu_loop+48>: 48 89 c7 mov %rax,%rdi 0x000000000810d644 <cpu_loop+51>: e8 5e f7 fb ff callq 0x80ccda7 <cpu_exec> => 0x000000000810d649 <cpu_loop+56>: 89 45 dc mov %eax,-0x24(%rbp) (gdb) p env->regs[0] $7 = 0x0 (gdb) |
The above back trace also shows that the prologue is called from somewhere in cpu_tb_exec accel/tcg/cpu-exec.c:172. When we go to that line, we can see the following C code:
1 | ret = tcg_qemu_tb_exec(env, tb_ptr);
|
1 2 | # define tcg_qemu_tb_exec(env, tb_ptr) \ ((uintptr_t (*)(void *, void *))tcg_ctx->code_gen_prologue)(env, tb_ptr) |
1 2 3 4 5 6 7 8 9 | (gdb) f 1 #1 0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172 172 ret = tcg_qemu_tb_exec(env, tb_ptr); (gdb) p tb_ptr $1 = (uint8_t *) 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036" (gdb) p tcg_ctx->code_buf $2 = (tcg_insn_unit *) 0x7ffff0000100 <code_gen_buffer+211> "\213]\360\205\333\017\214\036" (gdb) |
So the generated host code is called like this: the first parameter is env and tcg_ctx->code_buf is the second parameter. If we check x86_64 calling convention for Linux, RDI is used by env and RSI is used by tcg_ctx->code_buf. If we still recall 'jmpq *%rsi' in the prologue, it's so clear to us that how the execution is transfered from the prologue to the sub-routine.
Let's see what will happen after executing the whole generated code by setting a break point in epilogue:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | (gdb) b *0x7ffff000001f Breakpoint 3 at 0x7ffff000001f (gdb) c Continuing. Thread 1 "qemu-x86_64" hit Breakpoint 3, 0x00007ffff000001f in ?? () => 0x00007ffff000001f: c5 f8 77 vzeroupper (gdb) ni 0x00007ffff0000022 in ?? () => 0x00007ffff0000022: 41 5f pop %r15 (gdb) 0x00007ffff0000024 in ?? () => 0x00007ffff0000024: 41 5e pop %r14 (gdb) 0x00007ffff0000026 in ?? () => 0x00007ffff0000026: 41 5d pop %r13 (gdb) 0x00007ffff0000028 in ?? () => 0x00007ffff0000028: 41 5c pop %r12 (gdb) 0x00007ffff000002a in ?? () => 0x00007ffff000002a: 5b pop %rbx (gdb) 0x00007ffff000002b in ?? () => 0x00007ffff000002b: 5d pop %rbp (gdb) 0x00007ffff000002c in ?? () => 0x00007ffff000002c: c3 retq (gdb) 0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172 172 ret = tcg_qemu_tb_exec(env, tb_ptr); 0x00000000080cbfd7 <cpu_tb_exec+279>: 48 c7 c0 a0 ff ff ff mov $0xffffffffffffffa0,%rax 0x00000000080cbfde <cpu_tb_exec+286>: 64 48 8b 00 mov %fs:(%rax),%rax 0x00000000080cbfe2 <cpu_tb_exec+290>: 48 8b 80 a0 00 00 00 mov 0xa0(%rax),%rax 0x00000000080cbfe9 <cpu_tb_exec+297>: 48 89 c1 mov %rax,%rcx 0x00000000080cbfec <cpu_tb_exec+300>: 48 8b 55 d8 mov -0x28(%rbp),%rdx 0x00000000080cbff0 <cpu_tb_exec+304>: 48 8b 45 d0 mov -0x30(%rbp),%rax 0x00000000080cbff4 <cpu_tb_exec+308>: 48 89 d6 mov %rdx,%rsi 0x00000000080cbff7 <cpu_tb_exec+311>: 48 89 c7 mov %rax,%rdi 0x00000000080cbffa <cpu_tb_exec+314>: ff d1 callq *%rcx => 0x00000000080cbffc <cpu_tb_exec+316>: 48 89 45 e8 mov %rax,-0x18(%rbp) (gdb) bt #0 0x00000000080cbffc in cpu_tb_exec (cpu=0x853c070, itb=0x7ffff0000040 <code_gen_buffer+19>) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:172 #1 0x00000000080ccd22 in cpu_loop_exec_tb (cpu=0x853c070, tb=0x7ffff0000040 <code_gen_buffer+19>, last_tb=0x7ffffffed9a8, tb_exit=0x7ffffffed9a0) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:619 #2 0x00000000080ccfd2 in cpu_exec (cpu=0x853c070) at /mnt/d/qemu-5.0.0/accel/tcg/cpu-exec.c:732 #3 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 #4 0x00000000080dc30c in main (argc=0x7, argv=0x7ffffffee1e8, envp=0x7ffffffee228) at /mnt/d/qemu-5.0.0/linux-user/main.c:872 (gdb) f 3 #3 0x000000000810d649 in cpu_loop (env=0x8544350) at /mnt/d/qemu-5.0.0/linux-user/x86_64/../i386/cpu_loop.c:207 207 trapnr = cpu_exec(cs); 0x000000000810d63d <cpu_loop+44>: 48 8b 45 e0 mov -0x20(%rbp),%rax 0x000000000810d641 <cpu_loop+48>: 48 89 c7 mov %rax,%rdi 0x000000000810d644 <cpu_loop+51>: e8 5e f7 fb ff callq 0x80ccda7 <cpu_exec> => 0x000000000810d649 <cpu_loop+56>: 89 45 dc mov %eax,-0x24(%rbp) (gdb) p env->regs[0] $8 = 0x1234567890abcdef (gdb) |
Summary
Hope now you have some ideas about how TCG actually works.
But, because 'mov' is very simple and straight forward, this exercise does not show other more complex parts of the job. For example, if the input instruction is AVX, TCG might need to use one of its template-based helper function. That part is also quite chanlleging to understand.
Another thing is the performance. While singlestep is easy for learning TCG, it actually does the job in bulk in the real world. TranslateBlock structure is the key player in this case. The input is a group of ASM instructions (grouped by branch instructions) and each TranslateBlock is connected with jumps. TCG also has a hash table (key is the address of the instruction) to cache previously translated instructions, so the new translation is only necessary when it's not found.
According to its document TCG was based on another compiler project (I feel that TCG works like a JVM). So it includes a disassembler. As I found, there are three disassemblers in QEMU: TCG, disas and capstone. disas is used by TCG to log those 'PROLOGUE', 'IN' and 'OUT' log messages, so they are well format and easy to understand. I haven't figured out how capstone works. Maybe need another blog if I get a chance to learn.
Finally, I think the below picture is useful for someone who's totally new to TCG and is wondering how it works. The same question bothered me for very long time until I could do the above debug sessions. These calling sequences are quite similar between QEMU 5 and QEMU 2. QEMU 1 and older have much more differences but I guess probably nobody really cares now.
No comments:
Post a Comment