To get a systematic view of how QEMU works, we decompose it into a collection of events that are sharing internal resources (enumerated with short description about the activities and common objects involved):
Build: it is the process of source code translation where the common structures of
compiler framework will be shared with the Restore event if code conversion engine is common to all emulation threads. A new code fragment, translated from the source guest code, will be added to the code cache.
Restore: for performance reason, guest PC is only updated once at the end of a
code fragment. If a guest exception is generated in the middle of the code fragment, for example, a page fault from a memory instruction, the guest PC must be restored. This is done by means of table lookup or reconstruction to preserve precise guest architectural state. In practice, a SVM does not generate PC information in regular code fragments in order to reduce space overhead (QEMU adopts this strategy), and reconstruction using identical compiler framework in Build is used instead. See subsection 2.2.3 Reconstruct PC Address for detailed
explanation.
Chain / Unchain: these two events involve instruction modification to code
fragments. There are issues such as modify-when-use and synchronization between I and D caches if the code cache is shared. Subsection 2.4 Engineering Challenge has more discussions on these issues.
Flush: the code cache is out of space and all translated codes will be abandoned immediately. This has been discussed in subsection 2.2.1 Internal Data Structures in QEMU. After the flush, some old code fragments may still lives in the code
cache, but that does not mean other threads can stay within because the subsequent Build will insert new blocks into code cache.
SMC: Self-Modifying-Code refers to changes made to existing guest code that has
been translated. See discussion in 2.3 Multi-Core Awareness.
Find: locate translated code fragment using the guest PC. If the target block is not
found, then invoke Build to translate and insert a new block to code cache.
Execute: emulation thread executes the code fragment found in Find or generated at Build as express in 2.2.2 From Guest Instruction to Target Machine Code.
Figure 11: State diagram of QEMU events
The flow of events is illustrated as a state transition diagram in Figure 11. The event Unchain (not shown in figure) is completely initiated by the SIGUSR1 signal handler in an asynchronous fashion, so it will intervene with all other events. Special treatments in lock acquiring for Unchain must be deployed, otherwise deadlock could
Find
Flush
Build
Chain
SMC Execute
Restore
occur. The transition from SMC to Restore (dashed line) stands for self-invalidation that would happen when a code fragment writes data to the guest address of the source executable. It mimics the hardware synchronization mechanism between I and D cache (write instruction bytes to where the next PC points to in x86 for example). ARM does not support automatic I/D coherence enforcement. Therefore SMC to our guest ARM is purely code invalidation that never reaches Restore, only ends in Execute.
Figure 12: Software layout in QEMU
The overall software layout of QEMU is illustrated in Figure 12. Internal events are in green. QEMU allocated memories (code cache, SDRAM and Flash) are light grey, and other boxes are helper functions. CPU, memory and IO modules are included in emulation thread, and IO thread is responsible for interactions between QEMU and
Interrupt notification :
host OS. These activities are mostly SVM interface update (keyboard, mouse and screen) and guest hardware implemented using host system service (RTC timer).
From the perspective of hardware replication, providing multi-cores is fairly simple. QEMU has already reserved a memory pool for storing architectural states for each VCPU; memory and IO elements are shared among all VCPUs. Everything seems ready to go for parallel emulation. But QEMU still uses a sequential emulation model similar to Figure 10(a), and all guest cores are running in a non-preemptive time-shared fashion. This implementation simplifies the emulation of multi-cores in a lock-free setting and can handle the increment of guest core with ease. Shared components (mainly the code cache and derivatives) can be accessed without any contention; order and atomicity is preserved by round-robin and exclusive access of guest memory system. Likewise, core augmentation is merely additional memory allocation for new architectural state with a little bit of personalization in CPU ID or marking a portion of guest address space for core-private devices. However, target hardware resources are poorly utilized. The time received for actual core emulation is reversely proportional to the number of guest cores. Even worse, when all guest threads are evenly disturbed on all virtual processors, QEMU will suffer from slow guest code execution as no way to tell when and where a progress to guest code is.
PQEMU relaxes the emulation model used by QEMU that only one event is active
at a time. PQEMU allows all emulation threads free to go like in Figure 10(b). The code cache and the translation engine are shared in the current PQEMU implementation.
Translated code can be reusable among all emulation threads as each VCPU has its own architectural state in registers (the VM manager will do the setup before entering the code cache). Although contentions on the code cache may be there among different guest emulation threads, such incidences are relatively infrequent for many parallel applications. Besides, this implementation incurs less engineering effort since it
requires no special effort to maintain the coherence of the code cache.
Write Event Read Event
Build Restore Chain Unchain Flush SMC Find Execute
Write
Table 1: Disposition of event combinations in PQEMU
With parallel emulation, many events from different VCPU threads will happen simultaneously, serialization must be enforced for correct manipulation of the shared and writable objects. We tabulate all event combinations and their possible disposition
in Table 1, where S indicates a serialization and X means don’t care (free run). The following properties explain the reasons for serialized combination (marked S) in Table 1. Except read events (Find and Execute), all write events are in serialization with themselves (S on diagonal boxes of Table 1).
z Build and Restore shares the same translation engine, a lock is required.
z Eliminating a code fragment (Flush and SMC) is permitted only when it is not being referenced by another thread (Build, Restore, Chain, Unchain, Find and Execute). Except Build, all other five events operate on code fragment(s) in the
code cache (even though it is for comparison in Restore to bring back fault guest PC address). A sudden removal of a code fragment will ruin their functionalities.
Build is in the list because it will register identifiers for new code fragments and
pointer to free code cache space before translation begins. In addition, Flush and SMC themselves are serialized to prevent incomplete code erasing.
z Execute and Chain / Unchain raise problems in target cache synchronization and code modification as stated above. A lock is required.
z Build and Find prevents the recurrence of code translation for the same guest address. This could be removed for better performance as such situations are rare and serialization here is truly an overkill. If serialization is taken away, there should be an extra check for validity of the code fragment during Restore, in case
at the same time another thread is in the Build state. No code invalidation will take place between two successive Build as Flush and SMC are in serial with all other events, and finally two code fragments to the same guest address will be.
Redundant code fragment will be used once by generating thread and live solitary after without being reference anymore until die (Flush and SMC).
Figure 13: Lock scheme in current PQEMU
To effectively increase parallelism, performance, lock is applied in fine-grained control with weak or strong attribute. Strong lock equals to exclusive access in QEMU that keeps all other emulation threads out of the common code cache (equivalent to a pause), whereas the weak one corresponds to an ordinary lock that only the relevant access should get blocked. In another words, PQEMU effectively reduces the lock strength from a big strong one in the original QEMU to a few small and dispersed ones.
Find
Build Restore
Execute
Chain Unchain
Flush SMC
RWlock A – read lock RWlock A – write lock
Spinlock B Spinlock C
PQEMU
Figure 14: Lock protection scheme in current PQEMU
The lock scheme in PQEMU is illustrated in Figure 13, conforms to disposition in Table 1 that aggressively turns S in Build and Find to X. To prevent code fragments from being wiped out when they are in execution, RWlock A is used in read/write mode as the only weak/strong lock in PQEMU. Whenever a thread moves to a state that will refer to an existing code fragment, it must acquire the read-locked A beforehand. The Flush and SMC are fulfilled at the meantime by acquiring the strong write-lock A,
indicating no threads are allowed to run at the same time. Read-locked A is acquired at beginning of the VM manager (transition to state Find in Figure 11) and is released at the end of emulation cycle (on way back to Find). Although theoretically they can be guarded in finer granularity of code fragment, we treat the entire code cache as one unit for engineering simplicity. Spinlock B protects common compiler related data
Legend
structures, and spinlock C serializes the requests of instruction modifications from Chain and Unchain. Emulation thread must grab read-locked A prior to acquiring B or
C for Build / Restore or Chain / Unchain to keep the code fragment from being wiped
out, or simply transfer to write-lock A if it moves to Flush or SMC state. The whole lock protection scheme is illustrated in Figure 14. With hierarchical lock layout, we are free of deadlock and concurrency of shared objects is preserved across all emulation threads.
Next, we discuss issues with memory. As to the weakly-ordered memory architecture, the question would be how to effectively enforce atomicity source ISA imposed on the target machine. ARM defines mainly two atomic instruction groups based on swap and exclusive mechanisms. The swap instruction swp acts similar to exchange instruction xchg in x86, except it could function on two distinct registers (one source, and one destination register). That excludes the chance realizing swp using xchg directly, unless both destination and source registers are identical. PQEMU realizes this coalesced memory load and store by designating a common spinlock X (different from those in CPU) at initial part of swp, analogous to the effect of #LOCK prefix in x86 and
#ASSERT on ARM. But it will occupy the whole system bus and could be a serious
limit to scalability. A more elaborated approach for exclusion in ARM is introduced after ARMv5. The new approach is similar to load-linked and store-conditional pair on
MIPS (i.e. ldrex and strex). A common table is arranged to record any load-linked address that is still waiting for a closing store-conditional instruction, and address will be erased if it is in collision with a subsequent store address. Any store-conditional with address not in table is regarded as a failed attempt that no data will be written to memory, and the unsuccessful value will be returned. However, monitoring all store instructions by software in PQEMU is inefficient and costly. A compromise is made that appends content snapshot into the entry of a table. If strex operates on an address not in the table or data in guest memory does not agree with snapshot made in time of ldrex, this trial is unsuccessful. We defer the generations of such guest atomic instructions till final code emitting stage to avoid the issues of larger code size and complex control if code generation was performed with IR at earlier stages. Dedicated emitters are akin to ones for ordinary memory access, except it will insert additional spinlock code for common table at the start and end of output code fragment. Sequence of execution will be determined by target hardware at run-time like what it is in real machine rather than predefined by QEMU. This resemblance is a giving for emulator in testing of a multi- thread program, a scene round-robin emulation model (Figure 10(a)) could never reach (even if aggressively in random ordered, it is sequential from the scope of executing guest instructions). PQEMU honors guest program order in translation, and there will be no memory reordering in output code fragments as PQEMU knows nothing about
the aliasing.
Last, N-N interrupt delivery model is supported by ARM11 multi-core processor, where all cores are informed simultaneously as soon as receiving an interrupt. Hardware could achieve this by asserting a signal wire within one cycle, but QEMU can only notify all emulation threads in sequential, which leads to an unbalanced utilization.
PQEMU fixes this by asserting all core-private interrupt-notification flags in a round- robin order of guest cores, making request serving less unfair.
Figure 15 shows the revised software layout in PQEMU for a dual-core system (we omit the detailed boxes for simplicity). Most modules are identical to unmodified QEMU (Figure 12), except there are multiple emulation threads for system with multi- core support and the code cache is shared among (denoted by unified to distinguish with the separate alternative).
Figure 15: Revised software layout for a dual-core guest in PQEMU
Emulation thread #1
CPU 1
Memory
IO
IO thread Emulation thread #0
CPU 0 Unified CacheCode
Emulation threads group