A quick overview of the Scan Before Execute (SBE) strategy ========================================================== The SBE strategy was chosen as our initial implementation because it was more simplistic to implement, and thus less prone to obscure bugs. In a nutshell SBE works on a page-by-page code basis, dynamically creating virtualized code pages which mirror the real code pages. The only current modifications necessary are to replace the first byte of instructions we want to virtualize with INT3 instructions. The rest of the instructions are identical. We use a trick which exploits the split I&D TLB caches to temporarily place the modified code page in the address space where the real code page lives, and effectively makes the page 'execute-only', an attribute not naturally provided by the IA32 paging. This simplifies a lot of things. As I mentioned, the main benefit of this strategy is that it was quite simplistic. The 'translated' code is merely an echo of the real code with breakpoint instructions replacing instructions which need to be virtualized. Code offsets are maintained, and thus there is no need to store meta information regarding where translated code fragments are stored etc. The downside of this strategy is that the technique for virtualizing certain instructions (replacing with INT3) is performance expensive. Each execution of that instruction invokes an exception, which transitions from ring3 (where the guest is executed) to the ring0 (where the monitor is executed), processing in the monitor, and a return transition back to the guest. Since this INT3 is not specific to any particular virtualized instruction, processing has to be more generic and thus even more expensive. If there are a large number of out-of-page branches (for example, C function calls outside of the page) or calculated branches (for example, optimized switch statements or dereferenced function calls), then performance penalties can be big. Quasi dynamic translation: allowing for more flexible generated code ==================================================================== Rather than generating modified (breakpointed) code which is very rigid in the sense that instructions are always placed at exactly the same offsets as the original code, I propose here a "quasi dynamic translation" scheme, which can allow for more flexibility and performance in the generated code. Here, the generated code sequence for a given guest instruction can be arbitrary. This allows us the flexibility to insert calls to handler code, rather than generate traps to the VM monitor, as well as a place to add instrumentation in the future. However, the general design philosophy here is in fact, generating code which is a simple 1:1 translation, whenever possible. Or in other words, we just pass the instruction through unmodified. Traditional dynamic translation techniques tend to be quite heavy, especially when emulating IA32, due to modeling of the various protection checks, 8/16/32-bit register usage, arithmetic flags handling, scheduling of native registers across static and dynamic branches, etc. Thus, for a generic IA32-to-any architecture dynamic translation, there is often a large amount of code overhead that needs to be generated to handle certain IA32 code properly. But, since plex86 is specific to IA32 capable hosts, we can heavily exploit the natural mechanisms whenever possible, and eliminate such overhead and complexities in the translation. A large amount of IA32 code consists of user-oriented instructions which access the general registers, arithmetic flags, and user segment registers. It is quite possible to use these instructions as-is, eliminating to a high degree, an amount of management of such constructs, and without worry about scheduling of resources across branches. The reason I call it "quasi" dynamic translation, is because we really don't do much translation at all. Using the guest data segments directly ====================================== All guest data segments {ES,SS,DS,FS,GS} are virtualized by monitor such that they can be used directly by translated code, thus most translated code can be a 1:1 pass-through. Only the DPL is currently adjusted. This lets the native segmentation limit/access checks do the work for us. Similarly, as long as we are careful that the values of general registers and the arithmetic flags contain the values expected by the guest code, we can pass-through instructions which make use of them. As such, whenever we generate code which is not a mere pass-through of guest instructions, we need to save/restore such guest register state, so that the assumption of native use will be valid for following code. Virtualizing the guest CS segment accesses ========================================== The guest code segment CS, needs to be virtualized differently. Since we would be executing code from a translation buffer, the addresses generated by accessing the CS segment are unrelated to those in the real guest CS segment. In general practice, data accesses via a 'CS:' prefix are not very common. A simple approach to handling these is to virtualize these instruction so they are emulated by the monitor. Perhaps later, we could generate code to handle these instructions directly or generate calls to special handler routines. The main use of the CS segment is of course, from branch instructions. Some branch instructions involve a simple static offset, which is applied to the CS.base value to determine the linear address of the target code. Other branch instructions use a calculated (dynamic) offset, which can not be determined at translation time. Dynamic branch targets ====================== We need a mechanism for dynamically determining if there is already translated code for a given guest target address. If so, then we can jump to it. If not, then code at the target address must be translated first, followed by a jump to the newly translated code. There are possibilities of performing the translation in either the ring0 monitor space, or by ring3 handler code. The monitor space is a good first effort place to do this. To accomplish this, we can generate code which calls a common branch handler routine. This routine would be responsible for saving any guest state it needs to modify, performing the branch target lookup, invoking a translation routine if necessary, restoring any modified guest state, and transferring to the translated code. It would be a more efficient use of translated code buffer space and cache-lines to centralize this sort of functionality into a few optimized handler routines, rather than inline them in many places in generated code. Static branch targets ===================== A simple-minded approach to handling these, would be to use the same handling routines as for dynamic branch targets. This might be a good initial approach, as it is quite simple, though it does not produce optimal code. At the other end of the spectrum, we could first translate code at the target location, and then generate a direct jump to that translated code. This eliminates the intermediate branch handling routine. At first glance, this seems like an end-all obvious solution. However, consider when the guest modifies its own code. Such modifications require an invalidation of associated translated code. Imagine that there may be many translated sequences of code which make calls to a routine which has been modified. If we were to have hardcoded calls in the calling code, to the invalidated routine, then we would need a way to propagate the invalidation back to calling instructions. And to instructions which called those instructions, etc. Thus, if we hardcode branches, we need to maintain a representation of control flow as we dynamically translate code. Or simply flush all translate code. There are ideas which lie somewhere in the middle of these two approaches. For example, if we look at regions of code as single entities, then static branches within the region could be generated simply as static branches to generated code. Branches outside the defined region could be coded to use the dynamic branch handler. Invalidation of code within a given region could then invalidate only that region of translated code. This would eliminate the need for maintenance of control flow information, at the expense of introducing extra overhead for longer branches. The notion of a region is arbitrary, but perhaps some multiple of the 4k page size is convenient. The idea of viewing code in atomic regions also has the advantage of allowing the code regions to be re-optimized as entities, without interfering with external calling code, for much the same reasons as code invalidation. Handling self-modifying code ============================ As in the SBE method, code pages which have been marked as containing guest code for which we have corresponding translated code, are marked with read-only permissions. The page fault gives us the opportunity to invalidate translated code related to that page, or specific range of affected addresses. Handling self-examining code ============================ This is not a problem, as data accesses can work naturally, and see the actual untranslated guest code. Data accesses via the CS: prefix are virtualized. Correlating translated instructions with their guest addresses ============================================================== It is important to maintain information which allows for the correlation between the addresses of translated code and the associated guest code. The translation logic needs to convert from guest address to translated code address, and the exception handling logic needs to convert the other direction, to find what real guest address effectively caused the exception. At least some of this information needs to be mapped into the address space such that the special branch handling routine function. An illustrated view of how it would work ========================================
In the diagram, there is an assembly code fragment which represents some arbitrary code we are translating at a given time. Imagine that we are translating beginning at 'label10'. Graphic (a) depicts in abstract, the control flow of the instruction stream as the translator scans it. Non-branch instruction, such as the 'addl', have only a forward flow as the next instruction is implicitly executed. Conditional branch instructions have two possible targets, the specified branch target if the condition is met, and the next instruction if it is not. For example, the 'jcs' and 'jz' instructions are conditional branches. In implementation, these instructions might be translated by generating branches to branch handling routines. Unconditional branches, such as the 'ret' instruction, have only one possible flow. For each potential path of control flow, where the target is known, the translator can reiterate the process. When the flow hits an unconditional branch, the translation process can end. Now imagine that at some point in the future, we begin translating at a new address, 'label9'. As we have already translated the sequence at 'label10' which follows, we would not necessarily want to retranslate the sequence again. Since this newly generated code would be located at an arbitrary place in the translation buffer, it is not easy to align it so that execution will naturally flow into the translated code sequence for 'label10'. Thus, we can simply code a branch to it instead, creating sort of an artificial flow control depicted in (b). Re-optimizing translated code ============================= Because of the timing of when the translation logic encounters certain code sequences, the situation in (b) will occur occasionally. It is not a huge loss, but jumps do flush the instruction pipelines, wasting cycles. If this instruction is involved in a tight loop, the impact can be substantial. We could view figure (b) as instead (c), noting that the extra colored shim represents an extra branch placed there because the code sequences are not really contiguous. If this translated code is not executed much, there is not much point in optimizing. However, if it is evident from whatever use data we can gather, that this translated code is used frequently, there is an optimization which can be done. It is possible to regenerate the code region, reordering translated sequences such that contiguous ones are placed in the buffer as such, eliminating extra branch instructions previously imposed by the translation engine. This is depicted in (d). This reordering could be occasionally done by the VM monitor, though if the translation information is available to the host user space program, it may be possible to offload it to a separate user space thread. This would allow such a thread to execute on an idle processor. This process could look for code regions worth optimizing, re-order them, and then post the results to the monitor. The monitor would be free to accept the changes, or ignore them based on updates to the region or other reasons. Dealing with 16-bit code ======================== [Fill this in]