Fast Host Service Interface Design - 支援3-D立體視訊的數位電視多媒體平台設計(II)

for Embedded Java Application Processor

Kuan-Nian Su

Department of Computer Science National Chiao Tung University

Hsinchu, Taiwan

Chun-Jen Tsai

Department of Computer Science National Chiao Tung University

Abstract—In this paper, we have proposed a fast inter-processor communication interface (IPC) for a dual-core Java application processor. The dual-core Java application processor is a SoC composed of a RISC core and a double-issued Java bytecode execution core. The proposed fast IPC mechanism provides Java system software a high-level way to invoke a host processor service routine from Java source code. The proposed IPC has much lower overhead than that of the standard Java Native Interface (JNI). Unlike other fast native call interface designed for VM interpreter, the proposed IPC mechanism is exclusively designed for the communication between two physical hardwired processor cores. Based on the experimental results, the proposed mechanism is very promising for embedded Java runtime environment.

I. INTRODUCTION

Java runtime environment is becoming very important for embedded applications. Due to its machine code-level portability across different operating systems and processors, it has been selected by many standard organizations (such as 3GPP and DVB) as the standard application environment for multimedia-capable mobiles and set-top boxes [6][7]. A dual-core Java application processor was proposed in [1]. The Java application processor is composed of a host processor core (PowerPC 405 in [1]) and a double-issue Java bytecode execution core. The later is referred to as the Bytecode Execution Engine (BEE) core.

In order to support the full Java runtime environment (JRE), the proposed joint software-hardware architecture is shown in Fig. 1. Upon the execution of a Java application, the class loader running on the host processor core will load and parse the main class file into the runtime method/class data structures and store the runtime information in the method area. The BEE core will then be initialized to fetch-and-execute the byte codes of the application class files from method area. In order to reduce the implementation cost of the Java bytecode execution core logic, some complex instructions such as the floating point operations, system resource access operations(e.g. for memory allocation, media accelerators, and I/O service), etc., will be implemented on the host processor core as service routines. The communication

between the BEE core and the host processor core must be achieved through some efficient inter-processor communication interface.

Bytecode Execution Engine (BEE)

Method Area Dynamic

Class Loader And Verifier

Host Processor (PPC 405) IPC

Bytecode Execution Engine (BEE)

Method Area Dynamic

Class Loader And Verifier

Host Processor (PPC 405) IPC

Fig. 1 Proposed dual-core JRE architecture.

Ideally, the IPC should allow the Java source code to invoke a host processor service routine. One possible approach is to implement the Java Native Interface (JNI) [2].

JNI is designed to handle situations where Java applications need to call a library implemented in native binary form (usually a dynamic loading library). As a two-way interface, the JNI can support two types of function calls: down-calls and up-calls. Down-calls are cases when a Java application invokes native functions, while up-calls are cases when native functions invoke JNI interface function to access the Java application resource such as field and data. Due to its general and flexible nature, JNI overhead are pretty high. It involves call stack conversion and dynamic native function loading and calling (automatically handled by the OS). Therefore it is not efficient enough for the one-way host system service invocation we need in the proposed dual-core JRE architecture.

Most VM implementation leaves a back-door interface for fast native system function calls. However, the designs are in general for software-based VM interpreters. Therefore, calling native operating system services from a native VM interpreter application are quite straightforward. In this paper, the design of a Fast Host Service Interface for invoking host system

routines from the Java processor is proposed. A special Java class, mmes.native, as a one way BEE-to-host calling interface is implemented to make services calls to host processor transparent to Java source programs. Any references to the methods in this special class from a Java application will be intercepted at the byte code level by the BEE core and turned into native calls to host service routines. Neither stack conversion nor data structure conversion is necessary for such native calls since all the parameters will be passed into host processor directly by exporting the internal Java stacks to the host processor via memory-mapped I/O mechanism. FHSI aims to provide an extendable and efficient design for inter-processor communications between the host inter-processor and the Java bytecode processor.

The paper is organized as follows. Section II describes the details of FHSI, including runtime method resolution and the parameter passing mechanism. Section III describes the implementation platform and shows some experimental results.

Some concluding remarks are given in section IV.

II. F^ASTH^OSTS^ERVICEI^NTERFACE

Proposed Fast Host Service Interface has two major steps.

At first, Java application invocates the method defined in mmes.HostService through fast dynamic method resolution.

Then, interrupt will be enabled to pass arguments to host system service routine.

A. Fast Dynamic Method Resolution

In our proposed dual-core JRE architecture, the dynamic class loader is a routine running on the host processor core. It locates and loads Java classes upon the request of the BEE core (triggered by, for example, a “new” instruction). This dynamic class loader shall not be mistaken as the Java class loader. Note that there can be more than one class loaders in a Java application, but the dynamic class loader running on the host processor is unique. This loader converts class files into Java runtime information images and incorporate them into a large runtime information structure. Each class file is stored in a structure shown in Fig. 2, which is composed of four parts, including class table of Constant Pool TOC (TOC), constant pool, field, and method information.

reserve

addr 1 addr n

…

Constant Pool

Info Constant Pool Data

(Same as that in the class file) name index 0 descriptor index 0

access flag 0 heap offset 0

name index k–1 descriptor idx k–1

access flag k–1 heap offset k–1

…

*All values are in big-endian format.

Field Info Addr Method Info Addr 16 bits

addr 0

access flag m–1

data space (8 bytes) data space (8 bytes) argument count 0

argument cnt m–1 Method 0 bytecodes

max stack max local

Method m-1 bytecodes

max stack max local

Header reserve

addr 1 addr n

…

Constant Pool

Info Constant Pool Data

(Same as that in the class file) name index 0 descriptor index 0

access flag 0 heap offset 0

name index k–1 descriptor idx k–1

access flag k–1 heap offset k–1

…

*All values are in big-endian format.

Field Info Addr Method Info Addr 16 bits

addr 0

access flag m–1

data space (8 bytes) data space (8 bytes) argument count 0

argument cnt m–1 Method 0 bytecodes

max stack max local

Method m-1 bytecodes

max stack max local reserve

addr 1 addr n

…

Constant Pool

Info Constant Pool Data

(Same as that in the class file) name index 0 descriptor index 0

access flag 0 heap offset 0

name index k–1 descriptor idx k–1

access flag k–1 heap offset k–1

…

*All values are in big-endian format.

Field Info Addr Method Info Addr 16 bits

addr 0

access flag m–1

data space (8 bytes) data space (8 bytes) argument count 0

argument cnt m–1 Method 0 bytecodes

max stack max local

Method m-1 bytecodes

max stack max local Header

Fig. 2 Definition of Java runtime information.

The last three parts are copied directly from the original java class file and the offset address of field information and method information can be indexed by “Filed info Addr” and

“Method info Addr”. Each entry in the Constant Pool TOC is the address (relative to the base address) to the TAGs in the constant pool extracted directly from the class file [8]. Some indirect references will be resolved by the class loader in advance so that dynamic resolution during runtime will be faster and simpler. An example is shown in Fig. 3. A byte code instruction, invokestatic 1D, refers to the constant pool entry 1D and “Methodref_info” represents a symbolic reference to the method declared in a class. A typical JavaVM resolves this symbolic reference at runtime. Our class loader will resolve some references if possible during class loading to speed up runtime operations.

Initial PC

Fig. 3 Dynamic method resolution of Java.

Fig. 4 shows our mechanism for fast resolution. The class loader uses the memory space following a “Methodref_info”

entry to store the target address of the method reference.

During runtime, the instruction, invokestatic 1D, refers to the constant pool entry 1D of the constant pool TOC, and read the data of that entry. Then, the java BEE core will retrieve the target address points to the method entry directly. With this mechanism, dynamic resolution will be faster at runtime.

Symbolic references to other information, such as interfaces and fields, are implemented in the same way.

Constant Pool

access flag arg_cnt max stack max locals

000A 0004 0003

access flag arg_cnt max stack max locals

000A 0004 0003

033DA700……….

0002

access flag arg_cnt max stack max locals

000A 0004 0003

Fig. 4 Fast dynamic method resolution.

B. Java Stack Operation for Method Invocation

According to the Java VM specification [8], when an instance method is invoked, a reference to its instance is passed in through local variable 0 in the stack frame. In the Java programming language the instance is accessible via the keyword, this. Class (static) methods do not have an instance, so a class method starts using local variables at index zero.

Therefore our method invocation is set up by pushing arguments onto top of stack. When the frame for the new method is created, the arguments passed to the method become the initial values of the new frame's local variables.

Note that only a pair of VP (variable pointer) and SP (stack pointer) exists in stack. VP stands for the first local variable of the current method, while SP means a next top of entry in stack. In each frame, some information besides local variable are stored against the program execution. Pervious JPC (java program counter) record the return point and Pervious VP points to the original VP at the previous frame.

After the information of the invoker is recorded, the new method is invocated. When it returns, its return value is pushed onto the operand stack of the frame of the invoker. The VP and SP are then reset. Fig. 5 and Fig. 6 show the transformation of stack for method invocation and return.

They key idea for the proposed Fast Host Service Interface is to let the java BEE core identify and intercept all method invocations to a special Java class (mmes.HostService) and then signal an interrupt to the host service handler routine running on the host processor. The caller stack will also be exposed to the host processor service routine.

VP Local var 0

Fig. 5 Java stack operation for method invocations.

Fig. 6 Stack transformation for method return.

C. Execution Flow of Fast Host Service Interface

Fig. 7 shows a detail Fast Host Service Interface example when a Java application calls an I/O service, mmes.HostService.print(), on the host side.

1. The Java code that calls mmes.HostService.print() is compiled by Java compiler into an indirect reference which is composed of instruction and a unique ID referred to the constant pool.

2. The proposed fast method resolution mechanism resolves this indirect reference and start executing the bytecode sequence of mmes.hostservice.print().

Each method in the class mmes.hostservice has some inline user-defined bytecode that assign the unique service ID to the interface register, ServiceID, and signals an interrupt to the host processor.

3. Upon reception of the interrupt, the host processor executes the host service routine that corresponds to the ServiceID register. Note that there are still some custom data registers are cached for parameter passing even though the proposed BEE core has three register for three top stack elements. These custom data registers are designed for host services only. Therefore, the service routine on the host processor can accesses parameters directly through the custom data registers which are consistent with operand stack.

mmes.hostservice.print ( args ) ; ÆB8 00 1D

Java Code 1

package mmes;

public class hostservice {

public static void print ( int args0 ) { // inline bytecodes

1. Record a unique ID from Related ISR in language C 3

…

Fig. 7 Execution flow of FHSI.

III. EXPERIMENTAL R^ESULT

The proposed dual-core JRE is implementation on an SoC emulation platform, the Xilinx ML405. The platform is based on a Virtex IV FPGA with a PowerPC hard IP core. Both the processor frequency and the bus frequency are 100 MHz. On the RISC side, a thin OS kernel is used for the experiments.

Some host services (to support the behavior including new object, new array, and print out) are encapsulated in an ISR.

The RTL model of the java BEE core is written in VHDL and the synthesis report using SynplifyPro for the Virtex IV device is shown in TABLE II.

TABLE I. SYNTHESIS REPORT OF THE BEECORE

Device : vertex-4 xc4vfx20 ff672-10

Number of Slices 3390 out of 8544 39%

Maximum Frequency 104.170 MHz

A. Interrupt Overhead of the Target Platform

The communication efficiency between the java BEE core and the host processor core is crucial for such heterogeneous dual-core model. In general, interrupt-driven and polling are two common ways of the inter-processor communication. In TABLE II. , the communication latency of each method is shown. The unit of cycle means that the clock cycles are required, and milliseconds is the multiplication of cycles and period (10 nanoseconds). Although polling has smaller latency, we choose to use interrupt mechanism for its flexibility.

TABLE II. THE LATENCY OF INTER-PROCESSOR COMMUNICATION

#Cycle Milliseconds

Interrupt-driven 474 0.05

Polling 135 0.01 B. Efficiency of the Proposed Host Service Invocation

TABLE III. is the experimental result comparing the proposed dual-core JRE to the standard CVM running on the same emulation platform. The value in TABLE III. stands for execution time in milliseconds for some Java operations (method invocation, native method invocation, and the new object) executed for 10,000 iterations. The less value means the higher performance.

Method invocation is the common function call and it represents a symbolic reference to the method declared in a class. This experiment presents the capability of dynamic resolution. We get about 10 times improvement of this operation due to the design of fast dynamic method resolution.

Native method invocation means a Java application invokes the native C dummy function for 10,000 times.

Although proposed JRE has extra interrupt overhead per native call, the performance is still slightly better than that of CVM. According to the experiment, the overhead of interrupt is 98.2% (15,979,980 of 16,269,995 cycles). Only 1.8% of the execution time is for the Java bytecode. Note that IPC overhead is unavoidable for a dual-core processor, but the advantage is that the overall system performance is higher.

For example, we have use two benchmarks, PI and SIEVE, to show the full system performance. For PI, calculation of π to 500 decimal digits, the execution time of the proposed dual-core JRE is 176 milliseconds while the execution of Sun’s CVM is 1086 milliseconds. For SIEVE benchmark, the execution times of the proposed dual-core JRE and Sun’s CVM are 2,887 milliseconds and 26,199 milliseconds.

The last experiment tested the overhead of “new object,”

which instantiates an object through the “new” bytecode instruction. It is the worst case of the proposed system due to memory management overhead under the host processor side.

This overhead can be reduced if the memory management functions are optimized for the proposed system.

TABLE III. EXPERIMENTAL RESULT

Dual-core JRE Sun’s CVM method invocation ( java-call-java ) 3 ms 42 ms native invocation ( java-call-c ) 163 ms 171 ms

new object 375 ms 42 ms

IV. CONCLUSIONS

A dual-core Java application processor and a fast IPC for the Java core to invoke system service routines running on the host core is presented in this paper. The proposed IPC mechanism is very extensible and the experimental results show that it is also very efficient.

V. ACKNOWLEDGEMENT

This research is partly funded by National Science Council, Taiwan, R.O.C., under grant number NSC 97-2220-E-009-024.

R^EFERENCES

[1] H. J. Ko and C. J. Tsai, “A Double-issue Java Processor Design For Embedded Application,” Proc. of IEEE Int. Symp. on Circuits and Systems, May. 2007.

[2] S. Liang. The Java Native Interface: Programmer’s Guide and Specification. Addison-Wesley 1999.

[3] M. Schoebel, “Evalution of a Java Processor,” Tagungsband Austrochip 2005, pp. 127-134, Oct. 2005.

[4] S. Nino, T. Mori, Y. Ko, Y. Shibata, and K. Oguri, “FPGA Implementation of a Statically Reconfigurable Java Environment for Embedded Systems,” IEEE Int. Symp. on Field-Programmable Technology, 2007.

[5] D. Kurzyniec and V. Sunderam, “Efficient Cooperation between Java and Native Codes – JNI Performance Benchmark,” http://janet-project.sourceforge.net/papers/jnibench.pdf

[6] Sun Microsystems, J2ME Building Blocks for Mobile Devices, Sun Microsystems White Paper, May. 2000.

[7] Sun Microsystems, Connected, Limited Device Configuration Specification Version 1.0a, Sun Microsystems White Paper, May. 2000.

[8] T. Lindholm and F. Yelling, The Java Virtual Machine Specification, Addison-Wesley, 1996.

在文檔中支援3-D立體視訊的數位電視多媒體平台設計(II) (頁 62-65)