3D Graphics with OpenGL ES and M3G- P43 potx

404 FIXED-POINT MATHEMATICS APPENDIX A In these examples we showed the programs as a list of assembly instructions. It is not possible to compile them into a working program without some modifications. Here is an example of an inlined assembly routine that you can actually call from your C program (using an ARM GCC compiler): INLINE int mul_fixed_fixed( int a, int b ) { int result, tmp; __asm__ ( "smull %0,%1,%2,%3 \n\t" "mov %0,%0,lsr #16 \n\t" "orr %0,%0,%1,lsl #16 \n\t" : "=&r" (result), "=&r" (tmp), : "r" (a), "r" (b) :"" ); return result; } Here the compiler allocates the registers and places the register of result to argument %0, tmp to %1, a to %2, and b to %3. For result and tmp = means that the register is going to be written to, and & indicates that this register cannot be used for anything else inside this __asm__statement. The first line performs a signed multiply of a and b and stores the low 32 bits to result and the high 32 bits to tmp. The second line shifts the result right 16 times, the third line shifts tmp left 16 times, and combines tmp and result into result using a bitwise OR. The interested reader may want to consult a more in-depth exposition on GCC inline assembly [S03, Bat]. Another compiler that is used a lot for mobile development is the ARM RVCT compiler. It also handles the register allocation of the inline assembly. RVCT goes a step further though: there is no need to specify registers and their constraints as they are automatically handled by the compiler. Here is the previous example code in the inline assembler format usedbyRVCT: INLINE int mul_fixed_fixed( int a, int b ) { int result, tmp; __asm { smull result, tmp, a, b mov result, result, lsr #16 orr result, result, tmp, lsl #16 } return result; } For a list of supported instru ctions, check ARM Instruction Set Quick Reference Card [Arm]. SECTION A.3 FIXED-POINT METHODS IN JAVA 405 A.3 FIXED-POINT METHODS IN JAVA Fixed-point routines in Java work almost exactly as in C, except that you do not have to strugg le with the portability of 64-bit integers, because the long type in Java is always 64 bits. Also, since there is no #define nor an inline keywordinJava,youneed to figure out alternative means to get your code inlined. This is crucially important because the method call overhead will eliminate any benefit that you get from faster arithmetic otherwise. One way to be sure is to inline your code manually, and that is what you probably end up doing anyway, as soon as you need to go beyond the basic 16.16 format. Note that the standard javac compiler does not do any inlining; see Appendix B for suggestions on other tools that may be able to do it. The benefit of using fixed-point in Java depends greatly on the Java virtual machine. The benefit can be very large on VMs that leverage Jazelle (see Appendix B), or just-in-time (JIT) or ahead-of-time (AOT) compilation, but very modest on traditional interpreters. To give a ballpark estimate, a DOT4 done in fixed-point using 64-bit intermediate reso- lution might be ten times faster than a pure float routine on a compiling VM, five times faster on Jazelle, but only twice as fast on an interpreter. On a traditional interpreter, float is relatively efficient because it requires only one bytecode for each addition, multiplication, or division. Fixed point, on the other hand, takes extra bytecodes due to the bit-shifting. The constant per-bytecode overhead is very large on a software interpreter. On Jazelle, integer additions and multiplications get mapped to native machine instructions directly, whereas float operations require a function call. The extra bytecodes are still there, however, taking their toll. Finally, a JIT/AOT compiler is looking at longer sequences of bytecode and can probably combine the bit-shifts with other operations in the compiled code, as we did in the previous section. To conclude, using fixed-point arithmetic generally does pay off in Java, and even more so with the increasing prevalence of Jazelle and JIT/AOT compilers. There is a caveat, though: if you need to do a lot of divides, or need to convert between fixed and float frequently, you may be better off just using floats and spending your optimization efforts elsewhere. Divides are ver y slow regardless of the number format and the VM, and will quickly dominate the execution time. Also, they are much slower in 64-bit integer than in 32-bit floating point! This page intentionally left blank B APPENDIX JAVA PERFORMANCE TUNING Although M3G offers a lot of high-level functionality implemented in efficient native code, it will not write your game for you. You need to create a lot of Java code yourself, and that code will ultimately make or break your game, so it had better be good. The principles of writing efficient code on the Java ME platform are much like on any other platform. In order to choose the best data structures and algorithms, and to implement them in the most efficient way, you need to know the strengths and weaknesses of your target architecture, programming language, and compiler. The problem compared to native platforms is that there are more variables and unknowns: a multitude of different VMs, using different acceleration techniques, running on different operating systems and hardware. Hence, spending a lot of time optimizing your code on an emulator or just one or two devices can easily do you more harm than good. In this appendix we briefly describe the main causes of performance problems in Java ME, and suggest some techniques to overcome them. This is not to be taken as final truth; your mileage may vary, and the only way to be sure is to profile your application on the devices that you are targeting. That said, we hope this will help you avoid the mostobvious performance traps and also better understand some of the decisions that we made when designing M3G. 407 408 JAVA PERFORMANCE TUNING APPENDIX B B.1 VIRTUAL MACHINES The task of the Java Virtual Machine is to execute Java bytecode, just like a real, nonvirtual CPU executes its native assembly language. The instruction set of the Java VM is in stark contrast to that of any widely used embedded CPU, however. To start with, by tecode instructions take their operands off the top of an internal operand stack, whereas native instructions pick theirs from a fixed set of typically sixteen registers. The ar bitrary depth of the operand stack prevents it from being mapped to the registers in a straightforward manner. This increases the number of costly memory accesses compared to native code. The stack-based architecture is very generic, allowing implementations on almost any imaginable processor, but it is also hard to map efficiently onto a machine that is really based on registers. Another complication is due to bytecode instructions having variable length, compared to the fixed-length codewords of a RISC processor. This makes bytecode very compact: most instructions require just one byte of memory, whereas native inst ructions are typically four byteseach.Thedownsideisthatinstruction fetchinganddecodingbecomes morecomplex. Furthermore, the by tecode instruction set is avery mixed bag, having instructions at widely var ying levels of abstraction. The bytecodes range from basic arithmetic and bitwise operations to things that are usually considered to be in the operating system’s domain, such as memory allocation (new). Most of the bytecodes are easily mapped to native machine instructions, except for having to deal with the operand stack, but some of the high-level ones require complex subroutines and interfacing with the operating system. Adding into the equationthe factsthat allmemory accesses are type-checked and bounds-checked, that memor y must be garbage-collected, and so on, it becomes clear that designing an efficient Java VM, while maintaining security and robustness, is a formidable task. There are three basic approaches that virtual machines are taking to execute bytecode: interpretation, just-in-time compilation, and ahead-of-time compilation. The predom- inant approach in mobile devices is interpre tation: bytecodes are fetched, decoded, and translated into machine code one by one. Each bytecode instruction takes several machine instructions to translate, so this method is obviously much slower than executing native code. The slowdown used to be some two orders of magnitude in early implementations, but has since then been reduced to a factor of 5–10, thanks to assembly-level optimizations in the interpreter loops. The second approach is to compile (parts of) the program into machine code at runtime. These just-in-time (JIT) compilers yield good results in long-running benchmarks, but perform poorly when only limited time and memory are available for the compiler and the compiled code. The memory problems are exacerbated by the fact that compiled code can easily take five times as much space as bytecode. Moreover, runtime compilation will necessarily delay, interrupt, or slow down the program execution. To minimize the dis- turbance, JIT compilers are restricted to ver y basic and localized optimizations. In theor y, the availability of runtime profiling information should allow JIT compilers to produce SECTION B.2 BYTECODE OPTIMIZATION 409 smaller and faster code than any static C compiler, but that would require a drastic increase in the available memory, and the compilation time would still remain a problem for inter- active applications. Today, we estimate well-written C code to outperform embedded JIT compilers by a factor of 3–5. The third option is to compile the program into native code already before it i s run, typically at installation time. This ahead-of-t ime (AOT) tactic allows the compiler to apply more aggressive optimizations than is feasible at runtime. On the other hand, the compiled code consumes significantly more memory than the original bytecode. Any of these three approaches can be accelerated substantially with hardware support. The seemingly obvious solution is to build a CPU that uses Java bytecode as its machine language. This has been tried by numerous companies, including Nazomi, Zucotto, inSil- icon, Octera, NanoAmp, and even Sun Microsystems themselves, but to our knowledge all such attempts have failed either technically or commercially, or both. The less radical approach of augmenting a conventional CPU design with Java acceleration seems to be working better. The Jazelle extension to ARM processors [Por05a] runs the most common bytecodes directly on the CPU, and manages to pull that off at a negligible extra cost in ter ms of silicon area. Although many bytecodes are still emulated in software, this yields performance roughly equivalent to current embedded JIT compilers, but without the excessive memor y usage and annoying interruptions. The main weakness of Jazelle is that it must execute each and every bytecode separately, whereas a compiler might be able to turn a sequence of bytecodes into just one machine instr uction. Taking a slightly different approach to hardware acceleration, Jazelle RCT (Runtime Compilation Target) [Por05b], augments the native ARM instruction set with additional instructions that can be used by JIT and AOT compilers to speed up array bounds checking and exception handling, for example. The extra instructions also help to reduce the size of the compiled machine code almost to the level of the original bytecode. As an application developer, you will encounter all these different types of vir tual machines. In terms of installed base, traditional interpreters still have the largest market share, but Jazelle, JIT, and AOT are quickly catching up. According to the JBenchmark ACE results database, 1 most newer devices appear to be using one of these acceleration techniques. Jazelle RCT has not yet been used in any mobile devices by the time of this writing, but we expect it to be widely deployed over the next few years. B.2 BYTECODE OPTIMIZATION As we pointed out before, Java bytecode is less than a perfect match for modern embedded RISC processors. Besides being stack-based and having instructions at wildly varying 1 www.jbenchmark.com/ace 410 JAVA PERFORMANCE TUNING APPENDIX B levels of abstraction, it also lacks many features that native code can take advantage of, at least when using assembly language. For instance, there are no bytecodes correspond- ing to the kind of data-parallel (SIMD) instructions that are now commonplace also in embedded CPUs and can greatly speed up many types of processing. To take another example, there are no conditional (also known as predicated) instructions to provide a faster alternative to short forward branches. Most of the bytecode limitations can be attributed to the admirable goal of platform independence, and are therefore acceptable. It is much harder to accept the notoriously poor quality of the code that the javac compiler produces. In fact, you are better off assuming that it does no optimization whatsoever. For instance, if you compute a constant expression like 16*a/4 in your inner loop, rest assured that the entire expression will be meticulously evaluated at every iteration—and of course using real multiplies and divides rather than bit-shifts (as in a<<2). The lack of optimization in javac is presumably because it trusts the virtual machine to apply advanced optimization techniques at runtime. That may be a reasonable assump- tion in the server environment, but not on mobile devices, where resources are scarce and midlet start-up times must be minimized. Traditional interpreters and Jazelle take a serious performance hit from badly optimized bytecode, but just-in-time and ahead-of- time compilers are not immune, either. If the on-device compiler could trust javac to inline trivial methods, eliminate constant expressions and common subexpressions, convert power-of-two multiplications and divisions into bit-shifts, and so on, it could spend more time on things that cannot be done at the bytecode level, such as register allocation or eliminating array b ounds checking. Given the limitations of javac, your best bet is to use other off-line compilers, bytecode optimizers, and obfuscators such as GCJ, 2 mBooster, 3 DashO, 4 ProGuard, 5 Java Global Optimizer, 6 Bloat, 7 or Soot. 8 None of these tools is a superset of the others, so it might make sense to use more than one on the same application. B.3 GARBAGE COLLECTION All objects, including arrays, are allocated from the Java heap using the new operator. They are never explicitly deallocated; instead, the garbage collector (GC) automatically reclaims any objects that are no longer referenced by the executing program. 2 gcc.gnu.org/java 3 www.innaworks.com 4 www.preemptive.com 5 proguard.sourceforge.net 6 www.garret.ru/ Ä knizhnik/javago/ReadMe.htm 7 www.cs.purdue.edu/s3/projects/bloat/ 8 www.sable.mcgill.ca/soot/ SECTION B.4 MEMORY ACCESSES 411 Automatic garbage collection eliminates masses of insidious bugs, but also bears significant overhead. Explicit memory management using malloc and free has been shown to be faster and require less physical memory. For example, in a study by Hertz and Berger [HB05], the best-performing garbage collector degraded application performance by 70% compared to an explicit memory manager, even when the application only used half of the available memory. Performance of the garbage collector declined rapidly as memory was r unning out. Thus, for best performance, you should leave some reasonable percentage of the Java heap unused. More importantly, you should not create any garbage while in the main loop, so as not to trigger the garbage collector in the first place. Pitfall: There is no reliable way to find out how much memory your midlet is consuming, or how much more it has available. The numbers you get from Runtime.getRuntime().freeMemory() are not to be trusted, because you may run out of native heap before you run out of Java heap, or vice versa, and because the Java heap may be dynamically resized behind your back. A common technique to avoid generating garbage is to allocate a set of objects and arrays at the setup stage and then reuse them throughout your code. In other words, start off your application by allocating all the objects that you are ever going to need, and then hold on to them until you quit the midlet. Although this is not object-oriented and not very pretty, it goes a long way toward eliminating the GC overhead—not all the way, though. There are built-in methods that do not facilitate object reuse, forcing you to create a new instance when you really only wanted to change some attribute. Even worse, there are built-in APIs that allocate and release temporary objects inter nally without you ever knowing about it. Strings are particularly easy to trip on, because they are immutable in Java. Thus, concatenating two strings creates a new String object simply because the existing ones cannot be changed. If you need to deal with strings on a per-frame basis, for example to update the player’s score, you need to be extra careful to avoid creating any garbage. Perhaps the only way to be 100% sure is to revert to C-style coding and only use char arrays. B.4 MEMORY ACCESSES One of the most frequent complaints that C programmers have about Java is the lack of direct memory access. Indeed, there are no pointers in the Java programming language, and no bytecode instructions to read or write arbitr ary memory locations. Instead, there are only references to strongly typed objects that reside in the garbage-collected heap. You do not know where in physical memory each particular object lies at any given time, nor how many bytes it occupies. Furthermore, all memory accesses are type-checked, and in case of arrays, also bounds-checked. These restrictions are an integral part of the Java security model, and one of the reasons the platform is so widely deployed, but they also rule out many optimizations that C programmers are used to. 412 JAVA PERFORMANCE TUNING APPENDIX B As an example, consider a bitmap image stored in RGBA format at 32 bits per pixel. In C, you would use a byte array, but still access the pixels as integers where necessary, to speed up copying and some other operations. The lack of type-checking in C therefore allows you to coalesce four consecutive memory accesses into one. Java does not give you that flexibility: you need to choose either bytes or integers and stick to that. To take another example, efficient floating-point processing on FPU-less devices requires custom routines that operate directly on the integer bit patterns of float values, and that is something you cannot do in Java. To illustra te, the following piece of C code computes the absolute valueofafloat in just one machine instr u ction, but relies on pointer casting to do so: float fabs(float a) { int bits = *(int*)(&a); // extract the bit pattern bits &= 0x7fffffff; // clear the sign bit return *(float*)(&bits); // cast back to float } Type-checking is not the only thing in Java that limits your choice of data structures and algorithms. For example, if you want to build an aggregate object (such as an array of structures) in C, you can either inline the component objects (the structures) or reference them with pointers; Java only gives you the latter option. Defining a cache-friendly data structure where objects are aligned at, say, 16-byte boundaries is another thing that you cannot do in Java. Moreover, you do not have the choice of quickly allocating local variables from the CPU stack. Finally, the lack of pointer arithmetic forces you to follow object references even when the target address could be computed without any memory accesses. Unlike type checking, array bounds checking does not limit your choice of data structures. It does impose a performance penalty, though, and the more dimensions you have in the array, the higher the cost per access. Thus, you should always use a flat array, even if the data is inherently multidimensional; for instance, a 4 × 4 matrix should be allocated as a flat array of 16 elements. Advanced JIT/AOT compilers may be able to eliminate a range check if the array index can be proven to be within the correct range. The compiler is more likely to come up with the proof if you use new int[100] rather than new int[getCount()] to allocate an array, and index<100 instead of index<getCount() to iter ate over its elements. Do not let this complicate your code too much, however, as this sort of optimization may be beyond the capabilities of the current compilers. To minimize memory accesses in general, it is a good idea to use the built-in primitive typessuchasint and float rather than objects. Also, the input parameters and local variables of a method are likely to be faster than class variables or instance variables. Finally, using System.arraycopy pays off almost universally: it amounts to a native memcpy with some extra type-checking and range-checking up front. The savings can be huge compared to doing the same checks for each element separately. SECTION B.5 METHOD CALLS 413 B.5 METHOD CALLS Method invocations in Java are more expensive and more restricted than function calls in C or C++. The virtual machine must first look up the method from an internal symbol table, and then check the type of each argument against the method signature. A C/C++ function call, on the other hand, requires very few machine inst ructions. In general, private methods are faster to call than public or protected ones, and stand a better chance of being inlined. Also, static methods are faster than instance methods, and final methods are faster than those that can be re-implemented in derived classes. synchronized methods are by far the slowest, and should be used only when necessary. Depending on the VM, native methods can also bear high overhead, particularly if large objects or arrays are passed to or from native code. As a final note, code and data are strictly separated in Java. There is no way for a method to read or write its own bytecode or that of any other method. There is also no way to tr ansfer program control to the data area, or in fact anywhere else than one of the predefined method entry points. These restrictions are absolutely mandatory from the security standpoint, but they have the unfortunate side-effect that any kind of runtime code generation is prevented. In other words, you could not implement a JIT compiler in Java! . result and the high 32 bits to tmp. The second line shifts the result right 16 times, the third line shifts tmp left 16 times, and combines tmp and result into result using a bitwise OR. The interested. (b) :"" ); return result; } Here the compiler allocates the registers and places the register of result to argument %0, tmp to %1, a to %2, and b to %3. For result and tmp = means that the. the best data structures and algorithms, and to implement them in the most efficient way, you need to know the strengths and weaknesses of your target architecture, programming language, and

3D Graphics with OpenGL ES and M3G- P43 potx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan