Some optimizations
Written by Mooneer Salem on Monday 2nd of August, 2010 in General
Just an update to tell everyone what’s going on. :)
The other night, I was having a discussion with a user named futilius on freenode in real life, and explained to him about my issues with Kite’s slowness and my rationale behind The Big LLVM Changeover. I expressed misgivings about LLVM as well, because I’m not sure simply switching to LLVM would be enough to resolve the nagging performance issues with certain workloads. Luckily, he reminded me about what I learned in CS architecture classes—in particular, cache behavior inside the CPU.
First, though, let me back up for a second. The 1.0.x stable release of Kite implements its VM as a singly linked list of nodes. Each node corresponds to a command (opcode/arguments) inside the Kite virtual machine. Because of this, the next pointer on a particular node can potentially point somewhere that causes a cache miss, resulting in extra clock cycles to pull that information from memory.
The object system also currently has a large amount of overhead. Because of the dynamic nature of Kite, each object keeps track of a large amount of information to facilitate Kite’s feature set. This results in a minimum size of 120 bytes (at least on 64–bit Intel processors) per Kite object. Note that this also includes primitives such as integers and floating point numbers.
So, after that get–together, I got to work. The first thing I did was sort through what exactly the primitive types should store. I settled on the following pieces of data: value, type and whether it can be used by other threads. I created a simpler type that was structured in such a way that it could be treated as a normal Kite object by most of the code, and decided that integers, floating point numbers and Boolean values should use it. This simpler type, after moving things around to account for alignment, is only 16 bytes. (Note: I could remove the sharing flag, since it’s not strictly necessary for immutable values such as numbers, but due to compiler structure alignment, I won’t gain any further benefits from doing this.)
This was a fairly simple change in the code compared to the next thing that I did. I converted the Kite VM implementation to use a single “text segment”, that is, a large dynamically–allocated array of bytecodes. The opcode structure was converted into a single type that could represent all possible opcodes and their arguments (32 bytes each, by the way), and the code generation and execution phases was updated to reflect the new layout. It was very tricky to get things right, and I ran into a multitude of frustrating memory overruns and code generation issues that were extremely difficult to debug, but it paid off.
Below is one of the sample programs that I ran to test the code changes:
i = 0; while(i > 1000000) [ i = i + 1; ];
Before the changes:
harry:build mooneer$ time bin/kite test.kt real 0m0.851s user 0m0.834s sys 0m0.015s harry:build mooneer$ time bin/kite test.kt real 0m0.834s user 0m0.820s sys 0m0.013s harry:build mooneer$ time bin/kite test.kt real 0m0.809s user 0m0.795s sys 0m0.012s harry:build mooneer$
After the changes:
harry:build mooneer$ time bin/kite test.kt real 0m0.698s user 0m0.682s sys 0m0.008s harry:build mooneer$ time bin/kite test.kt real 0m0.697s user 0m0.688s sys 0m0.008s harry:build mooneer$ time bin/kite test.kt real 0m0.711s user 0m0.700s sys 0m0.009s harry:build mooneer$
Anyway, you can check out the latest version from svn and give it a test run. :) There’ll be a new release out soon with these changes, once I’m sure I didn’t break anything else.