Friday, March 28, 2008

SCALE III, part 4

The SCALE API is as simple as it gets. It consists of just two functions: ParseFromFiles() and ParseFromMemory(). The ParseFromFiles() function takes an array of filenames, loads the files into memory, and passes them to ParseFromMemory().

ParseFromMemory() itself takes an array of strings, each string in the array represents a translation unit. Inside ParseFromMemory(), the SCALE engine is initialised and parsing occurs.

The SCALE Engine has 2 distinct stages: tokenisation and parsing.


Tokenisation

Tokenisation of a translation unit is the process of breaking up the entire translation unit into separate entities known as "lexemes". A lexeme is the smallest "unit" that SCALE understands - they are analogous to words in english.

In English, lexemes (eg, "run", "play", "discombobulate") are delimited by whitespace. That means, lexemes (or words) are defined to be separated by whitespace - if there's a bunch of characters that aren't separated by whitespace, it's considered to be a single word.

I'm sure that I'm about to be mob lynched by all the linguists out there, so I think I'll stop with the analogies to real languages. In SCALE, lexemes are delimited by a certain number of characters which where described in part 2 of this series. It's the tokeniser's job to break apart the entire translation unit into lexemes that then go into the parsing stage. For example, let's say we have the following translation unit:


Type ObjectName
{
~Frobnicate
{
Foo = 5;
}
}


The tokeniser would break up that code into the following list of tokens:

Lexeme Index Row, Column
"Type" 0 1, 1
"ObjectName" 5 1, 6
"{" 17 2, 1
"~Frobnicate" 24 3, 5
"{" 41 4, 5
"Foo" 52 5, 9
"=" 56 5, 13
"5" 58 5, 15
";" 59 5, 16
"}" 66 6, 5
"}" 69 7, 1


As you can see, whitespace and comments are ignored - only the useful information is extracted. This list of tokens is then passed to the parser.


Parsing

The job of the parser is to make sense of the raw tokens given to it by the tokeniser. At its core, the SCALE parser's job is to check syntax and format/convert the information into useful data structures. The parser is responsible for type checking, syntax & formatting, and conversion of data. The parser just sequentially runs through each token in the list, and performs checking to ensure that it's correct. If an unexpected token is found (eg. missing an opening brace), the parser errors out. When parsing object types, object names, modules, variable names and variable values, the parser does checks to ensure the validity of the token and enforces things like naming rules and types.

The output of the parser is a large associative data structure (a std::map<>, specifically) called ParsedData. The ParsedData structure contains all the types which were parsed by the parser. ParsedData is a std::map whose key is a type name, and whose key values is a data structure called TypeData.

The TypeData structure contains all the parsed objects that are of the specified type. It, too, is a std::map associative container. The key value is an object name, and the key value type is a structure called ObjectData.

The ObjectData structure contains all the modules contained in a particular object. Like the other data structures, this is another associative container (std::map). The key is a module name, and the key value is an array of "ModuleData" data structures (remember, it's legal to have multiple modules of the same name in an object).

The ModuleData structure finally contains a map of all the variables in that module. The map's key is a variable name, the value is a VariableData structure.

Finally, the VariableData structure contains the data for a variable, including the type.

If you didn't understand that, here's a helpful diagram that may help:




Performance

SCALE III, as the name implies, is the third time I've attempted to implement SCALE. The first iteration was, well, terrible. It was slow and incredibly fragile with overly strict and verbose syntax. It was abandoned quite quickly, and SCALE II was born. SCALE II had the exact same syntax and naming/typing rules as SCALE III, but it's impementation was very different. SCALE II was an attempt at single-pass lexical analysis. There was no tokenisation or parsing or compiling phase, it was all done in-place. And it was a bloated, slow, complex piece of code. I had believed that I could save some performance by making it single pass - but it was a lot more difficult than I would have imagined. That's why SCALE III is done in two passes, like a lot of other lexers out there.

SCALE II relied heavily on simple arrays to store data - and searching arrays was pretty slow as it was an O(n) operation. For namespace collisions and type checking, this resulted in a terrible O(n^2) running time. SCALE III now relies on sorted associative arrays: std::map is usually implemented as a binary red-black tree and it guarantees O(log n) time complexity when searching, indexing, inserting and removing from the tree. Thanks to this, large portions of the engine were optimised from O(n^2) running time to O(n log n), which is a very significant improvement.

SCALE III also supports multithreaded processing of translation units (well, SCALE II did as well, but it wasn't quite as robust). Since translation units can be processed separately, they are dispatched to a thread pool where they are processed in parallel. Since name resolution and checking across translation unit boundaries requires that all the information be available, it's delayed until all translation units have been processed. Then, the thread pool is destructed and namespace resolution occurs on a single thread. That means that some of the processing is required to be serial, but the majority of processing (tokenisation, most of the parser) can be run in parallel for a significant performance increase over single-threaded processing.

Here are some benchmark results on my processor (Core 2 Duo E6600):
2x 7.5mb files (15mb total), MT OFF: 6505ms
2x 7.5mb files (15mb total), MT ON: 3679ms


With multithreading on, the total running time is reduced by 43%. That means that with multithreading on, SCALE is over 76% faster! Of course, I only have a dual core to test it on. The gains would be even larger on a quad core, or octo core. This multithreaded system is scalable up to the number of translation units available, so benefits would likely be seen up to 16-core and beyond, given that there's likely to be a large number of translation units.

Of course, it certainly could be faster. I haven't even run SCALE through a profiler yet, since there's practically no need to optimise. In any sane, non-pathological real-world case, we'd be looking at miniscule amounts of data. At most, maybe 1mb of text data spread out across dozens of files. Since the loading is faster on lots of small files as opposed to a couple of large files, SCALE can parse an entire game's worth of data almost instantly.

0 comments: