Community

AGI Programming => AGI Development Tools => Topic started by: AGKorson on January 12, 2022, 12:57:29 AM

Title: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: AGKorson on January 12, 2022, 12:57:29 AM
I finally finished disassembling an early version of the AGI compiler, 'CG.EXE' to better understand the syntax and structure of original AGI source code. There are quite a few differences between this compiler's syntax and the currently accepted AGI syntax used by most modern compilers (such as WinAGI and AGIStudio) i.e. 'canon'. Here is a detailed discussion of the CG compiler (to act as a future reference) and I'll follow with a discussion on the most significant differences between 'canon' and what the original Sierra compiler actually supported.

The CG version I disassembled is version 3.14, from 1984 (I don't have the original file date; it got lost after moving/copying it over the past few years). But since it says it's from 1984, it appears to be a very early version 2 compiler. (If anybody has any more information on the exact date of this file, or what versions it may be specifically targeted to, that would be very helpful. Also, if you know of a more recent version of CG.EXE, I'd love to have it to do some comparisons.)

Running the compiler:
The compiler is a native MSDOS program (obviously, I think- but maybe not...) The usage form is
Code: [Select]
cg room [room...] [-o output_directory] [-b buffer_size] [-v]There are three possible command switches:

Arguments (command switches and source files) can be in any order. Each argument, including source code files ('room') must be separated by one or more spaces. Filenames have to be valid MSDOS filenames (no long-filenames) and can include path info and wildcard characters ('?' and '*').

If no arguments are passed, the usage information is displayed. If no source files are passed (only one or more command switch) the program switches to console mode, and source files can then be typed in separately. Pressing ENTER adds a file to the list, pressing CTRL+Z and then ENTER sends that list to the program, which are then processed.

Source filenames without an extension are assumed to have the '.cg' extension. If specified, any extension will work just fine. You can also pass a list of files by preceding the file that has the list with the '@' symbol; i.e. 'cg @roomlist.txt' will open the file 'roomlist.txt' and read each line as an input source file name. You can also specify DOS environment variables 'HEAD' and 'TAIL', which are added to the beginning and end of the file list passed on the command line. (I don't know what value that might have; it may be a generic feature that was automatically added when the cg.exe file was built.)

Source filename (not extension) must include a number, or the compiler will throw an error. The output file for a source is always the sourcefile truncated at the first number, with the number as the extension. For example, 'logic1.cg' output is 'logic.1', 'log2a3.txt' becomes 'log.2', etc.

The maximum number of source files (including those specified by name, by wildcard and in file lists) is 200, which seems odd since AGI allows up to 256 logics in a game.

General compiler behavior:
In general, Sierra's compiler is intentionally designed to allow for changes/additions to the action and test commands without needing to rebuild the compiler itself. The compiler only manages the syntax used to access the commands; the commands themselves all need to be declared for the compiler every time it is run. The compiler only has a small number of keywords that are hard coded in the program. Because of this, even this early compiler can compile all versions of AGI source code. (With the exception of support for shorthand syntax for multiplication and division. More on that later.)

The compiler processes each source file separately. The sequence of actions is
 
The compiler uses a single pass when compiling source; I may use the term 'preprocessor' occasionally through out this article, but keep in mind that unlike more modern compilers, all input is compiled linearly, in a single pass; so all declarations and defines need to be listed BEFORE they show up in source code.

The compiler converts all text fields (symbols) it encounters into a hash value (by summing the ascii values of all characters in the symbol, and then returning that value MOD 203). This value is then compared against entries in the compiler's symbol hash table to determine what it represents. If more than one symbol has the same hash value, the compiler creates linked lists for each hash value to avoid conflicts. If the number of symbols is too large such that all memory is used up to hold them, the compiler will throw an error and quit. (I have no idea what that maximum number might be, but on legacy equipment, memory was often a problem so it had to be closely monitored.) For obvious reasons, no duplicate symbols are allowed; if a duplicate is detected, the compiler throws an error. Symbols are case sensitive.

I don't know why the hash table is limited to 203 entries. There may be some valid mathematical reason for picking 203, but I don't know what it is. Regardless, the compiler easily handles that, by linking all symbols with the same hash value, and then using text comparisons when searching for a symbol that shares a hash value with other symbols.

The use of the hash table appears to be a speed enhancer; after converting each symbol to a hash number, it is very fast to then search for the matching number and extract relevant symbol information, instead of doing multiple string comparisons every time a symbol is encountered in source code. On modern CPUs, this wouldn't be a big deal, but in the 1980s, this would likely have been noticeable in showing increased compile speed.

On startup, the following symbols are added to the hash table: %include, %tokens, %test, %action, %flag, %var, %object, %define, %message, %view, #include, #tokens, #test, #action, #flag, #var, #object, #define, #message, #view, goto, if, else, FLAG, OBJECT, MSG, WORD, NUM, MSGNUM, VIEW, VAR, ANY, WORDLIST. These can be broken down into three groups:

 ‏‪‫‬
 ‏‪‫‬
Shorthand Syntax:
The compiler provides limited support for shorthand syntax in lieu of command names. But instead of just recognizing the shorthand command and directly adding the appropriate byte code, the compiler actually inserts the matching command symbol into the data stream, as if it had been typed in the source code and then compiles that symbol. This means the declarations of shorthand commands must exactly match the internal spelling. For example, you can't create a custom action command for the assignn function (byte code 3); you must declare it as 'assignn'. (You could create a #define value to assign another different command text value to assignn though.) The supported shorthand commands are:
This compiler version (3.1.4) does not include shorthand for multiplication or division, which makes me believe it's probably an early AGI version 2 tool (multiplication and division weren't added until version 2.411).

Miscellaneous Syntax Information:
Commas and semi-colons are completely interchangeable. You can use either to separate arguments in a command, or to mark the end of a line.

The compiler does not require the end of line marker. Commands can be separated by one or more spaces, a line feed (not a carriage return), a semi-colon, or a comma. Line feeds (ascii value 10) but not carriage returns (ascii value 13) mark new lines. Carriage returns are completely ignored.For example:
     assignn(v1, 1); [ OK
     assignn(v2; 2) ,, assignn(v3;3)  [ OK
     assignn
     (
     v1
     ,
     1
     )
     [ OK; same as first line


For numeric arguments, the compiler does not enforce unsigned byte values. If a number value is greater than 255, the compiler uses number MOD 256. Negative numbers will also compile without error; the compiler converts them to 2s-complement (and will also MOD it if result is > 8 bits).

The only supported comment tag is the open square bracket ([). Double-slash (//) is not supported by the compiler, nor are block comments.

A 'return' command (byte code 0) is automatically added by the compiler. If the source code ends with a 'return' command, the resulting compiled logic will in fact end with two return byte codes.

Syntax Differences:
So what are the biggest differences between original Sierra AGI syntax (as enforced by the CG.EXE compiler) and current fan-based 'canon' syntax? And do any of them warrant adjusting what modern compilers enforce?

OK, that was quite a bit of information! Honestly, though, it was mostly a thought exercise - I enjoy taking apart the binary code of AGI tools to get to the learn all the inner workings. The AGI community doesn't seem particularly strong these days, so I don't really expect there's much discussion to be had around any kind of 'official' or 'canon' AGI syntax. I'll probably just add those things that I think are worth it into the next WinAGI without worrying too much about whether it would be compatible with AGIStudio or any other compiler.

Anyway, feel free to comment/discuss/correct any of the above information. I'd be happy to respond.
[/list]
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: lskovlun on January 12, 2022, 07:07:47 AM
Nice write-up!
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: ZvikaZ on January 12, 2022, 08:39:35 AM
I agree with Lars :)

Few comments:
1.
Regarding the file's date. As far as I remember, directories were added only in MSDOS version 2 (can someone confirm that? I think it's implied from https://en.wikipedia.org/wiki/MS-DOS (https://en.wikipedia.org/wiki/MS-DOS), but it's not 100% clear). Anyway, according to that Wikipedia article, MSDOS 2 was released on October 1983. So, if I'm right here, that's the earliest possible date for starting to work on a program that has directories support. But the guys at Sierra had to buy the new MSDOS version on October, learn its new features, design to use them, implement that, then there's vacation at December...
So, we could guess that it's from the 1984's beginning, in the earliest case.
However, an alternate story would be that Sierra had their hands on a DOS 2 before it was released (maybe as part of the special relations with IBM around that time), and in that case, the earliest date can be earlier...

2.
Quote
The compiler is a native MSDOS program (obviously, I think- but maybe not...)
Well, it's not obvious. IIRC, I once read an interview with one of Lucas developers, and he said that they developed on another machine. I don't remember which one, but he said that the IBM-PC was too weak to do real work with it, so they used some stronger machine, and cross-compiled to PC.

3.
Code: [Select]
cg room [room...] [-o output_directory] [-b buffer_size] [-v]That's very interesting!
DOS usually used slashes (/) for parameters. That was no problem, since DOS-1 didn't have directories.
When they added directories, they had to use the strange backslash (\), because the regular slash (/) was already used.
So, it's interesting that Sierra's tool used the more Unix-like tradition of - for parameters.
Maybe it could be related to the previous item, and hint that the tool was used on other machines as well?
But that's really a far-fetched guess...

4.
To be honest, I haven't (yet) read all the syntax differences details you wrote. However, I'm just curious if you checked with the original AGI sources that we have - do they all compile with "canon" compilers? With that CG compiler?

5.
Again, great work, and thanks!
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: Charles on January 12, 2022, 09:23:53 AM
If it's the same file OmerMor posted a couple years ago, the date is August 1, 1986 (10:29:58 AM).
CRC32: 022E711B
MD5: A70C533DCBDAA58491DD09A10AC1F251
SHA-1: 4700B789548A53FBE3600ADDB55E4B1C492436E0

http://sciprogramming.com/community/index.php?topic=1814.0
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: Collector on January 12, 2022, 07:36:34 PM
Nice. Would you mind making an entry on this on the Wiki?
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: MusicallyInspired on January 12, 2022, 11:03:51 PM
Really nice to see this much attention given to AGI still. :)
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: klownstein on January 15, 2022, 02:24:45 AM
I am definitely an amateur programmer, if that, and have not done any programming in the last two decades except to use AGIstudio and WinAGI to work on AGI projects as a personal hobby. So, my programming experience is very limited. My experience with AGI comes exclusively from playing AGI games as a child, from asking a handful of questions on these forums, and from the tools, like WinAGI, that others have made. So, while I don't really have much to contribute to the discussion about the original AGI compiler and it's inner workings, I very much appreciate those who do dissect it, understand it, and share that expertise with us. Thank you all.
-klownstein
Title: Re: Sierra's AGI Compiler (CG.EXE) Disassembled
Post by: pmkelly on January 22, 2022, 05:56:34 AM
Fantastic writeup!

A note on the hash table: This approach is very common, both for symbol tables in compilers as well as lots of other applications (most programming language these days include a hash table class in their standard library, and yes they are still important for performance even on modern machines). The number 203 would just be the number of "buckets", it doesn't mean that only 203 symbols can be present, since if there are more than that there will be entries with multiple links (as you mentioned). Normally a prime number is used for the number of buckets, so i'm not sure why they would have picked 203. See https://en.wikipedia.org/wiki/Hash_table

It's interesting that the compiler operates in a single pass. This is how the AGI Studio compiler worked; this was the first time I'd written a compiler and I didn't really know what I was doing. The following year I did a course on compiler construction as part of my CS degree and learnt about abstract syntax trees and realised "ah, that's how it's supposed to be done". Having said that, single-pass compilers weren't uncommon for the time. A well-known example is Turbo Pascal, and it's single-pass strategy was a contributor to its speed.

I'm also quite surprised to learn that the action & test commands were not hard-coded into the compiler. If I recall correctly, each had a corresponding numeric id in the compiled bytecode, which means it would have been necessary to ensure that the definitions you used matched up with the target interpreter.

Regarding the "canon" syntax, I think additions are fine as long as it doesn't break backwards compatibility with existing source. There could be an option to choose between the syntaxes. The CG syntax would mostly be for historical interest, or for compiling any sierra source code if there's any of that around.