Author Topic: Sierra's AGI Compiler (CG.EXE) Disassembled  (Read 1454 times)

0 Members and 1 Guest are viewing this topic.

Offline AGKorson

Sierra's AGI Compiler (CG.EXE) Disassembled
« on: January 12, 2022, 12:57:29 AM »
I finally finished disassembling an early version of the AGI compiler, 'CG.EXE' to better understand the syntax and structure of original AGI source code. There are quite a few differences between this compiler's syntax and the currently accepted AGI syntax used by most modern compilers (such as WinAGI and AGIStudio) i.e. 'canon'. Here is a detailed discussion of the CG compiler (to act as a future reference) and I'll follow with a discussion on the most significant differences between 'canon' and what the original Sierra compiler actually supported.

The CG version I disassembled is version 3.14, from 1984 (I don't have the original file date; it got lost after moving/copying it over the past few years). But since it says it's from 1984, it appears to be a very early version 2 compiler. (If anybody has any more information on the exact date of this file, or what versions it may be specifically targeted to, that would be very helpful. Also, if you know of a more recent version of CG.EXE, I'd love to have it to do some comparisons.)

Running the compiler:
The compiler is a native MSDOS program (obviously, I think- but maybe not...) The usage form is
Code: [Select]
cg room [room...] [-o output_directory] [-b buffer_size] [-v]There are three possible command switches:
  • -o to set the output directory (where the compiled logic files will be saved) - The default is the current directory. The output directory can be any legal MSDOS directory name, including relative paths. If the path is not valid, the compiler will quit with an error. If the argument following the -o switch is missing, the current directory is used. If the argument is a valid directory, but does not exist, the compiler will run, but will raise an error at the end when it can't write the output files.
  • -b to adjust buffer size - The default is 1000h (4K). There's really no reason to change this, but it can be adjusted if you really want to. If the argument following the -b switch is non-numeric or missing, a value of zero is used, which will cause the compiler to fail immediately. Larger values means disk writes are minimized, but take up more memory. This was more of a concern in the 1980s than on today's systems.
  • -v to set 'verbose' mode - This switch does not take an argument. When used, it causes the output to display additional information, including the number of symbols and messages found in the source file, and information on the symbol hash table (explained below).

Arguments (command switches and source files) can be in any order. Each argument, including source code files ('room') must be separated by one or more spaces. Filenames have to be valid MSDOS filenames (no long-filenames) and can include path info and wildcard characters ('?' and '*').

If no arguments are passed, the usage information is displayed. If no source files are passed (only one or more command switch) the program switches to console mode, and source files can then be typed in separately. Pressing ENTER adds a file to the list, pressing CTRL+Z and then ENTER sends that list to the program, which are then processed.

Source filenames without an extension are assumed to have the '.cg' extension. If specified, any extension will work just fine. You can also pass a list of files by preceding the file that has the list with the '@' symbol; i.e. 'cg @roomlist.txt' will open the file 'roomlist.txt' and read each line as an input source file name. You can also specify DOS environment variables 'HEAD' and 'TAIL', which are added to the beginning and end of the file list passed on the command line. (I don't know what value that might have; it may be a generic feature that was automatically added when the cg.exe file was built.)

Source filename (not extension) must include a number, or the compiler will throw an error. The output file for a source is always the sourcefile truncated at the first number, with the number as the extension. For example, 'logic1.cg' output is 'logic.1', 'log2a3.txt' becomes 'log.2', etc.

The maximum number of source files (including those specified by name, by wildcard and in file lists) is 200, which seems odd since AGI allows up to 256 logics in a game.

General compiler behavior:
In general, Sierra's compiler is intentionally designed to allow for changes/additions to the action and test commands without needing to rebuild the compiler itself. The compiler only manages the syntax used to access the commands; the commands themselves all need to be declared for the compiler every time it is run. The compiler only has a small number of keywords that are hard coded in the program. Because of this, even this early compiler can compile all versions of AGI source code. (With the exception of support for shorthand syntax for multiplication and division. More on that later.)

The compiler processes each source file separately. The sequence of actions is
 
  • parse input file name, build output file name
  • reset compiler parameters
  • open source file and assign to buffer space
  • create output file (overwriting any existing file) and assign to buffer space
  • initialize the hash table, adding predefined symbols
  • allocate space for output, and for AGI messages
  • compile the source (errors are displayed as they are encountered)
  • close the output file
  • display results, including total number of errors encountered
  • repeat for next source file

The compiler uses a single pass when compiling source; I may use the term 'preprocessor' occasionally through out this article, but keep in mind that unlike more modern compilers, all input is compiled linearly, in a single pass; so all declarations and defines need to be listed BEFORE they show up in source code.

The compiler converts all text fields (symbols) it encounters into a hash value (by summing the ascii values of all characters in the symbol, and then returning that value MOD 203). This value is then compared against entries in the compiler's symbol hash table to determine what it represents. If more than one symbol has the same hash value, the compiler creates linked lists for each hash value to avoid conflicts. If the number of symbols is too large such that all memory is used up to hold them, the compiler will throw an error and quit. (I have no idea what that maximum number might be, but on legacy equipment, memory was often a problem so it had to be closely monitored.) For obvious reasons, no duplicate symbols are allowed; if a duplicate is detected, the compiler throws an error. Symbols are case sensitive.

I don't know why the hash table is limited to 203 entries. There may be some valid mathematical reason for picking 203, but I don't know what it is. Regardless, the compiler easily handles that, by linking all symbols with the same hash value, and then using text comparisons when searching for a symbol that shares a hash value with other symbols.

The use of the hash table appears to be a speed enhancer; after converting each symbol to a hash number, it is very fast to then search for the matching number and extract relevant symbol information, instead of doing multiple string comparisons every time a symbol is encountered in source code. On modern CPUs, this wouldn't be a big deal, but in the 1980s, this would likely have been noticeable in showing increased compile speed.

On startup, the following symbols are added to the hash table: %include, %tokens, %test, %action, %flag, %var, %object, %define, %message, %view, #include, #tokens, #test, #action, #flag, #var, #object, #define, #message, #view, goto, if, else, FLAG, OBJECT, MSG, WORD, NUM, MSGNUM, VIEW, VAR, ANY, WORDLIST. These can be broken down into three groups:

  • 'Preprocessor' symbols:
    The symbols starting with '%' or '#' are compiler commands that add symbols to the hash table which tell the compiler how to handle other symbols it will encounter later. Note that there are two versions of each, with either the '#' or '%' character to start. There is no difference between them; the compiler just offers the flexibility to use either format. I will only refer to the '%' version even though either version is perfectly acceptable to the compiler. (I assume that in early MSDOS compilers, there may have been a difference between commands starting with either symbol, but this compiler doesn't care which is used).
              ​‌‍‎‏‪‫‬ 
    • %include: This symbol is used to include another source or header file. Syntax is:
           %include "filename.ext"

      The filename must be enclosed in double-quotes. It can include path information, but not wildcards. Included files can be nested, but there is a limit of 5 layers of nesting. More than that will cause an error.
              ​‌‍‎‏‪‫‬ 
    • %tokens: This symbol is used to load the WORDS.TOK file. Syntax is:
           %tokens "WORDS.TOK"

      The filename doesn't have to be WORDS.TOK, but it must be a valid AGI word file. It must be enclosed in double-quotes. It can include path information, but not wildcards. If you do not assign a WORDS.TOK file with the %tokens command, the compiler will enter an infinite loop if it tries process a 'said' command.
              ​‌‍‎‏‪‫‬ 
    • %test/%action: These symbols are used to declare AGI command symbols. None of the AGI commands are hard coded into the compiler; they must be declared at the beginning of the source code, typically in a header file that is included with the %include command. Syntax is:
           %test testname([arg1, arg2, ...]) cmdnum
           %action actionname([arg1, arg2, ...]) cmdnum

      The names of commands are well established by precedent (and by example from released original game code), but you can assign any name you want for any command. The arguments must be enclosed in parentheses, separated by commas. If no arguments, empty parentheses must be included. Arguments must be one of the pre-defined argument types (see below). The number of the command (the byte value that goes into the compiled logic) must follow the command declaration. It is imperative that the arguments assigned match exactly with the AGI command associated with the byte command number; the compiler will happily create code that has any combination of command byte values and arguments, but if they don't match what the interpreter expects, the code won't run.
              ​‌‍‎‏‪‫‬ 
    • %flag/%var/%object/%view: Flags, variables, screen objects and views can be assigned symbols using its designated preprocessor command. Note that there isn't a symbol command for numbers; the compiler doesn't convert numbers into a symbol. Syntax is:
           %flag flagname flagnum
           %var variablename varnum
           %object objname objnum
           %view viewname viewnum

      These symbol types correspond to the expected argument types that are used in command declarations. For example, if a command is declared with %action as '%action assignn(VAR, NUM)', a symbol of type %var must be passed as the first argument, and a number must be passed as the second argument. Note that these types are completely arbitrary, meaning as long as the command declaration matches the argument type passed in source code, the compiler won't have any problem. For example:
           %action assignv(VIEW, OBJECT) 3
           %view   a_view     5
           %object an_object 10
           assignv(a_view, an_object);

      will compile, and when run in AGI it would assign variable 10 to variable 5. Of course there is little practical value in using constructs such as this.
      The OBJECT argument type is used in most original Sierra game source for both screen objects and inventory objects. Care must be taken by the programmer to keep them straight, because the compiler won't do it for you.
              ​‌‍‎‏‪‫‬ 
    • %define: The generic define command allows you to assign a symbol to a number, text, or another symbol. This allows you to do things like:
           %var v0      0
           %define currentroom v0

      or
           %var v200   200
           %define lvar1  v200
           %define counter  lvar1
           addn(counter, 1); [ same as addn(lvar1, 1) or addn(v200, 1)

      Symbols can be nested in this manner as deep as you want. As long as each symbol is defined before it is referenced, the compiler will continue substituting define values until it reaches a non-text symbol type (var, num, flag, msgnum, etc.)
              ​‌‍‎‏‪‫‬ 
    • %message: message symbols are just a bit different from other symbol assignments. Syntax is:
           %message msgnum "message text"
      Note that the number precedes the text value, unlike other symbols, where the number is last. The message text must be included in quotes. It cannot be split, i.e. the compiler will not automatically concatenate multi-line strings. The message text can be a symbol that was previously %defined to be a string value. For example
           %define aboutmsg "AGI Game, by Author"
           %message 1 aboutmsg

      will compile with no errors.
 ‏‪‫‬
  • Keywords:
    There are only three key words that the compiler recognizes - 'if', 'else' and 'goto'. The syntax for 'if' and 'else' used by the compiler is identical to the AGI 'canon' syntax. 'if' statements must be in parentheses, curly brackets separate code blocks, etc. The 'and' and 'or' features are the same ('&&' and '||' as operators, and 'or'ed tests must be in parentheses). The exclamation point '!' is used for negation of test commands.
    The goto command does not use parentheses. Using parentheses will cause an error. The syntax is:
         goto label
    Labels are defined with a colon followed by the label name, with no space between them. If there is a space between them, the compiler will create a label using a null string (""), and the following text will be interpreted by the compiler as the next command symbol.
 ‏‪‫‬
  • Argument types:
    Arguments used in action and test commands must be designated as one of ten different argument types. When declaring a command, the compiler expects a symbol with a type that matches the argument type, and it must be one of the predefined argument types for each argument. They are case sensitive, so 'var' is not the same as 'VAR'. With the exception of numbers and vocabulary words (from WORDS.TOK), all arguments must be passed as a symbol that was previously declared using the appropriate 'preproccessor' command. For example, if an argument is of type FLAG, you must pass a symbol declared with the '%flag' command.
    • FLAG: Use this argument type when you want a command to use a flag argument.
    • OBJECT: Use this argument type when you want a command to use a screen object argument.
    • MSG: This argument type is not used by AGI; it appears to be a legacy type that no longer works. The internal value assigned to it will actually create an error if you try to use this argument type in a command declaration.
    • WORD: This is another legacy argument type. If used, the compiler will expect a single word from the WORDS.TOK file. There are no AGI commands that take a single word as an argument. (Earlier versions of AGI did use a 'said' command that took a single word as an argument.)
    • NUM: Use this argument type when you want a command to use a numeric argument value.
    • MSGNUM: Use this argument type when you want a command to use a message number as an argument value. The message must be properly declared with %message before it is used in a command.
    • VIEW: Use this argument type when you want a command to use a view argument.
    • VAR: Use this argument type when you want a command to use a variable argument.
    • ANY: This argument type should not be used in command declarations. It is an internal type that is used when the compiler is handling shorthand syntax such as 'v0 = v1;'. Technically, you could use this in a command declaration, in which case, any valid symbol would be compiled. But using strict argument typing helps prevent bugs by forcing the game programmer to use correct argument types.
    • WORDLIST: This argument type is only used by the 'said' command. It acts as a placeholder for one or more words from the WORDS.TOK file. Its placement at the end of the list suggests it was added after the 'said' command was changed from having a single word argument to a variable number of arguments. Unlike modern AGI compilers (WinAGI and AGIStudio for example), argument values are not passed as double-quoted strings; instead they are passed quote free, and the dollar sign ($) is used in place of spaces. For example, if your word in WORDS.TOK is 'save game', it would look like this in a 'said' command:
           said(save$game)
Shorthand Syntax:
The compiler provides limited support for shorthand syntax in lieu of command names. But instead of just recognizing the shorthand command and directly adding the appropriate byte code, the compiler actually inserts the matching command symbol into the data stream, as if it had been typed in the source code and then compiles that symbol. This means the declarations of shorthand commands must exactly match the internal spelling. For example, you can't create a custom action command for the assignn function (byte code 3); you must declare it as 'assignn'. (You could create a #define value to assign another different command text value to assignn though.) The supported shorthand commands are:
  • '++' and '--' can be used as shorthand for increment/decrement. The operators must precede the variable being modified. So
         ++variable1; [ OK
    will compile, but
         variable1++; [ syntax error
    will throw an error
  • assignn/assignv, addn/addv and subn/subv can be replaced with 'v# = #/v#', 'v# += ##/v#' and 'v# -= ##/v#'. For addition and subtraction you can't use the longer notation 'v# = v# + ##'. For example:
         v1 = v2;  [ OK, same as assignv(v1, v2);
         v1 += 1;  [ OK, same as addn(v1, 1);
         v1 -= v3; [ OK  same as subv(v1, v3);
         v1 = v1 + 2; [ NOT OK, syntax error
  • left and right indirection can be replaced with '@v# = #/v#' and 'v# = @v#'. Note this is different from modern compilers that use the asterisk/star character (*). For example:
         @v1 = 1;   [ OK, same as lindirectn(v1, 1);
         @v2 = v3;  [ OK, same as lindirectv(v2, v3);
         v4 = @v5;  [ OK, same as rindirect(v4, v5);
         *v1 = 1;   [ NOT OK, wrong indirect symbol
  • for test commands, '==', '>',  and '<', can be used in place of  equaln/equalv, greatern/greaterv and lessn/lessv. '!=', '>=' and '<=' can be used in place of the negated versions of these commands.
  • flags can be tested by their name; i.e. 'if(flag1)' will compile as if it were 'if(isset(flag1)'.
  • variables can also be tested by name; 'if(var1)' will compile as if it were 'if(greatern(var1, 0)'
This compiler version (3.1.4) does not include shorthand for multiplication or division, which makes me believe it's probably an early AGI version 2 tool (multiplication and division weren't added until version 2.411).

Miscellaneous Syntax Information:
Commas and semi-colons are completely interchangeable. You can use either to separate arguments in a command, or to mark the end of a line.

The compiler does not require the end of line marker. Commands can be separated by one or more spaces, a line feed (not a carriage return), a semi-colon, or a comma. Line feeds (ascii value 10) but not carriage returns (ascii value 13) mark new lines. Carriage returns are completely ignored.For example:
     assignn(v1, 1); [ OK
     assignn(v2; 2) ,, assignn(v3;3)  [ OK
     assignn
     (
     v1
     ,
     1
     )
     [ OK; same as first line


For numeric arguments, the compiler does not enforce unsigned byte values. If a number value is greater than 255, the compiler uses number MOD 256. Negative numbers will also compile without error; the compiler converts them to 2s-complement (and will also MOD it if result is > 8 bits).

The only supported comment tag is the open square bracket ([). Double-slash (//) is not supported by the compiler, nor are block comments.

A 'return' command (byte code 0) is automatically added by the compiler. If the source code ends with a 'return' command, the resulting compiled logic will in fact end with two return byte codes.

Syntax Differences:
So what are the biggest differences between original Sierra AGI syntax (as enforced by the CG.EXE compiler) and current fan-based 'canon' syntax? And do any of them warrant adjusting what modern compilers enforce?
  • The additional argument declaration types, as well as the action and test declarations, do give more flexibility to writing source code. WinAGI and AGIStudio have built-in command declarations, and manage argument types by prefix. That's probably sufficient, although adding the ability to declare your own command names as an override (overload?) is something I think I might add to the next iteration of WinAGI.
  • WinAGI already supports negative numbers, which I think should be added to the 'canon' syntax.
  • Sierra's CG compiler doesn't use double-quoted strings for 'said' command arguments. The more I think about it, the more I like that idea, because I think it would make source code a bit easier on the eyes, allowing you to quickly tell the difference between actual message text and said command arguments. I might add this as an option for future WinAGI releases.
  • Indirection is also one that's different (using the '@' instead of '*'), but I don't think it's worth making any changes/additions to 'canon' syntax rules; the star is probably a lot more intuitive to modern programmers.
  • Although original compiler is very liberal with how it manages separations (allowing commas and semi-colons to be treated the same for example), I think the stricter enforcement in the 'canon' syntax is actually a good thing.
  • Labels, and increment/decrement having their operators BEFORE the symbol is something that WinAGI already supports, even though current 'canon' syntax doesn't include it. I don't think the post-fix versions should be eliminated, but I do believe the canon rules should be changed to allow the pre-fix versions as well.
  • Long-hand arithmetic (i.e. v1 = v1 + 1; instead of v1 += 1;) is also something that I think is fine to leave in 'canon' even though not supported by original Sierra compiler. Using goto like an AGI command (by requiring the destination to be enclosed in parentheses) is also something that should remain in 'canon', despite not being supported by the original.

OK, that was quite a bit of information! Honestly, though, it was mostly a thought exercise - I enjoy taking apart the binary code of AGI tools to get to the learn all the inner workings. The AGI community doesn't seem particularly strong these days, so I don't really expect there's much discussion to be had around any kind of 'official' or 'canon' AGI syntax. I'll probably just add those things that I think are worth it into the next WinAGI without worrying too much about whether it would be compatible with AGIStudio or any other compiler.

Anyway, feel free to comment/discuss/correct any of the above information. I'd be happy to respond.
[/list]
« Last Edit: January 14, 2022, 10:39:14 AM by AGKorson »



Offline lskovlun

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #1 on: January 12, 2022, 07:07:47 AM »
Nice write-up!

Offline ZvikaZ

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #2 on: January 12, 2022, 08:39:35 AM »
I agree with Lars :)

Few comments:
1.
Regarding the file's date. As far as I remember, directories were added only in MSDOS version 2 (can someone confirm that? I think it's implied from https://en.wikipedia.org/wiki/MS-DOS, but it's not 100% clear). Anyway, according to that Wikipedia article, MSDOS 2 was released on October 1983. So, if I'm right here, that's the earliest possible date for starting to work on a program that has directories support. But the guys at Sierra had to buy the new MSDOS version on October, learn its new features, design to use them, implement that, then there's vacation at December...
So, we could guess that it's from the 1984's beginning, in the earliest case.
However, an alternate story would be that Sierra had their hands on a DOS 2 before it was released (maybe as part of the special relations with IBM around that time), and in that case, the earliest date can be earlier...

2.
Quote
The compiler is a native MSDOS program (obviously, I think- but maybe not...)
Well, it's not obvious. IIRC, I once read an interview with one of Lucas developers, and he said that they developed on another machine. I don't remember which one, but he said that the IBM-PC was too weak to do real work with it, so they used some stronger machine, and cross-compiled to PC.

3.
Code: [Select]
cg room [room...] [-o output_directory] [-b buffer_size] [-v]That's very interesting!
DOS usually used slashes (/) for parameters. That was no problem, since DOS-1 didn't have directories.
When they added directories, they had to use the strange backslash (\), because the regular slash (/) was already used.
So, it's interesting that Sierra's tool used the more Unix-like tradition of - for parameters.
Maybe it could be related to the previous item, and hint that the tool was used on other machines as well?
But that's really a far-fetched guess...

4.
To be honest, I haven't (yet) read all the syntax differences details you wrote. However, I'm just curious if you checked with the original AGI sources that we have - do they all compile with "canon" compilers? With that CG compiler?

5.
Again, great work, and thanks!

Offline Charles

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #3 on: January 12, 2022, 09:23:53 AM »
If it's the same file OmerMor posted a couple years ago, the date is August 1, 1986 (10:29:58 AM).
CRC32: 022E711B
MD5: A70C533DCBDAA58491DD09A10AC1F251
SHA-1: 4700B789548A53FBE3600ADDB55E4B1C492436E0

http://sciprogramming.com/community/index.php?topic=1814.0

Offline Collector

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #4 on: January 12, 2022, 07:36:34 PM »
Nice. Would you mind making an entry on this on the Wiki?
KQII Remake Pic

Offline MusicallyInspired

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #5 on: January 12, 2022, 11:03:51 PM »
Really nice to see this much attention given to AGI still. :)
Brass Lantern Prop Competition

Offline klownstein

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #6 on: January 15, 2022, 02:24:45 AM »
I am definitely an amateur programmer, if that, and have not done any programming in the last two decades except to use AGIstudio and WinAGI to work on AGI projects as a personal hobby. So, my programming experience is very limited. My experience with AGI comes exclusively from playing AGI games as a child, from asking a handful of questions on these forums, and from the tools, like WinAGI, that others have made. So, while I don't really have much to contribute to the discussion about the original AGI compiler and it's inner workings, I very much appreciate those who do dissect it, understand it, and share that expertise with us. Thank you all.
-klownstein

Offline pmkelly

Re: Sierra's AGI Compiler (CG.EXE) Disassembled
« Reply #7 on: January 22, 2022, 05:56:34 AM »
Fantastic writeup!

A note on the hash table: This approach is very common, both for symbol tables in compilers as well as lots of other applications (most programming language these days include a hash table class in their standard library, and yes they are still important for performance even on modern machines). The number 203 would just be the number of "buckets", it doesn't mean that only 203 symbols can be present, since if there are more than that there will be entries with multiple links (as you mentioned). Normally a prime number is used for the number of buckets, so i'm not sure why they would have picked 203. See https://en.wikipedia.org/wiki/Hash_table

It's interesting that the compiler operates in a single pass. This is how the AGI Studio compiler worked; this was the first time I'd written a compiler and I didn't really know what I was doing. The following year I did a course on compiler construction as part of my CS degree and learnt about abstract syntax trees and realised "ah, that's how it's supposed to be done". Having said that, single-pass compilers weren't uncommon for the time. A well-known example is Turbo Pascal, and it's single-pass strategy was a contributor to its speed.

I'm also quite surprised to learn that the action & test commands were not hard-coded into the compiler. If I recall correctly, each had a corresponding numeric id in the compiled bytecode, which means it would have been necessary to ensure that the definitions you used matched up with the target interpreter.

Regarding the "canon" syntax, I think additions are fine as long as it doesn't break backwards compatibility with existing source. There could be an option to choose between the syntaxes. The CG syntax would mostly be for historical interest, or for compiling any sierra source code if there's any of that around.


SMF 2.0.19 | SMF © 2021, Simple Machines
Simple Audio Video Embedder

Page created in 0.026 seconds with 23 queries.