CHAPTER 9 DIRECTIVES IN A86 Segments in A86 The following discussion applies when A86 is assembling a .COM See the next chapter for the discussion of segmentation for .OBJ files. A86 views the 86 computer's memory space as having two parts: The first part is the program, whose contents are the object bytes generated by A86 during its assembly of the source. A86 calls this area the CODE SEGMENT. The second part is the data area, whose contents are generated by the program after it starts running. A86 calls this area the DATA SEGMENT. Please note well that the only difference between the CODE and DATA segments is whether the contents are generated by the program or the assembler. The names CODE and DATA suggest that program code is placed in the CODE segment, and data structures go in the DATA segment. This is mostly true, but there are exceptions. For example, there are many data structures whose contents are determined by the assembler: pointer tables, arrays of pre-defined constants, etc. These tables are assembled in the CODE segment. In general, you will want to begin your program with the directive DATA SEGMENT, followed by an ORG statement giving the address of the start of your data area. You then list all your program variables and uninitialized data structures, using the directives DB, DW, and STRUC. A86 will allocate space starting at the address given in the ORG statement, but it will not generate any object bytes in that space. After your data segment declarations, you provide a CODE SEGMENT directive. If the program starts at any location other than the standard 0100, you give an ORG giving the address of the start of your program. You follow this with the program itself, together with any assembler-generated data structures. A short program illustrating this suggested usage follows: DATA SEGMENT ORG 08000 ANSWER_BYTE DB ? CALL_COUNT DW ? CODE SEGMENT JMP MAIN TRAN_TABLE: DB 16,3,56,23,0,9,12,7 MAIN: MOV BX,TRAN_TABLE XLATB MOV ANSWER_BYTE,AL INC CALL_COUNT RET 9-2 A86 allows you to intersperse CODE SEGMENTs and DATA SEGMENTs throughout your program; but in general it is best to put all your DATA SEGMENT declarations at the top of your program, to avoid problems with forward referencing. CODE ENDS and DATA ENDS Statements For compatibility with Intel/IBM assemblers, A86 provides the CODE ENDS and DATA ENDS statements. The CODE ENDS statement is ignored; we assume that you have not nested a CODE segment inside a DATA segment. The DATA ENDS statement is equivalent to a CODE SEGMENT statement. The ORG Directive Syntax: ORG address ORG moves the output pointer (the location counter at which assembly is currently taking place within the current segment) to the value of the operand, which should be an absolute constant, or an expression evaluating to an absolute, non-forward-referenced constant. ORG is most often used in a DATA segment, to control the location of the data area within the segment. For example, in programs that fit entirely into 64K, you provide an ORG directive as the first line within your DATA segment at the top of your program. The location given by the ORG is some location that you are sure will be beyond the end of your program. If you are sure that your program will not go beyond 8K (02000 hex), your program can look like this: DATA SEGMENT ORG 02000 ; data goes here, beyond the end of the program (your data segment variable and buffer declarations go here) DATA ENDS (your program goes here) 9-3 There is a special side effect to ORG when it is used in the CODE segment. If you begin your code segment with ORG 0, then A86 knows that you are not assembling a .COM program; but are instead assembling a code segment to be used in some other context (examples: programming a ROM, or assembling a procedure for older versions of Turbo Pascal). The output file will start at 0, not 0100 as in a .COM file; and the default extension for the output file will be .BIN, not .COM. Other than in the above example, you should not in general issue an ORG within the CODE segment that would lower the value of the output pointer. This is because you thereby put yourself in danger of losing part of your assembled program. If you re-assemble over space you have already assembled, you will clobber the previously-assembled code. Also, be aware that the size of the output program file is determined by the value of the code segment output pointer when the program stops. If you ORG to a lower value at the end of your program, the output program file will be truncated to the lower-value address. Again, almost no program producing a .COM file will need any ORG directive in the code segment. There is an implied ORG 0100 at the start of the program. You just start coding instructions, and the assembler will put them in the right place. The EVEN Directive Syntax: EVEN The EVEN directive coerces the current output pointer to an even value. In a DATA SEGMENT or STRUC, it does so by adding 1 to the pointer if the pointer was odd; doing nothing if the pointer was already even. In a code segment, it outputs a NOP if the pointer was odd. EVEN is most often used in data segments, before a sequence of DW directives. The 16-bit machines of the 86 family fetch words more quickly when they are aligned onto even addresses; so the EVEN directive insures that your program will have the faster access to those DW's that follow it. (This speed improvement will not be seen on the 8-bit machines, most notably the 8088 of the original IBM-PC.) Data Allocation Using DB, DW, DD, DQ, and DT The 86 computer family supports the three fundamental data types BYTE, WORD, and DWORD. A byte is eight bits, a word is 16 bits (2 bytes), and a doubleword is 32 bits (4 bytes). In addition, the 87 floating point processor manipulates 8-byte quantities, which we call Q-words, and 10-byte quantities, which we call T-bytes. The A86 data allocation statement is used to specify the bytes, words, doublewords, Q-words, and T-bytes which your program will use as data. The syntax for the data allocation statement is as follows: 9-4 (optional var-name) DB (list of values) (optional var-name) DW (list of values) (optional var-name) DD (list of values) (optional var-name) DQ (list of values) (optional var-name) DT (list of values) The variable name, if present, causes that name to be entered into the symbol table as a memory variable with type BYTE (for DB), WORD (for DW), DWORD (for DD), QWORD (for DQ), or TBYTE (for DT). The variable name should NOT have a colon after it, unless you wish the name to be a label (instructions referring to it will interpret the label as the constant pointer to the memory location, not its contents). The DB statement is used to reserve bytes of storage; DW is used to reserve words. The list of values to the right of the DB or DW serves two purposes. It specifies how many bytes or words are allocated by the statement, as well as what their initial values should be. The list of values may contain a single value or more than one, separated by commas. The list can even be missing; meaning that we wish to define a byte or word variable at the same location as the next variable. If the data initialization is in the DATA segment, the values given are ignored, except as place markers to reserve the appropriate number of units of storage. The use of "?", which in .COM mode is a synonym for zero, is recommended in this context to emphasize the lack of actual memory initialization. When A86 is assembling .OBJ files, the ?-initialization will cause a break in the segment (unless ? is embedded in a nested DUP containing non-? terms, in which case it is a synonym for zero). A special value which can be used in data initializations is the DUP construct, which allows the allocation and/or initialization of blocks of data. The expression n DUP x is equivalent to a list with x repeated n times. "x" can be either a single value, a list of values, or another DUP construct nested inside the first one. The nested DUP construct needs to be surrounded by parentheses. All other assemblers, and earlier versions of A86, require parentheses around all right operands to DUP, even simple ones; but this requirement has been removed for simple operands in the current A86. Here are some examples of data initialization statements, with and without DUP constructs: CODE SEGMENT DW 5 ; allocate one word, init. to 5 DB 0,3,0 ; allocate three bytes, init. to 0,3,0 DB 5 DUP 0 ; equivalent to DB 0,0,0,0,0 DW 2 DUP (0,4 DUP 7) ; equivalent to DW 0,7,7,7,7,0,7,7,7,7 9-5 DATA SEGMENT XX DW ? ; define a word variable XX YYLOW DB ; no init value: YYLOW is low byte of word var YY YY DW ? X_ARRAY DB 100 DUP ? ; X_ARRAY is a 100-byte array D_REAL DQ ? ; double precision floating variable EX_REAL DT ? ; extended precision floating variable A character string value may be used to initialize consecutive bytes in a DB statement. Each character will be represented by its ASCII code. The characters are stored in the order that they appear in the string, with the first character assigned to the lowest-addressed byte. In the DB statement that follows, five bytes are initialized with the ASCII representation of the characters in the string 'HELLO': DB 'HELLO' Note that except for string comparisons described in the previous chapter, the DB directive is the only place in your program that strings of length greater than 2 may occur. In all other contexts (including DW), a string is treated as the constant number representing the ASCII value of the string; for example, CMP AL,'@' is the instruction comparing the AL register with the ASCII value of the at-sign. Note further that 2-character string constants, like all constants in the 8086, have their bytes reversed. Thus, while DB 'AB' will produce hex 41 followed by hex 42, the similar looking DW 'AB' reverses the bytes: hex 42 followed by hex 41. For compatibility, A86 now accepts double quotes, as well as single quotes, for strings in DB directives. The DD directive is used to initialize 32-bit doubleword pointers to locations in arbitrary segments of the 86's memory space. Values for such pointers are given by two numbers separated by a colon. The segment register value appears to the left of the colon; and the offset appears to the right of the colon. In keeping with the reversed-bytes nature of memory storage in the 86 family, the offset comes first in memory. For example, the statement DD 01234:05678 appearing in a CODE segment will cause the hex bytes 78 56 34 12 to be generated, which is a long pointer to segment 01234, offset 05678. DD, DQ, and DT can also be used to initialize large integers and floating point numbers. Examples: DD 500000 ; half million, too big for most 86 instructions DD 3.5 ; single precision floating point number DQ 3.5 ; the same number in a double precision format DT 3.5 ; the same number in an extended precision format 9-6 The STRUC Directive The STRUC directive is used to define a template of data to be addressed by one of the 8086's base and/or index registers. The syntax of STRUC is as follows: (optional strucname) STRUC (optional effective address) The optional structure name given at the beginning of the line can appear in subsequent expressions in the program, with the operator TYPE applied to it, to yield the number of bytes in the structure template. The STRUC directive causes the assembler to enter a mode similar to DATA SEGMENT: assembly within the structure declares symbols (the elements of the structure), using a location counter that starts out at the address following STRUC. If no address is given, assembly starts at location 0. An option not available to the DATA SEGMENT is that the address can include one base register [BX] or [BP] and/or one index register [SI] or [DI]. The registers are part of the implicit declaration of all structure elements, with the offset value increasing by the number of bytes allocated in each structure line. For example: LINE STRUC [BP] ; the template starts at [BP] DB 80 DUP (?) ; these 80 bytes advance us to [BP+80] LSIZE DB ? ; this 1 byte advances us to [BP+81] LPROT DB ? ENDS The STRUC just given defines the variables LSIZE, equivalent to B[BP+80], and LPROT, equivalent to B[BP+81]. You can now issue instructions such as MOV AL,LSIZE; which automatically generates the correct indexing for you. The mode entered by STRUC is terminated by the ENDS directive, which returns the assembler to whatever segment (CODE or DATA) it was in before the STRUC, with the location counter restored to its value within that segment before the STRUC was declared. Forward References A86 allows names for a variety of program elements to be forward referenced. This means that you may use a symbol in one statement and define it later with another statement. For example: JNZ TARGET . . TARGET: ADD AX,10 9-7 In this example, a conditional jump is made to TARGET, a label farther down in the code. When JNZ TARGET is seen, TARGET is undefined, so this is a forward reference. Earlier versions of A86 were much more restricted in the kinds of forward references allowed. Most of the restrictions have now been eased, for convenience as well as compatibility with other assemblers. In particular, you may now make forward references to variable names. You just need to see to it that A86 has enough information about the type of the operand to generate the correct instruction. For example, MOV FOO,AL will cause A86 to correctly deduce that FOO is a byte variable. You can even code a subsequent MOV FOO,1 and A86 will remember that FOO was assumed to be a byte variable. But if you code MOV FOO,1 first, A86 won't know whether to issue a byte or a word MOV instruction; and will thus issue an error message. You then specify the type by MOV FOO B,1. In general, A86's compatibility with That Other assembler has improved dramatically for forward references. Now, for most programs, you need only sprinkle a very few B's and W's into your references. And you'll be rewarded: in many cases the word form is longer than the byte form, so that the other assembler winds up inserting a wasted NOP in your program. You'll wind up with tighter code by using A86! Forward References in Expressions A86 now allows you to add or subtract a constant number from a forward reference symbol; and to append indexing registers to a forward reference symbol. This covers a vast majority of expressions formerly disallowed. For the remaining, more complicated expressions, there is a trick you can use to work your way around almost any case where you might run into a forward reference restriction. The trick is to move the expression evaluation down in your program so that it no longer contains a forward reference; and forward reference the evaluation answer. For example, suppose you wish to advance the ES segment register to point immediately beyond your program. If PROG_SIZE is the number of bytes in your program, then you add (PROGSIZE+15)/16 to the program's segment register value. This value is known at assembly time; but it isn't known until the end of the program. You do the following: MOV AX,CS ; fetch the program's segment value ADD AX,SEG_SIZE ; use a simple forward reference MOV ES,AX ; ES is now loaded as desired Then at the end of the program you evaluate the expression: PROG_SIZE EQU $ SEG_SIZE EQU (PROG_SIZE+15)/16 9-8 The EQU Directive Syntax: symbol-name EQU expression symbol-name EQU built-in-symbol symbol-name EQU INT n The expression field may specify an operand of any type that could appear as an operand to an instruction. As a simple example, suppose you are writing a program that manipulates a table containing 100 names and that you want to refer to the maximum number of names throughout the source file. You can, of course, use the number 100 to refer to this maximum each time, as in MOV CX,100, but this approach suffers from two weaknesses. First of all, 100 can mean a lot of things; in the absence of comments, it is not obvious that a particular use of 100 refers to the maximum number of names. Secondly, if you extend the table to allow 200 names, you will have to locate each 100 and change it to a 200. Suppose, instead, that you define a symbol to represent the maximum number of names with the following statement: MAX_NAMES EQU 100 Now when you use the symbol MAX_NAMES instead of the number 100 (for example, MOV CX,MAX_NAMES), it will be obvious that you are referring to the maximum number of names in the table. Also, if you decide to extend the table, you need only change the 100 in the EQU directive to a 200 and every reference to MAX_NAMES will reflect the change. You could also take advantage of A86's strong typing, by changing MAX_NAMES to a variable: MAX_NAMES DB ? or even an indexed quantity: MAX_NAMES EQU [BX+1] Because the A86 language is strongly typed, the instruction for loading MAX_NAMES into the CX register remains exactly the same in all cases: simply MOV CX,MAX_NAMES. 9-9 Equates to Built-In Symbols A86 allows you to define synonyms for any of the assembler reserved symbols, by EQUating an alternate name of your choosing, to that symbol. For example, suppose you were coding a source module that is to be incorporated into several different programs. In some programs, a certain variable will exist in the code segment. In others, it will exist in the stack segment. You want to address the variable in the common source module, but you don't know which segment override to use. The solution is to declare a synonym, QS, for the segment register. QS will be defined by each program: the code-segment program will have a QS EQU CS at the top of it; the stack-segment program will have QS EQU SS. The source module can use QS as an override, just as if it were CS or SS. The code would be, for example, QS MOV AL,VARNAME. The NIL Prefix A86 provides a mnemonic, NIL, that generates no code. NIL can be used as a prefix to another instruction (which will have no effect on that instruction), or it can appear by itself on a line. NIL is provided to extend the example in the previous section, to cover the possibility of no overrides. If your source module goes into a program that fits into 64K, so that all the segment registers have the same value, then code QS EQU NIL at the top of that program. Interrupt Equates A86 allows you to equate your own name to an INT instruction with a specific interrupt number. For example, if you place TRAP EQU INT 3 at the top of your program, you can use the name TRAP as a synonym for INT 3 (the debugger trap on the 8086). Duplicate Definitions A86 contains the unique feature of duplicate definitions. We have already discussed local symbols, which can be redefined to different values without restriction. Local symbols are the only symbols that can be redefined. However, any symbol can be defined more than once, as long as the symbol is defined to be the same value and type in each definition. This feature has two uses. First, it eases modular program development. For example, if two independently-developed source files both use the symbol ESC to stand for the ASCII code for ESCAPE, they can both contain the declaration ESC EQU 01B, with no problems if they are combined into the same program. 9-10 The second use for this feature is assertion checking. Your deliberate redeclaration of a symbol name is an assertion that the value of the symbol has not changed; and you want the assembler to issue you an error message if it has changed. Example: suppose you have declared a table of options in your DATA segment; and you have another table of initial values for those options in your CODE segment. If you come back months later and add an option to your tables, you want to be reminded to update both tables in the same way. You should declare your tables as follows: DATA SEGMENT OPTIONS: . . OPT_COUNT EQU $-OPTIONS ; OPT_COUNT is the size of the table CODE SEGMENT OPT_INITS: . . OPT_COUNT EQU $-OPT_INITS ; second OPT_COUNT had better be the same! The = Directive Syntax: symbol-name = expression symbol-name = built-in-symbol symbol-name = INT n The equals sign directive is provided for compatibility with That Other assembler. It is identical to the EQU directive, with one exception: if the first time a symbol appears in a program is in an = directive, that symbol will be taken as a local symbol. It can be redefined to other values, just like the generic local symbols (letter followed by digits) that A86 supports. (If you try to redefine an EQU symbol to a different value, you get an error message.) The = facility is most often used to define "assembler variables", that change value as the assembly progresses. The PROC Directive Syntax: name PROC NEAR name PROC FAR name PROC PROC is a directive provided for compatibility with Intel/IBM assemblers. I don't like PROC; and I recommend that you do not use it, even if you are programming for those assemblers. 9-11 The idea behind PROC is to give the assembler a mechanism whereby it can decide for you what kind of RET instruction you should be providing. If you specify NEAR in your PROC directive, then the assembler will generate a near (same segment) return when it sees RET. If you specify FAR in your PROC directive, the assembler will generate a far RETF return (which will cause both IP and CS to be popped from the stack). If you simply leave well enough alone, and never code a PROC in your program, then RET will mean near return throughout your program. The reason I don't like PROC is because it is yet another attempt by the assembler to do things "behind your back". This goes against the reason why you are programming in assembly language in the first place, which is to have complete control over the code generated by your source program. It leads to nothing but trouble and confusion. Another problem with PROC is its verbosity. It replaces a simple colon, given right after the label it defines. This creates a visual clutter in the program, that makes the program harder to read. A86 provides an explicit RETF mnemonic so that you don't need to use PROC to distinguish between near and far return instructions. You can use RET or a near return and RETF for a far return. Even if you are programming in that other assembler, and you need to code a far return, I recommend that you create a RETF macro (it would have the single line DB 0CBH), and stay away from PROCs entirely. The ENDP Directive Syntax: [name] ENDP The only action A86 takes when it sees an ENDP directive is to return the assembler to its (sane) default state, in which RET is a near return. NOTE that this means that A86 does not support nested PROCs, in which anything but the innermost PROC has the FAR attribute. I'm sorry if I am blunt, but anybody who would subject their program to that level of syntactic clutter has rocks in their head. The LABEL Directive Syntax: name LABEL NEAR name LABEL FAR name LABEL BYTE name LABEL WORD LABEL is another directive provided for compatibility with Intel/IBM assemblers. A86 provides less verbose ways of specifying all the above LABEL forms, except for LABEL FAR. 9-12 LABEL defines "name" to have the type given, and a value equal to the current output pointer. Thus, LABEL NEAR is synonymous with a simple colon following the name; and LABEL BYTE and LABEL WORD are synonymous with DB and DW, respectively, with no operands. LABEL FAR does have a unique functionality, not found in other assemblers. It identifies "name" as a procedure that can be called from outside this program's code segment. Such procedures should have RETFs instead of RETs. Furthermore, I have provided the following feature, unique to A86: if you CALL the procedure from within your program, A86 will generate a PUSH CS instruction followed by a NEAR call to the procedure. Other assemblers will generate a FAR call, having the same functional effect; but the FAR call consumes more program space, and takes more time to execute. WARNING: you cannot use the above CALL feature as a forward reference; the LABEL FAR definition must precede any CALLs to it. This is unavoidable, since the assembler must assume that a CALL to an undefined symbol takes 3 program bytes. All assemblers will issue an error in this situation.