UP | HOME

x86 Assembly from my understanding

Soooo this article (or maybe even a series of articles, who knows ?) will be about x86 assembly, or rather, what I understood from it and my road from the bottom-up hopefully reaching a good level of understanding

Memory :

Memory is a sequence of octets (Aka 8bits) that each have a unique integer assigned to them called The Effective Address (EA), in this particular CPU Architecture (the i8086), the octet is designated by a couple (A segment number, and the offset in the segment)

  • The Segment is a set of 64 consecutive Koctets (1 Koctet = 1024 octets).
  • And the offset is to specify the particular octet in that segment.

The offset and segment are encoded in 16bits, so they take a value between 0 and 65535

Important :

The relation between the Effective Address and the Segment & Offset is as follow :

Effective address = 16 x segment + offset keep in mind that this equation is encoded in decimal, which will change soon as we use Hexadecimal for convention reasons.

  • Example :

    Let the Physical address (Or Effective Address, these two terms are interchangeable) 12345h (the h refers to Hexadecimal, which can also be written like this 0x12345), the register DS = 1230h and the register SI = 0045h, the CPU calculates the physical address by multiplying the content of the segment register DS by 10h (or 16) and adding the content of the register SI. so we get : 1230h x 10h + 45h = 12345h

    Now if you are a clever one ( I know you are, since you are reading this <3 ) you may say that the physical address 12345h can be written in more than one way….and you are right, more precisely : 212 = 4096 different ways !!!

Registers

The 8086 CPU has 14 registers of 16bits of size. From the POV of the user, the 8086 has 3 groups of 4 registers of 16bits. One state register of 9bits and a counting program of 16bits inaccessible to the user (whatever this means).

General Registers

General registers contribute to arithmetic's and logic and addressing too.

Each half-register is accessible as a register of 8bits, therefor making the 8086 backwards compatible with the 8080 (which had 8bit registers)

Now here are the Registers we can find in this section:

AX: This is the accumulator. It is of 16 bits and is divided into two 8-bit registers AH and AL to also perform 8-bit instructions. It is generally used for arithmetical and logical instructions but in 8086 microprocessor it is not mandatory to have an accumulator as the destination operand. Example:

ADD AX, AX ;(AX = AX + AX)

BX: This is the base register. It is of 16 bits and is divided into two 8-bit registers BH and BL to also perform 8-bit instructions. It is used to store the value of the offset. Example:

MOV BL, [500] ;(BL = 500H)

CX: This is the counter register. It is of 16 bits and is divided into two 8-bit registers CH and CL to also perform 8-bit instructions. It is used in looping and rotation. Example:

MOV CX, 0005
LOOP

DX: This is the data register. It is of 16 bits and is divided into two 8-bit registers DH and DL to also perform 8-bit instructions. It is used in the multiplication and input/output port addressing. Example:

MUL BX (DX, AX = AX * BX)

Addressing and registers…again

I realized what I wrote here before was almost gibberish, sooo here we go again I guess ?

Well lets take a step back to the notion of effective addresses VS relative ones.

Effective = 10h x Segment + Offset . Part1

When trying to access a specific memory space, we use this annotation [Segment:Offset], so for example, and assuming DS = 0100h. We want to write the value 0x0005 to the memory space defined by the physical address 1234h, what do we do ?

  • Answer :
    MOV [DS:0234h], 0x0005
    

    Why ? Let's break it down :

    lain-dance.gif

    We Already know that Effective = 10h x Segment + Offset, So here we have : 1234h = 10h x DS + Offset, we already know that DS = 0100h, we end up with this simple equation 1234h = 1000h + Offset, therefor the Offset is 0234h

    Simple, right ?, now for another example

Another example :

What if we now have this instruction ?

MOV [0234h], 0x0005

What does it do ? You might or might not be surprised that it does the exact same thing as the other snipped of code, why though ? Because apparently and for some odd reason I don't know, the compiler Implicitly assumes that the segment used is the DS one. So if you don't specify a register( we will get to this later ), or a segment. Then the offset is considered an offset with a DS segment.

Segment + Register <3

Consider DS = 0100h and BX = BP = 0234h and this code snippet:

MOV [BX], 0x0005 ; NOTE : ITS NOT THE SAME AS MOV BX, 0x0005. Refer to earlier paragraphs

Well you guessed it right, it also does the same thing, but now consider this :

MOV [BP], 0x0005

If you answered that its the same one, you are wrong. And this is because the segment used changes according to the offset as I said before in an implicit way. Here is the explicit equivalent of the two commands above:

MOV [DS:BX], 0x0005
MOV [SS:BP], 0x0005

The General rule of thumb is as follows :

  • If the offset is : DI SI or BX, the Segment used is DS.
  • If its BP or SP, then the segment is SS.
  • Note

    The values of the registers CS DS and SS are automatically initialized by the OS when launching the program. So these segments are implicit. AKA : If we want to access a specific data in memory, we just need to specify its offset. Also you can't write directly into the DS or CS segment registers, so something like

    MOV DS, 0x0005 ; Is INVALID
    MOV DS, AX ; This one is VALID
    

The ACTUAL thing :

Enough technical rambling, and now we shall go to the fun part, the ACTUAL CODE. But first, some names you should be familiar with :

  • Mnemonics : Or Instructions, are the…well…Instructions executed by the CPU like MOV , ADD, MUL…etc, they are case insensitive but i like them better in UPPERCASE.
  • Operands : These are the options passed to the instructions, like MOV dst, src, and they can be anything from a memory location, to a variable to an immediate address.

Structure of an assembly program :

While there is no "standard" structure, i prefer to go with this one :

    org 100h
.data
				; variables and constants

.code
				; instructions

MOV dst, src

The MOV instruction copies the Second operand (src) to the First operand (dst)… The source can be a memory location, an immediate value, a general-purpose register (AX BX CX DX). As for the Destination, it can be a general-purpose register or a memory location.

these types of operands are supported:

MOV REG, memory
MOV memory, REG
MOV REG, REG
MOV memory, immediate
MOV REG, immediate

REG: AX, BX, CX, DX, AH, AL, BL, BH, CH, CL, DH, DL, DI, SI, BP, SP.

memory: [BX], [BX+SI+7], variable

immediate: 5, -24, 3Fh, 10001101b

for segment registers only these types of MOV are supported:

MOV SREG, memory
MOV memory, SREG
MOV REG, SREG
MOV SREG, REG
SREG: DS, ES, SS, and only as second operand: CS.

REG: AX, BX, CX, DX, AH, AL, BL, BH, CH, CL, DH, DL, DI, SI, BP, SP.

memory: [BX], [BX+SI+7], variable

Note : The MOV instruction cannot be used to set the value of the CS and IP registers

Variables :

Let's say you want to use a specific value multiple times in your code, do you prefer to call it using something like var1 or E4F9:0011 ? If your answer is the second option, you can gladly skip this section, or even better, seek therapy.

Anyways, we have two types of variables, bytes and words(which are two bytes), and to define a variable, we use the following syntax

name DB value ; To Define a Byte
name DW value ; To Define a Word

name - can be any letter or digit combination, though it should start with a letter. It's possible to declare unnamed variables by not specifying the name (this variable will have an address but no name). value - can be any numeric value in any supported numbering system (hexadecimal, binary, or decimal), or "?" symbol for variables that are not initialized.

Example code :

org 100h
.data
x db 33
y dw 1350h

.code
MOV AL, x
MOV BX, y

Arrays :

We can also define Arrays instead of single values using comma separated vaues. like this for example

a db 48h, 65h, 6Ch, 6Fh, 00H
b db 'Hello', 0

Surprise Surprise, the arrays a and b are identical, the reason behind it is that characters are first converted to their ASCII values then stored in memory!!! Wonderful right ? And guess what, accessing values in assembly IS THE SAME AS IN C !!!

MOV AL, a[0] ; Copies 48h to AL
MOV BL, b[0] ; Also Copies 48h to BL

You can also use any of the memory index registers BX, SI, DI, BP, for example:

MOV SI, 3
MOV AL, a[SI]

If you need to declare a large array you can use DUP operator. The syntax for DUP:

number DUP ( value(s) ) number - number of duplicate to make (any constant value). value - expression that DUP will duplicate.

for example:

c DB 5 DUP(9)
;is an alternative way of declaring:
c DB 9, 9, 9, 9, 9

one more example:

d DB 5 DUP(1, 2)
;is an alternative way of declaring:
d DB 1, 2, 1, 2, 1, 2, 1, 2, 1, 2

Of course, you can use DW instead of DB if it's required to keep values larger then 255, or smaller then -128. DW cannot be used to declare strings.

LEA

LEA stands for (Load Effective Address) is an instruction used to get the offset of a specific variable. We will see later how its used, but first. here is something we will need :

In order to tell the compiler about data type, these prefixes should be used:

BYTE PTR - for byte. WORD PTR - for word (two bytes).

For example: BYTE PTR [BX] ; byte access. or WORD PTR [BX] ; word access. assembler supports shorter prefixes as well:

  • b. - for BYTE PTR
  • w. - for WORD PTR

in certain cases the assembler can calculate the data type automatically.

  • Example :
    org 100h
    .data
    VAR1 db 50h
    VAR2 dw 1234h
    .code
    MOV AL, VAR1 ; We check the value of VAR1 by putting it in AL
    MOV AX, VAR2 ; Same here
    LEA BX, VAR1 ; BX receives the Address of VAR1
    MOV b.[BX], 44h
    MOV AL, VAR1 ; We effectively changed the content of the VAR1 variable
    LEA BX, VAR2
    MOV w.[BX], 5678h
    MOV AX, VAR2
    

Constants :

Constants in Assembly only exist until the code is assembled, meaning that if you disassemble your code later, you wont see your constant definitions.

Defining constants is pretty straight forward :

name EQU value

Of course constants cant be changed, and aren't stored in memory. So they are like little macros that live in your code.

⚐ :

Now comes the notion of Flags, which are bits in the Status register, which are used for logical and arithmetical instructions and can take a value of 1 or 0 . Here are the 8 flags that exist for the 8086 CPU :

  • Carry Flag(CF): Set to 1 when there is an unsigned overflow, for example when you add 255 + 1( not in range [0,255] ). by default its set to 0.
  • Overflow Flag(CF): Set to 1 when there is a signed overflow, for example when you add 100 + 50( not in range [-128, 128[ ). by default its set to 0.
  • Zero Flag(ZF): Set to 1 when the result is 0. by default its set to 0.
  • Auxiliary Flag(AF): Set to 1 when there is an unsigned overflow for low nibble (4bits), or in human words : when there is a carry inside the number. for example when you add 29H + 4CH , 9 + C => 15. So we carry the 1 to 2 + 4 and AF is set to 1.
  • Parity Flag(PF): Set to 1 when the result has an even number of one bits. and 0 if it has an odd number of one bits. Even if a result is a word, only the Low 8bits are analyzed.
  • Sign Flag(SF): Self explanatory, set to 1 if the result is negative and 0 if its positive.
  • Interrupt Enable Flag(IF): When its set to 1, the CPU reacts to interrupts from external devices.
  • Direction Flag(DF): When this flag is set to 0, the processing is done forward, if its set to 1, its done backward.

Author: Crystal

Created: 2024-05-03 Fri 20:47

Validate