Compiler

and non-determinism plays an important role in this subject.

The automaton that responds with a YES or NO signal is called an acceptor.

Given an input string, the accepter can accept or reject the input string.

Input file

Maybe you are interested!

Control Unit

Storage

Output

Figure A.1 illustrates the configuration of a general automatic device.

The automaton is assumed to operate on a discrete time scale. At any given time, the controller is in an internal state, and is scanning symbols on the input file. The internal state of the controller at the next step is determined by the next state or transition function.

This transition function outputs the next state, the current input symbol, and the information stored in temporary storage. During the transition from one time interval to the next, either information is output or temporary storage is changed.

The word “configuration” refers to three elements: the current state of the controller, the input file, and temporary storage. The automatic transition of a device from one configuration to another is called a move.

We need to distinguish between deterministic and non-deterministic automata. A deterministic automaton or automata uniquely determines its current configuration. If we know the internal state and the contents of its temporary storage, we can predict the next state of the automata with certainty. This is not true for non-deterministic automatons. At each instant, a non-deterministic automaton has more moves. Therefore, we can predict only a subset of the possible next states. The relationship between deterministic and non-deterministic automata plays an important role in this subject.

The automaton that responds with a YES or NO signal is called an acceptor.

Given an input string, the accepter can accept or reject the input string.

A.2. Application of automaton in programming language

Although we emphasize the abstractness and precision of formal languages and automata, it is important to see that these concepts have many common applications in computer science and are actually a common thread that connects many specialized fields together. In this section, we give examples to show that what we are learning here is not just a set of abstract concepts, but something that helps us understand many important things that are used in practice.

Formal languages and grammars are widely used in programming languages. Syntax descriptions of languages are common in most programming literature, but if you want to write a compiler, or if you want to explain the correctness of a program, a specific description of the program is necessary. For example, let's create a small language similar to Pascal.

For example

In this language, it is assumed that a valid statement identifier is the set of all strings that start with a character and are followed by an arbitrary number of characters or digits. The following grammar describes its identifiers under this assumption.

<id>

	<letter><rest>,
<rest>		<letter><rest>\|<digit><rest>\| 
<letter>		a\|b\|…
<digit>		0\|1…

<id>, <letter>, <digit>, and <rest> are variables a, b, …, 0, 1, … are the terminating characters

A derivation for the identifier a0 l

=> a <rest>

=> a <digit> <rest>

=> a0 <rest>

=> a0

Defining a programming language in terms of a grammar is common and very useful. But there are other ways to define a programming language that are appropriate in many cases. For example, we describe a language in terms of an acceptor, which considers an accepted string to be part of the language. To understand this more clearly, some formal definition of an automaton is needed.

An automaton can be represented by a graph with nodes representing internal states and edges representing transitions. The labels of the edges indicate what happens (in terms of input or output) during the transition. For example, Figure 1.3 shows the transition from state 1 to state 2 that occurs when the input symbol is a. With this visual illustration, we can describe language identifiers in another way.

For example

Figure A.2

Letter

1 2

Dig it

Letter or Digi

Letter or Digit

Figure A.3

Figure A.3 shows an automaton that accepts all valid identifiers. Suppose that the automaton starts in state 1: we denote it by an arrow (not originating from any node) into this state. The string to be tested is read from left to right, one symbol at a time. If the first symbol is a letter, the automaton moves to state 2, after which the remaining symbols of the string are irrelevant. State 2

represents the “accept” state of the acceptor. Conversely, if the first symbol read in is a digit, the automaton will move to state 3 which is the “not accepted” state of the acceptor and will remain in that state (invalid identifier). Here we assume that the input symbols are only characters or digits.

Compilers and other machines that translate a program from one language to another also use the ideas presented in the examples above. Programming languages can be defined precisely through grammars.

APPENDIX B

PROGRAMMING LANGUAGES AND INTERPRETER PROGRAMS

B.1. Different levels of programming languages

In order for a computer to be able to execute an algorithm, it is necessary to write the algorithm in the form of "command" lines according to certain conventions that the computer can execute directly or write in a form that can automatically generate a form that the computer can execute directly. The set of symbols and rules for writing commands to express an algorithm is called a programming language . The rules for writing a program are called the syntax of the programming language. Each program will have a certain meaning that we call the semantics of the program.

There are many different classes of programming languages. According to the level of formalization, programming languages are divided into the following classes:

Machine language . A program in machine language is a sequence of machine instructions that the CPU can execute directly. It is the only programming language that the computer "understands". In the hierarchy of languages that communicate with computers, this is the lowest level but the efficiency of the program will be the highest because we can fully exploit the capabilities of the machine. Depending on the hardware design, each type of computer has a different machine language. Instructions written in machine language are generally in binary form or their variations in the hexadecimal number system.

The following example is a program written in machine language for a computer using an Intel 8086 microprocessor.

Table B.1. Machine language

Code on binary system

Hexadecimal code	Meaning
1001 0001 0110 0100 0001 0000	A1 64 10	Load 2 byte number from 1064 to AX
0000 0011 0110 0110 0001 0000	03 65 10	Add AX with the 2 byte number at 1066, the result is on AX.
1010 0011 0000 0000 0010 1011	A3 00 2B	Convert the result from AX to two bytes starting from 2B00

This program adds two two-byte integers at addresses 1064 and 1066. The result is in the two bytes starting at address 2B00. The first column is the hexadecimal instruction, the middle column is the binary equivalent (the actual image of the program in memory), and the right column is the explanation. AX is the name of a 16-bit register in the 8086 microprocessor.

Through the above example, we see that machine language is not really suitable for the majority of computer users because to write or understand a program, people have to memorize the code numbers of the instructions mechanically, and these numbers do not have clear meaning. On the other hand, because the instruction sets of processors may be different, it is impossible to use a program written on one processor to run on a computer using a different type of processor.

Assembly language . To overcome the above disadvantages of machine language, people proposed a language to communicate with the machine at a more formal level called assembly language. Basically, assembly language has structures very similar to machine language. The difference is that in assembly language, instructions can be written in the form of text code. Text code represents the instruction code or objects in the instruction (in machine language, it is the instruction code and the address of the object). The instruction code in lowercase is just English words with clear meaning, and the object is named by us according to the concept of that object. For example, if the above program is used to add the length and width of a rectangle to calculate the semi-perimeter, in ASM assembly language, we only need to write

Table B.2. Program written in Assembly

MOV AX LENGTH

ADD AX_WIDE

MOV AXIAL_CIRCUMFERENCE

The word MOV comes from the English word MOVE, which means to move, and the word ADD means to add. The first instruction means to load the data that we named CHIEU_DAI into the AX register, the second instruction means to add the number in the AX register with the data that we named CHIEU_RONG. We see that although it is still cumbersome and depends on a specific type of computer, assembly language is much easier to use than machine language.

In order for an assembly language program to run on a computer, it needs to be translated into machine language. Obviously, each assembly language used for a certain type of machine needs a suitable compiler. When translating, the two objects CHIEU_DAI and CHIEU_RONG mentioned above will be automatically replaced by two specific addresses, not necessarily 1064 and 1066 as in the example above. Therefore, we do not need to care about arranging specific addresses after translating and running the program. The program that translates assembly language is called an assembler.

Algorithmic language (also called algorithmic language) . We have seen that machine language and assembly language both depend on the command system of a specific type of machine. They are not really suitable for the majority of computer users. People want to express algorithms with commands that have practical meaning and are independent of any specific type of machine. For example, in the above example, it is enough to write NUA_CHU_VI = CHIEU_DAI + CHIEU_RONG. Since the early 50s, people have built universal programming languages with commands close to natural language and mathematical language. These programming languages are called high-level programming languages . Because they only aim to express algorithms independent of specific computers, they are also called algorithmic languages . As with assembly language, each high-level programming language on a specific machine requires a compiler to translate the programs into the machine language of that machine before it can be executed.

Note that each assembly language instruction is generally translated into one machine language instruction, while each high-level language instruction is often equivalent to multiple machine instructions. For example, the instruction NUA_CHU_VI = CHIEU_DAI + CHIEU_RONG will be translated into 3 machine instructions.

There are two types of translation: Interpeter is the type that translates each instruction to understand the work to be done and executes it immediately, but does not necessarily create the corresponding code in machine language. If an instruction needs to be executed many times, it must also be translated many times.

The BASIC language that was popular in the 80s often followed the interpreted mode. Compilers would translate the entire original program (called the source program) into a corresponding program in machine language (called the target program), then load the target program into the computer for execution. The reason why in Vietnamese we call these two types of translation "interpretation" and "compilation" is because the nature of translation is somewhat similar to foreign language translation. Interpretation is like the work of an interpreter, translating as it is said. Translation is the work of a compiler, based on a complete document, we write a complete translation at once.

The first high-level language was created in 1957, FORTRAN (FORmula TRANslator - Formula Translator). Nowadays, there are many high-level programming languages such as PASCAL or C. The following is a program to solve a quadratic equation written in PASCAL and FORTRAN. Even if you have no idea about these languages, you can still understand what the following program says.

Table B.3. Program written in Pascal

(*Program fragment on PASCAL*)

DELTA := B*B - 4*A*C;

IF DELTA > 0 THEN

BEGIN

X1 := (- B + SQRT(DELTA))/(2*A);

X2 := (- B - SQRT(DELTA))/(2*A);

WRITE (X1,X2);

END

WRITE('No idea');

.....