|
6 LANGUAGE
PROCESSING
6.1 Lexical Analysis
6.2 Grammar Rules
6.3 Parsing
6.4 Syntax Analysis
6.5 Lexicons
6.6 Semantic
Analysis
6.7 Summary
6.8 References
Language processing has a wide range of
applications. Many computer programs are designed to process
language inputs. The examples are numerous. Compilers and
interpreters for programming languages must process statements in
the language and convert them into the machine's internal
language. Query languages for databases must convert the user's
query into a form that the database management system can
recognize. Command languages, languages for expert system shells,
natural languages, all must be handled as distinct languages by
computer systems.
Artificial languages as well as natural
languages can all be treated according to the same principles. No
matter what the language, the same steps are applied to
understand sentences in any language. Language processors must
separate the sequence of characters into words and other symbols,
they must apply rules for recognizing valid sentences or
statements in the language and contain methods for deriving the
meaning of those sentences.
In language processing, analysis can be
described as three steps:
1. Lexical analysis,
in which the input characters are grouped into lexical
elements.
2. Syntax analysis, in which
lexical elements are grouped hierarchically into trees with
collective meaning.
3. Semantic analysis, in
which the meaning of the derived syntax trees is determined.
In this chapter the same division will be
followed.
6.1 Lexical
Analysis
Lexical analysis is the first step in language
processing. The main task of a lexical analyzer is to read the
characters from an input stream and produce as output a sequence
of lexical elements. These lexical elements, also called tokens,
can be used as input for following steps of language analysis. A
token consists of one or more characters and is the smallest
meaningful linguistic unit of the language, such as a word, a
number, or any other simple or compound symbol of the language.
For example, the lexical analysis of the input string:
"the man eats the
apple"
will result in the following 5 tokens:
"the", "man", "eats",
"the", and "apple". The blanks separating the
characters of these tokens are normally eliminated during lexical
analysis.
An input sentence may not only contain words,
but also numbers and special signs, as illustrated by the
following example from a computer language:
" position = 2 * (last -
first) "
Lexical analysis of this sentence will result
in the following tokens: "position", "=",
"2", "*", "(","last",
"-", "first", and ")".
To illustrate lexical analysis we will develop
a number of definitions which will convert a text string into a
sequence of tokens. Before that can be done we first have to
define what tokens are and how they are composed of characters.
For that purpose we will use a syntax notation. One of the well
known methods to describe the syntax of a language is the
Backus-Naur Form (BNF). Many variations of BNF have been
developed for the definition of programming languages. A simple
kind of grammar is known as a "context free" grammar.
For example, to describe the syntax of tokens in original BNF we
may use:
<token> ::=
<word> | <number> | <sign>
This syntax rule says: a token is either a
word, or a number, or a (special) sign. The vertical bar |
is a binary operator used in syntax rules to separate
alternatives. In this grammar, a syntactic name is always
enclosed by the brackets pair < >. The ::= marks are used
as a definition symbol.
Based on this syntax rule we can define very
easily in Elisa the corresponding definition:
token (Input) -> optional (token);
token (input) = ( word (input), number (input), sign (input) );
This definition is a typical example of
defining alternatives. The expression in the definition rule is a
serial-expression. Normally, all the expressions of a
serial-expression are evaluated, resulting in a stream of data
items. However, because the result specification of the
definition has the optional quantifier, at most one result
will be returned. That means that we may read this definition as:
if the input is a word, return it to the caller; if the input is
not a word, but it is a number, return it to the caller; if the
input is not a word, nor a number, but it is a sign, return it to
the caller.
After we have defined a token the next step is
to define a word, a number, and a sign. We are again using BNF
for the syntax definition. We start with the definition of a
word:
<word> ::=
<letter> | <word> <letter>
This is an example of a recursive syntax
definition. It says: a word is a letter or it is an already
existing word followed by a letter. Consequently, a word is a
sequence of letters.
Also this syntax rule can be transformed into a
definition, although it is a little bit more complicated than the
token definition because of the repetition of letters. In
addition, we not only want to recognize a word in the
input text but we also like to form a word token, which requires
some supplementary code. The definition is:
word (Input) -> optional (token);
word (input) = [ Char:= Character (input);
if not letter (Char) then return;
start = Start (input);
repeat [ Char:= NextCharacter (input);
if not letter (Char) then ready];
Token (input, start)];
This definition has also an optional result
specification. The definition starts with reading the current
character of the input text. If this character is not a letter
then, according to the syntax, it is not the beginning of a word.
In that case, control returns to the caller without a result.
If the current character is a letter then it is
the beginning of a word and the start position of this word in
the input text will be recorded.
After that, following characters are read in a
loop until a character is read which is not a letter. This marks
the end of the word.
Finally, a token is formed of the characters
belonging to the word.
This definition makes use of several functions,
mainly belonging to the basic input component which will be
defined later on. One exception is the letter function. A
letter is defined as follows:
<letter> ::=
'a'|'b'| ..... |'z'|'A'|'B'| ..... |'Z'
The corresponding definition is:
letter (character) -> boolean;
letter (C) = 'a' <= C and C <= 'z' or 'A' <= C and C <= 'Z';
The second alternative of a token is a number.
As an example, we will use a very simple syntax for numbers:
<number> ::=
<digit> | <number> <digit>
This syntax definition is similar to the syntax
of words. It says that a number is a sequence of digits.
Consequently, we can use a comparable definition for number
tokens:
number (Input) -> optional (token);
number (input) = [ Char:= Character (input);
if not digit (Char) then return;
start = Start (input);
repeat [ Char:= NextCharacter (input);
if not digit (Char) then ready ];
Token (input, start) ];
Syntactically, a digit is defined as:
<digit> ::= '0'|
'1'| '2'| ... '9'
The corresponding definition is:
digit (character) -> boolean;
digit (C) = '0' <= C and C <= '9';
The third and last alternative of a token is a
(special) sign. We assume that a sign is any character which is
not a letter or a digit. The corresponding definition is:
sign (Input) -> token;
sign (input) = [ Char:= Character (input);
start = Start (input);
Char:= NextCharacter (input);
Token (input, start) ];
The word, number, and sign
definitions are using other definitions such as NextCharacter,
and Token. They are defined in a basic input component
(see Figure 6-1) which is responsible for the processing of
character input from text strings.
component TextInput;
type Input;
Input (text) -> Input;
End (Input) -> boolean;
Character (Input) -> character;
NextCharacter (Input) -> character;
SkipSpaces (Input) -> nothing;
Start (Input) -> integer;
Token (Input, start = integer) -> text;
begin
EOT = character(255); << End Of Text character >> Input (Text) = Input: [ Text; index:= 1; last = size(Text) ]; End (input) = input.index > input.last; Character (input) = if End (input) then EOT
else input.Text [input.index]; NextCharacter (input) = [ input.index:= input.index + 1; Character (input) ]; SkipSpaces (input) = [ if Character(input) <> ' ' then return;
repeat if NextCharacter (input) <> ' ' then ready]; Start (input) = input.index; Token (input, start) = [ size = input.index - start;
token = array (character, size); << text >>
[ i = 1 .. size;
token [i] := input.Text [start + i - 1] ];
token];end component TextInput;
Figure 6-1: Basic Input
Component
This component is responsible for the
operations on the input text. It creates an input-descriptor, it
reads character by character the input text, it tests for the end
of the input string, it skips spaces, and it creates text strings
for tokens.
Based on the input component and other
definitions, we are now able to construct a general procedure for
converting a text string into a sequence of tokens:
tokenize (Input) -> optional (sequence);
tokenize (Input) = [ SkipSpaces (Input);
if End (Input) then return (nil);
token (Input) & tokenize (Input) ];
This recursive definition skips the spaces of
the input text, it returns a nil value at the end of the
input sequence and it returns a sequence of tokens in a
uni-directional list.
The complete component for lexical analysis is
shown in Figure 6-2.
component LexicalAnalysis;
type sequence = term;
tokenize (Input) -> optional (sequence);
begin
type token = text;
token & sequence => sequence;
nil = null(sequence); tokenize (Input) -> optional (sequence);
tokenize (Input) = [ SkipSpaces (Input);
if End (Input) then return (nil);
token (Input) & tokenize (Input) ]; token (Input) -> optional (token);
token (input) = ( word (input), number (input), sign (input) ); word (Input) -> optional (token);
word (input) = [ Char:= Character (input);
if not letter (Char) then return;
start = Start (input);
repeat [ Char:= NextCharacter (input);
if not letter (Char) then ready];
Token (input, start)]; letter (character) -> boolean;
letter (C) = 'a' <= C and C <= 'z' or 'A' <= C and C <= 'Z'; number (Input) -> optional (token);
number (input) = [ Char:= Character (input);
if not digit (Char) then return;
start = Start (input);
repeat [ Char:= NextCharacter (input);
if not digit (Char) then ready ];
Token (input, start) ]; digit (character) -> boolean;
digit (C) = '0' <= C and C <= '9'; sign (Input) -> token;
sign (input) = [ Char:= Character (input);
start = Start (input);
Char:= NextCharacter (input);
Token (input, start) ];end component LexicalAnalysis;
Figure 6-2: Example of a
component for lexical analysis
We may now use this in the following session:
use TextInput, LexicalAnalysis; Text:= "the man eats the apple";
tokenize (Input (Text) )?
The result is:
"the" & ("man" & ("eats" & ("the" & ("apple" & [ ]))))
If we continue the session with:
Text:= " position = 2 * (last - first) "; tokenize (Input (Text) )?
the resulting sequence is:
"position" & ("=" & ("2" & ("*" & ( "(" & ("last" &
( "-" & ("first" & ( ")" & [ ]))))))))
Exercises:
6.1 In many programming languages an
identifier (or name) is defined as a word which may also
contain digits, except for the first position. Write a syntax
for an identifier in BNF. Adapt the lexical analysis
component of Figure 6-2 such that identifiers are processed
instead of words.
6.2 Numbers in many languages may
also contain decimal points and may be followed by an
exponent part, specifying the power of ten. Write a syntax
for such a number and change the corresponding definition of
number accordingly.
6.3 In various programming languages
some combinations of signs are used for symbols with specific
meanings. Examples are: :=, <>, <=, and so on. Make
a selection of those symbols and adapt the lexical analysis
component of Figure 6-2 such that each valid combination of
signs will be returned as one token.
6.4 The lexical analysis component
of Figure 6-2 has no error detection, because all characters
are valid. If we restrict the number of acceptable signs to
the period, the comma, and the left and right parentheses as
specified by the following syntax:
<sign> ::=
'.'|','|'('|')'
then error detection should be introduced
to detect all the non-acceptable characters. Where? How?
6.2 Grammar
Rules
Natural languages as well as artificial
languages are structured according to sets of rules. That means
that sentences in a language are much more than just arbitrary
strings of words or other symbols. Every sentence in a language
must be in conformance with the grammar of the language.
A grammar is a set of rules specifying what
strings of symbols are acceptable as sentences of that language.
It specifies how the symbols must be grouped together into
phrases and what orderings of these phrases are allowed. Given a
grammar for a language, we can look at any sequence of symbols
and see whether it meets the criteria for being an acceptable
sentence. Furthermore, if the sequence is indeed acceptable, a
structure representing natural grouping of words can be
established.
A grammar of a language describes its syntax.
One of the methods to describe the syntax of a language is the
Backus-Naur Form as described earlier. For example, we may
describe the syntax of English sentences in BNF in the following
way:
<sentence> ::=
<noun_phrase> <verb_phrase>
<noun_phrase> ::=
<determiner> <noun> | <noun>
<verb_phrase> ::=
<verb> <noun_phrase> | <verb>
These rules state that an English sentence
consists of smaller parts, called phrases. Each rule
specifies a form that a certain kind of phrase can take.
The first rule says that a sentence may
consist of a phrase called noun_phrase followed by a
phrase called verb_phrase. These two phrases are commonly
known as the subject and predicate of the sentence.
In the same way is a noun_phrase defined
by the second rule: a noun_phrase may consist of a determiner
followed by a noun or it may be a noun without a determiner.
Informally, a noun phrase is a group of words that names one or
more things. Such a phrase contains a word, the "noun",
which gives the main class that the thing belongs to. Thus
"the man" names a man, "the house" names a
house, and so on.
A determiner is a word belonging to a
group of limiting noun modifiers. Examples of determiners are:
a, an, the,
each, all, some, no, few, many, ...
his, her, its, ...
this, that, these, those,
which, what, whose, ...
A noun is a word that is the name of
something. Words as man, apple, bird, action, and house are all
examples of nouns. A noun may be preceded by a determiner.
The third rule says that a verb_phrase
may consist of a phrase called verb followed by a called noun_phrase,
or simply a verb.
A verb is a word that often expresses
action or being. Words as eat, sing, write, do, and make are all
examples of verbs.
A grammar can be used in two ways. It can be
used to generate all possible legal sentences which are
formed in accordance with the grammar rules. For example, if we
use the above mentioned English grammar rules together with the
examples we gave for determiners, nouns, and verbs, numerous
sentences could be generated, such as:
a man eats an apple
the apple sings the bird
each house writes her action
These sentences are syntactically correct but
are not always meaningful.
A grammar can also be used to recognize
a given sentence. A recognizer decides whether a given sentence
belongs to the language. In other words, it recognizes if the
sentence can be generated by the given grammar. The recognition
process is essentially the inverse of the generation process.
Because the recognition process is of greater practical value we
will mainly discuss how grammars can be used for recognition.
The recognition process tries to disassemble a
given sentence into its constituents according to the rules of
the grammar. Therefore, this process is often called parsing.
If a sentence cannot be parsed according to the grammar rules,
the given sentence is not a legal sentence in the language.
6.3 Parsing
If we want to operate on a sentence or if we
want to attach a meaning to a sentence one of the first tasks is
to parse the sentence. Parsing is the transformation of a linear
structure into a hierarchical structure. Such a hierarchical
structure can best be represented as a tree structure, called a parse
tree. Suppose we have the sentence:
the boy throws a stone
After parsing, the corresponding parse tree
will be as given in Figure 6-3. It shows how the parse tree of a
sentence contains, as sub-trees, parse trees of phrases.
sentence
/ \
/ \
/ \
noun_phrase verb_phrase
/ \ / \
/ \ / \
/ \ / \
determiner noun verb noun_phrase
| | | / \
| | | / \
| | | determiner noun
| | | | |
the boy throws a stone
Figure 6-3:
Example of a parse tree
A parse tree of a sentence is a tree with the
following properties:
1. The root is labeled by the sentence
name.
2. All internal nodes are labeled by phrase
names.
3. All leaf nodes are labeled by words of
the parsed sentence.
The problem of constructing a parse tree for a
sentence, given a grammar, is what we call the parsing problem. A
computer program that constructs parse trees for sentences of a
language is called a parser.
However, not all character strings are valid
sentences in a grammar. For example, try to parse:
throws the boy a stone
There does not exist a valid parse tree for
this sentence in our limited grammar although this may be
acceptable in English as part of a valid question. Therefore,
grammars for practical purposes will be much more complicated
than ours.
6.4 Syntax
Analysis
Syntax analysis involves the parsing and
analysis of the syntax of the input text according to the grammar
rules of the language. In many cases, parse trees are constructed
that can also be used for further processing such as semantic
analysis.
In this section we will describe how syntax
analysis can be formulated and how parse trees can be generated
based on the grammar rules of a language. Let us start, as an
example, with our first English grammar rule:
<sentence> ::=
<noun_phrase> <verb_phrase>
This rule can be read as follows:
- An input string is a sentence if the
string can be divided into two parts and the first part
is a noun-phrase and the second part is a verb-phrase.
The main problem is how to make the right
division because there are many ways to cut a sentence into two
pieces. The obvious one for our example sentence can be expressed
in a diagram:
the boy throws a stone
<--------------------> <---------------------->
<noun_phrase> <verb_phrase>
How is this division obtained? The general
procedure is to try to recognize a <noun_phrase> at
the beginning of a sentence and then try to find a <verb_phrase>
in what is left. If a first attempt is not successful, then
backtracking is needed for another trial. In general, a number of
different attempts may be required before the right division is
found. At the end of this, we should have used up exactly the
words of the sentence, no more and no less. And because noun
phrases and verb phrases are also composed of smaller sub-phrases
we have a recurring task of decomposing phrases into proper
sub-phrases.
As we have done earlier, we assume that each
grammar rule is represented by a corresponding definition. Thus,
there is a definition for sentence, for noun_phrase,
for verb_phrase, and so on. We further assume that all
those definitions are similar in nature and are based on the same
principles. That means that if we have found a scheme for a sentence
definition, we can also use that for other grammar rules. Each
definition has an input parameter, which gives the beginning of a
phrase and an output parameter which returns the end of a phrase.
Furthermore a definition returns the corresponding parse tree.
So, let us start with a version of a definition
for parsing sentences:
sentence (Phrase, var (RestPhrase) ) -> optional (ParseTree);
sentence (Start, End) = [ Middle := Start;
NP = noun_phrase (Start, var (Middle) );
VP = verb_phrase (Middle, var (End) );
Sentence (NP, VP)];
The first line is a specification which says
that a sentence has an input parameter which is a phrase, and a var
parameter which will be used as an output parameter to return the
position of the rest of the phrase which does not belong to the
sentence. The result specification contains the optional
quantifier. It specifies that if the input phrase is a sentence
the corresponding parse tree will be returned, otherwise
backtracking will occur.
The body of the definition rule consists of
four elements:
It first defines a local variable Middle,
which will be used later on; its initial value will be the start
position of the input phrase.
Then the definition of the noun_phrase
will be called. The noun_phrase starts at the same position as
the sentence and stops somewhere in the middle of the sentence.
If the sentence does not start with a noun_phrase then
backtracking occurs and no ParseTree will be returned.
Otherwise, the end of the noun phrase is assigned to the local
variable Middle and the result of the noun_phrase is named
NP.
After that the definition of the verb_phrase
will be called. The verb_phrase starts where the noun_phrase
was ended at the Middle position. In case of a successful
match it will return the end of the verb phrase, and thus the end
of the sentence, via the End parameter to the caller. If
there is no verb phrase discovered, backtracking occurs to the
preceding NP definition and a new alternative for a noun
phrase will be tried, if defined. If the definition of the noun
phrase has also an optional result-specification no
alternative for the noun phrase will be available. In that case,
backtracking will occur and no ParseTree will be returned.
If the noun phrase and the verb phrase are both
matching then a corresponding node of the ParseTree will
be returned. This node is a term with the following
specification:
Sentence (ParseTree, ParseTree) => ParseTree;
The corresponding type definitions are:
type Phrase = sequence;
type RestPhrase = Phrase;
type ParseTree = term;
These type definitions assume that a phrase is
a sequence of tokens as produced by the lexical analyzer.
And this makes the first version of the sentence
definition complete. Because this version may be simplified,
another version can be defined. This second version is derived
from the first version by doing some substitutions for NP
and VP, without loss of functionality:
sentence (Phrase, var (RestPhrase) ) -> optional (ParseTree);
sentence (Start, End) = [ Middle := Start;
Sentence (noun_phrase (Start, var (Middle) ),
verb_phrase (Middle, var (End) )) ];
Let us now start with the second grammar rule:
<noun_phrase> ::=
<determiner> <noun> | <noun>
This rule can be read as follows:
- A phrase is a noun_phrase if, either the
phrase can be divided into two parts and the first part
is a determiner and the second part is a noun, or, the
whole phrase is a noun.
The grammar rule says that there are two
alternatives for a noun_phrase: the first alternative is a
determiner followed by a noun, the second alternative is a single
noun. We discussed already in the beginning of this chapter the
question how to represent syntactic alternatives in a definition.
As there, we will include the alternatives in a
sequence-expression, as shown in the following example:
noun_phrase (Phrase, var (RestPhrase)) -> optional (ParseTree);
noun_phrase (Start, End) =
[ Middle:= Start;
( NounPhrase (determiner (Start, var (Middle)),
noun (Middle, var (End))),
NounPhrase (null (ParseTree),
noun (Start, var (End)))) ];
The definition body starts with the first
alternative in a serial-expression. If the input string is a
determiner followed by a noun, the corresponding node of the ParseTree
will be returned. This node is a phrase structure with the
following term specification:
NounPhrase (ParseTree, ParseTree) => ParseTree;
If the first alternative does not succeed then
the second alternative will be tried. If the input string is a
single noun then the corresponding node of the ParseTree
will be returned.
If none of the attempts succeed backtracking
occurs and no ParseTree is returned.
Our third grammar rule has a similar structure:
<verb_phrase> ::=
<verb> <noun_phrase> | <verb>
This rule can be read as follows:
- A phrase is a verb_phrase if, either the
phrase can be divided into two parts and the first part
is a verb and the second part is a noun_phrase, or, the
whole phrase is a verb.
This grammar rule makes use of the already
existing grammar rule for noun_phrases. The corresponding
definition is:
verb_phrase (Phrase, var (RestPhrase)) -> optional (ParseTree);
verb_phrase (Start, End) =
[ Middle:= Start;
( VerbPhrase (verb (Start, var (Middle)),
noun_phrase (Middle, var (End))),
VerbPhrase (verb (Start, var (End)),
null (ParseTree))) ];
This definition is similar to the
noun_phrase definition. The corresponding node of the ParseTree
has the following specification:
VerbPhrase (ParseTree, ParseTree) => ParseTree;
In a grammar rule the order of the
alternatives, and therefore in the sequence-expression of the
corresponding definition, is of importance. For example, if we
change the order of the two alternatives in the verb_phrase in
the following way:
<verb_phrase> ::=
<verb> | <verb> <noun_phrase>
and we change also the order of the
corresponding alternatives in the verb_phrase definition
then the <verb> will be satisfied without the
following <noun_phrase>. That means that parsing of
the sentence:
the boy throws a stone
stops after the verb
"throws", leaving the rest of the sentence unparsed.
The general rule for ordering alternatives in a grammar rule is
to specify the most encompassing cases first.
Associated with incorporating alternatives in
definitions is the specification of the result of the definition.
In all examples so far, we used the optional quantifier.
This means that even when more than one alternative could be
applied, only the first matching alternative will be returned,
which is what we usually will expect. However, sometimes there
are cases where we want to see all the matching alternatives. For
example, we may be interested in all possible parsings of
ambiguous phrases. In that case the quantifier multi
should be used.
6.5 Lexicons
As we have seen, words are belonging to special
categories , such as determiner, noun, verb, and so on. In real
applications the words are often belonging to an application
domain and are stored, together with other characteristics in a
lexicon or dictionary. For our exercises, that will not be
necessary. We will show how words can be defined by definitions
as if they were stored in a dictionary.
Our first example is the definition of a
limited number of determiners:
determiner (Phrase, var (RestPhrase) ) -> optional (ParseTree);
determiner (=nil, Rest) = no (ParseTree);
determiner ("a" & Tail , Rest) = [ Rest:= Tail; Determiner ("a")];
determiner ("the" & Tail , Rest) = [ Rest:= Tail; Determiner ("the")];
determiner ("all" & Tail , Rest) = [ Rest:= Tail; Determiner ("all")];
determiner ("some" & Tail , Rest) = [ Rest:= Tail; Determiner ("some")];
Each definition rule tests if the head of the
list is a specific determiner. If so, then the rest of the list
will be given back via the output parameter and the corresponding
node of the ParseTree will be returned by using the
following specification:
Determiner (text) => ParseTree;
Our next example is the definition of a limited
number of nouns:
noun (Phrase, var (RestPhrase) ) -> optional (ParseTree);
noun (=nil, Rest) = no (ParseTree);
noun ("man" & Tail, Rest) = [ Rest:= Tail; Noun ("man") ];
noun ("apple" & Tail, Rest) = [ Rest:= Tail; Noun ("apple") ];
noun ("boy" & Tail, Rest) = [ Rest:= Tail; Noun ("boy") ];
noun ("stone" & Tail, Rest) = [ Rest:= Tail; Noun ("stone") ];
noun ("Jack" & Tail, Rest) = [ Rest:= Tail; Noun ("Jack") ];
This definition is similar to the definition of
determiner. The Noun node is specified by:
Noun (text) => ParseTree;
Our final example is the definition of a
limited number of verbs:
verb (Phrase, var (RestPhrase) ) -> optional (ParseTree);
verb (=nil, Rest) = no (ParseTree);
verb ("sings" & Tail, Rest) = [ Rest:= Tail; Verb ("sings") ];
verb ("eats" & Tail, Rest) = [ Rest:= Tail; Verb ("eats") ];
verb ("throws" & Tail, Rest) = [ Rest:= Tail; Verb ("throws") ];
This definition is similar to the two preceding
definitions. The Verb node is specified by:
Verb (text) => ParseTree;
We have now defined all the elements of a set
of very limited grammar rules. The next step is to glue inputs
and outputs of lexical analysis and syntax analysis together in
one parsing definition:
Parse (text) -> optional (ParseTree);
Parse (Text) = [ End:= null (sequence);
sentence (tokenize (Input (Text)), var (End) ) ];
With these building blocks we are now able to
construct a component for syntax analysis (see Figure 6-4):
component SyntaxAnalysis;
type ParseTree = term;
Parse(text) -> optional (ParseTree);
begin
type Phrase = sequence;
type RestPhrase = Phrase;
nil = null (ParseTree); Sentence (ParseTree, ParseTree) => ParseTree;
NounPhrase (ParseTree, ParseTree) => ParseTree;
VerbPhrase (ParseTree, ParseTree) => ParseTree;
Determiner (text) => ParseTree;
Noun(text) => ParseTree;
Verb(text) => ParseTree; Parse (text) -> optional (ParseTree);
Parse (Text) = [ End:= null (sequence);
sentence (tokenize (Input (Text)), var (End) ) ]; sentence (Phrase, var (RestPhrase) ) -> optional (ParseTree);
sentence (Start, End) =
[ Middle := Start;
Sentence (noun_phrase (Start, var (Middle) ),
verb_phrase (Middle, var (End) )) ]; noun_phrase (Phrase, var (RestPhrase)) -> optional (ParseTree);
noun_phrase (Start, End) =
[ Middle:= Start;
( NounPhrase (determiner (Start, var (Middle)),
noun (Middle, var (End))),
NounPhrase (null (ParseTree),
noun (Start, var (End)))) ]; verb_phrase (Phrase, var (RestPhrase)) -> optional (ParseTree);
verb_phrase (Start, End) =
[ Middle:= Start;
( VerbPhrase (verb (Start, var (Middle)),
noun_phrase (Middle, var (End))),
VerbPhrase (verb (Start, var (End)),
null (ParseTree))) ]; determiner (Phrase, var (RestPhrase) ) -> optional (ParseTree);
determiner (=nil, Rest) = no (ParseTree);
determiner ("a" & Tail , Rest) = [ Rest:= Tail; Determiner ("a")];
determiner ("the" & Tail , Rest) = [ Rest:= Tail; Determiner ("the")];
determiner ("all" & Tail , Rest) = [ Rest:= Tail; Determiner ("all")];
determiner ("some" & Tail , Rest) = [ Rest:= Tail; Determiner ("some")]; noun (Phrase, var (RestPhrase) ) -> optional (ParseTree);
noun (=nil, Rest) = no (ParseTree);
noun ("man" & Tail, Rest) = [ Rest:= Tail; Noun ("man") ];
noun ("apple" & Tail, Rest) = [ Rest:= Tail; Noun ("apple") ];
noun ("boy" & Tail, Rest) = [ Rest:= Tail; Noun ("boy") ];
noun ("stone" & Tail, Rest) = [ Rest:= Tail; Noun ("stone") ];
noun ("Jack" & Tail, Rest) = [ Rest:= Tail; Noun ("Jack") ]; verb (Phrase, var (RestPhrase) ) -> optional (ParseTree);
verb (=nil, Rest) = no (ParseTree);
verb ("sings" & Tail, Rest) = [ Rest:= Tail; Verb ("sings") ];
verb ("eats" & Tail, Rest) = [ Rest:= Tail; Verb ("eats") ];
verb ("throws" & Tail, Rest) = [ Rest:= Tail; Verb ("throws") ];end component SyntaxAnalysis;
Figure 6-4: Example of a
component for syntactic analysis
We can now use the components in a session:
use TextInput, LexicalAnalysis, SyntaxAnalysis;
and we request the system to:
Parse ("the boy throws a stone")?
The parsing of the input text results into a
tree structure corresponding to the following parse tree:
Sentence ( NounPhrase ( Determiner ("the"), Noun ("boy")),
VerbPhrase ( Verb ("throws"), NounPhrase ( Determiner ("a"),
Noun ("stone"))))
Compare this with the parse tree of Figure 6-3.
If we continue the session with:
Parse ("boy throws")?
the resulting parse tree is:
Sentence ( NounPhrase ( [ ], Noun ("boy") ),
VerbPhrase ( Verb ("throws"), [ ] ) )
Exercise:
- 6.5 Assume that we have defined the
syntax of arithmetic expressions by means of the
following rules:
<expr> ::= <term>
| <expr> + <term> | <expr> -
<term>
<term> ::=
<factor> | <term> * <factor> |
<term> / <factor>
<factor> ::=
<identifier> | <number> | (<expr>)
Develop a component for the syntax analysis
of these arithmetic expressions.
6.6 Semantic
Analysis
Semantic analysis is concerned with the meaning
of sentences in a language. Once sentences are parsed by means of
syntax analysis the next step is to understand the sentences.
A sentence may be grammatically correct but
still lack meaning. For example, the sentence "The flower
drove 250 beats per hour" is grammatically correct. However,
a flower cannot drive and its speed is not measured in beats per
hour. The semantic rules for a language differentiate between
meaningful sentences and gibberish. During semantic analysis,
these semantic rules are applied.
Defining the semantic rules of natural
languages is an extremely difficult problem that is the subject
of on-going research. Solutions to the problem of formalizing the
syntax and semantics of a language have only been successfully
applied to simple subsets of natural languages. Those subsets not
only limit the grammar to simple sentences, but they also make
assumptions about the kinds of sentences the user will use.
Typical examples are found in the area of data base applications
where dialogues are limited to the domain covered by the data
base.
In this section we will show how semantic
analysis may help to understand sentences in a specific domain.
The application domain we have chosen is the problem of finding
optimal routes between cities as described in the preceding
chapter. We will use already existing components and we will
develop a natural language interface based on a limited number of
sentences. The system will accept two kinds of sentences:
questions and assertions.
Users can ask for routes between two cities.
Examples of specific questions include:
What is the distance between Amsterdam and
Nice?
Distance Berlin and Vienna?
What is the best route from Rome to Paris?
What are the routes from Paris to Vienna?
Users can also enter new information about
cities and their distances. Examples of such assertions are:
The distance between Amsterdam and Brussels
is 220 kilometers.
Distance Brussels and Paris is 297 km.
The sentences which will be accepted by our
system are limited to a number of specific patterns. Examples of
such patterns are:
What is the distance between X and Y?
Distance X and Y?
What is the best route from X to Y?
What are the routes from X to Y?
The distance between X and Y is Z
kilometers.
Distance X and Y is Z km.
A pattern serves as a model for a variety of
sentences. A pattern may contain a number of variables which
specify where input sentences may deviate from the pattern. In
our examples, X, Y, and Z are pattern variables. During pattern
matching where an input sentence is matched against one of the
pattern sentences, the pattern variables are representing the
variable parts of the input sentence. Note that the patterns we
use here are not the same as the patterns used in parameter
expressions for pattern matching of terms, although both kinds of
pattern serve the same purpose. In this context we use the word
"pattern" to describe a number of possible sentences.
To illustrate the basic principles of pattern
matching for sentences we will use in our system a very simple
procedure. For a successful match, an input sentence has to match
word for word and token for token with the pattern sentence. The
only exceptions are pattern variables which match any token. In
all other cases, such as deviations of letters in upper and lower
case, or differences in punctuation symbols, or in the total
number of tokens will lead to an unsuccessful match. If there is
no pattern sentence found which matches the input sentence, then
the input sentence cannot be recognized by the system.
Associated with each pattern is a semantic
action of the form:
Pattern 1 => Action 1
Pattern 2 => Action 2
..................................
Pattern n => Action n
If an input sentence matches a pattern then the
corresponding semantic action will be executed. The semantic
action depends on the recognized pattern.
In our program, a pattern is represented by a
text string and the associated semantic action is coded as a text
string of two characters. All patterns are collected in a pattern
list:
PatternList =
{ pattern ("Q1", "What is the distance between X and Y?"),
pattern ("Q1", "Distance X and Y?"),
pattern ("Q2", "What is the best route from X to Y?"),
pattern ("Q3", "What are the routes from X to Y?"),
pattern ("A1", "The distance between X and Y is Z kilometers."),
pattern ("A1", "Distance X and Y is Z km.")};
Mark that for some patterns the same semantic
actions are specified. In the code for semantic actions we used
the letter Q for questions and the letter A for assertions,
although this distinction is not essential.
Patterns are converted to sequences of tokens
as specified by the following definition:
pattern (text, text) -> optional (sequence);
pattern (Label, Pattern) = Label & tokenize (Input (Pattern));
The tokenize procedure is defined by the
Lexical Analysis component. The label for the semantic action is
placed in front of the pattern sequence.
Input sentences are also converted to sequences
of tokens. After an input sequence has been converted, the
resulting sequence of tokens will be used to search for a
matching pattern. The matching procedure is defined by:
Match (Input = sequence, Pattern = sequence) -> optional (sequence);
Match (= nil, = nil) = nil;
Match (= nil, X) = no (sequence);
Match (X, = nil) = no(sequence);
Match (H1 & T1, H1 & T2) = Match (T1, T2);
Match (H1 & T1, H2 & T2) = if Variable (H2) then H1 & Match (T1, T2)
else no (sequence);
Match (X , Y ) = no (sequence);
This procedure matches the input and pattern
sequences, token for token. For a successful match all tokens
must be equal, with the exception of input tokens which are
matching pattern variables, as defined by the one but last
definition rule. In that case the input token will be part of an
output sequence destined for further processing. In all other
cases, the pattern sequence is not matching with the input
sequence, no output sequence will be returned, and another
pattern will be tried for matching.
In our program, a pattern variable is
represented by a single, capital letter, as defined by:
Variable (term) -> boolean;
Variable (text: T) = [ C = T[1]; size (T) == 1 and 'A' <= C and C <= 'Z'];
If there is a match between the input sequence
and a pattern sequence, the matching pattern has been found, and
no further matching is necessary; the output sequence can be used
in the corresponding semantic action. Take for example the
following input sentence:
What is the best route from Amsterdam to
Vienna?
This matches with the third pattern:
What is the best route from X to Y?
The resulting output sequence will be:
Amsterdam & (Vienna & nil)
The corresponding semantic action is coded as "Q2".
The meaning of a sentence is defined by the
semantic action code and the output sequence. In our program, all
semantic actions are defined by one definition:
Action(Code = term, sequence) -> optional (term);
Action(Q, =nil) = no (term);
Action("Q1", X & (Y & =nil)) = Distance (SearchBestFirst(Map, Node(X), Node(Y)));
Action("Q2", X & (Y & =nil)) = Route (SearchBestFirst(Map, Node(X), Node(Y)));
Action("Q3", X & (Y & =nil)) = { Route (SearchBestFirst(Map,Node(X), Node(Y))) };
Action("A1", X & (Y & (Z & =nil))) = [ add (Map, arc(Node(X), Node(Y), Cost(Z))); term: "accepted"];
This definition defines the meaning of three
questions and one assertion. The first question asks for the
distance of two cities. To answer that question the SearchBestFirst
function is called for the two cities X and Y, as
defined in the preceding chapter. The Node converts a
token into a city node. The Distance function computes the
distance from the output of SearchBestFirst. The answer
will be returned to the user.
The second and third kinds of questions are
similar in nature. The Route function computes the route
between two cities.
The last definition rule defines the semantic
action for an assertion. It adds new information about the
distance between two cities to the Map and returns the
word "accepted" to the user.
The Route function needs some
explanation. It makes some cosmetic changes to the output as
created by SearchBestFirst. For example, if we ask for the
route between Amsterdam and Rome SearchBestFirst returns
as one of the answers:
cost(Rome, 2152) & (cost(Nice, 1429) & (cost(Paris, 517) & (cost(Amsterdam, 0) & [ ])))
However, we like to return the inverse list of
cities, starting with Amsterdam, and without the cost
information. This requires some additional programming. The Route
definition is:
Route (sequence) -> term;
Route ( S) = term:{ prioritems ( {State (Items (S) )} )};
The Items function returns a stream of nodes
from sequence S and State returns the city information from a
node. The prioritems function is used to reverse the order
of the nodes.
Based on these definitions we can now construct
a component for semantic analysis of our application domain (see
Figure 6-5).
component SemanticAnalysis;
UserInput (text) -> term;
begin
nil = null (sequence);
term & sequence => sequence;
text & sequence => sequence; Map = Graph: alist (Arc); PatternList =
{ pattern ("Q1", "What is the distance between X and Y?"),
pattern ("Q1", "Distance X and Y?"),
pattern ("Q2", "What is the best route from X to Y?"),
pattern ("Q3", "What are the routes from X to Y?"),
pattern ("A1", "The distance between X and Y is Z kilometers."),
pattern ("A1", "Distance X and Y is Z km.")}; pattern (text, text) -> optional (sequence);
pattern (Label, Pattern) = Label & tokenize (Input (Pattern)); UserInput (text) -> term;
UserInput (input) = [ Result:= nil;
[ Result:= Process (input) ];
if Result <> nil then Result
else term: "Sentence not understood!"]; Process (text) -> optional (term);
Process (input) = [ Input = tokenize (Input (input));
Pattern = items (PatternList);
Action (Head (Pattern), Match (Input, Tail (Pattern))) ]; Match (Input = sequence, Pattern = sequence) -> optional (sequence);
Match (= nil, = nil) = nil;
Match (= nil, X) = no (sequence);
Match (X, = nil) = no(sequence);
Match (H1 & T1, H1 & T2) = Match (T1, T2);
Match (H1 & T1, H2 & T2) = if Variable (H2) then H1 & Match (T1, T2)
else no (sequence);
Match (X , Y ) = no (sequence); Variable (term) -> boolean;
Variable (text: T) = [ C = T[1]; size (T) == 1 and 'A' <= C and C <= 'Z']; Action(Code = term, sequence) -> optional (term);
Action(Q, =nil) = no (term);
Action("Q1", X & (Y & =nil)) = Distance (SearchBestFirst(Map, Node(X), Node(Y)));
Action("Q2", X & (Y & =nil)) = Route (SearchBestFirst(Map, Node(X), Node(Y)));
Action("Q3", X & (Y & =nil)) = { Route (SearchBestFirst(Map,Node(X), Node(Y))) };
Action("A1", X & (Y & (Z & =nil))) = [ add (Map, arc(Node(X), Node(Y), Cost(Z))); term: "accepted"]; Distance (sequence) -> term;
Distance ( cost ( State, integer: Cost) & Z) =
term: "The distance is " & text (Cost, -1) & " kilometers."; Node (term) -> Node;
Node (text: T) = Node: symbol (T); Route (sequence) -> term;
Route (S) = term: { prioritems( {State (Items (S))} )}; Cost (term) -> Cost;
Cost (text: T) = Cost: integer (T); Items (sequence) -> multi (term);
Items ( =nil) = no (term);
Items ( Head & Tail) = (Head, Items (Tail)); State (term) -> State;
State( cost (State, Cost)) = State; Head (sequence) -> optional (term);
Head (Head & Tail) = Head; Tail (sequence) -> optional (sequence);
Tail (Head & Tail) = Tail; end component SemanticAnalysis;
Figure 6-5: Example of a
component for semantic analysis
We may now use this component for any
collection of cities we are interested in. Let us start with a
session:
use TextInput;
use LexicalAnalysis;
use SearchBestFirst;
use SemanticAnalysis;
Because the SemanticAnalysis component
starts with an empty map, we first have to fill the map with
information. Let us use the information about the distances
between cities as shown in Figure 20-3. We enter a number of
assertions:
UserInput ("Distance Amsterdam and Berlin is 669 km.")?
UserInput ("Distance Berlin and Vienna is 648 km.")?
UserInput ("Distance Vienna and Rome is 1150 km.")?
UserInput ("Distance Amsterdam and Paris is 517 km.")?
UserInput ("Distance Paris and Vienna is 1271 km.")?
UserInput ("Distance Paris and Nice is 912 km.")?
UserInput ("Distance Nice and Vienna is 1130 km.")?
UserInput ("Distance Nice and Rome is 723 km.")?
We may now enter a dialogue with the system and
ask questions, such as:
UserInput ("What is the distance between Amsterdam and Rome?")?
"The distance is 2152 kilometers."UserInput ("Distance Amsterdam and Nice?")?
"The distance is 1429 kilometers."UserInput ("What is the best route from Amsterdam to Nice?")?
{ Amsterdam,
Paris,
Nice}
These examples illustrate how pattern matching
can be used to recognize specific input sentences and how the
corresponding semantic actions may create the necessary response.
It is easy to envision for a given application domain all kinds
of pattern sentences, in full and in abbreviated forms, in native
and in foreign languages, which are useful for a number of users.
If we compare pattern matching with syntax
analysis as described in previous sections, we may say that a
pattern also represents a syntax rule. As with a syntax rule it
describes the fixed parts and variable parts of a sentence.
Compare this with a syntax rule for an if statement as used in
programming languages:
if B then S1 else
S2
Such a syntax rule is also a pattern with fixed
and variable parts. In this pattern, B, S1, and S2 may be
considered as pattern variables, although we prefer to call them
syntactic units because they may represent compound structures.
On the other hand, we may also may make patterns as
The distance between X and Y is
Z kilometers.
more precise by using a different notation:
The distance between <City1>
and <City2> is <Number> kilometers.
The syntax of <City1>, <City2>,
and <Number> can now be defined in normal syntax
rules. In this way pattern variables can represent compound
syntactic structures. This means that pattern matching can be a
quite sophisticated process because it may involve syntax
analysis as described earlier.
Exercises:
6.6 One of the drawbacks of the
current pattern matching procedure is its inflexibility in
spelling. Design a pattern matching procedure which is
independent of upper and lower case letters.
6.7 Another disadvantage of the
current pattern matching procedure is that words in the input
sentence may match with pattern variables, which are meant to
be numbers. Design a pattern matching procedure which will
accept only numbers where numbers are required.
6.7 Summary
- Language processing has a wide range of
applications. Artificial languages as well as natural
languages can all be treated according to the same
principles. Analysis of languages can be described as
three steps: lexical analysis, syntax analysis, and
semantic analysis.
- A grammar can be used in two ways. It can
be used to generate all possible legal sentences
and it can also be used to recognize a given
sentence.
- Grammar notations can be used to translate
syntax definitions easily into corresponding definitions.
- The main task of a lexical analyzer is to
read the characters from an input stream and produce as
output a sequence of tokens that can be used as input for
following steps of language analysis. A token consists of
one or more characters and is the smallest meaningful
linguistic unit of the language, such as a word, a
number, or any other simple or compound symbol of the
language.
- Components for standard text input,
lexical analysis, and syntax analysis have been defined.
- Syntax analysis involves the parsing and
analysis of the syntax of the input text according to the
grammar rules of the language.
- Semantic analysis is concerned with the
meaning of sentences in a language. Once sentences are
parsed by means of syntax analysis the next step is to
understand the sentences.
- Pattern matching is a useful technique for
handling a limited number of sentences in a specific
application domain.
- A pattern serves as a model for a variety
of sentences. A pattern may contain a number of variables
which specify where input sentences may deviate from the
pattern.
6.8 References
Different strategies for language processing
are discussed by several authors. Winston(1984) gives a general
introduction to natural language understanding. Winston and
Horn(1984) describe language processing based on Augmented
Transition Trees (ATTs). Charniak and McDermott(1985) give an
introduction to natural language parsing and are discussing
natural language processing based on Augmented Transition Nets
(ATNs).
Applications of Definite Clause Grammars (DFGs)
as used in Prolog, are discussed by Clocksin and Mellish(1984),
Marcus(1986), Walker(1987), and Bratko(1990).
 |
|
Part 2: Data Structures
|
|
Chapter 6: Language
Processing |
|