|
Academic Open Internet Journal |
Volume 13, 2004 |
ANALYSIS OF
SENTENCES FOR SYNTACTIC VALIDITY
1. Research Scholar, Department of Computer Science
& Engineering
2. Prof. and Head, Department of Computer Science
& Engineering
3. Student, Department of Electrical &
Electronics Engineering
Email: rnalliah@yahoo.co.in1,
Karthikv2k@yahoo.co.uk3
Abstract: The aim of this work is the design and implementation
of a system which analysis the natural
language input sentence to see if it is grammatically correct. To do so it uses
different modules to identify the words of the structure in which these words
appear is correct. The dictionary is maintained in a static database contained
in a separate file. Another module is used for conversion of the input string
(sentence) in to a list for further processing. The main program parses the
list of words to check for grammatical validity, using the other two modules.
Once the syntax analysis is performed, and the sentence is found to be grammatically correct, semantic
analysis may be performed to understand the meaning of the sentence. However
semantic analysis will not be undertaken if the sentence was syntactically
wrong.
Key Words: Predicate Logic, Natural Language Processing, Syntax
Analysis, Semantic Analysis
Introduction to Natural Language Understanding
1 One of the most challenging tasks that a computer has
been faced with is that of natural language understanding. Although a computer
has now got the capability to compute the value of ‘pi’ to the 100th
decimal point, we still have not produced computer program whose overall
linguistic performance rivals that of people. Understanding a natural language
is hard. It requires both linguistic knowledge of the particular language being
used and world knowledge relating to the topic being discussed. We have
developed here a system that tries to understand a
single sentence from the English language. Once we can understand single
sentences we can build on the software to understand a text of more than one sentence.
However with many sentences there is a problem – a sentence may not be clearly
understood until it is considered in a wider context, i.e. taking one or more
of the following sentences into account. We should however keep in mind fact
that once we have a system which can understand a single sentence; it will be
possible to develop one that can understand two or more sentences at a time. In
our project we will concentrate on the understanding of single sentences which
can be classified into3 parts:
·
Lexicon or
dictionary where each word of the sentence is understood.
·
Syntax analysis
– where the combination (in which the words of the sentence occur) is checked
for validity with respect to rules of grammar in English.
·
Semantic
analysis – where a check is made to see if the sentence makes any sense. So
that we actually have is two phase.
·
Phase I : Understanding of words.
·
Phase II: Understanding the complete sentence with the
help of phase
1.2 Understanding
of Words: Lexicon
A
separate database called the LEXICON is maintained wherein information about
each word can be found. The information includes
·
What figure of
speech it belongs to noun, verb preposition etc.
·
What tense does
the verb belongs to i.e. what is the root verb?
·
Extra
information about a word can be contained using semantic markers.
1.3.1. Understanding of Sentences (Syntax, Semantics)
The second and more difficult part
of NLU is combining words to form a structure representing the meaning of a
sentence. It relies on a variety of sources of information.
i Knowledge
of the language being used
ii Knowledge of the domains being used
iii Knowledge
of the conventions for language use that the speakers for language share.
This
phase is the real interpretation phase which can be further subdivided into 3
components.
1.3.2. Syntactic Analysis
Linear sequences of words are
transformed into structures that show how the words relate to each other. Some
word sequences may be rejected if they violate the languages rules for how
words may be combined. For example, our English syntactic analyzer would reject
the sentence
“Dog the ran in park the”, but it would
accept
“The book ran to the park”.
1.3.3. Semantic Analysis
The structures created by the
syntactic analyser are assigned meanings. In other
words, a mapping is made between the syntactic structures and objects in the
task domain. Structures for which no such mapping is possible may be rejected.
For e.g. In most universes, the sentence
“Colourless green ideas sleep furiously”
Would be rejected as semantically anomalous.
1.3.3. Pragmatic Analysis
The structure representing what
was said reinterpreted to determine what was actually meant. For example, the
sentence “Do you know the time”, should be interpreted as a request to be told
a time. A computer would be tempted to answer simply in either the affirmative
or the negative. The boundaries between the syntactic phase and the semantic
phase are fuzzy. The phases are sometimes performed separately or all at once.
If they are performed in sequence one phase may need to appeal for help from
another phase.
For ex: Rama went to the
shop with red roof. This sentence can be syntactically understood in two
different ways.
i)
Rama went to a shop which
happened to have a red roof.
ii) Rama went, accompanied by a red roof, to a shop.
It is the semantic phase that will
decide that the former is correct. It is desirable that the two phases interact
with each other, so that the correct meaning of a sentence is understood. In
our project, we have a syntactic analysis which depends on a lexicon that is
maintained in a static database. Using the lexicon, the system first finds out
if the word exists and if so whether it is a verb, noun, prep.etc.It
next finds out if the words’ placing the sentence is grammatically correct. The
method in which this will be performed will be discussed in chapter 4.We have
also included a morphological component which
i) Obtained the root verb from various
conjugation of the verb in different tenses
e.g. for the root verb “play” we can have many conjugations.
Plays,
played, was playing, is playing etc.
ii) The root word can be obtained from its
plural form
Word Root
Dogs’ dog
Bats bat and so on.
Further
some exceptions have also been included.
For
e.g. the word “go” will appear in the past tense as “went”. And the plural of
“goose” is “geese” not “gooses”. A few of these irregularities have been
incorporated into our database. An exhaustive list of all irregularities may
also be included. However, we have not done so due to limitations of time and
resources.

2. Suitability of Prolog for Natural Language
Understanding
Natural language processing lies within the scope of
artificial intelligence. Hence it is quite obvious that the language we choose
for our project should be language that can be used for various applications of
artificial intelligence. For those who might ask why conventional languages
such as BASIC, COBOL & PASCAL aren’t considered we take the liberty to
answer with a quotation from meek (3) “some people regard achieving a solution
to a problem while using an unsuitable language as a kind of challenge somehow,
they must show that there technical mastery is so great that they can overcome
the limitations of the language in tackling something seemingly intractable.
This is an immature, unprofessional and ultimately self indulgent attitude”.
Hence we need to select an appropriate AI language for one NLP problem. Let us
look at some of the special features that an AI language is expected to have
i. Particularly
good facilities for manipulating lists, since lists are such widely used structure in almost all
AI Programme.
i.
A variety of
data types to describe the many kinds of information a large system needs.
ii.
The ability to decompose
the system into small understandable units so that it is relatively easy to
make changes to one part of the system without disturbing the entire thing.
iii.
Flexible control
structure that facilitate both recursion and parallel decomposition of the system.
iv.
The ability to
communicate with the system interactively both for program development and for
maximally effective use of the finished system.
v.
The ability to
produce efficient code so that system performance is acceptable.
vi.
Late binding
times for such things as the size of data structure, or the type of an object
to be operated on.
vii.
Pattern matching
facilities for data and to determine control. Pattern matching for data is an
important part of using a large knowledge base. Pattern matching for control
forms is the basis for execution of production systems.
viii.
Facilities for
building complex knowledge structures, such as frames, so that related pieces
of information can be grouped together and accessed as a unit.
ix.
Facilities for
performance of some kind of automatic deduction and for storing a database of
assertions that provides the basis for deduction.
x.
Mechanism by
which the programmer can provide additional knowledge that can be used to focus
be attention of the system where it is likely to be most profitable.
xi.
Control
structures which facilitate goal directed behavior (top down behavior) in
addition to the more data directed behavior (or bottom up processing).
xii.
Ability to
intermix procedures and declarative data structures in whatever way best suits
a particular task.
With
respect to the above mentioned attributes the two strong contenders for our
project was reduced to LISP & PROLOG. The results of our study can be
summarized as follows.
.Lists
LISP:
yes
PROLOG:
Lists and records
.Decomposability
into Easy-to-change Pieces
LISP:
function definition
PROLOG:
relations
.
Flexible control; structures
LISP:
recursion
PROLOG:
recursion, backtracking via goals
.Interactive
ness
LISP:
Yes
PROLOG:
like LISP
.Efficiency
LISP:
compiler available
PROLOG:
compile clause heads, indexing of caluses, not full
unification
.Late
Binding Time
LISP:
Yes
PROLOG:
Yes
.Pattern
matching for data and control
LISP:
PROLOG:
Unification
.Automatic
Deduction
LISP:
PROLOG:
Procedures are theorems
.Knowledge
structuring
LISP:
properly lists
PROLOG:
records
.Attention
Focusing
LISP:
PROLOG:
goal-directed behavior, indexing
.Goal-directed
Behavior
LISP:
PROLOG:
backward chaining
.Ability
to Intermix Procedures and Data
LISP:
Uniform representation as lists
PROLOG:
View theorems either as static structures or as procedures.
As we can see, LISP is more
AI language .But it was not chosen since it had no good pattern matching or
automatic deduction facilities. No does LISP have attention focusing on goal
directed behavior. On the other hand PROLOG has the above mentioned it was specially
chosen due to its excellent facilities for unification and backtracking.
3. ENGLISH GRAMMAR (From the point of Natural
Language understanding)
3.1. Introduction
English which began life as a modest brought to
Girl French
.
3.2 Ambiguities in English
The language we use id very disorderly. Its words are
ambiguous, its syntax confused. Many sentences are imperfect expressions of
thought because the language is only partly rational.Mechanised
parsing is difficult because many words are verbs at one time and nouns at
another, and the language is full of irregularities and exceptions.
Difficulties in computer parsing of English will be understood with words like
nothing. We would have instructed the computer that if “A is better than B” and
“B is better than C” then “A is better than C”.Apply
this rule to a couple of sentences involving “Nothing”.
“Nothing”
is better than a good book”.
“A
magazine is better than nothing”.
Our smart friend, the computer,
would deduce that a magazine is better than a good book.
Parsing a sentence is also
freight with the dangers of context.
Take
a sentence.
“Anita wants to be a doctor”.
This
could be the about a seven year old kid who changes her mind, on a weekly
basis, as all children of her age. In this instance a computer would be making
as blunder if it eagerly tried to help by spouting the admission procedures for
the best medico schools in the state. Of course, the computer cannot be blamed
for errors which arise purely out of the ambiguity of English, which can
proudly claim to be the most ambiguous of all natural languages in the
world.
To be accepted as having a native-like command
of the English language requires familiarity with a wide variety of idiomatic
expressions. A mature speaker of this language will be able to pick his way
with unconscious case. But a foreign student (or a computer), whose
understanding of idiom is not quite complete, will lock for explicit guidance.
A sophisticated Natural language Processor should be able to provide this sort
of guidance if the computer is to stimulate a native user of the language.
3.2 English
Grammar
It would be a formidable task to encapsulate
all rules of grammar that the English obeys into a system. On top of that,
English is such a versatile language that a sentence that was rejected as
grammatically incorrect just a decade ago would be accepted without a murmur.
Nowadays people don’t
care about uptight rules that infinitives should not be split and that a
sentence should not end with a preposition. In fact Bernard Shaw once wrote in
a letter to the Times (in 1997).
“There is a busy body on your staff that
devotes a lot of time to chasing split infinitives. Every good literary
craftsman splits his infinitives when the sense demands it. I call for the
immediate dismissal of this pendant. It is of no consequence whether he decides
to go quickly to go, or to quickly go. The important thing is that he should go
at once”.
From Wren and Martin, our good
old high school book on English Grammar, we gathered that the best approach to
a new language was to first understand the different parts of speech-the nouns,
verbs, adjectives etc.
This
itself, if undertaken exhaustively, would have taken us six months. we have
common nouns, proper nouns and abstract nouns, not to mention the various pronouns.Adjectives, we found, had types such as
demonstrative adjective (this, that)and Exclamatory
Adjectives, as well as a separate chapter on comparison of adjectives.Pronouns,which
we had thought of dismissing as not too important, had a mind-boggling
variety-personal Emphatic,Indefinite,Demonstrative
and Distributive pronouns to name only a few.
An attempt to study verbs led
to transitive and intransitive and chapters on moods (!) and tense of verbs.
After through study of the bewildering maze that is English grammar, we
selected the most important aspects. Due to limitations of time, we were unable
to implement all aspects of English grammar: to do so we require a couple of
years atleast.what we did implement was the most
common tenses and the most common types of sentences. Gerunds and infinitives
were ignored, as were pronouns, and comparison of adjectives.
3.4.
In this section,
we describe the portion of grammar that we implemented in our system. In our
system can understand four types of sentences-narative,
Imperative, Exclamatory and Interrogative. Of these the type with the widest
range of sentences allowed is the Narrative type.
The Narrative Type:
This
sentence type conveys some information or the other.
It is always of the form (NounPhrase, VerbPhrase).Some
examples of the narrative types are
i.
The boy played
in the park with the sweet girl(simple past)
ii.
The king was
laughing happily(continuous past)
iii.
The cat is
kicking the ball towards the table(continuous past)
iv.
Rama will go to
v.
The Imperative Type:
This
includes sentences where in order laid down such as “shut down the door”.Here, we can see that it is of the form verb phrase
followed by Noun phrase
The Exclamatory Type:
This is used to express intense
emotions such as surprise, anger etc.The only types
of exclamatory sentence that our system accepts are those of the form “what a
pleasant surprise!”
It is mandatory that this
sort of sentence ends in an exclamation mark (!)
The Interrogative Type:
This, as we know is used for
asking questions. The sorts of sentences that will be accepted are
“Who played in the park?”
“Who is playing now?”
and even our familiar “who framed Roger rabbit?”
The sentences much necessarily
end in a question mark (?) for the last two types; of course, the sentences
must end with a full stop.
4. Grammar Rules in Prolog
Introduction:
In the previous chapter we
discussed English Grammar. Sentences in English (or any other natural language,
for that matter) are many more than just arbitrary sequences of words. We
cannot string together any set of words and make a reasonable sentence. At the
very least we must conform to what we consider grammatical. A grammar for a
language is a set of rules for specifying what sequences of words are
acceptable as sentences of that language. it specifies
how the words are grouped into phase and what orderings of the phrases are
allowed. Given a language for a grammar, we can look at any sequence of words
and see whether it meets the criteria for being an acceptable sentence. If the
sequence is indeed acceptable, the process of verifying this will have
established something of the underlying structure of the sentence. This chapter
will discuss how rules of English grammar are implemented in PROLOG. Using our
software, the noun phrase and verb phrase can be identified.Furhter
all parts of speech that occur with in a given sentence (i.e.) nouns, verbs,
adjectives, adverbs, prepositions & conjunctions will be listed out. All
this, if and only if the sentence to be analysed,obeyed
all rules of English grammar(which are specified in the previous chapter).we
should remember here, that whereas grammar rules will be checked correctly, no
attempt will be made to find out if the sentence made any sentence. That is,
semantics or meaning of the sentence is completely ignored. Thus the sentence
“the boy went to the park with the friendly dog” will be accepted, as well the
sentence “the park went out to the dog with the friendly boy”, where the latter
is
a perfectly correct
sentence with respect to the grammar, though it’s meaning of course seems
nonsensical.
5. Grammar Rules:
Let
us first study the notion of grammar and languages with their formal definition.
Definition:
1: A grammar G (Z) is a finite non
empty set of rules is a symbol which must appear as the left part of at least
one rule. It is called the distinguished symbol. Where it is obvious from the
context (or if we don’t care) what the distinguished symbol Z is, we write G
instead of G (Z).
Definition: 2: Let G (Z) be a grammar. A string x is called sentential form if x‘s derivable from the sentential from consisting only of
terminal symbols. The
language I G (Z) is the set of sentences.
L
((G) =X/Z x and x in VT +
Where VT is the set of terminal symbols.
Thus the language is first a
subset of the all terminal symbols. The structure of a sentence is given by the
grammar. Several different grammars can however generate the same language.
E.g. .: we write a grammar
for a language containing variations on the sentence “the big elephant ate the
peanut”. The grammar G (sentence) is
<Sentence> ::=<
subject><predicate>
<Subject> :: <article><adjective><hom>|<Pronoun>
<Predicate> :: <verb><direct object>
<Direct object>
::=< article><noun>
<Pronoun>:=he
<Article>: the
<Adjective>:: big
<verbs: =ate
<noun>:=elephant/peanut
The set of terminals is: he, the, ate,
elephant, peanut.
The language l (g) is the set of sentence of
these terminals which are dervible from the distinguished symbol:
Sentence
Some of the sentences are
He ate the peanut
He ate the elephant
The big peanut ate the elephant
The elephant ate the big elephant
4.3: Using the basic idea of
language &grammar obtained from the above formal definition, the Chomsky
hierarchy may be discussed.
Chomsky (56) defined 4 basic
classes of languages in terms of grammar which are tables (v, t, p, and z)
where
The language of a grammar is a set of
terminal strings which can be generated from z. The different in the four types
of grammar is in the form of the rewriting rules allowed in p.
We have chosen here context free grammar
to represent one idea of the English language in the following manner.
<Sentence> ::=< noun
phrase><verb phrase>
<nounphrase> ::=< article><noun>/<noun>
<verbphrase> ::=< verb>/<verb >
<adverb>
<verbphrase>
::=< verb><nounphrase>/<verb><nounphrase><adverb>
<verbphrase> ::=<’verb>
<prep.pharse>/<verb><prep.phrase><adverb>
<prep.Phrase>::= <prep><nounphrase>/prep><nounphrase><prephrase>
<verbphrase>::= <verbphrase><conj><verbphrase>
<article
>::=the /an/a
<adj>::= small/fat/brown/happy………
<noun>::=boy/king/queen/table……….
<verb>::=
agree/like/do/play/go/make…….
<Prep>::=
to/in/which/ at……..
<conj>::=
and/or/………
<adverb>::=
happily/slowly/quickly……/..
Using thus grammar top down
parsing will be done to find out if the sentence that was input following the
rules given above.
Syntax Analysis- Implementation
5.1 Introduction:
This chapter will explain the actual
implementation of our project in the software we have developed.
The software is split up in to
various modules.
5.2 String/List Converter:
The sentence for analysis is input in the
form of a string. This is then converted in to a list where each words of the
sentence are taken to be one element of the list.
Other than this, capital letters are
converted in to lower case letters, the words “and” and or converted to “and”
and “or” respectively. ?,! and are converted to qm, em and fs.
E.g. Input string:
“The boy played”
Output list:
[The, boy, played, fs]
The way this is
done is using the built in predicate”fronttoken”.
Using this predicate each words of the sentence can be analysed
character by character.
5.3. Lexicon:
The lexicon, as
explained earlier has a finite set of words. These words are grouped according
to their types. That is, all adjectives will adjectives will be contained in a
list called member and so on. In the case of verbs, the root verb only will be
stored. Form the root verb rules are defined in the MORPHO module, as to how
the root can be constructed from its various conjugations. Similarly MORPHO is
used to find the singular form of plural words. In the lexicon we have defined
a predicate “member” to check if a given element is contained in a given list
of elements. Using the above function we define that a word is a noun if it is
a member of the list apple, boy, cat, doll etc… (So also we can define verbs,
adjectives, adverbs etc.)
5.4 Morpho:
The main works of the morpho is to find out the roots of words. Here first symbol
to string conversion is done.
We defined four predicates,
root1, root2.root3, and root.
Root1: position of “s” as in
“plays” is found out and it is removed. For
verbs like “goes”,
“go” is the root word. Here “es” has to be removed. This
is in the case of present tense.
Root
2: Past tense: position of “d” and “ed”are identified,
and removed accordingly to find out the root verb.
Root
3: Continuous tense: position of “ing” was identified
and removed to get the root word.
Root 4: Noun: Singular of nouns was found from the
plural form by removing an “s” from the correct position of the word. Special
cases were defined separately.
5.5 Main
Program:
The main predicate used here was
parse(s), where s is the input sentence in the form of
a list.
For each parse we check for the noun-phrase
and verb-phrase.
5.5.1.
Noun phrase: the predicate nnp detects the nounphrase. We have 3 types for nnp.
nnp(S, T1)-article,
adjective, noun.
nnp (S, T1)-article, noun
nnp(S, T1)-noun
Each time the head if the sentence is taken
and checked for the membership in the respective type and the remaining of the
sentence is the verb phrase is left in T1.
5.5.2.
Verb phrase:
we have defined 4 types of verb phrases.
Vvp2- verb or verb, adverb
Vvp3 – Verb, prepphrase
or verb, adverb, prepphrases
Where prepphrases-
preposition, nounphrase
Preposition, nounphrase,
Prephrases
Vvp- verb,
nounphrases or verb, nounphrase,
adjective
Vvp- verb, nounphrases,
prepphrase or uup,
adjective.
For exclamatory and interrogative
sentences, the head of the list is taken and checked whether it is a member of
interrogative list and if so further processing is done. For these sentences,
the last element of the list is taken and checked if it is equal to ‘em’ or’qm’ and hence the
respective sentence is recognized.
Using all of the above modules, the
system accepts sentences in the form of a string. It analysed
this sentence and outputs the details about its syntax if and only if the
syntax was correct.
For e.g: Input: ”The boy ran to the park”
Output:
Article (the)
Noun
(boy)
Verb-past
tense (ran)
Preposition
(to)
Article
(the)
Noun
(park)
The boy ran to the park.
If the sentence was not syntactically
correct it would simply output.
“The sentence is syntactically wrong”.
Semantic Analysis - A Study
6.1. Introduction:
For better understanding of a sentence in a natural
language we need knowledge about the meaning of the words that appear in the
sentence. To do so we must include semantic markers with each word. To
illustrate the need for semantic markets consider the following example:
The boy
put the book on the table.
The
table put the boy on the book.
Where the first sentence is
perfectly correct both syntactically and semantically, the second one is
correct syntactically but not semantically. Human-being at once will understand
the flaw in the sentence. But to make our system understand that the table is
an inanimate object and hence incapable of performing actions, we may have a
semantic marker with each object-ANIMATE or INANIMATE.The
process of determining the correct meaning of an individual word is called
“word sense disambiguation” or “Lexical Disambiguation”. This is done by
associating with each word in the Lexicon information about the context in
which each of the words may appear .Each word in the other words must be determined.Sometimes only very straight forward information
about each word sense is necessary. For example,
John decided to meet Mary
at
John decided to meet Mary
at the park. (LOCATION)
Here the correct meaning
of the words “
Physical Object (Table, Chair, Ball)
Animate Object (Boy, Girl, Dog)
Abstract Object (Heat, Anger)
Using these, the appropriate
meaning of a word can be selected.
For better understanding more and more
semantic markers will be required. Also we may have included what sort of
object will be required by a verb.
E.g.: the verb “put” will require
(usually) an inanimate object. That is
John put the book on the table.
But we may also have special
cases such as
John put the puppy on the table
Here
the puppy can’t be defined as INANIMATE since it can walk, bark and so on.
Similarly we may identify the synonyms of verbs and classify them into broad
categories such as
§
REQUESTIVE
§
ADVISIVE
§
EXPRESSIVE
§
PERMISSIVE
§
OBLIGATIVE
§
EXPOSITIVE
§
STIPULATIVE
From the appropriate type we may do propositional mapping which will
lead to effective interpretation of our input sentence.
Conclusion
In this paper we have presented an analysis of
sentence for semantic validity using PROLOG. Here we have used predicate logic
of both syntax analysis and semantic analysis. This concept is suitable for all
artificial intelligence, expert systems, Natural language systems and
perception systems for vision and touch. It encompass programs that understand English
natural language of the user, such as English language, but it is central to
the interaction of humans, there has been used for teaching languages.
References
[1]
Clocksin W.F. and Mellish C.S., ‘Programming in PROLOG’, Springer Verlag,
[2]
Carl Townsend
‘Introduction to Turbo Prolog, BPP Publications’, 1987.
[3]
Elaine Rich,’
Artificial Intelligence’, McGraw – Hill, 1983.
[4]
David Gries,’Compiler construction for Digital Computers’, John
Wiley & Sons, 1971.
[5]
D.Sheela, ‘Representation
of Sentence of Effective Interpretation’, 1988.
[6]
R.Ramanan, ‘A Natural
Language Interface to Data Bases’, 1988.
[7]
Wren P.C., &
Martin H, ‘High School English Grammar & Composition’, S.Chand
& Co., 1980.
[8]
Chaniak E., Riesberg, C.k. and M.C. Dermott
H.D.W., ‘Artificial Intelligence Programming’,
[9]
Godfrey Howard,
‘A guide to Good English in the 1980’s’Pelham Books’, 1985.
[10]
Winston P.H., ‘Artificial Intelligence’, Addision
Wesley.
Technical College - Bourgas,