ÀÎÒ   Àâòîìàòè÷åñêàÿ Îáðàáîòêà Òåêñòà

ãëàâíàÿ î íàñ ïðîäóêòû ñêà÷àòü  äåìî òåõíîëîãèè   ^

A short description of Dialing Project. Alexey Sokirko, 2001.

A short description of Dialing Project

Introduction

Before the First Level Semantics

Basic Semantic Description

Building Semantic Nodes

Building all possible semantic relations

Building Tree Variants

After the First Level Semantics

Acknowledgments

References

 

A short description of Dialing Project ^

This paper gives a short description of a  natural language processing system (NLP-system) that is called Dialing Project (1999-2001). Dialing Project is a Russian-to-English machine translation system that is based on French-to-Russian machine translation system [1] developed by Dr.Leontieva and her group in 1976-84. The author of the paper pays attention mainly to the semantic module of Dialing Project  since he  participated in its development.

 

Introduction ^

Dialing System consists of six modules of text processing:

  1. Graphematics;
  2. Morphology;
  3. Group Syntax;
  4. First-level Semantics;
  5. Transfer;
  6. Synthesis.

On each stage there is a special formal language in which one can translate the input text. Firstly, the input text is translated to graphematical language, then from graphematical  to morphological  and so on. Evidently, these languages  share some words between each other. In principle it is possible to  translate from one language to another without intermediate languages. In practice one should avoid it because it violates the well-known Law of Demeter("each component should only know about closely related components").  For example, one should avoid references to Russian graphematical representation  from English Synthesis stage. Still this scheme makes   possible to use the modules in different combinations,  or even stand-alone, that"s why  one can convert the  machine translation system Dialing to various types of NLP systems.

The main symbols of  these six special languages are descriptors that are  special constants which  describe one word or group of words of the input text. We can build complex construction based on descriptors, i.e. sets or lists of descriptors, logical forms and so on. Each descriptor has its meaning and a set of rules that can assign it to a particular word or group of words. For example, descriptor "EOS" means end of sentence, and its set of assigning rules is based on analysis of full stops, exclamation or question marks.  While working with natural languages one can find it helpful to use the same descriptors for different natural languages. The shared descriptors always have the same meaning, but can differ in sets of assigning rules.

In text processing a descriptor can be attached to a word or group. In Dialing Project any group of words always has a single main word. A group can be continuous and not.  A continuous group can be defined by lower and upper borders. A non-continuous group can be defined only word by word. A group of two words is called a relation.

Besides the modules of analysis there are dictionaries that hold predefined information about words and groups. This is the most important part of the system, since it is the only source of real knowledge for the computer. When we find an entry in dictionaries that matches the input word or group, we say that we find an interpretation. If there are many interpretations for one text item then it is the case of ambiguity. Normally we should choose only one interpretation for a text item. Speaking roughly the process of analysis can be viewed as a sequence of interpretation choices. In most cases an interpretation of  some text item A depends  on interpretations of neighbor items, which depend  themselves on the interpretation of A. In other  words ambiguous items are cross dependent. That"s why we are to process all text variants of  the word"s context and choose the best. If  the number of variants is too large  then the programmer should narrow the context or simply cut some variants off. Such narrowing is an unavoidable source of interpretation errors that a computer usually makes.

 

Before the First Level Semantics ^

Now let us describe Dialing process stages in more detail. The first processor (Graphematics) divides the input text into words, digital sequences, alpha-numeral sequences. It finds electronic addresses, file names, some abbreviations. Graphematical procedure splits input text into  sentences, paragraphs, and headings.

The second processor is morphology. For each input word form the procedure finds all morphological interpretations in the morphology dictionary. Morphology interpretation is a tuple:

<Normal form, Part of Speech, Grammems>, for example:

doing -> <do, VERB, ing-form>

In addition the morphological processor applies heuristic context-dependent rules that resolve some simple cases of morphological ambiguity.

The next stage is occupied by syntax and fragmentation modules that are closely connected.  The fragmentation  module finds clauses in Russian  compound sentences. The syntax module builds continuous syntax groups inside clauses (NP, PP etc).  Any word of a syntax group has a determined morphological interpretation whereas clause can include some items that are not fully interpreted. Since Russian language has free word order, syntax groups of Dialing Project cover 70% of words. We consider that the rest of the syntax structure(30%) should be built on the next stage, i.e. First Level Semantics.

 

Basic Semantic Description ^

The First Level Semantic  analysis builds a semantic graph for one Russian sentence.  Speaking generally, the concept of "semantics" is not well defined in computational linguistics. Theoretical linguists and art scholars  understand this word deeper than specialists of applied linguistics. Usually pure humanists think that natural language semantics  is a science that deals with definitions of  words and collocations. Accordingly,  they feel that formal linguistics should build formal definitions. However the last investigations in this domain convince us that full formal descriptions of  words cannot be effectually combined to result in a structure for the whole input text. The main reason of it is the size of the descriptions.  For example, consider a definition of word "Table" [2]:

"Table" = "a piece of furniture having a smooth flat top supported by one or more vertical legs" (Wordnet).

Now imagine what a mess the input text will be after replacing all original words by such definitions.

If formal definitions cannot be used in computational linguistics due to their size, then what is the difference between semantics and syntax? Following Dr. Leontieva[9-10] we assume that the difference is in the relations and descriptors. Semantic relations are universal while syntax relations are language-dependent. Thus semantic structure is a graph with semantic relations. The number of nodes in the graph is approximately  equal to the number of  words in the input text, but there are some exceptions.  Certain words  can be divided into two or more nodes, for example,  German word "Donaudampfschifffahrtsgesellschaftskapitänswitwenrentenauszahlungsstelle". will be translated into 8 nodes ("Danube", "steamship company", "society", "capitan", "widow", "rent",  "payment", "office" ). On the other hand, some words can be united in one semantic node, for example "one hundred and forty" -> "140". But the most frequent correspondence  is  "one word -> one node", which makes the whole structure transparent. Consider the example of analysis:

Input Text: I write with a pencil
Syntax Structure:  NP => I;  V -> write; PP ->  with a pencil

Semantic StructureAgent(I, write); Instrument (pencil, write)

Although the semantic structure is  very simple and therefore very productive,  it is necessary to understand that it is very hard to get such structure without  errors.  In this light one should  not overvalue computational semantics in whole. For example, if you remove the semantic module  from Dialing Project  the translation will become only 20% worse.

Now we are ready  to describe the  semantic machinery used in Dialing Project.   There are three types of objects to be discussed:

  1. semantic relations;
  2. semantic descriptors;
  3. semantic categories.

Most  of semantic relations we use are wide spread (see, for example Universal Networking Language). Semantic relations are written in the following format:

R(A,B), where R is a relation name, A is slave semantic node and B is  master semantic node. 

Formula R(A,B) with some R,A, and B  agrees  with the input text iff  "A is R  for B" , i.e.:

R(A,B) <=>  "A is R  for B".

For example, formula Author(Leo Tolstoy, War and Peace) agrees with the text "the author of "War and Peace" Leo Tolstoy" since the text implies that Leo Tolstoy is the author of  "War and Peace".

The next principal object of our semantics is semantic descriptor.   There are about 40 semantic descriptors in the dictionary, like Animative, Rank, Organization etc. A formula that is built on logical operations (conjunction, disjunction) and semantic descriptors as atoms  is attached to each entry in the dictionary and to each actant pattern.  For example:

SemDes(president) = Power & Rank.

Initially such formulae were introduced to choose the right sense of words, if more than one sense exist. Subsequently semantic descriptor acquired some special meaning that means that

SemDes(A) = SemDes(B) <=> "Meanings of A and B have something in common".

The third semantic instrument is semantic categories.  There are four of them:1. Situations; 2.Objects; 3. Properties; 4 Operators. Situations(go, explosion, ...) always have two latent valencies: where  and when it happened.  Objects(keyboardmonitor, ...) have only one latent valency: where it is located . Properties(red, old..) modify situations and objects. Operators (not, yet,...) only modify semantic structure, they never correspond  to separate semantic nodes,.

All these instruments are used in description of word senses. Consider an example:

Title  =  read
Sense = 1
Category  = Situation
SemDes = Process
Valency = Agent(C, A1);  Object(C, A2)
SemDes1 = Animative
GramDes1 = subj
SemDes2 = Information & Thing
GramDes2 = obj

That was a short description of the adopted semantic theory that is used in Dialing Project. Let us proceed with the description of the program.

The input of the semantic module is a set of syntax linkages. The output is semantic structure. While  building the resulting semantic structure the program faces the following problems:

  1. Morphological ambiguity, since it cannot be fully resolved on the syntax stage.
  2. Lexical ambiguity, for example  I saw a bat, where a bat might refer to an animal or, among others, a table tennis bat.
  3. Structural ambiguity, for example "John sold the dog apple",  where at least two interpretations are possible:  "John sold the apple to the dog" or "John sold the apple that somehow resembles a dog"

The types of ambiguity constitute the design of the semantic module:

input text -> M1,...,Mn – morphological variants

each Mi -> L1,...,Ln – lexical  variants

each Li -> S1,...,Sn – structural  variants

 

Building Semantic Nodes ^

On the first level the procedure works with one morphological variant. In short this level is aimed to build semantic nodes and to assign them dictionary interpretations.  The most interesting parts of the level are as follows:

  1. Finding time groups;
  2. Finding fixed collocations;
  3. Getting all semantic interpretations for each semantic node.

The procedures 1) and 2) deal with collocations, and  it is time to have  a look at collocations in Dialing Project.  Algorithmically, there are two distributions: conditional/unconditional collocations and closed/open collocations.  Conditional collocations differ from unconditional ones in the method of finding. To find an unconditional collocation in a text one should only know its syntax and morphological properties. On the contrary, there must be strong semantic evidence for conditional collocations, that"s why  we have to find conditional collocations only on the semantic stage. Here are examples:

Unconditional:

  1. to be chip off the old block
    Yes, Tom is just a chip off the old block.
  2. come hell or high water
    Finish your homework come hell or high water!

Conditional:

  1. clear the air
    1. His apology cleared the air
    2. Drivers! Clear the air with Flame Gard Filters
  2. climb the wall
    1. If I don't go on a vacation soon, I'll climb the wall.
    2. If you can't climb the wall, build a door

Unconditional collocations are usually more idiomatic and longer than conditional.

The difference between open and closed collocations is more rigid. Any closed collocation corresponds  to only one semantic node in a semantic structure while  open collocations have as many semantic nodes as their words. Words of an open collocation can be connected with other nodes of the semantic structure, but of course, there must be  some irregularity  either in syntax of the collocation, or in its translation. Each node of an open collocation has a separate dictionary interpretation whereas a closed collocation can be interpreted only as a whole.

Examples:

Open collocations:

  1. 1.time groups with  the special use of colon
    1. at 7 : 30
    2. at 7 : 30 Friday evening
  2. crack a smile
    1. I finally got him to crack a smile.
    2. He would crack a forced smile

Closed collocations

  1. cut and dried
    That's my answer, cut and dried
  2. cross the Rubicon
    He crossed the Rubicon      

Obviously, this distribution is not absolute but it is very practical. It is clear that the computer can process unconditional/closed collocations quicker than conditional/open ones. If  we want to forget about optimization issues we should make all collocations open (just in case) and conditional.

Collocations are stored in the collocation dictionaries, some of which will be described below.

The dictionary of fixed functional words consists of  complex conjunctions (in order to, so that...) and complex prepositions (on behalf of ...).They are closed and unfortunately open, since there are some examples like this:

(1) to be, or not to be — that is the question // not conjunction

(2) God has given us complete autonomy, that is self-government //conjunction

The next dictionary is Dialing Thesaurus. Its structure  resembles Wordnet. The basic relations are hypernymy/hyponymy and meronymy/holonymy. One thesaurus relation can link two synsets together. That is the common place. However there are some differences from standard thesauri:

  1. Only noun phrase can be inserted into  Dialing Thesaurus,  no VP.
  2. Each thesaurus entry can be supplied with a valency structure that is expressed by means of semantic relations and descriptors.
  3. Any thesaurus synset can be used as semantic descriptor to select a word"s actant.

By default all thesaurus entries are open and unconditional collocations.

The next collocation dictionary is Time Group Dictionary. Time groups fill Time valencies, for example:

On Monday we go to the Zoo.

These collocations are open and unconditional, even though non-prepositional time phrases can be conditional, for example

(1) Sunday morning will be all right  (not Time valency)

(2) We"ll meet Sunday morning (Time valency)

The syntax of time groups is very special, that"s why we consider them collocations:

(1)  Tuesday September 4 1:34 AM ET

(2)  in the 1980s

The last collocation dictionary is called Fixed Collocation Dictionary. Most of entries of this dictionary are verb phrases, like clear the air or climb the wall,  and considered unconditional and closed, for example chip off the old block.

Besides the collocation dictionaries there is also one big dictionary for words that is called Russian Common-Puprose Dictionary. Thus, the Dialing Project uses the following Russian dictionaries:

  1. Russian Common-Puprose Dictionary (main);
  2. Thesaurus;
  3. Time Group Dictionary;
  4. Functional Words Dictionary;
  5. Fixed Collocations Dictionary. 

Each entry  of these dictionaries is identified by Title and SenseId. In other words, dictionary   interpretation is the following tuple:

 <Dictionary Name,  Title, SenseId> 

If no dictionary interpretation can be found for a word, the default one is used. For example, there is a default interpretation for transitive verbs. Of course, the program should choose only one dictionary interpretation for one semantic node, that"s why the program has to search among all lexical variants (a variant of attaching  dictionary interpretations to semantic nodes).

 

Building all possible semantic relations ^

The main steps the program makes while processing  one lexical variant  concern semantic relations:

  1. Building valency structure of each node;
  2. Building all possible relations;
  3. Processing  similar nodes;

The first  procedure builds valency structure of  each node. Here are further complications to worry us.  Firstly, some valencies are not compatible with the others, and this information is written in the dictionary in a special format. Secondly, there are default valencies that should be added  in any case to some types of nodes. For example, any Situation node has  a default Time valency. Thirdly, some valencies are mandatory, and some valencies are not.

The second procedure builds all possible semantic relations without checking semantic conditions. It uses morphological and syntax information about nodes and information that is written in GramDes field [3] of node"s article. For example, let X be a node. Let An be the n-th valency of word X.  If  the dictionary article of X includes  formula "GramDesnof+NP"  then  each PP with the preposition "of" will be connected to X. Usually this procedure builds much more relations than needed.  It is a task for the next procedures to choose the right hypothesis.

The ratio  R/N, where R  is  the number of all possible relations and N is  the number of nodes can measure  the  complicity of the input text. If R/N  > 1,7 then the current version of the program invokes heuristic procedures that delete unpromising relations.

It is well known that the worst enemies of NLP are ambiguity and similarity that"s why the procedure for similar nodes is one of the important part of this level. Similar words occupy only one  valency place, that"s why there should be a proxy (we call it Multiple Node Actant, or MNA ) that  represents all similar words. There are several MNA syntax types:

(1)  I like Peter, Mary, and John (basic)

(2)  I like neither Peter nor Mary (corr. conj)

(3) That will do more harm than good  (comparative)

After some kind of MNA is found  then we need to identify all possible nodes that should be subdued by this MNA and to find all possible parents of this MNA.

Our solution of the last problem resembles Link Parser[3] Similar Rule where linkage of sentence

I like Peter and John  

is built from two linkages

I like Peter

I  like John.

In short, node X becomes a parent of MNA if X subdues two children of the MNA:

 

Building Tree Variants ^

The last procedure searches the best tree variant and it is the slowest part of the whole program.  Obviously it is impossible to check all tree variants for a sentence of standard length (20-30 words). We know two implemented methods to restrict the search space. The first method  that is used in ETAP System[4] chooses the best variant for each node locally. Initially it finds the root of the clause, then it  chooses root"s children and  so on.  It means that choices on different levels are independent. Evidently that is not a correct method because it is not clear in which order  the program should  go through the nodes. But we don"t  want to criticize it since the existence of fully correct methods  is questionable. The second known method was developed in Mikrokosmos Project by Beale Stephen[5]. It is based on Constraint Satisfaction [6]. In short, this method divides the graph into subgraphs with minimal  ratio N/M, where N is the number of nodes in this subgraph and M is the number of nodes of the subgraph that are connected  to nodes that are not in this subgraph.

After finding the first subgraph division it solves ambiguity in each subgraph.  Then it finds new subgraphs and so on. One can see that Mikrokosmos method  is not also fully correct since  choices in different subgraphs are independent, but we know that the only correct method is too slow. That"s why, to some extent, the Dialing tree variant procedure is a kind of Microkosmos  method.

Our procedure hinges on the assumption that the number of cycles in the graph of all possible relations is small. For Russian language this fact is proved experimentally,  though there is no evidence  that it is true for other languages. If the number of cycles is small then the following steps are reasonable:

  1. Delete one relation in each cycle in all possible ways that produce a set of uncycled graph variants, for example, if there is one cycle with three  nodes then there are three ways to make the graph uncycled;
  2. If there is only one relation that points to a node, then fix this relation. We are sure that all fixed relations will be in the resulting variant, because the result variant should be connected;
  3. Find best tree variants in each maximal connected component that has no fixed relation.
  4. Combine all best tree variants in the resulting tree variant.

One can see that besides the number of cycles the number of fixed relations is also crucial for this procedure.  If the input graph has no fixed relation then  the whole  procedure turns out to be the full search. In practice the part of fixed relations is 20%, which makes the procedure several times quicker than the full search.  This method was tested and no counter-examples were found.

Now the key question is the evaluation of tree variants. There are several parameters of different importance that produces the value of a tree variant:

  1. Number of Connected components;
  2. Projectivity
  3. Length of relations
  4. Valency order;
  5. Mandatory and optional valencies;
  6. Matching Semantic Descriptors;
  7. ...

There are more than twenty parameters the program uses. The most important thing about them is that the parameters should be linguistically transparent – no special search trick is admitted. Thus,  Dialing Project uses transparent search algorithm (with fixed relation) and linguistically motivated set of parameters to estimate variants. We believe,  any NLP system should be designed in such way:

  • search algorithm should be simple and performs nearly full search;
  • measures should  be linguistically transparent.

 

After the First Level Semantics ^

The First Level Semantics sends the best semantic structure to the Transfer stage, which performs  two main tasks. Firstly, it maps  the  semantic structure  to the  English Semantic Dictionary, i.e. finds the best translation for a semantic node. To find the best translation one should consider semantic node"s  descriptors and valencies.  Secondly,   the program makes some changes in the  semantic structure to prepare it for English Synthesis. For example,  in Dialing Project  all phrases like:

The girl is beautiful

The beautiful  girl

The girl who is beautiful

have the same semantic structure Attrib(beautiful, girl) with different syntax information, hence Transfer  should have a  procedure which  inserts node "be" in the structure like this:

girl       be  
     
  beautiful   girl   beautiful

The last step is English Synthesis. Its main task is to produce a string of English Words, therefore all problems with word order are solved here.  The Word Order Problem is relatively simple when the semantic graph has only one connected component. The things become worse when the graph is not connected. In this case we are to put all connected components one after another in the output sentence, which usually  produces unsound results. Now we say  that in Dialing Project it is crucial to achieve fully connected graphs on the First  Level Semantic Stage. To achieve it we need to enlarge  our semantic dictionaries.

 

Acknowledgments ^

I would like to thank all members and consultants of Dialing Project and especially Dialing Company President Eduard Khachukaev, who funded this project.  More information about the Project can be found on www.aot.ru.

 

References ^

[1] The Concise Oxford Dictionary of Linguistics, © Oxford University Press 1997

[2] English as 2nd Language http://esl.about.com

[3] Temperley D, Lafferty J., Sleator D. 1995.Link Grammar Parser  http://www.link.cs.cmu.edu/link

[4] Russian Academy of Sciences Institute for Information Transmission Problems, ETAP-3 system  http://proling.iitp.ru/Etap/index_e.html

[5] Beale Stephen. (1996) Hunter-Gatherer: Applying Constraint Satisfaction, Branch-and-Bound and Solution Synthesis to Natural Language Semantics NMSU CRL Technical Report. MCCS-96-292.

[6] Tsang E. 1993. Foundation of Constraint Satisfaction. Academic Press, London.

[7] Michael Blekhman, Boris Pevzner; First Steps of Language Engineering in the USSR: The 50s through 70s; http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-6.htm

[8] W.John Hutchins; Machine translation over fifty years; http://ourworld.compuserve.com/homepages/WJHutchins/HEL.htm

[9] Leontieva Nina. Textual Facts as Units of Coherent Text Semantic Analysis // Arbeitspapiere der GMD 671. Darmstadt, 1992. 0,6

[10] Leontieva Nina. ROSS: Semantic Dictionary for Text Understanding and Summarization // META. - 1995.- 40, No 1. Montreal, 1995.

 

 

[1] Some information about the system can be found in [7] and [8].

[2] Many examples in this paper are translated into English to help reader to understand problems that exist in Russian.

[3] = Grammatical Description Field.

 

ãëàâíàÿ î íàñ ïðîäóêòû ñêà÷àòü  äåìî òåõíîëîãèè   ^

 
Ðàçðàáîòêà DiP.
© 2003 ÀÎÒ. Âñå ïðàâà çàùèùåíû.