PROCEEDINGS OF THE 9th INTERNATIONAL CONFERENCE "LINGUISTIC RESOURCES AND TOOLS FOR PROCESSING THE ROMANIAN LANGUAGE" 16-17 MAY 2013 Editors Elena Mitocariu Mihai Alex Moruz Dan Cristea Dan Tufis Marius Clim Organisers Faculty of Computer Science "Alexandra loan Cuza" University of Iasi Research Institute for Artificial Intelligence "Mihai Draganescu' Romanian Academy, Bucharest Institute for Computer Science Romanian Academy, Iasi PROGRAM COMMITTEE The publication of this volume was supported by the Faculty for Computer Science, "Alexandra loan Cuza" University of Iasi Corneliu Burileanu, Faculty of Electronics', Telecommunications and Information Technology, University of Bucharest and Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Constantin Ciubotaru, Institute of Mathematics and Computer Science, Academy of Science, Chisinau Mihaela Colhon, Informatics Department, Faculty of Exact Science, University of Craiova Dan Cristea, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute for Computer Science, Romanian Academy, Iasi branch Nicolae Curteanu, Institute for Computer Science, Romanian Academy, Iasi branch Cristina Florescu, Institute of Romanian Philology "Al. Philippide", Romanian Academy, Iasi branch Corina Forascu, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Gabriela Haja, Institute of Romanian Philology "AL Philippide" of Iasi, Romanian Academy Adrian Iftene, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi Diana Zaiu Inkpen, School of Information Technology Engineering, University of Ottawa Radu Ion, Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Catalina Maranduc, Institute of Linguistics "Iorgu Iordan - Al. Rosetti", Romanian Academy, Bucharest Rada Mihalcea, Computer Science and Engineering, University of North Texas Vivi Nastase, School of Information Technology Engineering, University of Ottawa Ionut Pistol, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi Dan Stefanescu, Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Elena Isabelle Tamba, Institute of Romanian Philology "Al. Philippide", Romanian Academy, Iasi branch Horia-Nicolai Teodorescu, Institute for Computer Science, Romanian Academy, Iasi branch and "Gheorghe Asachi" Technical University of Iasi Amalia Todirascu, Department d'informatique, Universite de Strasbourg Diana Trandabat, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute for Computer Science, Romanian Academy, Iasi branch Dan Tufis, Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Cristina Vertan, Research Group "Computerphilology" (UHH), University Hamburg Adriana Vlad, Faculty of Electronics, Telecommunications and Information Technology, University of Bucharest and Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Marius Zbancioc, Institute for Computer Science, Romanian Academy, Iasi branch ISSN 1843-91IX ORGANIZING COMMITTEE Dan Cristea, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute for Computer Science, Romanian Academy, Iasi branch Sabina Deiiu, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi Corina Forascu, Faculty of Computer Science, "Alexandra loan Cuza" University of Iasi and Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Lucian Gadioi, Faculty of Computer Science, "Alexandra loan Cuza" University of Iasi Daniela Gifu, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi Gabriels Haja, Institute of Romanian Philology "Al. Philippide", Romanian Academy, Iasi branch Elena Mitocariu, Faculty of Computer Science, "Alexandra loan Cuza" University of Iasi Alex Mornz, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute for Computer Science, Romanian Academy, Iasi branch Madalin lonel Patrascu, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute of Romanian Philology "AL Philippide", Romanian Academy, Iasi branch Ionut Pistol, Faculty of Computer Science, "Alexandra loan Cuza" University of Iasi Lhiu Andrei Scutelnicu, Faculty of Computer Science, "Alexandru loan Cuza" University of Iasi and Institute for Computer Science, Romanian Academy, Iasi branch Radu Simionescu, Faculty of Computer Science, "Alexandra loan Cuza" University of Iasi Dan Tufis, Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest Elena Isabelle Tamba, Institute of Romanian Philology "AL Philippide", Romanian Academy, Iasi branch TABLE OF CONTENTS TABLE OF CONTENTS............................................................................................................V FOREWORDS.........................................................................................................................VII CHAPTER 1 LANGUAGE RESOURCES...............................................................................1 EXTRACTING LEXICAL DICTIONARIES FROM COMPARABLE CORPORA - IN WHAT ........... CONDITIONS DOES IT WORTH? .......................................................................................................3 Irimia Elena DATABASE ON THE MEDICOPHARMACEUTICAL TERMINOLOGY [mpht] IN VARIOUS DISCURSIVE SPACES: ELABORATION-RELATED ISSUES.........................................................13 Stelian Dumistracel, Doina Hreapca, Luminita Botosineanu ELECTRONIC LINGUISTIC RESOURCES FOR HISTORICAL STANDARD ROMANIAN..........35 Elena Boian, Svetlana Cojocam , Constantin Ciubotarn , Alexandru Colesnicov, Ludmila Malahov, Mircea Petic CLRE - PARTIAL RESULTS IN THE DEVELOPMENT OF A ROMANIAN LEXICOGRAPHIC CORPUS..................................................................................................................................................51 Madalin lonel Patrascu, Elena Tamba, Marius-Radu Clim, Ana Catana-Spenchiu SUGGESTIONS FOR THE CLASSIFICATION OF TEXTS................................................................59 Catalina Mdrdnduc RELYING ON LANGUAGE..................................................................................................................71 Dan Stoica CHAPTER 2 TEXT PROCESSING.......................................................................................79 ROMANIAN-ENGLISH STATISTICAL TRANSLATION AT RACAI..............................................81 Tiberiu Boros, Stefan Dumitrescu, Radu Ion, Dan Stefanescu, Dan Tufis STATISTICS ON DERIVATION AND ITS REPRESENTATION IN THE ROMANIAN WORDNET.............................................................................................................................................99 Verginica Barbu Mititelu INSTANTIATING CONCEPTS OF THE ROMANIAN WORDNET.................................................109 Stefan Daniel Dumitrescu, Verginica Barbu Mititelu STEPS TO ANEW DTD AND SCD-BASED DICTIONARY ENTRY PARSER. OPTIMIZING RECURSIVENESS IN SENSE DEPENDENCY HYPERGRAPHS....................................................119 Neculai Curteanu, Alex Moruz, Svetlana Cojocaru ROMANIAN ETYMOLOGICAL CHAINS - A PRELIMINARY ANALYSIS..................................131 Raluca Moiseanu, Dan Cristea VIRTUAL CIVIC IDENTITY..............................................................................................................139 Daniela Gifu, Dan Stoica, Dan Cristea CHAPTER 3 SPEECH PROCESSING................................................................................149 ROMANIAN CORPUS FOR SPEECH-TO-TEXT ALIGNMENT......................................................151 Anca-Diana Bibiri, Dan Cristea, Laura Pistol, Liviu Andrei Scutelnicu, Adrian Turculet DATA-DRIVEN METHODS FOR PHONETIC TRANSCRIPTION OF OUT-OF-VOCABULARY (OOV) WORDS....................................................................................................................................163 Tiberiu Boros, Radu Ion, Dan Stefanescu USING FUNCTION WORDS FOR GUIDING THE PREDICTION OF THE ROMANIAN INTONATION......................................................................................................................................175 Doina Jitcd, Vasile Apopei, Otilia Paduraru MAXIMUM ENTROPY BASED MACHINE TRANSLITERATION APPLICATIONS AND RESULTS..............................................................................................................................................185 Adrian Zafiu, Tiberiu Boros INDEX OF AUTHORS...........................................................................................................197 FOREWORD The series of events organised by the Consortium for the Informatisation of the Romanian Language (ConsILR) has reached this year its 9th edition. With a history that goes back to 2001, the ConsILR series of events evolved in these 12 years of existence, by attracting more and more interest from linguists and computational linguists, but also from researchers of the humanities, PhD students and master students in Computational Linguistics, all with a major interest in the study of the Romanian language from a computational perspective. The series of events started in the format of a workshop and was transformed in 2010 into a conference, in order to reach an international visibility, being addressed to researchers working on Romanian language also from outside Romania. This year event was organised in Mielausem, in the old and romantic, recently rehabilitated, Sturdza castle, that we believed will create a perfect atmosphere for concentration and brainstorming, inviting for dialogues and debates. The organisers of the Conference Linguistic Resources And Tools For Processing The Romanian Language, as in previous years, have been the Faculty of Computer Science of the "Alexandru loan Cuza" University of Iasi and two research institutes of the Romanian Academy: the "Mihai Draganescu" Research Institute for Artificial Intelligence, Bucharest, and the Institute for Computer Science of the Iasi branch. The organisers were pleased to accept also a satellite workshop organised by the "Alexandru Philippide" Institute for Romanian Philology of the Iasi branch of the Academy. The workshop, titled The Romanian Academic Lexicography. Challenges of Going Computational is meant to show the progresses made in the computational approaches to lexicography in the research institutes of the Academy that shepherd the Thesaurus Dictionary of the Romanian Language. In the period from the previous Conference, an event of major importance for the field of Language Technology has happened in Europe: in September last year, the series of White Papers Languages in the Digital Age was published as bilingual editions (in English and each out of 30 European languages) by the META-NET consortium. Each bilingual edition1 includes comprehensive descriptions of the linguistic features of one of the European languages and tables showing comparative positioning of languages from the point of view of the existing resources and the technological development. In these tables Romanian is placed in a rather privileged position (compared against the majority of European languages) with respect to automatic translation, then near the majority of the other languages as regards the text analysis and the acquisition of resources for text and speech, but still on a tail position in the domain of speech processing. This comparative study shows with clarity not only that there is still a lot to be done in all domains of Romanian Language Technology but that we have only made the first steps towards the big science and for shaking the hands with the big industry dedicated to this domain. A serious alarm signal, stressing that if serious research efforts will not be made, 21 of the 30 European languages face "digital extinction", was addressed in a synchronised manner all over Europe (by press releases, TV and radio interviews) on September 26th 2012, therefore on the very European Day of Languages. By marking this day, the Council of Europe recognises the importance of fostering and developing the rich linguistic and cultural heritage of our continent. Unfortunately, the alarm seems not to be perceived with the same intensity also by the stakeholders deciding the finances of Europe, as a likely reduction of the proposed European budget allocated to research for Horizon 2020 (the period 2014-2020, which includes also the Eight Framework Programme) has been announced and has created panic within the research community of Europe. If this will be the case, the Language Technology domain, with certitude, will not thrive in the next 7 years and the threat for digital extinction of some of the languages in Europe could become a sad reality. Romanian is certainly one of them, digital extinction meaning being less and less used in the internet (because of the lack of stable technologies able to translate it, to summarise it, or to support its automatic interpretation, like text mining, parsing, crawling, etc.). As the internet has become the principal means of communication nowadays, the danger could indeed be very serious, and an explosion of foreign influence from the dense languages could be manifested in the spoken Romanian as well, sooner than we all expect (this is because the very young speakers are also the most weak-willed, while also having a big influence in the trendy evolution of the colloquial language). But it is very well known that text and speech technologies cannot exist without resources, since the resources are the ground from which the linguistic technological development is sucking its sap. Resources and technology go hand in hand and this is the very reason why this Conference exist. Conforming to this truth, the volume, including 16 papers, is structured in 3 parts: Language Resources, Text Processing and Speech Processing. We have opted to mix in this volume the papers addressing the Conference with those addressing the theme of the Workshop. Each paper has been reviewed by at least two members of the Programme Committee and, in accordance with international practice, the accepted papers were transferred again to authors for final corrections and answers to the reviewers' comments. The volume does not announce the presentations for which we received only abstracts, although we have recommended them for direct presentation (as part of the Workshop), and in one or two cases we accepted essay-like formats of the papers. As in other editions, the complete program of the Conference and audio-video recordings of the talks can be consulted online (at http://consilr.info.uaic.ro/2Q13A, thanks to MEDIAEC - the Multimedia Laboratory of the "Alexandru loan Cuza University". Iasi, Bucuresti, May 2013 The editors 1 The English-Romanian edition: Diana Trandabat, Elena Irimia, Verginica Barbu Mititelu, Dan Cristea, Dan Tufis. (2012) The Romanian Language in the Digital Age / Limba romana in era digitals, in White Paper Series, Eds. Georg Rehm and Hans Uszkoreit, Berlin, Springer, ISBN 978-3-642-30702-7, 87 p. can be accessed online at . vi vii CHAPTER 1 LANGUAGE RESOURCES EXTRACTING LEXICAL DICTIONARIES FROM COMPARABLE CORPORA IN WHAT CONDITIONS DOES IT WORTH? ELENA IRIMIA Research Institute for Artificial Intelligence "Mihai Draganescu ", Romanian Academy, Bucharest, Romania elena@racai.ro Abstract In previous papers we described DEACC, a tool that extracts lexical dictionaries from comparable corpora (CC) based on an algorithm introduced by Reinhardt Rapp in 1999 and extended by us through various amendments and heuristics. While anterior experiments limited the evaluations at the level of accuracy of the results (the percentage of source words that received correct target translations), we consider that an analysis of the amount of new translation information that such a method can provide is very necessary. Accordingly, we want to look for answers for the following questions: How many new (not seen in the seed lexicon) words were extracted? How many of the new words have accurate translations? How big a seed lexicon should be so that the newly acquired words justify the extraction work? Keywords: comparable corpora, parallel corpora, parallel data extraction, phrase alignment, machine translation 1. Introduction Using parallel data to extract translation knowledge is the established practice in the machine translation technology. But to have sufficient parallel corpora available for one's translation needs is a privileged position, in which only the most spoken languages are. Usually, to acquire parallel corpora (PC) involves paying intellectual property rights for a text in the source language and its human translation in the target language. To be used in automatic translation, this corpus must be consequently aligned at document/paragrapn/sentence level, morphologically and/or syntactically annotated, etc.; this is expensive both in terms of money and time. For less economically and culturally important languages (in which category one can consider, between many others, the Eastern-European/Balkan languages), the digitalized amount of texts, and implicitly the amount of parallel data, is significantly reduced. One of the forthcoming ideas for improving the situation of the under-resourced languages in Machine Translation (MT) was to collect a less pretentious type of corpora: comparable corpora. The EAGLES - Exper Advisory Group on Language Engineering Standards Guidelines (1996)2 defines a comparable corpus as "one which selects similar texts in more than one language or variety." The condition for the data parallelism is replaced by the weaker condition of similarity, which makes the corpora much easier to procure. News articles about the same subjects or Wikipedia entries 2 http://www.ilc.pi.cnr.it/EAGLES96/browse.html 3 Extracting Lexical Dictionaries from Comparable Corpora Elena Irimia describing the same entities or concepts can be downloaded and collected to compile a comparable corpus. Of course, to use CC as a basis for MT one needs techniques for extracting translation information from them. Parallel information can be identified at document, paragraph, sentence or inter-sentential level. Intuitively, the translation information that can be extracted from CC is less reliable than the translation tables extracted from PC; this is why the first approaches in the field imagined a base-line machine translation system constructed on as much parallel data as can be acquired and improved by adding data extracted from CC. 2. Methodologies for extracting translation data from comparable corpora As we already mentioned in the Introduction, under-resourced languages must rely on parallel information scattered on the web to compensate the gap to the privileged languages' technologies. Methods for collecting data and developing CC have been created and documented, but we will not pause upon them. For more inside on these matters, we recommend the ACCURAT project report (Paramita et al, 201 i). Parallel data may be found in CC at any textual level: document, sentence, phrase, word. Different algorithms have been developed that focus on a specific level of granularity. For CC extracted from structured platforms like Wikipedia, which organizes its entries according to some interlingual identifiers, the document alignment is an easy task. In other situations, where there are no obvious links between documents in different languages, a variety of alignment techniques have been developed. Tao and Zhai (2002) employ Pearson's correlation coefficient variant r to compute similarities between words in the documents corresponding to the two languages. Using the word similarity measure r(x,y), they construct a document similarity function: where x and y are words from documents d} and d2 andp(x\d) is the probability for the occurrence of x in d. The alignment precision of this algorithm doesn't rise above 86%. But Vu et al. (2009) improve it by adding a Date-Window filter to reduce the search space (assuming that documents on the same subject are created around the same date). Furthermore, a second filter called Title-n-Content favours alignment candidates which have at least one title-word of the source document translated in the target document content. They also add a linguistic feature which concerns terms (multi-word expressions acting like single units) and replace Pearson's correlation coefficient with Discrete Fourier Transform to compute the similarity score of two frequency distributions. (Vu et al, 2009) reports an increase in the alignment precision of 4% for En-Chinese and 8% for En-Malay compared with (Tao and Zhai, 2002). (Munteanu & Marcu, 2002) and (Munteanu, 2006) use a Cross-Lingual Information Retrieval Technique (CLIR): they translate source words from a document using a lexicon and use the translations to construct a query which is run against the collection of target documents. The top k documents returned by the IR engine are the most probable pairings for the query document. This approach is designed to ensure a high recall rather than a high precision of the alignment. Another approach is to translate the target documents with MT systems and compare the translated document D' with the candidates Df. A classical technique is to identify the similar documents Dj through a vector-based clustering algorithm. Montalvo et al. (2006) use named entities and their cognates to perform cross-lingual clustering and obtain 90% accuracy. Abdul-Rauf and Schwenk (2009) measure the closeness of the Df documents with TER, a standard MT evaluation metric and select the document with the smallest distance: D* = argnim{TER(pifBfy}. Ion et al. (2011) designed an Expectation Maximization algorithm to align different type of textual units, including documents. They imagined an analogy with the IBM-1 model for word alignment, where the translation probability is computed through an EM algorithm and the hidden variable a models an assignment (1:1 word alignments). Similarly, an assignment between two sets of documents (a 1:1 sequence of document correspondences) can be modeled buy a hidden variable {tnie/false} and is determined by word translations between pairs of documents. The hypothesis is that there are pairs of translation equivalents which are better indicators of a correct document correspondence. Parallel sentence extraction techniques are based on the assumption that comparable documents/corpora may contain some sentences which are reciprocal translations. Most of the approaches described in the document alignment section have been adapted and used for sentence alignment. Preliminarily web-crawling for pages with similar URLs, Resnik and Smith (2003) use a lexicon based on parallel data to compute alignment scores between documents or sentences. Similarly, Zhao and Vogel (2002) find (nearly) parallel sentences in comparable documents through dynamic programming. For each n:m possible alignment between the sentences, they compute an alignment score based on a word alignment model, use special insertion and deletion models and find a path which maximizes the total alignment probability. Abdul-Rauf and Schwenk (2009) use IR techniques (WER, TER) and simple filters like the sentence length rate to identify the most similar sentence in the target language. More recent approaches have been developed (inside the European project ACCURAT) at RACAI and have been described in (§tefanescu et al, 2012). LEXACC requires aligned document pairs for a better precision of the sentence alignment. The algorithm interpolates five features functions: 1) a translation overlap score for content words using GIZA++ format dictionaries, 2) a translation overlap score for functional words, 3) the alignment obliqueness score, 4) a punctuation score and 5) a score indicating whether strong content word translations are found at the beginning and the end of each sentence in the given pair. 4 5 Extracting Lexical Dictionaries from Comparable Corpora Elena Irimia Phrasal alignment approaches are following similar steps. A standard phrase alignment algorithm relying on the Viterby path of the word alignment, a binary classifier algorithm and a lexical features based algorithm are the three techniques used by Hewavitharana and Vogel (2011); the best performance in terms of precision, recall and F-measure is reported for the last technique. PEXACC, developed together with LEXACC at RACAI for the ACCURAT project's purposes, linearly combines a set of feature functions ft (which output translation similarity scores between 0 and 1) to obtain the final score of parallelism P for two phrases e (in the source language) and/(in the target language) i i The most popular method to extract lexical dictionaries from CC, on which we based the construction of our tool, is described and used by Rapp (1999). It relies on external seed dictionaries and is based on the hypothesis that word target! is a candidate translation of word source! if the words with which target! co-occur within a particular window in the target corpus are translations of the words with which source! co-occurs within the same window in the source corpus. The translation correspondences between the words in the window are extracted from seed dictionaries. A co-occurrence matrix is computed both for the source and for the target corpus: each of its rows corresponds to a type word in the corpus and each column corresponds to a type word in the base lexicon. Finally, similarity scores are computed between all the source vectors and all the target vectors computed in the previous step, thus setting translation correspondences between the most similar source and target vectors. Different similarity scores were used in variants of this approach; see (Gamallo, 2008) for a discussion about the efficiency of several similarity metrics combined with two weighting schemes: simple occurrences and log likelihood. 3. DEACC: adapting and extending the original Rapp}s approach 3.1. What is new in DEACC Initially, the co-occurrence matrix is constructed based on the co-occurrence frequencies in the corpus. In a subsequent step, the frequencies are replaced by log-likelihood scores which are able to eliminate word-frequency effects and favour significant word pairs. In our approach, this is followed by a step of LL filtering, in which all the words that occur with an LL smaller than a threshold are eliminated. The filtering was motivated by the need to reduce the space and time computational costs and is also justified by the intuition that not all the words that occur at a specific moment together with another word are significant in the general context of our approach (the LL score is a good measure of this significance). The seed lexicons we used in our experiments are translation tables, automatically extracted using GIZA++ from parallel corpora. In such a table, a source word can have multiple translations and each pair (source, target,) is associated with a translation probability. This introduces polysemy in our seed lexicons, situation which is avoided and not discussed in the standard approach. Other approaches either keep for reference only the first translation candidate in the dictionary or give different weights to the possible translations according to their frequencies in the target corpus. We think one need to take advantage of all possible translations, as the semantic content of a linguistic construction is rarely expressed in another language through an identical syntactic or lexical structure. Our solution was to distribute the log-likelihood of a word pair (wl9 w2) in the source language to all the possible translations of w2 in the target language as follows: LL{wlfw2} = LL(wltw2") * p(w2* £i) i where p(w2, £j is the probability of a word w2 to be translated with tt and As the purpose of DEACC (and of all the other tools in the ACCURAT project) was to extract - from CC - data that would enrich the information already available from parallel corpora, it seemed reasonable to focus (just like Rapp (1999) did) on the open class (versus closed class) words. Because in many languages, the auxiliary and modal verbs can also be main verbs and most often the POS-taggers don't discriminate correctly between the two roles, we decided to eliminate their main verb occurrences as well. For this purpose, the user is asked to provide a list of all these types with all their forms in the languages of interest. Being based on word counting, the method is sensitive to the frequency of the words: the higher the frequency, the better the performance. In previous works, the evaluation protocol was conducted on frequent words, usually on those with the frequency above 100, an option that ensures very accurate translation candidates. However even if the operation causes loss of precision, the frequency threshold must be lowered when we are interested in extracting more data; in our tool, this parameter can be set by the user, according to his/her needs. Following the conclusions of Gamallo's (2008) experiments, we used as a vector similarity measure the DiceMin function. In computing the similarity scores, we did not allowed the cross-POS translation (a noun can be translated only by a noun, etc.); the user can decide if he/she allows the application to cross the boundaries between the parts of speech, through a parameter modifiable in the configuration file. Each choice has its rationales, as we know that a word is not always expressed through the same part of speech when translated in another language. On the other hand, putting all the words in the same bag increases the number of computations and the risk of error. For the proper nouns, which are more probably to be translated into a similar graphic form from a language to another, we introduced a cognate score (based on Levenhstein Distance), which is used in the computing of the similarity metric to boost the cognate candidates. If the user's machine has multiple processors, the application can call a function that splits the time consuming problem of computing the vector similarities and runs it in parallel. The tool is implemented in the programming language C#, under the .NET Framework 2, and is language independent, providing that the corpus is POS-tagged according to the MULTEXT-East tag set3 and that the user is introducing manually in the configuration file the list of source and target verbal forms to be ignored by the algorithm. 3 http://nl.ijs.si/ME/V3/msd/html/msd.html 6 7 Extracting Lexical Dictionaries from Comparable Corpora Elena Irimia 3.2. Initial Experiments The seed lexicon we used is a word-to-word sub-part of a translation table, extracted with GIZA++ from corpora in different registers. Only the content words were kept. The translation table can be loaded as two different dictionaries EN-RO (64,613 polysemic entries) and RO-EN (66,378 polysemic entries). Tests have been conducted on two different CC of different sizes types/registers: 1. A corpus of articles extracted at RACAI from Wikipedia: 743,194 words for Romanian, 809,137 words for English; strongly comparable one, with little noise (due to the fairly similar structure of the wiki pages, which facilitated the elimination of the boilerplates). 2. A corpora compiled by USFD in the ACCURAT project: journalistic corpora downloaded from Google News through a heuristic based on a list of English paper titles, translated into Romanian. For more details, see (Paramita et al, 2011). The pre-processing (tokenization, insertion of diacritics, lemmatization, POS-tagging) of the comparable corpora has been described in (Irimia, 2012) and we will not detail it here. Initially, we manually compiled a gold standard lexicon of around 1,500 words (common nouns, proper nouns, verbs and adjectives) from the Wikipedia corpus. In the conditions described by the default parameters in the configuration file, the precision-1 (the number of times a correct translation candidate of the test word is ranked first, divided by the number of test words) and precision-10 (the number of correct candidates appearing in the top 10, divided by the number of test words) scores were computed: Table 1: P-l and P-iO for the 1,500 test words from Wikipedia corpus POS Precision-1 Precision-10 common nouns 0.5739 0.7381 proper nouns 0.6956 0.7336 adjectives 0.4943 0.6292 verbs 0.6620 0.8275 Because the initial experiments with the USFD corpus were very disappointing, we acknowledged the need for correcting some POS annotations and also for introducing two different frequency thresholds for the two corpora (English: 7,280,609 words; Romanian: 2,170,425 words), to compensate for the difference in size. We also used the Levenshtein Distance for all the analyzed POS, to boost those scores that correspond to graphically similar translations. This boost is done after all the similarity scores between a certain source word and all the target words are computed. The threshold to which the words were considered cognates were a LD<0.3 and the boost meant a multiplication with 10 of the similarity score. All the scores that resulted above 1 were reduced to 0.99. After all these heuristics, the results became more reasonable, but still not rising to the performances obtained on trie Wikipedia corpus. We see this as a consequence of the serious difference in the degree of comparability between the two corpora. We constructed gold-standard dictionaries with 100 entries for common nouns, verbs and adjectives and Precision-! and Precision-10 scores were computed: Table 2: P-l and P-10 for the 300 test words from USFD corpora POS Precision-1 Precision-10 common nouns 0.2909 0.5454 adjectives 0.3663 0.5049 verbs 0.24 0.48 The effect of introducing the cognate test for all the POS was important for many of the good results, producing more forms of the same lemma as possible translations, which is consistent with the rich morphology of Romanian and is very useful in a dictionary. This phenomenon occurred for around 46% of the correct translated nouns, 39% of the correct translate adjectives and 29% of the correct translated verbs. 3.3. Using DEACC results to improve SMTsystems There are two basic directions in making use of the translation data extracted from CC to increase the performance of the SMT systems: adding parallel data extracted from parallel corpora to the training PC or constructing mixture translation and interpolated language models from PC and CC. But before thinking about how to integrate our lexical dictionaries to an SMT, we need to evaluate how reliable is the data we obtained. Using a seed lexicon extracted from a diverse and big corpus, as seen in the previous experiments, conducted to good P-l and P-10 scores. This approach is fitted for domain-adaptation techniques, in situations when the available parallel corpus is general and the comparable corpus is from a specific domain. But for settings were the parallel corpus (and, implicitly, the seed lexicon) is a small one, the method might produce less accurate results. We experimented with a lexicon extracted from "1984" English-Romanian corpus4; the one-to-one, content word only version of this small seed lexicon has only 2870 entries, as opposed to the seed lexicon used in our first experiments, who had around 265,000 entries. We also used the opportunity to vary some other parameters of the application: the frequency, the-co-occurrence window and the LD yes/no option. As can be seen in Table 3, we did not experiment with all the possible parameters combinations, but guided our decisions according to the results in a previous experimental step. The first set of experiments was composed by four settings: General/F50-10/LDyes/w General/F3 0-10/LDyes/w5 1984/F50-10/LDyes/w5 1984/F50-10/LDyes/w5 from which we learned that there is no significant influence coming from two different frequency ratios. We continued by keeping F30-10 and varying the window w to 10 and we noticed a good improvement (the maximum, PI for adjectives: from 0.3 to 0.48). 4 http://nl.ijs.si/ME/Vault/CD/docs/mte-d21 f7node7.html 9 Extracting Lexical Dictionaries from Comparable Corpora Elena Irimia Table 3: general vs. 1984 dictionaries General Dictionary "1984" Dictionary F50- 10 F30-10 F50-10 F30-10 LD: yes LDno LDyes LDno LDyes LDno LDyes LDno w5 wlO w5 wlO w5 wlO w5 wlO w5 wlO w5 wlO w5 wlO w5 ■wlO N PI 0.185 0.17 0.25 - 0.09 0.12 - - - 0.13 0.12 - - P10 0.46 - - - 0.46 0.5 - 0.26 0.36 - - - 0.36 0.39 - - A PI 0.29 - - - 0.3 0.48 0.17 023 - - - 0.26 0.26 - - P10 0.57 - - - 0.57 0.65 0.46 0.51 - - - 0.51 0.53 - - V PI 0.18 - - - 0.19 0.26 - 017 0.12 - - - 0.11 0.1 - - P10 0.53 - - - 0.53 0.6 - 0.48 038 - - - 0.38 0.41 - - Next, we set w to 10 and changed LD to no, obtaining serious decreasing to practically the worst scores for the general seed dictionary. The best parameter combination for the general seed dictionary (F30-10, LDyes,wl0 - see the first bold column in the table) was used to test the "1984" seed dictionary and the results improved for P10, but decreased or remained the same for PI (see the second bold column in the table). As expected, there is a significant decrease in performance when using a small seed dictionary, ranging from 13% to 22% for PI and from 11% to 19% for P10. But we can still use such a dictionary to extract new information from comparable corpora when the available parallel data are poor. For the two best settings SI and S2 (the bold column in the table), we computed the number of new words against their specific seed dictionaries. Table 4: The number of new words extracted from USFD corpora A N V total number of extracted forms 1887 5530 2945 new forms vs. 1984 diet. 1620 (-86%) 4820 (-77%) 2486 (-84%) new forms vs. general diet. 604 (-32%) 1572 (-28%) 638 (-21%) Then we computed the P-l and P-10 scores for SI and S2, on gold-standards manually validated for 700 word-forms for each of the noun, adjective and verb categories. Below, one can see that the percentage of correctly translated new words when using the general dictionary is insignificant in relation to the extraction effort: around 20 new words in 1000 words are on the first positions in the candidate lists. On the contrary, for the 1984 dictionary, where the computational costs are reduced (-20 minutes for -12,000 new word forms), there are, on average, 18% correct translations in the first positions of the candidate lists. Table 5: P-l and P-10 scores for the new words extracted from USFD corpus PI P10 A 1984 0.21 0.32 general 0.02 0.04 N 1984 0.20 0.38 general 0.02 0.05 V 1984 0.15 0.33 general 0.01 0.02 4. Conclusions In terms of new information added to the available parallel data, the whole process of extraction using a big seed dictionary was costly and almost futile: too less information for too much work. We already had a lot of parallel corpora from different domains and we wanted to extract new information from a comparable corpora which was quite general (the News domain). However, our experiments showed that when a small seed dictionary is the only available and the comparable corpora is a lot out of the dictionary's scope (news versus prose dated from 1949), the procedure is recommended, either for domain-adaptation or for under-resourced pair of languages, as explained in the introduction. References Abdul-Rauf, S. Schwenk, H. (2009). Exploiting Comparable Corpora with TER and TERp. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora, ACL-IJCNLP, 46-54 Gamallo P. (2008) Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. Proceedings of LREC Workshop on Comparable Corpora, Marrakech, Marroco, pp. 19-26. ISBN: 2-9517408-4-0. Hewavitharana S. and Vogel S. (2011). Extracting parallel phrases from comparable data. ACL: Proceedings of the Fourth Workshop on Building and Using Comparable Corpora, Portland, Oregon, USA, 61-68. Ion, R. (2012). PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 2181—2188, Istanbul, Turkey Ion, R. Al. Ceausu and E. Irimia.( 2011). An Expectation Maximization Algorithm for Textual Unit Alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) held at the 49th Annual Meeting of the Association for Computational Linguistics Portland, Oregon, USA Irimia, E. (2012). Experimenting with extracting lexical dictionaries from comparable corpora for English-Romanian language pair. In Proceedings of The 5th Workshop on Binlding and Using Comparable Corpora: 'Language Resources for Machine Translation in Less-Resourced Languages and Domains", Istanbul, Turkey, 49-55. 10 11 Extracting Lexical Dictionaries from Comparable Corpora Montalvo, S., Martinez, R., Casillas, A., and Fresno, V. (2006). Multilingual Document Clustering: a Heuristic Approach Based on Cognate Named Entities. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, 1145-1152. Munteanu, D. S., and Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Philadelphia, USA, 289-29. Munteanu, D. S. (2006). Exploiting Comparable Corpora. PhD Thesis, University of Southern California. Paramita, M., Aker, A., Gaizauskas, R., Clough, P., Barker, E., Mastropavlos, N., Tufis, D. D3.4 Report on methods for collection of comparable corpora", http://www.accurat-project.eu/index.php?p=deliverables Rapp, R. (1999) Automatic identification of word translations from unrelated English and German corpora. ACL-1999: 37th Annual Meeting of the Association for Computational Lingidstics. Proceedings of the conference, Maryland, USA 519-526. Resnik, P. and Smith, N.A. (2003). The Web As a Parallel Corpus. In: Computational Lingidstics Journal, 29:3. §tefanescu, D., Ion R. and Hunsicker. S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy. Tao, T., and Zhai, CX. (2002). Mining Comparable Bilingual Text Corpora for Cross Language Information Integration. In Proceedings of KDD'05, Chicago, Illinois, USA. Vu, T., Aw, AT., and Zhang, M. (2009). Feature-based Method for Document Alignment in Comparable News Corpora. In Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, 843-851. Zhao B. and Vogel S. (2002). Full-text story alignment models for Chinese-English bilingual news corpora. Interspeech 2002: 7th International Conference on Spoken Language Processing, Denver, Colorado, USA, 517-520. DATABASE ON THE MEDICO-PHARMACEUTICAL TERMINOLOGY [MPHT] IN VARIOUS DISCURSIVE SPACES: ELABORATION-RELATED ISSUES* STELIAN DUMISTRACEL, DOINA HREAPCA, LUMINITA BOTOSINEANU The Romanian Academy, 'A. Philippide " Institute of Romanian Philology, Iasi Branch - Romania steliand@naic. ro, {doina.hrepca, lumi. botosineanuj (wgmail com Abstract The elaboration of a database with systematic information on the two levels of use concerning the MPhT (endogenous discourse - of the specialists, and exogenous discourse - of their communication with the public that "consumes" the products of the research and of the industry in the field) is analysed starting from the following concepts and challenges: a) discourse spaces, terminological levels, and linguistic barriers from the perspective of pragmatics; b) general characteristics of the MPhT in terms of diachrony and synchrony, as terminology with high socio-cultural impact; c) the expression of the metalinguistic function of language in the texts of direct contact with the public and in specialized lexicographic works that have as objective instructing the consumer in the field of medicine and pharmacy; d) effects of globalization: the opportunities of an effective linguistic contact between the specialist and the public in the MPhT field; e) the necessity and the possibility of elaborating a database of the type "Medico-pharmaceutical protection of the consumer's rights more effective." Keywords: medico-pharmaceutical terminology, discursive spaces, terminology levels, linguistic barriers, socio-cultural impact 1. Introduction 1.1. A working project The elaboration of a database comprising systematic information that makes the distinction between the two levels of use [MPhT] (endogenous discourse - of the specialists, and exogenous discourse - of their communication with the public that "consumes" the products of their research and of the industry in the field) is analysed starting with the following concepts and challenges: [1] pragmatic-discursive spaces; terminology levels and linguistic barriers from the perspective of pragmatics; communication contract; * Main consultants: Pharm. Irina DUMISTRACEL, PhD, Technical Manager, Athlone Laboratories Ltd, Roscommon, Ireland; Dan-Dom PLETEA, MD, College of Doctors Iasi. 12 13 Database on the Medico-Pharmaceutical Terminology [Mpht] [2] general characteristics of the [MPhT] in terms of diachrony as terminology within a field with high socio-culturai impact; [3] the expression of the metalinguistic function of the language in texts regarding the direct contact with the public and in specialized lexicographic works that have as objective the training of the consumer in the field of medicine and pharmacy; [4] effects of globalization: the opportunities of an effective linguistic contact between the specialist and the public in the MPhT field; [5] the necessity and possibility of elaborating a database of "Medico-pharmaceutical Security Glossary" [MPhSG], in order to make the specialized communication with the public more effective and to protect the consumer's rights. These are, in fact, the matters we shall treat in the present paper. 1.2. Previous research We mention that the starting points of this approach are represented by the monographs published by Steiian Dumistracel and by articles written in collaboration, published in the past two years, as well as several papers with the same status, presented during various scientific events (currently published). We shall emphasize them briefly in the following lines; they are thematic approaches, regarding pragmatic problems in the relationship between the specialist and the consumer in the medico-pharmaceutical field, which represents the diastratic variation, meaning the differences on the level of professional language and particularized discourses. Firstly, this depends on the degree of education and, secondly, on the general scientific education of the speakers (socio-cultural differences), on the diaphasic variation, which involves the different ways of expression in terms of communication performance, including the expressive modalities per se (disease and treatment suppose various psycho-linguistic implications). We have paid attention to text analyses of selected publications from various epochs, which underline the application of the programme that answer to the demands of competence expression regarding the diastratic variation. As for the markers concerning the competency in diaphasic variation, we have paid due attention to the "paratext" structures (prefaces, various types of explanatory notes, etc.). As personal and strictly specialized bibliography, we make reference to the following titles: (Dumistracel, 2000); (Dumistracel, 2006a); (Dumistracel et al, 2011a); (Dumistracel et al, 2011b); (Dumistracel et al., 2012). 2. Concepts and terminology 2.1. Pragmatic-discursive space We use a syntagm, pragmatic-discursive space ([PDS] hereinafter), starting from certain concepts coined by Dominique Maingueneau, who distinguishes the discursive space (which is an element of the triad also comprising the discursive field and the discursive universe) as the ideological positioning (identity) of the enunciator. The cited author agrees with the relation between the concepts and the concept of «champ scientifique» coined by Pierre Bourdieu, and developed in the study with the same name (Bourdieu, 1976). Generally, our new approach is determined by the fact that the cited starting point refers especially to ideology such as philosophical schools or political currents (cf. Charaudeau - Maingueneau 2002: 97; 453-454). On the other hand, the PDS concept aims at being distinguished from the general concept of «(discursive) Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu field», which is confrontation-oriented in various areas of spirituality and social field. Other works by Pierre Bourdieu that reflect this area of preoccupations are entitled Le champ politique or Le champ religieux dans le champ de manipulation symbolique; see also Le champ journalistique et la television [the title of a series of TV shows, 1996]. Joseph Jurt argues, starting from Bourdieu, the symbolic concept of «champ litteraire», in a study referring to the theory of literary field and to the "internationalization" of literature (cf. Jurt 2001: passim). In general, when we study the PDS concept we focus on the aspects of pragmatics, as mentioned bellow. More precisely, we underline the discourse adjustment by assessing the characteristics specific to the communication setting and to the personal data of the interlocutors. We have developed our concept based on information regarding communication spaces and registers (Dumistracel, 2006b) and on the concepts defined by Eugeniu Coseriu, regarding «linguistic competence)) and «linguistic variation)) (Dumistracel et al. 2012). PRAOiMTI€-IMSCmSI¥E SPACE A. Objective eta B. Subjective cists C-Duannmkstion space * i Commimkstkm register Linguistic competence i i Linguistic varmtkm public sodoprofessfG&al personal Communication space Lin^tiistk competence 4 dcM^tionary expressive A* B' Comnmnkation register publcinfomial familiar (intioiat e. Indie) Linguistic variation -indisdifonY 4 diatepk dkstrsttc disphask Figure 1. Pragmatic-discursive space (©Dumistracel et al.) 14 15 Database on the Medico-Pharmaceutical Terminology [Mpht] For a brief analysis of Fig. 1, we mention the following coordinates and components: Firstly, two communication coordinates are illustrated: the objective data (A) and the subjective data (B). The category of objective data comprises the communication situation within a given space - the public, socio-professional and personal space. The verbal, paraverbal, and nonverbal correspondents are related (adequate registers): formal public register, informal public register, and familiar register (the last one has also evolved towards intimate or even ludic register). Regarding to the subjective data of communication, related to the personal aptitudes of the interlocutors, the first reference is to their linguistic competence, mainly to the idiomatic competence (how well the individual knows a language). The second reference is to the expressive competence, which represents the performance "in given situations and concerning certain things, with certain interlocutors" (Coseriu, 1992-1993), hence, the adequacy to the communication situation, to the theme of speech, and to the interlocutor. In other words, this means the capacity/performance of the emitter of placing himself on the same level of idiomatic and expressive competence as the receptor. 2.2. Terminological levels We have launched the syntagm terminological levels (Dumistracel, 2000: passim; cf. Also Dumistracel et al. 2011a: I, § 1-2), with reference to the distinct level of the discourse specific to the communication in a specialized field, such as the medico-pharmaceutical one, in terms of endogenous and exogenous discourse (see, above, § 1.1). The starting point in this matter was the analysis of the inventory of terms within the Vademecnm published by Gheorghe Danila (1999). Of the 4,500 entries, only approximately 1% represent terms that can be accepted generally as known and used by a trained public (for the results of the detailed analysis, see also Dumistracel et al 2011b: § 2). In terms of pragmatics, it is worth making a general distinction between terminologies of exegesis per se, in fields accessible only to specialists (astronomy, mathematics, chemistry, linguistics, etc.), and terminologies in fields with high socio-cultural impact, with an interest in both exegesis and the public. In this category -where two terminological levels function permanently - one can find, for instance, the legal and administrative terminology, the terminology of religious cult and, the most significantly illustrated, the medico-pharmaceutical terminology. Obviously, our country will be able to add the banking and Internet terminology and maybe other fields. 2.3. The issue of linguistic barriers Considering the above mentioned aspects, the existence of the specialized level in the area of the terminologies mentioned in the second category - among which the one used for the communication between the specialist (as a doctor and pharmacist), on one side, and the patient, on the other - leads to the creation of true linguistic barriers. They result from the social, cultural, status, role, strategic, emotional, etc. differences in the society, seen as barriers. 2.4. Communication contract The presented elements point out the communication setting and the factors that govern this action, for whose assessment as a whole one can start (with good results) from considering the concept of ^communication contract)) as a development of the «reading Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu contract)) concept, imagined by Eliseo Veron as "communication relation" (Veron 1997: passim). We present certain brief data on the subject: the communication process involves three basic factors: [a] the image of the one who sends a message, the place that he ascribes to himself concerning what "he says"; [b] the image of the one to whom the message is destined, the place ascribed to him; [c] the relation created based on these images, between enunciator and addressee. In fact, the contractual definition of the speech, which means the existence of two subjects in an intersubjectivity relation, has known various formulations, for convergent visions: "intersubjectivity" (Benveniste), "dialogism" (Bakhtin), "collective intention" (Searle), "joint intentionality" (F. Jacques), "negotiation" (Kerbrat-Orecchioni; for an overall presentation, cf. Dumistracel 2006a: 36). 3. General characteristics of [MPhT] from a diachronic perspective 3.1. MPhT characteristics in various stages At the beginnings of its constitution within the realities of the national culture, the [MPhT] characteristics were analyzed by N.A. Ursu, in a monograph dedicated to the formation of the Romanian scientific terminology. As essential starting point, we outline the translation from French and German, in the second half of the eighteenth century and the first half of the nineteenth century, of various specialized texts. We refer here to short treatises on general medicine, on balneology, as brochures with instructions for epidemics (smallpox, plague) and regarding the cure of diseases, or general norms of hygiene (Ursu, 1962). Within this framework, books of the type "house doctor", of sanitation, hygiene education and treatment occupy a special place, considering the lack of specialized practitioners. In the phase 1760-1860, but also later, the accessibility in terms of communication was involuntarily ensured for readers of the publications (excepting specialists) by the constant presence of loan translations (that represent "transparent" lexemes and syntagms) and mostly of folk medical terms, besides the neological loans per se. For the current analysis, we have identified a seemingly paradoxical aspect: the linguists who studied the field from the perspective of the history of the culture language show (secondarily, of course) certain dissatisfaction for that mixture. The main reason was that the process of creation took more time than what would become the terminology of the modern Romanian literary language, which, by aspiring toward the level "of the endogenous discourse", was proven to have eliminated to a great extent the loan translations and to have decisively got rid of the folk terms. Another aspect that draws attention - from the same perspective of the study - is the classification (with little differentiation) of all translations done at the end of the eighteenth century and the beginning of the nineteenth century as "books meant to popularize scientific knowledge". In fact, besides such publications and other manuals, publications in the medicine and pharmacy fields are actually books for medical and sanitary education or instruction per se, thus involving an exogenous discourse (see especially the printings of §tefan Vasile Episcupescul; (Dumistracel et al, 2012: I, §5.3.1). 16 17 Database on the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu 3.2. Communication perspective There are not many studies on the specifics of medical terminology in terms of communication; the best known study in Romania is the chapter written by Christian Baylon and Xavier Mignot, entitled Limbaj si comunicare medicaid (Medical language and communication), part of the monograph Comunicarea (Paris, 1994, translated into Romanian in 2000). Besides general issues, such as "the medical language", "the medical information", the chapter deals especially with aspects such as "the doctor-patient communication" (hence, on the level we call "exogenous discourse"), and "the written communication among doctors" ("the endogenous discourse"). It is worth mentioning the study of the so-called "cryptic function" of the medical language (not of the respective terminology, which has been encrypted anyway since the beginnings of "the division of labour", besides empirical cures, through magical and occult practices). This function refers to both the use of the jargon as professional language in medical practice and of the argot as communication strategy, aiming at making the patient not understand certain (at least) unpleasant aspects (a cryptic function in action), and to the "technical character" of language, as "potential cryptic function", which gives a "potential power" to the user. Hence, the two authors also take into account the level of language, as well as that of the medical discourse, depending on the information transmission (Baylon & Mignot, 2000). 4. The expression of the metalinguistic function of language 4.1. Stages in the study of diastratic variation Firstly, we shall outline aspects regarding the diastratic variation in exogenous discourses, in three phases: a) the first half of the nineteenth century; b) the first half of the twentieth century; c) regarding a recent particular specialized work. Secondly, we will refer to the interest for communication with the public in current highly specialized lexicographic works (representing the endogenous discourse, with a minimum opening towards the public). Books for medical and sanitary instruction and "house doctor" type of dictionaries The following texts shall comment the idea of "hygiene-medical-sanitary" instruction starting from a Western model from the end of the eighteenth century, which illustrates the application of the Rluminist orientation of mass dissemination of scientific knowledge. We are talking about a "house doctor" in German, printed in Leipzig. The -very instructive - title reads as follows: Immanuel Stange, Der Hausarzt oder Anzeige der bewdhrtesten Hausmittel, imd Anweisung sie zur Verhiltung oder Heilung der KraiMeiten gehdrig zu gebrauchen. Ein Handbuch fur Landgeistliche, Hausvdter imd andere Personen, die an Orten leben, wo kein Arzt fct (Leipzig, Liebeskind, 1797), which means: "House doctor or advice on the best known remedies and indications on preventing and curing diseases. A handbook for country priests, heads of families, and for other persons who live in places where there is no doctor". We find truly remarkable the intuition regarding the necessity of using a language adjusted to the various categories of readers in the works - translations or original works - of a pioneer of the Romanian medical publications, the Walachian doctor $tefan Episcupescul. He is the author of unsigned translations from Greek, which were attributed to him, and not less than five printed books, which appeared between 1829 and 1846 (for a detailed analysis see Dumistracel et al. 2012: part II). As we cannot discuss here in details the various means through which this author's manifestation of the competence on the diaphasic variation, we shall only cite the facts that illustrate the concern for the differentiation of the text depending on the instruction of the reader categories (diastratic variation). Hence, we shall present certain facts of Episcupescul's book entitled Practica doctorului de casd. Cunostinta apdrdri s 'a tdmaduiri boalelor bdrbdtesti, femeesti si copildresti [The Practice of the House Doctor. The Knowledge of Preventing and Curing the Diseases of Men, Women, and Children] (1846). DOCTOPMSI DE KACL P. BlMEiOT ¥VJ*i - m. --------„mr^,.,,--, ,v.immmmTfr»i.f-~----- . -...w^Mi. «Mi i i 4 ft Figure 2. Title page of the book written by §tefan Vasilie Episcupescul, Practica doctorului de casd [House Doctor Practice] To get an overall picture, firstly we find warnings from contact texts (with a paratext status). After the title page, we find out that the volume is printed "for the Doctor and the people" and that it is "elaborated for the health benefit of the community", an idea 18 19 Database on the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu also detailed through a distinction within the paratext representing a "dedication" to "Barbu De Stirbeiu", great ban (governor) and "knighted to various orders"; we present the mention below: "The book comprises, My lord!, a presentation suitable for all the categories of our society: noble, urban, and rural, with the simplest indications and the easiest means for any sick person to replace a doctor if necessary" (op.cit, p. VI; our italics). The idea of text adaptation depending of the specialized reader and on the public is illustrated in the general presentation of the book contents (p. XLVTI sqq,), "for the people" [a], which represents various practical advice, and "for doctors" [b],"the medicine theory and practice", etc. We quote the conformation to this communication option through titles from the "Scara cuprinderii carfi" ["The book contents"] (p. 509; we mention that we have not translated the quotations, as the names of the diseases are transparent as neologisms): [a] "for the people": "oranduiala imbracamintei", "~ hrami", "taina impreunlrii", "zamislirea pruncului", "ingrijirea lahuzii", but also "epizootikon, veterineria" (terms indexed by "cresterea si finerea sanatatii dobitoacelor" - see p. 87-88); [b] "for doctors": "Terapia firii: sterna si astenia sanatatii"; "Terapia metodica: boala stomahului - morbus gastricus"; "~ raeelii - refrigheratio"; "~ urechii - otitis"; "~ buboiului si a sugiului - furunculus panaritium"; "~ pubertatii - hlorosis si nostalghia"; "~ intunecimii lintei ochilor - cataracta", etc. (see p. 102 sqq.). OOSTOHUl D£ CASA DICTI0H1RUL SiMTif II Diwrom. V'ASftE SIANIJ ' Bacrosia. M>AK Figure 3. Title page of the book written by Vasile Bianu and loan Glavan, Doctorul de casd sau Dicfionarul sdndtdfii [House Doctor or Health Dictionary] Even when describing the treatment for various diseases, in the texts "for doctors" there are natural alternations between the technical terms of the profession, which are 20 frequently loans or translations from Greek, and the folk medical terms or common language words with special meanings in the communication regarding the care for the ill. For instance, "Varsatul spuzos - scarlatina"; "Boala sdngeraturii matchii -menoraghia" [the Greek-based version for Menorrhagia 'condition of the uterus...'] (Episcupescul 1846: 243; 271). All these prove the special gift of communicator of the doctor §tefan Vasile Episcupescul. For the first half of the twentieth century, an example of performance regarding the application of the diastratic competence is represented by another "house doctor" book, which employs the syntagm even in the title. We refer to the dictionary published by two doctors, Vasile Bianu and loan Glavan: Doctorul de casd sau Diciionarul sanatafii [House Doctor or the Dictionary of Health], a work awarded by the Romanian Academy. The references of the present text concern the second edition of the book, published in 1929 (a massive volume, of 804 p., 25 x 20 format). A true bestseller, the same dictionary got to the fourth edition, in 1942. A first necessary mention is that the two authors - famous specialists - also have an important contact with various segments of the public. Bianu was also a military doctor (which helped him know the folk terminology for diseases and treatments in this field) and, concerning the public career, he was also a deputy (he dealt with issues related to sanitation education). In his turn, Glavan, a professor at the university, published a significant amount of works in the field. In order to illustrate the concern for an efficient communication with the readers, we shall only present the lexicographic correlation, in regard to the entry words representing neologism technical terms and folk terms in a synonymy rapport, in the tables 1 and 2, with material taken from Bianu - Glavan 1929: passim. Table 1: Synonymic correspondences for terms belonging to science and therapeutics Acne - cosifunigei; cataracta - perdea; cefalalgie, cefalee - see durere de cap ("durerea acuta de cap se cheama cefalalgie si cea cronica se numeste cefalee"); constipatiune - incuiere, incuieturd; Diabet - boala de zahar; diaree — cufureald, esire afard, pdntecdrie, pdrtuicd, treapdd, urdinare; fisuri — crdpdturi sau pleznituri la sezut (la anus); fortifiante, intaritoare - look at tonice; idiotie (see this word) - nerozie, imbecilitate - un grad mai mic de prostie; intoxicate - look at otrdvire; laringe - beregatd, gdtlej, rdsufldtoare ocluziune, ~ intestinala - or incurcdturd de mate (look at intestin); placenta — casa copilului; scabie - look at rdie; strangulare - gdtuire, sugrumare; tuberculoza - atac, tusd seacd,ftizie, hecticd, ofticd Table 2: Synonymic correspondences for folk medical terms Abuba - abces alveolar; mat - look at intestin; 21 Database on the Medico-Pharmaceutical Terminology [Mpht] bale - look at saliva; buline (colloquial) - capsule; caldura - febrd; Ciuma - pestd, pestd bubonicd, pestd orientatd; curatenie - purgativ; dropica - see idropizie; inmoiere de creieri - ramolisment cerebral; lesin - look at sincopd; nebunie - look at alienatie mentald; pogana - look at afte; rac - look at cancer, sapimas, sapunei - look at supozitor; soare sec, soareie in cap - look at congestiime si insolatiune; sucitara - look at entorzd; vitriol - look at sulfuric (acid) However, not only the correspondences illustrated by the Tables 1 and 2 make the object of the concerns for eliminating or, at least, for attenuating the linguistic barriers related to the diastratic competence for the readers of the Dictionary. For instance, we find the synonymy between neologisms of various ages (!), such as "'fin - look at influenza", "tablets or plates" (the second meaning is out of use). On the other hand, the attention paid to the diatopic variation is reflected by the presentation of the synonymic correspondence between words within the folk speech such as "sapunei - look at odogaci" or "mdsalari(a - look at nebunariid" - as well as the richness of regional synonyms for names of plants of interest for treatments diets. For instance, for the term potato there are no less than 15 equivalents (a serious competition for the inventory of the Academy Dictionary and of a dictionary of synonyms): "Cartofi, bandraburce, baraboi, barabule, bologeane, cartoafe, crumpene, crumpeni, crumpiri, grumciri, hadeburce, mere de pamant, picioci, piciorca, poame de pamant, fermer (Solanum tuberosum, fam. Solaneelor)". 0 ENCiCLOPEDH* Figure 4. First cover of the book 1000 de bolipe intelesid tuturor [1000 diseases in plain language], vol. II Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu The few elements mentioned above illustrate the vocation of competence of the authors of the House Doctor analyzed regarding the diastratic variation in terms of [MPhT] and they explain, of course, the success of their book to the public. The idea was resumed nowadays, especially regarding alternative treatments, naturist medicine. This way, for instance, Doctorul de casd (Bucharest, Rom Direct Impex, 1994) is a translation after J. Frank Hurdle, A country doctor's common sense health manual (1975); "The house doctor" or "The doctor of the house" are also titles of blogs1. The qualities of the works briefly presented in § 4.1 are also significantly highlighted through a comparison with a specialized dictionary recommended from an editorial perspective as largely accessible work. Hence, the translation from French - made by "Dr. Cosmin Pop" - of the work of the two French authors, Ch. Prudhomme and J.-F. DTvernois, with a rather easy-to-go original title, Connaitre et comprendre 1000 maladies de A a Z (Paris, 2009), is presented in "the reading threshold" constituted by the inscription on the first cover. The cover "overestimates" the competence of the addressee in terms of "knowing things" -"in plain language" and "medical encyclopaedia indispensable to the family", through seductive formulas for the buyer. If the text on the cover might have been a mere editorial strategy, the declared intention is also present in the title: the very title page "1000 de boli pe intelesul tuturor" ["1000 diseases in plain language"]. However, it is not even by far equivalent with "connaitre et comprendre" (we shall not discuss here other comparable appealing formulas comprised in the texts that appear on another "reading threshold", the fourth cover of each of the two volumes of the Romanian version of the book). We have detailed the aspects mentioned above considering that, at first glance, "1000 diseases" could be considered a current counterpart of the Bianu - Glavan dictionary, which even mutatis mutandis is not in conformity with the reality. Most entry words represent scholarly technical terms, such as choanal atresia, eritrasma, chronic subdural hematoma, polymyositis, tularaemia, etc., while the description of the disease usually belongs to the same register. For instance, Verucile [the Warts] [D 7]: „tumori benigne ale pielii provocate de virusuri din familia papilomavirusurilor umane (HPV)..." ["benign skin tumours caused by the human papillomavirus (HPV)..."]. There are also simple correspondences probably representing differences brought by various medical schools: dracunculoza sau filarioza de Medina [dracunculiasis or Medina worm filariasis], otita cronicd colesteatomatoasa sau colesteatomul [chronic otitis cholesteatomatosa or cholesteatoma]. The cited examples illustrate the status of the targeted readers: specialized doctors or medical students (such situation are also present in Bianu - Glavan 1929: passim, but they are much less frequent). However, one may still identify various degrees of accessibility in the case of the diseases representing neologisms that have become part of the common lexicon, such as angina [angina], bronsitd [bronchitis], cancer [cancer], diaree [diarrhoea], nijeola 1 cf. http://healthyl3-annelisse.blogspot.ro/ or http://www.gustos.ro/articole/sfaturi-practice/aloe-vera-doctorul-casei.html 22 23 Database op the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu [measles], sancru [chancre], tuberculoza [tuberculosis], etc. For some of them, there are, sporadically, folk correspondences; for instance, "Antraxul sau (vezi) cdrbunele" ["Anthrax or (see) coal"], a word (erroneously alphabetised in the entry list) under which the disease is described or, in a more complicated way, ,JLobstein (boala) sau boala oaselor de sticld sau osteopsatrioza" ["Lobstein (disease) or brittle bone disease or osteopsathyrosis"]. There are also other situations of equivalence. For instance, „bot de iepure sau (vezi) palatoschisis" ["harelip or (see) palatoschisis"] or „pdduchifsm pediculoza" ["lice or pediculosis"] - but only in the Index; however, the index is not well elaborated, because it does not indicate, technically, which of the elements within the synonymic series is the entry word and which is the variant. For instance, in the above-cited cases, the scholarly term is the entry, and the folk variant is only some sort of commentary (we shall not discuss here other technical dysrunctionalities). Finally, they seemingly belong to the same category, such as in „boala somnului sau tripanosomiaza africana" ["sleeping sickness or African trypanosomiasis"], or „viermele solitar sau (vezi) teniaza cu taenia saginata" ["tapeworm or (see) taeniasis with taenia saginata"]. It is obvious that such an approach actually ignores the common reader, interested in a diagnostic or/and in a treatment, but the analyzed dictionary has incontestable merits for the reader familiarised with the endogenous discourse. However, we have studied "1000 diseases in plain language" because this encyclopaedia-like lexicon is, to a certain extent, a pale counterpart to works that openly claim the profile of a strictly specialized dictionary, such as the one we study study hereinafter. However, we do warn that Rusu 2012 also treats with interest the issue of folk medical terminology. 4.2. A reliable dictionary The competence regarding the diastratic variation in the linguistic relation between doctor (pharmacist) and the beneficiary of specialized services is convincingly illustrated by several precepts formulated by the authors of the dictionary Rusu 2012. An example is the following: "From a pragmatic perspective, getting to know the medical terms in circulation [on the level of common and folk speech] can serve to a better communication between doctor and patient". Below, we cite the presentation of medical practice realities: "The doctor-patient dialog takes place on two distinct language levels [this is another formula for what we call ^terminological levels»j: the doctor needs precise, mainly anatomical terms, to locate the disease, as well as a clear expression of the symptoms. The patient may indicate the location and characteristics only approximately or in a totally different verbal code". There is a discussion on the issue of the so-called "exaggerations", which complicate the communication process: the manifestation of shyness, but also of the "vulgarity" (which should be considered with "tolerance"). Under these circumstances, "the use of folk medical terms for both speakers" is required, and "Experienced doctors spontaneously adapt the language to the patient's age, profession, reserve, or, on the contrary, the behaviour to the limit of mutual respect, the lack of confidence expressed by the patient" (Rusu 2010: 1433; see also the references to the need to/re/humanise the medical act through the dialog with the patient, in the Introduction to the fourth edition, p. 19). See also the criticism to what the author of the preface for the fourth edition, Dr. Gabriel Ungureanu, calls "the aggressive invasion of anglophone terms" in the past few years, as well as to "the irritating filling of the medical language with Americanisms", sometimes "out of pure intellectual snobbism"; op. cit,p. 9). Thus, in the most pretentious guide of scientific medical terminology in our country -where there are numerous strictly specialized sections that we cannot enumerate here for reasons of space -, as Rusu (2010) represents, in fact, the contemporary higher level of the [MPhT] belonging to the endogenous discourse, given all the possibilities of expression for the potential cryptic function (cf. § 3.2), the folk popular terms are considered very important for performance. By indicating their scientific correspondents or their meaning, these terms are made available to the specialists in a glossary that comprises over one thousand entries. These conclusions of professional common sense, after all appear in the introduction to the Glossary of folk medical terms, whose presence is motivated in a highly persuading manner from the perspective of the imperatives of professional communication. Excursus. A brief assessment allows us to appreciate that a recent publication of the Romanian book, a Diciionar medical ilustrat [Illustrated Medical Dictionary], representing the translation (planned to be published into 12 volumes) of the Italian original SALUTE. Dizionario medico (Milan, RCS Quotidiani, 2006) represents - considering its intention - a welcomed compromise regarding the addressee. In parallel with the endogenous discourse - moderated there are (not only within the articles per se) sections which have in view the pragmatic-discursive space of the presumptive patient; certain subdivisions of the articles are even accessible to the readers with a certain level of instruction and of average cultural formation. A simple enumeration of certain types of article substructure highlights this communicative opening; type [1]: a) the generally accessible definition of the entry word and, in parallel, in a special case, "Prophylaxis"; b) treatment; type [2]; a) definition; b) causes; c) symptoms; d) diagnostic; e) treatment (cf. Diciionar 2013: passim). However, it is obvious that such a dictionary is far from representing a... competition for an information tool of the [MPhSG] type. 5. Effects of the globalization in the [MPhT] field 5.1. Categories of specialized terms For the study of the linguistic barriers that emerged - mostly in the past decades -through the globalization of industry and of the pharmaceutical market, as a result of international enactments also adopted in Romania and with important linguistic effects, the issue of three categories of names requires further investigations: - [a] the denomination issue in terms of "inventing" the drug/medicine (= M); - [b] the denomination issue in the phase of prescribing M within the doctor-patient relationship; - [c] the issue of the denominations used when selling M in pharmacies. In order to determine the level of communicative performance, of the [a] category of texts, representing the M "invention" and "launching", linguistic analysis assumes the consideration of the effects of terminological regulations, referring to the following criteria: - [a] innovative M, - [[3] generic M, for which a study of special interest is the so-called "umbrella terms". The study on the performance of the term by the internal criteria of the field imposes the consideration of the text of laws regarding the "visibility tests" and, on the other hand, of the official texts such as Ghid privind denumirea medicamentelor de uz uman 24 25 Database on the Medico-Pharmaceutical Terminology [Mpht] [Guideline on the trade name of medicinal products for human use] (2008), etc. The limitations of encoding the "invented" term may be tracked down based on interpreting the interdictions within the scheme in Fig. 5, referring to the 'Decision tree", elaborated according to the instructions of World Health Organization (WHO), which we present in Figure 5. The term can be rejected, on one side, if there is a possibility of identification with similar medicines, a similarity in the written and/or verbal form, considering the medical context and/or the conditions of use and/the administration route. In addition, on the other side, it can be rejected if - regarding the remedies within the same or a different therapeutic class - the invented term provides indications on a public health issue. Furthermore, it can be rejected if there is a possibility of identifying a similarity taking into account the medical context and/or the conditions of use and/the administration route (we underline the formulations that we have emphasized with italics!). We are dealing here with "branding" and "marketing" issues. Approach on the issues related to international common denominations (ICD) within the proposed invented denominations (ID) 1) The similarity between an invented denomination and an ICD (of the medicine in question or a different ICD) - in the written and/or verbal form - taking into account the medical context and/or the conditions of use and/or administration route of the medicines in question is treated as follows: DECISION TREE PROPOSITION ID Marathon/ Viagra ICD of the medicine in question Sildenafil citrate Similarity Different ICD identification of similarity in the written and/or verbal form YES Identification of similarity in the written and/or verbal form taking into account the medical context and/or the conditions of use and/or the administration route ID rejection NO Z3T YES NO ID acceptance ID rejection ID acceptance Figure 5. Decision tree proposition for the medicine Marathon/Viagra For the [b] phase, the research on the respective terminology considers the existence and practical functioning of the regulations imposed to the family doctor, concerning the Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu concept "Brand vs ICD" [^International Common Denomination], comprised within the documents: - prescriptions following the ICD norms; - medicine prescriptions according to a "framework agreement" (for instance, that of2011); - "pharmaceutical rules of good practice", etc. The study of legislation referring to the denominations of medicine leads to the conclusion that the respective medicine - though one cannot directly accuse it of exacerbating a "cryptic" function of language - imposes the use of a foreign and even rebarbative terminology for the Romanian consumer. This is why the issue of investigation - in terms of knowledge of the pragmatic-discursive space - proposes the following objectives: - the concrete manner of social functioning concerning the current legislation on the "international" denomination of the medicine; - the proceedings of decoding the terms within the endogenous discourse in the functioning of the relation "trade name" vs "ICD". For the [c] phase, the [MPhT] research involves - starting from [a] - the doctor-pharmacist relationship, as endogenous discourse, a theme rarely discussed so far, [|3] the study of the relationship between pharmacist and the beneficiary of the treatment, as exogenous discourse. This relationship takes place where the medicines are commercialized, but this issue has been almost absent from the Romanian pragmalinguistic research; it is, however, especially interesting for the elaboration of the work we shall refer to in § 6. Concerning the [a] component, some of the aspects are studied, for instance, in a paper signed by Dr. Rodica Chirculescu, Relaiia doctor-farmacist, colaborare si raspundere asumata [Doctor-pharmacist relationship, collaboration and assumed responsibility}2. On the same site3, there is an article on the [P] component, entitled Principiile comunicdrii farmacist-client [Principles of the pharmacist-client communication], signed by Anda Pacurar4. There is no doubt regarding the need to study this general issue, as well as to elaborate the planned "Glossary", mostly because there are absurd perspectives, such as the one presented within an interview (with Dr. Cristian Carstoiu), entitled precisely Comunicarea farmacist-pacient [Pharmacist-patient communication], published in "Practica farmaceutica" [Pharmaceutical Practice] (vol. V, no. 3-4, 2012, p. 130-132). In this interview, the communication barrier issue - superficially and unprofessionally approached - is simply sent off in a sentence that invokes, by "defending" the pharmacist, besides the lack of time, the difference of education. The sentence reads as follows: "often patients simply do not have the necessary knowledge". Of course, aspects that are even more... human - though general -are considered: "For advice, one should use a language adjusted to the degree of information of a patient, and it should avoid specific terms"3. 5.2. Structure of contemporary denomination We cannot propose to analyze the strictly current specialized nomenclators; a brief presentation in this sense was included in Vademecum medicamentorum, published by Gheorghe Danila (cf. § 2.2). Compared to the 1% value of the "transparent" names of 2 cf. http://www.pharma-business.ro/oportxmitati/relatia-doctor-fannac asumata.html 3 www.pharma-business.ro 4 http://www.pharma-business.ro/oport 5 http://heppyportal.projectize.eu/database/publications/publication_45_ro.aspx 26 27 Database on the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu medicines for the public with medium education, estimated based on the cited source, the percentage is even lower nowadays. We make this statement considering the fact that the Romanian [MPhT] has been radically changing, a characteristic of the globalization in terms of medicine production and commercialisation, with a focus on the existence and functioning of the two above-defined distinct terminological levels regarding the communication between emitter and receptor (cf. § 2.2). The reference is made to the endogenous discourse, which became internationalized, with effects on the exogenous one. On the other side, it is also interesting to study the effects of using a so-called pharmaceutical lingua franca, based on the scientific terms for the active principles of medicines. Concerning this system, insurmountable linguistic barriers have started emerging in terms of the common communication (we could consider a profile previous to the one illustrated by Hepites 1862; cf. below, § 6.2). 5.3. The issue of elaborating a Medico-pharmaceutical Security Glossary Our interest - during the planning stage for an overall investigation - for the elaboration of a "Medico-pharmaceutical Security Glossary" also concerns the diversion it's represented for the medicines per se ("allopathic", in specialized terms), the real competition, regarding the contact with the users of alternative treatments [AT]. These treatments concern the so-called naturist products, "nutritional supplements" or "food supplements", present in pharmacies. The pharmacies represent a trade setting where an image transfer occurs in favour of the products in the [AT] class. In the area that we are referring to, transparency is not prohibited, regarding neither the verbal, nor the iconic message; both act directly and effectively through the contact texts represented by packaging, leaflets, and advertising. In this case, we are not talking about prescriptions, which constitute contact texts within the strictly interdisciplinary discursive space, because the prescriptions use only the terms for the active substances, and not the trade names). Without providing any more details (however, see Dumistracel et al. 2011b), we shall present several types of "hyper-transparenf terms on the [AT] level. In pharmacies, there are products whose names - in the contact texts such as packaging and advertising - are transparent, in the sense that they are relatively easily associated, on a certain level of idiomatic competence, with terms for diseases, treatment, or even substances. For instance, Colonhelp, Urinal, Acneogel, Hepatobil, or Calmocard (with three layers of the reception level through the text on the packaging: „Calmocard [1]. Calmant cardiac [2]. Contribuie la buna func^ionare a inimii [3]" ["Calmocard [1]. Cardiac analgesic [2]. Contributes to an optimal heart functioning [3]"]). The same opening toward immediate acceptance goes for medicines from Calmoplant, Larvalbina, Tutunstop, to "Hapciu" („Ceai Hapciu" ["Hapciu Tea"} and „Trusa Hapciu - un tratament natural contra racelii si gripei" ["Hapciu Kit - a natural treatment against cold and flu"]). Nevertheless, there is a significant distance between the possibility of deciphering the name of a certain medicine (though it may be semitransparent), Hemorzon, compared to the name of... a competitor, a "nutritional supplement", HemoroEasy (pronounced approximately as hemoroizi [haemorrhoids, in Romanian]; the commercial is resounding: "HemoroEasy cures you when you have hemoroizi"]). 6. Elaboration of a [MPhSG] database 6.1. Antecedents on MPhT decrypting A sui-generis opening toward deciphering the [MPhT] can be tracked down to the emergence of the first Romanian pharmacopoeia, the one published by Constantin C. Hepites in 1862. In the specialized literature, it is defined as a bilingual presentation - in Latin and Romanian -, which itself constitutes a significant step toward "democratization" in terms of communication in this pragmatic-discursive space. Hence, compared to the traditional scholarly nomenclators, which show the prestige of Latin in the sphere of sciences, in Hepites 1862 the specialized information related to various remedies (constituting the so-called "monographs") is presented on two columns: the first (a smaller text) is in Latin, and the second is in Romanian. However, we are especially interested in the linguistic transparency, which begins in the very title of the monographs, an efficient area of the paratext: the names of the remedies are not provided only in Latin and Romanian, but also in French and German. This way, this nomenclator becomes a multilingual one, actually. We illustrate this view through the title of two remedies based on «deer antler» (the spelling is the original one): "Cornu Cervi, Raspatum - Cornii de cerbu, Rasatura - Gall. Corne de cerf rdpee, Germ. Hirschhorn geraspeltes" si "Cornu cervi ustum - Cornii de cerbu arsii - Gall. Corne de cerf bridee, Germ. WeisgebranntesHirschhorn"'(Hepites 1862: 69). There is additional information in this sense, discovered after a minute research, regarding the animal pharmaceutical remedies, within Hepites' pharmacopoeia, present in the collection of the Museum of Pharmacy History in Sibiu. In that period, the pharmacists' interest concerned the substances used to treat diseases. For instance, "Castoreum - Castoreu", "Cetaceum - Spermaceti!", "Ossa Sepiae - Osse de sepii" (Hepites 1862: 48, 51, 126); or: "Cancrorum lapides (ochi de raci), Conchae (scoici), Fel bovinum (fiere de bou), Ichthyocolla (clei de peste), Sebum ovillum (seu de oaie)" (Toma et al. 2012: passim). N.B. "Bila de bou" is still used today, and the substance called castoreu was registered as an antispasmodic remedy and as emmenagogue in Bianu - Glavan 1929: s.v. 6.2. Downsides of leaflets We have seen the low effectiveness of the presence - on the book market - of term inventories such as "1000 diseases in plain language" (cf. § 4.1) and of the (brief) information on the WHO norms of enciphering the "invented" common denominations, meaning of the new products (cf. § 5.1). These aspects make it easy to accept, in terms of consumer's protection, the idea of the necessity of elaborating, in the [MPhT field specialized nomenclators and glossaries with explanations accessible to the public. Obviously, the elaboration of such work involves many issues, some of which are rather hard to identify before ordering and applying the facts referring to: [A] elaborating the general necessary database (categories of sources, material transcription, etc.) and [B] selecting the title-words depending on the two corresponding terminological levels, in order to elaborate the list of terms to appear in a planned [MPhSG]. The task becomes even more difficult because of the lack of proper Romanian dictionaries with more or 28 29 Database on the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu less common or even "folk" (meaning really "in plain language") terms for diseases and treatments. In regard to the elaboration of a database that reflects the terminology belonging to the current level of the exogenous discourse, the most important aspect is that the medicine treatment for certain diseases has been evolving rapidly; hence, new terms appear all the time. On the other hand, it is difficult to establish terminological correspondences within a certain area of treatments, meant to orient toward a prospecting approach of a synonymic nature. On principle, as a rough guide, a terminological group of remedies can be outlined starting from the core, represented by [the scientific name of the disease] + [the generic name of the remedy used for its treatment]. Around this core, an onomasiologic group can be outlined; this group comprises, from a linguistic perspective, firstly the word family of the basic term. In this phase, we do not intent to present samples or lexicographically organized materials; however, we do provide a brief example of this working hypothesis, Around the core formed by the term spasm (defined by the Romanian Explicative Dictionary as an "involuntary, strong contraction, with variable duration, of muscle or of a group of muscles") and by the term antispasmodic ("a medicine against spasms"), the file [F] may comprise the terms spasmodic, spastic, spasmofilie, spasmogen, spasmoliiic, and antispastic. This inventory will be subjected to a selection for the list of words [LW] of the [MPhSG], depending on the results of the surveys with doctors and pharmacists regarding the presence/occurrence of some of these terms in their discussions with various categories of patients and in the paratext structures represented by leaflets. In fact, such an approach also includes the issue of the lexicographic presentation of texts. We could consider giving up the alphabetical order of entries, in the favour of the presentation by onomasiologic groups, following the concept of "structural lexicology" formulated by Hallig -Wartburg 1963, by a "rational system of concepts", matter on which we cannot afford to discuss here. In any case, for such a lexicographic formula, an index of words solves the issue of easy orientation. Obviously, the project and elaboration of [F] should start from basic nomenclators in the field of the description of the medicines present at a certain point in time (such as Farmacopeea romdnd [The Romanian Pharmacopoeia], elaborated under the patronage of the National Medicines Agency). It should also start from lists within the documents emitted periodically by official bodies; these documents are extremely important because they present the advantage of reflecting the product circulation. For instance, in a list comprising the international common denominations (ICD) and the common terms for the medicines available to the persons with medical insurance within the social health insurance system (emitted by the National Health Insurance House), there are around 2,000 names (in electronic format, the document has 44 p. x 47 names per page). For [F], essential criteria must be considered: medicine classes by the ATC (Anatomical-Therapeutic-Clinic) system, by the administration route, maybe even by the presentation forms, etc. Another way of assessing and enriching [F] can also be represented by the extraction of terms from the paragraphs reserved to "treatment" within complex articles such as those in Dictionar 2013. In regard to the selection for the [LW], the decisive element is the experience of the potential collaborators to this project, doctors and pharmacists, especially the latter. This 30 occurs because, as easily concluded, among the patients who come to purchase medicines, a great part did not go to a doctor first; this way, often pharmacy is the setting where symptoms are described and medication recommendations are obtained. It is not less true that even the patients who come in with a prescription also choose to discuss with the pharmacist on the selection of medication, starting from common names per se, which correspond to the coded notes of doctors. In fact, the regulations in force do mention the functioning of a consultation room within the space of the pharmacy. Of course, the idea is also to ensure, if necessary, the confidentiality (in that professional setting, the concepts of "pharmaceutical care" and "confidentiality of the information" are quite common; cf. Cristea 2013). Besides the relationship with the qualified personnel (doctors, pharmacists), the patient also has the opportunity of a direct contact with the [MPhT] that goes beyond reading the prescription (often indecipherable; this has actually become proverbial: when it is said of someone that he "has a doctor's handwriting", the idea is that the handwriting is indecipherable). We refer to the consultation of the main paratext structure: the Leaflet of medicines (subtitled "Information for the user") and, more rarely, to the consultation of the same type of structures represented by the inscriptions on packaging or even on the bottle (we take into account, as always, the priority of the most used medicines). An analysis sample of the leaflet for the medicine called in the pragmatic-discursive space of trade SERMION 30mg (a name followed by the scientific "gloss" NICERGOLINE) is meant to clarify the difficulties of the approach. It is also meant to maintain a certain optimism - moderately, of course - regarding the practical possibilities of the project. This way, firstly, all the section titles of the leaflet are formulated in terms that are accessible to the reader with medium education; this status should also be taken into account considering the interest for deciphering the technical terms within the text of several sections. They are as follows: - section [1] "What is Sermion 30 mg and what it is used for"; - paragraphs within section [2] "Before you take Sermion 30mg"; - section [4] "Possible adverse effects". Those utterances represent around 25% of the entire leaflet text and they comprise a rather large number of diseases and therapies. We mention that we do not take into account utterances that belong to the pragmatic0 discursive space of an endogenous level and that are of no interest for the common reader, such as "the class of ergot alkaloid derivatives" (in section [1]), and, mostly, aberrations from the perspective of the user's competency, which we shall refer to hereinafter. Often, in pharmacies, when the patient tries to initiate a discussion on issues within the Leaflet of a medicine (and for all the right reasons, considering the subtitle "Information for the user"), the pharmacist blocks the potential dialog by replying "The Leaflet is for US!" So who is right? Both interlocutors are. The patient because he is approached through such a text, not to mention the direct information formulas regarding his particular situation (for instance, for the medicine of reference, in section 2, under "The use of Sermion 30 mg during pregnancy or if you intent to become pregnant should be extremely cautious"). He is also right considering the presence of a discourse that 31 Database on the Medico-Pharmaceutical Terminology [Mpht] Steiian Dumistracel, Doina Hreapca, Luminita Botosineanu claims to be exogenous, as the medical terms used are among the generally known ones (for instance, in the same situation, "Tell your doctor... if you have kidney diseases...; if you have high/low blood pressure", etc.). However, only the pharmacist can help the patient, even by simply translating utterances that clearly belong to the endogenous discourse, though, we underline again, the text is addressed to the patient. For instance, in the case of Sermion, the patient (in section 2, paragraph "Interactions") is informed of the following. "Tell your doctor if you use... anticoagulant and plaquetary antiaggregant medicines - as nicergoline inhibits platelet aggregation and reduces blood viscosity, it is necessary to frequently monitor the parameters of blood coagulation in case of the more prone patients". The utterance itself - just like many others - is grammatically incongruent. Actually, the patients are also preoccupied by the terminology; in texts present in social networks on the Internet there are sarcastic characterizations on the names of the medicines, some of which are considered, in the ludic register, "funny" or "stupid", etc.6 6.3. Necessity of computer-based sources In order to elaborate the database in question, we plan to use mainly computer-base sources, in order to permanently ensure practical operations of correlation/assessment. Furthermore, for the same reasons, we intend to edit the corpus material and the [MPhSG] in electronic format. 7. Conclusions We believe that the above-discussed facts justify our interest for the elaboration of a database on the medico-pharmaceutical terminology representing the concrete communication possibilities characteristic to the pragmatic-discursive space of the user of medicine treatment. In any case, the exclude the - chimerical - illusion that the communication profitability could be obtained, in this field, by training the patients to learn the codes of specialists, meaning the scientific terms for diseases and treatments. This illusion was hazardously considered possible (cf. Marin-Omer 2003), in full, but surprisingly unaware of the communication realities within a pragmatic-discursive space with the highest socio-cultural relevance. Whether our starting point represents the result of a correct assessment and if the issue of elaborating a practical working tool of the type "Medico-pharmaceutical Security Glossary" was considered realistic, we shall find out on this occasion, of the first "declaration" of our intention. We are looking forward to and we welcome the objections and suggestions of any nature, as well as any potential criticism the specialists. References A. Sources Bianu V., Glavan I. (1929). Doctorul de casa sau Dictionarul sanatatii..., Editia a ELa revazuta si marita, Cluj, Institutul de Arte Grafice Cartea Romaneasea S.A. * 6 cf., for instance, http://e4ari.blogspot.ro/2009/12/deniHTuri-haioase-de-medicamente.hte as well as http://www.krossfire.ro/un-plic-de-fluimucil/ Danila Gh. (1999). Vademecum medicamentorum, Iasi, Polirom. Dicfionar (2013). Dictionar medical ilustrat, vol. I-X, Bucuresti, Editura Litera International [traducere dupa SALUTE. Dizionario medico, Milano, RCS Quotidiani, 2006]. Episcupescul §.V. (1846). Practica doctorului de casa. Cunostinta aparari s'a tamaduiri boalelor barbatesti, femeesti si copilaresti. C-o prescurtare de hirurgie, de materie medicala si de veterinerie, pentru doctor si norod. In tipografia Colegiului Sf. Sava, Bucuresti. Hepites C. (1862). Pharmacopea romana, Typographia Jurnalului National, XIV + 790 p., Bucuresci. Prudhomme Ch., DTvernois F. (2012). 1000 de boli pe intelesul tuturor. O enciclopedie medicala indispensabila familiei, vol. I-II, Bucuresti, Editura Orizontiiri. Rusu V. (2010). Dicfionar medical, Edijia a IV-a revizuita si adaugita, Bucuresti, Editura Medicala. B. Exegeses Baylon Ch., Mignot X. (2000). Comunicarea, traducere de Ioana Ocneanu si Ana Zastroiu, Editura Universitatii „Alexandru loan Cuza", Iasi. Bourdieu P. (1976). Le champ scientifique. „Actes de la Recherche en Sciences Sociales", vol. 2, 88-104. Charaudeau P., Maingueneau D. (2002). Dictionnaire d'analyse du discours, Paris, Editions du Seuil. Coseriu E. (1994). Competenta lingvistica. Prelegeri §i conferee, excerpted from The "Anuar de lingvistica si istorie literara", XXXIII (1992-1993), 27-47. Cristea A.N. (2013). Consilierea pacientului in farmacia de comunitate. "Pharma Buisiness" (www.pharma-business.ro/opormnitati/consilierea-pacienUilui-in- farmacia-de-comunitate. html). Dumistracel S. (2000). Paliere terminologice. "Cronicd\ XXV, no. 1, 19. Dumistracel S. (2006a). Limbajul publicistic romanesc din perspectiva stilurilor functionale, Iasi, Institutul European. Dumistracel S. (2006b). Discursul repetat in textul jurnalistic. Tentatia instituirii comuniunii fatice prin mass-media, Editura Universitatii "Alexandru loan Cuza", Iasi. Dumistracel S., Stoica D., Dumistracel I. (2011a). Barrieres linguistiques, notamment au niveau terminologique, dans des domaines de communication a grand impact socio-culturel. Approche pragmalinguistique. Conference internationale "La formation en terminologie ". Dumistracel S., Hreapca D., Botosineanu L. (2011b). Paliere terminologice din perspectiva barierelor lingvistice: incifrare si transparent in terminologia medico-farmaceutica. In the volume comprising the Acts of the International Conference "Paradigm of the ideological discourse. Dynamics of terminologies and (re)modelling of the systems of ideas" (fourth edition; PID 4), Facultatea de Litere, Universitatea "Dunarea de Jos", Galati. Dumistracel S., Hreapca D., Botosineanu L. (2012). Variatie diastratica si variafie diafazica in comunicarea specializata: paliere terminologice. Spatiul discursiv al publicatiilor romanesti de instruire si educatie medico-sanitara (I). In the volume comprising the Acts of the International Conference. Hallig R., Wartburg W. (1963). Begriffsystem als Grundlage fur die Lexicographie. Versuch eines Ordnungsschemas, second edition, Berlin, Akademie Verlag. 32 33 Database on the Medico-Pharmaceutical Terminology [Mpht] Jurt J. (2001). La theorie du champ litteraire et l'internationalisation de la litterature7. Marin-Omer I. (2003). The Role of Medical Special Code and Slang. In Communication between Doctor and Patient in Oncology Departments, in vol. Limba si vorbitorii, Bucuresti (Tatiana Slama-Cazacu ed.), Editura Arvin-Press, 272-287.' Toma E.C., Mesaros A.M., Carata A. (2012). Remedii farmaceutice de origine animala prezente in prima farmacopee romana de la 1862 si in colectia Muzeului de Istorie a Farmaciei din Sibiu8. Ursu N.A. (1962). Formarea terminologiei stiintifice romanesti, Editura §tiintifica, Bucuresti. Veron E. (1997). Entre Fepistemologie et la communication. In "Hermes " 21, Sciences et medias, Paris, Editions CNRS, 23-32. C. Regulations, legislation Ghid privind exprimarea concentratiei in denumirea comerciala a medicamentelor de uz uman(2010). Hotararea Consiliului §tiintific al Asociatiei Nationale a Medicamentului si Dispozitive Medicale, nr. 2/29.02.2008: Ghidul privind denumirea medicamentelor de uz uman. Id., Anexa 1: abordarea problemelor referitoare la denumirile comune Internationale (DCI) in cadrul denumirilor inventate propuse (DI). Ordin nr. 75/2010 pentru aprobarea Regulilor de buna practica farmaceutica. Ordinul Ministerului Sanatatii din 1 aprilie 2009 privind prescrierea retetelor pe DCI. Reglementari privind modalitatea de gestionare a propunerilor de denumiri comerciale tip „umbrela" si alte denumiri comerciale pentru medicamentele de uz uman in raport cu denumiri ale suplimentelor alimentare, ale produselor cosmetice si ale dispozitivelor medicale (2012). Regulamentul de prescriere medicamente din Contractul Cadru 2011, anexa 30. 7 www.freidok.uni-freiburg.de Mittp://www.srif.eu/fisiere/7978JxYa^TOMA_Remedii%20origine%20 animala%202012_preg_prezentare.pdf ELECTRONIC LINGUISTIC RESOURCES FOR HISTORIC STANDARD ROMANIAN ELENA BOIAN, SVETLANA COJOCARU, CONSTANTIN CIUBOTARU, ALEXANDRU COLESNICOV, LUDMILA MALAHOV, MIRCEA PETIC Institute of Mathematics and Computer Science, Academy of Sciences of Moldova, Chisindu, Republic of Moldova lena@math.md Abstract This article describes digitization of old Romanian texts, problems at their recognition, and motivates the necessity to create specific electronic resources mirroring the history of the standard Romanian language. We analyze printed texts since the 16th century when the Romanian typography begins. We also provide statistics of results of recognition of documents in a Romanian text of the 19th century by modern OCR (optical character recognition )software. Keywords: digitization, Romanian linguistic resources, text recognition, language technology 1. Introduction The main directions of the cultural policy into zones when the Romanian language is spoken refer to study, evaluation and digitization of cultural and historic heritage. Process of heritage digitization requires the solving of many problems that refer to recognition, editing, translation, interpretation, circulation and reception of texts printed in Romanian and other modern languages. These problems became more complicated for Romanian as we need to consider the historic period when the source was printed, and we have several periods. This paper presents a short description of periods of the Romanian language evolution, and aspects of development of main language components: alphabet, lexicon, and orthography, specific for each period. Taking into account a specific period, we will propose a technology to obtain these components. In particular, we study the problem of digitization of printed Romanian texts using different writing systems starting since the 16th century (Ivanescu, 1980). The first book printed in the Romanian territory was the Church-Slavonic Liturgy Book (1508) edited by Serbian hieromonk Makarie. The first printed book in Romanian appeared in Brashov in 1535 (Panaitescu, 1965). It was The Romanian Catechism published by deacon Coresi. The National Library of the Republic of Moldova possesses approximately 21,000 old and rare books. The collection contains approx. 20 books printed in Romanian in the Romanian Cyrillic and transitional scripts in Bessarabia (Chisinau and Dubasari). Public libraries of Sankt-Petersburg keep important quantities of old Romanian books (the 16th-19th centuries). For example, there are 66 titles in The Catalog of Cyrillic 34 35 Electronic Linguistic Resources for Historic Standard Romanian editions of Southern Slavonians and Romanians. 45 volumes are of the Southern Slavonian origin, while 21 can be attributed to Romanian lands (Valori, 2008). In its history, the Romanian language has passed through a long and rich evolution. The existent studies explain appearance of each vowel and consonant at each specific stage of the language evolution that is necessary to determine the alphabet and specific letters (Ivanescu, 1980; Munteanu & Tara, 1978). This information permits us to construct linguistic resources and to use specific tools for a specific period of the language history. Our work is a long-term project that is in its beginning now. We implement it using the principle "from now into the depths of time". In this paper we describe our approach to digitization of Romanian texts from the 20th century and back until the 19th. Three types of texts can be selected: 1. Moldavian Cyrillic script that was used in 1924-1989, and is used now in Transnistria; 2. Latin script with additional letters, different depending on period; 3. Transitional script. We performed this categorization based on the alphabet. We should note that each of these periods can be subdivided on the basis of the corresponding orthography and lexicon. The structure if the paper is as follows: state-of-the-art in old text recognition, with orientation to South-Eastern Europe (Sec. 2); a short list of the historic periods of the Romanian language and script evolution (Sec. 3-4); exposition of techniques to digitize and to recognize printed texts (Sec. 5); examples and considerations on recognizing texts from specific periods (Sec. 6). 2. State-of-the-arts in working with historical texts of South-Eastern Europe The problem of digitization and preservation of historical linguistic heritage is a domain of priority in the digital agenda for Europe. The EU highlights the necessity for coordinated effort in the domain, and manifests vast actions to activate this process. These actions include development of the Europeana virtual library supported by a resolution of the European Parliament of May 5, 2010, and by adopting the Work Plan for Culture 2011-2014. Let us mention also the European Commission Recommendation on the digitization and online accessibility of cultural material and digital preservation of October 27, 2011. For Romanian historical linguistic heritage, the solution of this problem presents specific difficulties: a large number of periods in the language evolution; relatively small number and big dispersion of deposited resources; big variety of used alphabets, in particular, several so-called "transitional" (mixed Latin-Cyrillic) alphabets. The difficulties in digitization and preservation of this heritage lie in correct recognition of Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic characters and in lack of adequate lexicons corresponding to the periods of the texts printing. One of solutions of the lexicon problem could be aligning of old texts to contemporary linguistic norms (Moruz & al, 2012). As to OCR of printed and handwritten Cyrillic characters, we can mention a paper (Kornienko & al, 2011) where both standard ABBYY FineReader and Al techniques are used, in particular, artificial neural networks. There exists an application of methods based on knowledge technologies to the digital archive and multimedia library for Bulgarian traditional culture and folklore (Pavlov & al., 2011). Problems of transliteration caused by parallel use of two alphabets, Cyrillic and Latin, which appear at processing of written texts in modern Serbian, were solved applying monolingual and multilingual corpora and various e-dictionaries (Vitas & al, 2003). 3. Periods of evolution of the Romanian language The history of the Romanian language contains two epochs of its evolution. The first one is that of formation of the Daco-Romanian dialect and continues since the taking of Sarmizegetusa (106 A.D.) until the 15th century (Ivanescu, 1980). The Cyrillic alphabet was used in the end of the epoch because of the Orthodox Church domination. The second epoch (16th-20th centuries) of the evolution of standard Romanian begins since the appearance of the first texts written in Romanian as the result of a long and complex development (Munteanu & Tara, 1978). This second epoch can be divided in two big stages. The first stage begins since the appearance of the first Romanian literature texts, and ends in the beginning of the 18th century. This stage can be subdivided in three periods: 1.1532-1588, the first steps in language standardization; 2.1588-1656, consolidation of the main variants of standard Romanian (Muntenian, Moldavian, and South-West-Ardealian); 3.1656-1715, mutual influence of variants. In 1688 Biblia de la Bucuresti [the Bible of Bucharest] appeared. Its publication became a milestone in the linguistic unification that led to the second stage of the second epoch (Ghetie, 1978). This second stage covers 1715-1960 and consolidates a unified over-dialectal language. We can subdivide this stage in four periods: 1.1715-1780, the first unification, approx. at 1750; 2.1780-1836, linguistic diversification; 3.1836-1881, stabilization of main norms of the unified standard language; 4.1881-1960, fixing of norms of the modern standard Romanian language. The last period signifies also stylistic consolidation of standard Romanian. In 1904, the orthography was changed to be definitively based on the phonetic principle that is kept for standard Romanian till now, with some further refinements. 36 37 Electronic Linguistic Resources for Historic Standard Romanian Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic 4. Periods of Romanian scripts development In the 17th century, a Romanian Cyrillic script had appeared, with up to 47 letters. Most letters were taken from Old Church Slavonic. Several Greek letters were added to convey names exactly. An original Romanian letter was used as prefix or preposition in, im (in), or as the modern letter i in the beginning of words. Varlaam's HomUiary was printed in 1643 with this script (Fig. 1). The first Romanian ABC book was printed in Balgrad (Alba Iulia) in 1699, and in 1757 D. Evstatievich published a Romanian grammar. In the 18th century Romanian belle letters appeared. Since 1830, until the official adoption of Romanian Latin-based alphabet in 1862, the script was not regulated thoroughly, and at least seven modifications of so-called "Transitional alphabets" mixed from Cyrillic and Latin letters were used (Fig. 4, 7). For example, e - € (1830) - £ (1846); k - k; m - nit; s - #3 - dz - d (1846). Usage of the Latin-based script in Romania had not influenced the typography practice in Bessarabia. After the ceding of Bessarabia to imperial Russia in 1812, the official language was migrated to Russian. In 1833, Romanian was excluded from all official communications but remained in eparchial administration until 1873. The church typography in Chisinau was closed in 1883, and reopened in 1906. Except church books, we can also mention: ABC books, 1814 and later, 1861, 1863; a booklet on emancipation of serfs, 1861; calendars, etc. Instructive booklets on agriculture and hygiene in Romanian were published and distributed by local authorities. In 1867-1871, the Romanian version of Chisinau Eparchial Gazette was printed in civil Slavonic script with several traditional letters and y-like u (8). In several cases, transitional (1859) and even Latin-based script were used (Ciobanu, 1923). In the 1880-1890, the printing in the Romanian language was ceased in Bessarabia, resuming at the beginning of the 20th century. The religious printing used both church and civil scripts. It is necessary to distinguish the Romanian Cyrillic alphabet and the Moldavian Cyrillic alphabet (Fig. 8). The former was used for Romanian writing since the 14th-15th centuries until 1862. The latter is, in fact, an adaptation of the Russian Cyrillic alphabet to reproduce the Romanian phonetics by Russian orthographical norms that led to some weird orthographical effects. This second variant based on the Russian alphabet was used in the Moldavian Autonomous Soviet Socialist Republic (MASSR) in 1930-1932 and 1938-1940, then in the Moldavian Soviet Socialist Republic (MSSR) since its formation in 1940, and until 1989. This alphabet is still used in Transnistria. Between 1932-1938, the Latin-based alphabet was used in the MASSR. We can therefore exhibit the following periods in the development of the Romanian script since the publication of Varlaam's HomUiary (Tab. 1). 38 Table 1: Development of Romanian script since 1642 Romania Bessarabia 1642- 1710 (Romanian Cyrillic script) 1710 - 1830 (modified Romanian Cyrillic 1710 - 1814 (modified Romanian Cyrillic script) script) 1830 - 1862 (mixed Cyrillic-Latin 1814 - 1880 (Cyrillic scripts based on Russian transitional script) 1862 - 1904 (Latin-based script) 1904 - 1960 (modified Latin-based script) 1960 - 1993 (modified Latin-based script) civil and Old Church Slavonic scripts; occasionally, transitional and Latin-based script) IggO - 1905 (No Romanian typography) 1905 - 1918 (Cyrillic script based on Russian civil script) 1919 _ 1940, 1941 - 1944 (modified Latin-based script) 1940 - 1941 (Moldavian Cyrillic script) [See above in the text on situation in the MASSR] 1944 _ 1989 (Moldavian Cyrillic script; in 1967 letter yk appeared) 1993 - now (modern Romanian Latin-based 1989 - now (modern Romanian Latin-based script) script) [See above in the text on the situation in Transnistria] There are more factors except of script, which characterize periods of language development. They are also orthography and lexicon. We show in Fig. 1-8 examples of printed texts at different periods of the language evolution. 39 Electronic Linguistic Resources for Historic Standard R< omaman Figure 1: Varlaam's Homiliary, Iasi, 1643 "Romanian book of learning during the year and at the Christian feasts, and of the Great Saints. Under the order and all costs paid by Vasilie [Lupu], Prince and Ruler of Moldavia, complied and translated from many sources, from Slavonic into Romanian, by Varlaam the Metropolitan of Moldavia At Ruler s typography." Figure 2: Horoiogion, 1748 jsn=^r= •:^iyliWifc hoaot^ to $sri%%£ ^4ig Figure 3: Lord's Prayer. In: Book of Akathists with many selected prayers for humbleness of each Christian, Printed m the third time. Blaj: Typography at the Theological School, 1786 warn Figure 4: Chronicles of the State of Moldova published for the first time ever by Mihai] Kogalniceanu Volume I. Iasii. Available in all bookstores. 1852 Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic -~~;;r«L: *" VS^iidit eirti* mlllitifl MWke-aji noe3neH, BeHHHK TbiH3p uih (})epprqe, ^e ^hh 4>pyH3e biitb AOHHeiure, hq Ky 4>Jiyepyji bmt 3Hne... Figure 8: A text printed in Moldavian Cyrillic alphabet (1967-1989) used till now in Transnistria. From: M. Eminescu, "Epigones" 5. Recognition of characters in printed texts Manuscript digitisation and recognition is complicated because it requires additional operations, such as adjusting the contrast, cleaning the image, text segmentation. We 40 41 Electronic Linguistic Resources for Historic Standard Romanian also need to develop special algorithms of recognition and specialized lexicons. Further, we only take into account Romanian texts printed with Latin letters. Process of digitisation and recognition consists of the following stages: • Digitization of the text resulting in its graphical electronic copy. • Recognition by standard techniques, namely, using OCR (Optical Character Recognition) (OCR) software, possibly, with its training. Without OCR, procedures of conversion using Artificial Intelligence techniques should be applied. Transliteration of the text is performed taking into account specific letters from the initial text. • Verification of the recognised text is performed using reusable resources specialized for the corresponding period. OCR Al | procedures I Printed document 15*-20*eetmiry Digitization (scanning) Recognition Conversion/ Transliteration Electronic retissbfe Verification of the text Resulting text In the latm alphabet Smage Recognized / of the printed text text Editing / \ Expert j V_J Figure 9: Technological stages of printed text recognition Automated suggestions Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic Digitizing texts is their scanning and obtaining their electronic version as an image. OCR is used to recognize text from its image. Standard OCR systems use different methods to recognize texts. We tried two systems: IRIS and ABBYY FineReader. Results of experiments in recognition of a printed 19th-century text are exposed in Section 6. We found that IRIS does not offer the possibility to select an arbitrary fragment of image during training. Therefore, we cannot correct the fragmentation proposed by the system. This system does not satisfy our purposes, as it is impossible to train it to recognize old printed Romanian text. The ABBYY FineReader OCR system allowed us to adapt it for the alphabet of a corresponding period. We trained the system by enlarging the alphabet. It should be noted that OCR systems recognize the actual text if its internal spelling checker uses lexical resources that corresponds to the historical period of text. The OCR systems using standard (modern) lexicons do not always obtain a satisfactory result. To improve the results, we need further processing of the scanned text. Pattern recognition techniques are used to identify individual characters in the page, including punctuation, spaces and line ends. The recognized text appears as an editable file. Transliteration is a strictly individual process that is dependant on the examined period. It uses programs that depend on the initial text and contain information on specific letters in that text. Transliteration supposes creation of bidirectional relations between two systems of writing considering that a specialist could reconstruct the original text from its transliterated variant. Transliteration should be performed only as necessary. Text verification is performed by a special application that uses specific resources for the historic period of the printed text (Burlaca & al, 2010). Newly obtained words can be entered into the corresponding lexicon. 6. Results of experiments in recognition ofprinted 19th-century texts 6.1. Processing of texts in the Moldavian Cyrillic script To perform OCR (Fig. 9) of such texts, it is necessary to train the OCR system to recognize an additional letter m (since 1967), and to provide the corresponding lexicon. For the end of the period (1951-1989), we can obtain the dictionary transliterating the modern Romanian dictionary in the Latin script. The transliteration is not simple because of several irregularities in this system of writing, e.g.: • absence of h (i) in the Cyrillic equivalent of words like paine (nbrae - bread), caine (Kbme - dog); in other words containing diphthongs, this letter is kept: caraitor (KwpwHTop - croaking), taraitura (TbipwHTypa -creeping); • replacement of a with n (instead of a) in words like functia (m?^ immele nfk*iof ecea ce lodica tmpee^if't^^^-^ > - nura :difiire soft 'e^^nA^iM^^j^^ lnj , fuse er* ae mtaloesefi pentoi, m,z^d: im| Figure 10: Digitized text, 1894 (Densusianu, 1984, p. 130) Romarm, desi au avuta o miie de an[ se sufere in vasiunele barbare, care au distrusa tote operele mareje ale architecture! romane, in catu acesta fapta a ramasa pana adt in dicerea populara "n 'a ramasa petra pe petra", totust nici moravurile njel sufletnln loru nu s'a tn^batacitii. Ei au pastrate o adanca intimitate si doiosie in vieta familiara. Casatoria este incungiurata de-o mulfime de ceremonil cand grave, cand vesele. Miresa este "o, fata de tmparatn", mirele "ficioru de imparata". ceea ce indica respecta si fericire. Casatoria este "pe vieta si morte". pentru aceea si jelirea la m6rtea unuia dintre sot} este adanca §i lunga. In ceealalta lume tnsa er' se tntelnescn pentru a trai impreuna. Cultula mosilora (sufletele raposatilorn) este in forte mare 0- n6re pana adt. Anumite sarbatort peste anu suntu consacrate acestui culta. Figure 11: Text recognized with OCR system IRIS The next step was manual correction of the text from Figure 11 resulting in the text shown on Figure 12. Words in old writing are underlined. Romanil. desi au avutu o miie de am se sufere invasiunele barbare, care au distrusu. tote operele marete ale architecture! romane, in catu acestu faptu a remasu pana agTi in dicerea populara „n'a remasu petra pe petra". tojusi nici moravurile nici sufletulu loru nu s'a inselbatacitu. Ei au pasjratu o adanca intimitate si doiosie in vieta familiara. Casatoria este incungiurata de-o mul^ime de ceremonil cand grave, cand vesele. Miresa este „o fata de imperatu", mirele , jltiorti de lmperatu", ceea ce indica respectu si fericire. Casatoria este „pe vieta si morte". pentru aceea si jelirea la mortea unuia dintre soti este adanca si lunga. In ceealalta lume inse er' se intelnescu pentru a trai impreuna. Cujtulu mosiloru (sufletele reposaftloru) este in forte mare pnore pana adi. Anumite serbatori peste anu sjmtu consacrate acestui cultu. Figure 12: Manual correction of the text 6.2. Processing of texts in the Latin script with additional letters To illustrate the described technology we will investigate recognition and verification of digitized text from (Densusianu, 1984) that was published in 1894 (Fig. 10). The text on Fig. 10 was recognized with the OCR system IRIS with Romanian mode that uses modern lexicon. As we compare the resulting (Fig. 11) and source (Fig. 10) texts we see that unrecognized words are those written in the old orthography with letters specific for the 19th century. For example, we got tnsalbatacitu instead of inselbatacitu. This result cannot be improved, because IRIS in its training mode does not permit arbitrary fragmentation of image fixing its own fragmentation. The use of modern lexicon lead, for example, in recognizing of avutu as avuta, while the right word is avut in this context. Words from the 19th-century lexicon were not recognized because we need for their correct recognition dictionaries specific for the corresponding period that, in our case, would contain words like remasu, vieta, imperatu, etc. 44 45 Electronic Linguistic Resources for Historic Standard Romanian Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic Underlined word in Figure 11 are those erroneous or written differently comparing with the modern Romanian language. *Romanii, desi *au *avuul o *miie de *am se sufere *mvasiunele barbare, care *aji *distrusu *tote operele marete ale *architecturei romane, in *catu *acestu *faptu a *remasu pana *adi in *dicerea populara „n'a *remasu *petra pe *petra", *tojtus! *nici moravurile *nici *s,ufletulu *]oru nu s'a *inselbatacitii. *E! *au *pastratu o adanca intimitate si *a^jfi§|e in *vig£a familiara. Casatoria este *incungiurata de-o multime de *ceremonii cand grave, cand vesele. *Miresa este „o fata de *imperatu". mirele „*ficioru de *imperatu". ceea ce *indic& *respectu si fericire. Casatoria este „pe *vieta si *morte" pentru aceea si jelirea la *m6rtea unuia dintre *soti este adanca si lunga. In *ceealalta lume *inse *er se *intelnescu pentru a trai impreuna. *Cultulu *mosiloru (sufletele *reposatiloru) este in *forte mare *onore pana *adl Anumite *serbatori peste *anu * suntu consacrate *acestui *cnltii. Figure 13: Text cheeked with RomSP The corrected text was checked with RomSp spelling checker (Burlaca & al, 2010) with the lexicon of approx. 1 mln. words of modern. Romanian (Fig. 13). An asterisk * marks words not understood by the spelling checker that can be attributed as belonging to the lQth-century lexicon. The source text in Fig. 10 contains 130 words. 57% of words were found correct but 43% were suspicious. The "correct" words are those whose writing was kept intact since the 19th century, for example: sufere, acesta, fericire. "Suspicious" words are those affected by the changes in orthography, for example: ceealalta (cealaltd), doiosie (duiosie), miie (mie\ avutu (avut), adi (azf). It is seen that only part of "old" words contains specific letters. To recognize the text correctly, we need to train the OCR system to recognize specific letters and to add into the lexicon a set of new words specific for the 19*fa century, for example: avutu, miie, nici, doiosie, vieta, ficioru, etc. The OCR system ABBYY Fine Reader has more elaborated features of training. We used this system to perform another experience with the same text from Fig. 10. The system recognizes the whole Unicode set of letters in many font faces. The user can select any subset of Unicode as a "user-defined language", adding to it his own lexicon (list of words). In rare cases, a "real" training over images of letters can be necessary but we had not used it in this case. First of all, we instructed the system to include as recognizable specific letters for 19th-century Romanian: • u (a final letter, can be mute or pronounced), • e (is pronounced as diphthong ea), • 6 (is pronounced as diphthong oa), • d (is pronounced as z or dz), • i (i is written now, with special rules of pronunciation), • e (is used as a). The resulting text is shown in Fig. 14 (accuracy of 63%). Romanii, desi au avutu. o miie de am se sufere in- vasiunele barbare, care au distrusu tote operele marete ale architecture! romane, in caul acestu faptu a remasu pana ad! in cjicerea populara „n'a remasu petra pe petra", torus! nici moravurile nic! sufletulu loru nu s'a inselbatacitu. E! au pastratu o adanca intimitate si do- iosie in vie^a familiara. Casatoria este incungiurata de-o multime de eeremoni! cand grave, cand vesele. Miresa este „o iata de imperatu", mirele „ficioru de imperatu", ceea ce indica respectu si fericire. Casateria este „pe vie|a si morte", pentru aceea si jelirea la mortea u- nuia dintre sot! este adanca si lunga. In ceealalta lume inse er' se intelnescu pentru a trai impreuna. Cultulu mosiloru (sufletele reposatiloru) este in forte mare o- nore pana aAi. Anumite serbatori peste anu suntu consacrate acestu! culnl. Figure 14: Text recognized with ABBYY Fine Reader set for the 19th-century alphabet, without spell checking As the next step, the system was equipped with a dictionary containing words marked in Fig. 12, namely, those that do not exist in the modern lexicon. This lexicon was set as the additional one to the modern Romanian lexicon. This time ABBY Fine Reader recognized the source image (Fig. 10) with the accuracy of 98% correct words and 2% of suspicious words (Fig. 15). It is seen that most of "bad" words were not recognized because of poor image quality (adi, cjicerea, a(JT)). Comparing Fig. 14 and Fig. 15, we see that even hyphenated words were recognized correctly. Romanii, desi au avutu o miie de an! se sufere invasiunele barbare, care au distrusu tote operele marete ale architecture! romane, in caul acestu fapul a remasu pana adi in cjicerea populara „n'a remasu petra pe petra" torus! nic! moravurile nic! sufletulu loru nu s'a inselbatacitu. E! au pastratu o adanca intimitate si doiosie in viefa familiara. Casatoria este incungiurata de-o multime de eeremoni! cand grave, cand vesele. Miresa este „o fata de imperatu", mirele „ficioru de imperatu", ceea ce indica respectu si fericire. Casatoria este „pe vie|a si morte", pentru aceea si jelirea la mortea unuia dintre so^i este adanca si lunga. in ceealalta lume inse er' se intelnescu pentru a trai impreuna. Cultulu mosiloru (sufletele reposa|iloru) este in forte mare onore pana a(JT. Anumite serbatori peste anu suntu consacrate acestu! cultu. Figure 15: Text recognized with ABBYY Fine Reader set for the 19th-century alphabet, with spell checking and an additional dictionary (accuracy 98%) Thus equipped, FineReader was used to recognize another five pages from the same source (Densusianu, 1984), and, later, for pages from another book of the same period. We sum the results in Tab. 2. The errors can be attributed to the absence of words in the lexicon, or to the poor image quality. 46 47 Electronic Linguistic Resources for Historic Standard Romanian Elena Boian, Svetlana Cojocaru, Constantin Ciubotaru, Alexandru Colesnicov, Ludmila Malahov, Mircea Petic Table 2: Results of experiments in OCR of 19th-century texts Mode of recognition Correct words Suspicious words IRIS 57% 43% ABBYY FR, no training 63% 37% ABBYY FR, trained, dictionary's source page 98%> 2% ABBYY FR, trained, more pages, the same book 95% 5% ABBYY FR, trained, pages from another book 95.4% 4.6% If we want to obtain better results at the verification of printed text, we need that for the corresponding historic period: • the scanner (scanning software) would be trained to recognize specific characters, • a lexicon of words used in the specific period would be composed. 6.3. Processing of texts in the transitional script There are at least seven versions of transitional (mixed Cyrillic and Latin) script. Most of the letters of this script can be recognized with ABBYY Fine Reader by forming the "language" from the corresponding Unicode glyphs. Only one specific Romanian Cyrillic letter is absent in the Unicode. It is necessary to include in the language its letter equivalent (linguists simply use an arrowJ^: we may use, for example, Slavonic yus Aa), and to train the system over its graphical forms in different font faces. We experimented with the text from Fig. 7, with accuracy of 93.2%. With a small volume of training material and poor scan quality, this is a quite a good result. 7. Conclusions Digitized resources are specific records that are kept in the database accessible through Internet. To ease the access to these resources for users, it is necessary to develop interfaces and a special technology that allows text recognition. Our technology is oriented to solve, for each period of the language development, two main problems: 1) development of algorithms to recognize alphabets of a specific period; and 2) development of tools and interfaces needed to create the corresponding linguistic resources (lexicons). This would permit to recognize words and to align texts conforming to contemporary linguistic norms. As we move from one period to another, we can use previously elaborated tools and resources, thus implementing the principle "from now into the depths of time". The proposed technology can be used in the formation and completion of specific linguistic resources with new words extracted from digitized materials and certified by language experts. It would allow construction of parallel corpora of different nature. Development of the proposed technology would provide opportunities to transliterate digitized text into modern Romanian, to customize graphics, to offer possibilities for corpora building, to preserve the original texts. Specific electronic resources can be placed on the Internet for public access contributing to the development of the informational communicative media for the Romanian language. Moreover, these resources constitute an essential support for researchers, and conversions into modern standard text can be used as didactic materials at teaching. References Burlaca, O., Ciubotaru, C, Cojocaru, S., Colesnicov, A., Magariu, G., Malahov, L., Petic, M., Verlan, T. (2010). Applications based on reusable linguistic resources. Multilinguality and interoperability in language processing with emphasis on Romanian, 461-476. Cartea Moldovei (sec XVII - inc. sec XX). (1992). Edifti cu caractere chirilice (sec XVII - inc. sec XX). Catalog general. Chisinau,. [Moldavian Books (XVII-beg.XX cen.). Editions in Cyrillic Script (XVII-beg.XX cen.). General Catalog. Chisinau, 1992. - In Romanian.] Ciobanu, S. (1923). Cultura romaneasca in Basarabie sub stapanirea rusa. [Ciobanu, S. Romanian culture in Bessarabia under Russian rule. Chisinau, 1923. - In Romanian.] http://ww.scribd.com/doc/75147025/%C5%9Etefan-Ciobanu-Culmra^ romaneasc%C4%83-in-Basarabia-sub-st%C4%83panirea-rus%C4%83-1923 Densusianu, A. (1894). Istoria limbii si literaturii romane. Iasi. /Densusianu, A. History of the Romanian language and literature. Iasi, 1894. - In Romanian.] http://rn.scribd.com/doc/123035210/Istoria-limbii-si4iteraturii-romane Ghe{ie, I. ( 1978). Istoria limbii romane literare. Bucuresti. [Ghetie I. History of the standard Romanian language. Bucharest, 1978. - In Romanian.] Ivanescu, G. (1980). Istoria limbii romane. Iasi. [Ivanescu, G. History of the Romanian language. Iasi, 1980. - In Romanian.] KopHneHKO C.H., Afi^apoB IO.P., TarapHHa fl.A., ^epenaHOB d>.M., ^Ichhitkhh JI.H. (2011). nporpaMMHbin KOMnneKC rjw pacno3HaBaHim pyKonncHtix n CTaponenaTHbix TeKCTOB. HncpopMauuoHHbie pecypcu Poccuu, N«l, c. 35-37. [Kornienko S.I. et al. Program tools for recognition of handwritten and old-printed texts. Informational Resources of Russia, 2011, nr. 1, p. 35-37. - In Russian.] Moruz, M., Iftene, A., Moruz, A., Cristea, D. (2012). Semi-automatic alignment of old Romanian words using lexicons. Proceedings of the 8-th International Conference ^Linguistic resources and tools for processing of the Romanian language", Iasi, Editura Universitatii „A.I. Cuza", 119-125. Munteanu, Jara, V. (1978) Istoria limbii romane literare. Editura Didactica si Pedagogica, Bucuresti. [Munteanu, Jara, V. History of the standard Romanian language. Editura Didactica si Pedagogica, Bucharest, 1978. - In Romanian.] OCR (Optical Character Recognition) Technology http://www.unescap.org/stat/pop-it/pop-guide/capture_ch01.pdf Panaitescu, P. (1965). Inceputurile si biruurja scrisului in limba romana, Bucuresti. [Panaitescu, P. The beginning and the victory of the Romanian writing. Bucuresti, 1965. - In Romanian.] 48 49 Electronic Linguistic Resources for Historic Standard Romanian Pavlov, R., Bogdanova, G., Paneva-Marinova, D., Todorov, T., Rangochev, K. (2011). Digital archive and multimedia library for Bulgarian traditional culture and folklore. International Journal "Information Theories and Applications", Vol. 18, Number 3, 276-288. RRRL: Reusable Resources for the Romanian Language: http://www.math.md/elrr/ Valori Bibliofile-2008. Gazeta bibliotecarului, Iunie-Iulie 2008, nr. 6-7, p. 1. [Bibliophile Values-2008. Librarian's Gazette, June-July 2008, nr. 6-7, p. 1. - In Romanian.] http:/787.248.191.115/birrm/publicatii/files/3/93 .pdf Vitas, D., Krstev, C, Obradovic, I., Popovic, L., Pavlovic-Lazetic, G. (2003). An Overview of Resources and Basic Tools for the Processing of Serbian Written Texts. http://pomcare.matf.bg.ac.rs/~cvetana/biblio/Solun03MATF.pdf CLRE - PARTIAL RESULTS IN THE DEVELOPMENT OF A ROMANIAN LEXICOGRAPHIC CORPUS MADALIN IONEL PATRA§CU, ELENA TAMBA, MARIUS-RADU CLIM, ANA-VERONICA CATANA-SPENCHIU 1 The Romanian Academy, "A. Philippide " Institute of Romanian Philology, Iasi Branch — Romania 2 "Alexandru loan Cuza" University of Iasi, Facidty of IComputer Science, Iasi - Romania ioneipatrascu@info. uaic.ro, isabelle. tamba@gmail. com, marius. clim@gmail. com, anaspenchiu@gmail. com Abstract The aim of this paper is to point out the current status of creating an essential Romanian lexicographic corpus, which contains eDTLR (the digitalized version of the Romanian Language Thesaurus Dictionary) and other essential Romanian dictionaries (old and new dictionaries, general or specialized ones), aligned at entry level. Keywords: Romanian lexicography, CLRE, eDTLR, computerized lexicography, Linguistic resources, computerized lexicographic instruments 1. Introduction The CLRE project is financed by CNCS - UEFISCIDI, PN II - Human Resources area, with the purpose to encourage the training of young teams of researchers, for a period of three years (August 2010-July 2013), with a team formed of three lexicographers and an IT specialist. The CLRE project aims at achieving a corpus which will include 100 dictionaries from the Romanian Language Thesaurus Dictionary bibliography, aligned at entry and, partially, at meaning level. The purposes of the Essential Romanian Lexicographic Corpus are: to achieve a scanned corpus, with the reference dictionaries of DLR, aligned at entry and, partially, at sense level, to obtain a medium of programs that allow an interactive consultation, to develop a quasi-exhaustive list of words for Romanian language starting with the aligned corpus. 50 51 CLRE - Partial Results in the Development of a Romanian Lexicographic Corpus Madalin lonel Patrascu, Elena Tamba, Marius-Radu Clim, Ana-Veronica Catana-Spenchiu 2. Principles of Development Through this project it can be clearly seen the necessity of creating a missing bridge between two very different directions of scientific research. The ways of approach, development and solutions to the scientific problems differ significantly between the Computer Science area and Lexicography. However, the interdisciplinary cooperation can lead to surprising results, not only for the involved parties, but also for the general research. The informatics part within this project aims at achieving the dictionaries from the established bibliography, in electronic format to process them (by OCR - optical character recognition - the conversion from image to text), to store them in a database, segmenting the text at entry level and then to process these data by aligning them at word level and, where it is possible and if it exists the necessary information, to achieve the alignment at meaning level. All the stages of processing comply with the above enounced order having as reference the dictionary in work. The interoperability of stages from various dictionaries comes to support the adaptation and the optimization of the working process depending on the encountered particularities. However, the most viable solution should cover a great area of problems because an individual treatment for each dictionary based on the specific features would lead to waste of time and to a significant effort. A basic principle of this project is the opportunity to extract partial data with an important degree of coherence. In this respect, all modules process and store information that can be used regardless the completion stage of the processes. The software tools developed in this project respect the principle of portability and free access through Internet service. All results obtained by this approach can be used via query of the database, which eliminates the physical consultation of any of the 100 dictionaries. There is also the possibility of achieving a complex search that increases the degree of interactivity in terms of utility and by providing all these components via Internet various access obstacles to the information source are removed. 5. Storing and Securing Data All processed data have a different legal character. Some of these dictionaries from the CLRE bibliography are protected by the national patrimony law1 or by copyright law2. As such, few dictionaries have from this „point of view" a free character. For this reason we must secure and limit the access to the data like: folders containing scanned corpus, text obtained after optical character recognition, segmented entries resulted from the processed dictionaries, aligned definitions. This data is stored on a data platform based on SQL. 1 Law no. 182/2000 regarding the protection of movable national cultural heritage, republished in 2008. 2 Law no. 8/1996 regarding copyright and related rights modified by Law no. 285 of June 23, 2004 and Urgency Ordinance 123 of 1st September 2005. 52 The web access can be done through the project address http://85.122.23.90 and the primary stored data are available at ftp://85.122.23.90. As a security measure the password is encrypted in order to protect the personal information. Also access to computer tools that can affect the database is divided into levels of rights. However, to support the user and simplify the access to the software platform, the access to services is done with the same username and password, depending on rank of the user account. 4. The Process of Dictionary Digitization Digitization is the process by which, using a scanner, the document (the book) is transposed from physical (paper, manuscript, book, volume) into electronic format (pdf, tif, jpg. files). This method has the advantage of facilitating the access to information, which can be consulted online, and, at the same time, the specific document can be accessed simultaneously from several locations. Moreover, it represents the means of distributing rare documents, whose physical consulting may cause its deterioration. 4J.The Acquisition of Dictionaries (Scanning) This first stage is achieved with the help of a professional planetary scanner which uses the technique of photographing pages in a controlled environment. For an optimal quality of the captured images, the following settings must be taken into account: - white cold light from auxiliary lamps; - we do not use the light produced by camera flashes because it makes characters brighter, thus lowering the contrast between the letters and the background, and in other cases a blurring effect is caused, which seriously reduces the quality of the captured image; - photo cameras are on manual mode; - the environment is not exposed to any other source of light; - the exposure times are between 1/15 and 1/30, depending on paper quality; - auto focus; - the ISO level is set to 200. One of the major problems in the scanning process is represented by the quality of the paper. For example, thin sheets of a book affect this stage because of transparency, which allows the capture of letters from the reverse and the following pages. This effect can be minimized by interposing a black matt board under the page. More details about that can be found in the tutorial offered by E-BOOK ENLIGHTENMENT3. 4.2.Capture Editing 3 http://en.flossmanuals.net/e-book-enlightenment/scanning-book-pages/ 53 r CLRE - Partial Results in the Development of a Romanian Lexicographic Corpus The captured images are edited with the application BookDrive Editor Pro4. In this stage, the images are cut, in order to separate the content of the dictionary page from the rest of the capture. Another task is represented by the process of pivoting images in portrait orientation. Thus, line of the text should display a right angle with the bottom of the page. In the next step the new images in tif. format are saved with dynamie compression, in black-and-white. In this stage, some pages require superior editing, which implies setting the contrast, the level of primary colors, or reducing noises. 43.The Process of Character Optical Recognition (OCR) The images acquired in the previous stage were edited and indexed within a database. The editing meant a process of character recognition (OCR), using The Abbyy Fine Reader 9.0 library set for each page. The information was thus stored in the data base in two ways: as image of the dictionary page and as text format. For reasons related to paper quality, the conditions in which dictionaries were stored, the passing of time, or the typing quality, the efficiency of the OCR may be affected. Thus, in the recognized text some errors may appear due to the OCR process. 5. The Primary Editing of Data (Breaking in Entries) The unit of indexing information within dictionaries is the entry, the lexicographic definition of a word. The digitization process is followed by the stage of identification/ segmentation of dictionary entries. LISAS.v. Baton. pABAKl (pi. babane), s.f.; t. pastor.: vieiJle Lrabis, brebis sterile, Cuvmfe IpoMnese foarte intrebuintafc in Moldova si-a locid c£roift In Muntenia |e o&cirmiesie mai mult hotoasa. E format din baba pvin suftzud wngmcntatiT - a n (v. -an), ea §i cum s-ar zioe: „baM marei; sau J>aM, pe tot". * | Aelieiilu mime dsf friicielo; Crari mi coprind dccat im......smmiml (N. Petriceanu, emmfa, de ordinnr b^H^ Wm^tT^mf^WrK^] *a. Jfiptac'a s. al- surij Beifiigea, jufefcen, §inpf%en,; west's. Aeltillc. Cs&zutev*)i Lidh' §i Peleu, regele 31 | betfegen, Qingirifptti; a se adauSj pi —mi, ber 8Tn$an& git* i Jafc Me Seiffigunj, #nfage, • ijii* : ge^r;cu—a!,init5eEl8cif%Hii0. 'deca? adv., narnftdj, ba§ |d?$i; ia —, aufefrt, im grttfrfog; a; Figure 1: Types of dictionary The analysis of formatting styles and the position within the page is the most viable solution, since the accuracy of the recognized text after the OCR-isation process, when 4 http ://www.atiz. com/bookdri ve-editor-pro/ 54 Madalin lonel Patrascu, Elena Tamba, Marius-Radu Clim, Ana-Veronica Catana-Spenchiu compared to the original, leads to the conclusion of not using the information contained within the text, as there is no certainty in the validity of the processed data. Though, this solution is impeded by the diversity of stylistic ways of formatting. Practically, there are dictionaries whose titles of entries: - are aligned before, after, or on the same level with the definition; - are preceded or not by alphanumeric series or punctuation marks; - have the same formatting style as the body of the definition; - are written in lower case, upper case letters, or combination of these. To answer these problems, an algorithm was created, which analyses each page of the dictionary and, based on both word formatting and its position within the page, the algorithm identifies the title terms that appear in the text. This is a goal approach, which aims at finding the common element of each set of features that characterize dictionaries. 3. IncheieturS, articulate, Mdref, adinc §i luciu cdtd-

<11="249" p="1073n t="280" b="316" ci="0" C="57"> 3. incheietura, articulatie <11="197" r="1072" t="318" b="351" ci="57" C="60"> 3 . i n c - (Mold., TransiJv.) Dar, d-apoi ca, apoi, ca doar. Maiauzii-ai dumneaia cumnata.unaca : asta, sa fare ton pupaza Ce spui, cumnata. Daca S-as ucide in bataie and as afta ca eiaprins pupaza, s-o chinm iasca CREANGA, A 56. <$> (Adesea urmat de « doai *) Te rog, sa rrmcu bagare de sama, ca sa w-mi oravafi norai Daca doar nu-$ narabagiv de ieri de aiaiiaieri, cheukm, p. 115. Auzi sanuma grabesc and' m-am aprins?... Daca doar mhsunpoponetsam~ap?}nd$isa ma Xing pe mica oeceas ALECSANDRI. DACIAN s. n. (Geoi.) Al treilea etaj al piiocenuiui din esiul Europei. reprezentattrttara noastra printr-un complex de nisipuri, argile, gresii etc, de o grosime pma la 450 m, continind zacaminfe bogate de petrol si de lignit Pronuntat: -ct-an. DACIC, -A, dacici, -a, adj. Dac. [Columna lui Traian], imagine inca \ne a razboaseior dacice, oooaexu.S. ni 68. & DACiTA s. f. Exploziv a carui explozie nu aprinde srizuul si care este fblosrt in minele de carbuni. 2* DACOROMAN, -A, dacoromtni, -e, adj. (Despre dialect, grai, cuvinte, spre deosebire de a s ^ r o mf n, istroromin, meglenoromm) Al rommtlor sau pnVitor la rominii din norduf DunariL I Mingiie-ma, rogu^e, cu cfteva cwinte dacoromine, in singwatatea mea de atci. cwwgkle, o. vk 111. + (Substantivat, f.) Limba vorbita de rominii din nordul Dunarii. aum, 9. a H$. Bixamt, xuf&t 4a&w la ftmttf mi he, Jar ca slafete ia psarta nvhaal & xtti orsjs*. cstK&ssi, Jews* tmvt&iwSwf t» tb&StMra &4 giftSndu-im thai a~ar ft hint a&4 efcatft prist, mavis as p£r tfy tup, S*#ows«K3, r. J!. SL 7. (Cxt vakW ^r«rim3* Os xbbi- CW tm pd dt btmif Vacs m*i -emg&m pe ei # siz eaw- •JMSEua, sr. » ?6&..«> kee-., a^v- Alik ifa-eS ™ «a §*«*>.. 3. (lit cx^r.) 'W o. 2Z7) essij,. 5* Awr. Mis ausit-m asuosnaig, ssastunj&s tins ea asta, t4 fun len pitp&st„„ — Ce tpiti,. cwtm9t& Btssd hey t» &4te*cuid iff* English direction, model #3 was the best performing of the five, with a BLEU score of 57.01. For the English Romanian direction, scores were a bit lower, model #2 having the highest 53.94 BLEU points. Table 3: BLEU scores for various translation flows RO->EN EN-»RO Model # BLEU Model # BLEU #1 56.31 #1 52.43 #2 56.49 #2 53.94 #3 57.01 #3 49.97 #4 56.79 #4 49.12 #5 56.89 #5 48.70 The next step was to estimate the translation time of the ALL corpus. Moses offers two different translation options: the default translation search and the cube pruning search algorithm. There are two adjustable parameters: the stack size and beam search. These parameters have been manually specified to obtain insights about their influence on translation speed and quality. We present only model #3 for the RO->EN direction. The translation time includes language model and translation/generation tables loading time. The test machine is a dedicated 16 core (8 physical + 8 virtual, running at 2.6GHz), 12 GB RAM server. Table 4: Model #3 RO-EN: Parameter variation, translation time and BLEU scores Stack Size Param. Beam Search Param. Translation Time (s) BLEU Score (default) (default) 3074 57.01 100 (default) 1611 56.69 50 (default) 831 56.05 20 (default) 391 54.97 15 (default) 307 54.36 10 (default) 229 53.16 5 (default) 144 51.35 (default) 100 83 39.17 (default) 10 83 43.29 (default) 2 87 47.17 (default) 1 93 49.63 (default) 0.5 151 51.80 (default) 0.1 169 55.84 100 1 106 49.63 Cube pruning algorithm with stack size 2000 167 56.29 Table 4 shows measurements for the translation times and BLEU scores (RO->EN direction) of the test files (1,200 sentences), for different settings of the Stack Size and Beam Search. Even though the best performing translation was achieved using the default parameters (BLEU score: 57.01), due to the very long translation time, we found that the best compromise was to use the cube pruning algorithm with the stack size 2,000 that obtains a marginally lower BLEU score of 56.29. When using the cube pruning algorithm, we found that, for our test set, increasing the stack size to more than 2,000 does not generate any noticeable score improvements. 4. Cascaded translation In order to improve the quality of texts automatically translated, they are usually post-edited by human experts. Trying to speed-up the process of post-editing (Ehara, 2011) presented their EIWA ensemble which is based on a commercial rule-based MT (specialized in patent translation) for the first step and a MOSES-based SMT for the second phase (named statistical post-editing). 86 87 Romanian-English Statistical Translation at RACAI In (Tufis and Dumitrescu, 2012) we introduced the notion of cascaded translation using the same SMT system trained on different parallel data. Except for the training data and the different parameter settings, the two systems are incarnations of the same basic system. The first system SI, trained on parallel data {CA,CB} learnt to produce draft translations from LA to LB. The second translation system S2, trained on the "parallel" data {S1(CA), Cb}, learnt how to improve the draft translations. There are several other methodological differences between our system and the one described in (Ehara, 2011). EIWA does not work in real time because before proper translation of a text T, the SMT post-editor is trained on a text similar to T. The similar text is constructed from a large patent parallel corpus (3,186,284 sentence pairs) by selecting for each sentence in T an average number of 127 similar sentences. Contrary to Ehara (2011), we found that setting the distortion parameter to a non-null value improves the translation quality. Translation of a new, unseen text is achieved in real time (no retraining at the translation time). Based on the experiments reported in previous chapter, we have used the two best performing models (model #3 for the RO->EN direction and model #2 for the EN->RO direction) with the cube pruning search algorithm to translate each side of the ALL parallel corpus {CR0, Cen}. We obtained two new corpora: for the RO->EN direction we obtained a translated corpus in English paralleled with its reference translation {?si(Cro)3Cen}, and for the EN->RO direction, a translated corpus in Romanian paralleled with its reference translation {CRo,rSi(CEN)}. After the translations, the two newly obtained "parallel" corpora were processed as discussed in Chapter 1. Using the same NLP tool we used to annotate the original corpus we annotated the translated corpora with lemma, CTAGs and MSDs. Each of the two "parallel" corpora was used as training material for a second layer of the translation architecture with the purpose of validating our intuition that a cascaded translation system may improve its translation accuracy by learning from own mistakes. 4J. Second layer translation system (S2) Translating from broken language L into a better version of L (L being either English or Romanian), we trained 9 models to see which one would perform best. Table 5 shows the models chosen (the notations used in the Details column have the same meanings as in Table 2) and Table 6 shows the translation and BLEU scores using the cube pruning and default translation algorithms. The same models were used for both translation directions. Tiberiu Boros, Stefan Dumitrescu, Radu Ion, Dan Stefanescu, Dan Tufis Table 5: Translation flows variants for the second translation system Model Details #1 tO-0 mO #2 tl-1 gl-0 mO #3 tl-1 gl-2t2-2gl,2-0 m0,m2 #4 tl-1 gl-3 t3-3 gl,3-0 m0,m3 #5 tl-1 gl-3 t3-3gl,3-0 m0,m3r3 #6 tl-1 gl-2 t2-2 g2-3 t3-3 gl,3-0 m0,m2,m3 #7 tO, 1-0,1 mO #8 tO, 1,2-0,1,2 m0,m2 #9 tl,2-tl,2 m0,m2 The S2 translations (in both directions) were performed using the cube pruning search with stack size 2,000. The reordering model is the Moses default, with the only difference that in model 5 we have used MSDs as the reordering factor. For testing S2 we used the same test files as for SI, as they were translated with the best SI models: the model #3 for RO^EN direction and the model #2 for the EN^RO direction (see Table 3). The reference translations for the two directions were TEN and TR0 respectively (1,200 sentences each). For the RO->EN direction the BLEU translation score of the S1+S2 system has been improved from the best SI model (57.01) to a new BLEU score of 60.90. The fact that S2 translation based on model #7 (surface form & lemma with reduced MSD to surface form & lemma with reduced MSD using only the surface language model) was the fastest and most accurate is not surprising: we "translated" from partly broken English into presumably better English. Generation steps in models #2, #3, #4, #5, #6 were more detrimental than useful but the information on the lemma eliminated some candidates from the search space. That observation suggests that there were few inflected word forms to be corrected and most error corrections came from a more precise retrieval of the translation equivalents. Interestingly, the translation time the using default Moses parameters is very close to the cube pruning search (because the chosen model has just phrase translation and no generation component), but yields approximately 0.14 BLEU point increase. 88 89 Romanian-English Statistical Translation at RACAI Tiberiu Boros, Stefan Dumitrescu, Radu Ion, Dan Stefanescu, Dan Tufis Table 6: RO-»EN: S2(SI(7TRO)) Model # Transl. time (s) BLEU with Transl. time (s) with BLEU with rtth cube pruning cube pruning default params. default params. #1 195 60.42 257 60.65 #2 186 59.59 4745 60.12 #3 175 55.68 4129 56.12 #4 281 55.50 3994 56.18 #5 221 55.45 4104 56.20 #6 244 55.16 5016 55.98 #7 108 60.74 143 60.90 #8 144 58.50 254 58.61 #9 136 58.50 249 58.61 Table 7 shows that for the EN->RO direction, the S2 system models #7 and #8 have a similar performance, increasing the BLEU score from the original 53.94 points to 54.44 (0.5 BLEU point net increase). As with the RO->EN direction, the S2 models that employ generation steps actually slightly decrease the score. Table 7: EN -}RO: S2(S1(7TEN)) Model # Transl. time (s) rtth cube pruning BLEU with cube pruning TransL time (s) with default params. BLEU with default params. #1 254 54.41 154 54.42 #2 1443 52.14 556 52.55 #3 1051 53.50 594 53.50 #4 543 53.59 798 53.59 #5 530 53.59 613 53.59 #6 805 53.56 997 53.56 #7 282 54.43 167 54.44 #8 417 54.41 287 54.44 #9 403 54.40 280 54.42 Another interesting result was to evaluate the simple cascading systems without feature models, that is (S1=#1)+(S2=#1) and compare their performances with the direct translations and the best feature-models cascaded systems. The results are shown in Table 8. Table8:S2(Sl(rsource» RO EN'-»EN EN RO'^RO Model # BLEU Model # BLEU #1+#1 60.47 54.29 #3+#7 60.90 #2+#7 54.44 The increased accuracy due to various feature combinations versus the baseline system has been apparent from Tables 6 and 7 compared to the results in Table 3. Table 8 shows that the direct translations (SI with any model) for both directions have BLEU scores lower than the cascaded system (S1+S2) even when feature models were not used (model #1+#1). Thus, we can support the statement that the morphological features and the cascading idea are beneficial to the overall accuracy of translations (at least between Romanian and English). Finally, we took the cascading idea one step further by repeating the entire train-translate process (step 2), obtaining S3(S2(Sl(TSOUTCQ))). We observed that the translation stabilized, with very few sentences being changed (around 1%), and with the changes being minor (increasing or even decreasing the BLEU score by less than -0.05 points). We concluded that further cascading would not bring significant improvements. Overall, we obtain a 3.89 BLEU point increase for the RO~>EN direction and a smaller 0.5 BLEU point increase for the more difficult EN->RO direction using our cascaded system. In (Dumitrescu et al., 2013) we showed that the cascaded translation is beneficial for translating both in-domain and out-of-domain input texts. 5. Analysis of errors in cascaded translation We were interested to see which the most distant translations from the reference were, assuming that these were bad translations. We computed for each sentence / the similarity scores SIM between its translations and the reference translation. These scores were computed with the same BLEU-4 function used for bitexts. Similarly to the BLEU score applied to a bitext, 100 means perfect match and 0 means complete mismatch. Thus, we obtained 1,200 pairs of scores SIM1S1 and SMlsl^S2. We also compute the average similarity scores as 2JIf® SIM^a where Sa is SI or S1+S2. As expected, the average SIM scores make the same ranking as the BLEU scores, although they are a bit higher (ex: 61.18 for SI and 63.58 for S1+S2 for the RO->EN direction). We briefly comment on the results of this analysis for the Romanian-English translation direction. We manually analysed the test set translations. We identified 3 sentences with their translations having a zero SIM score for both systems. The explanation was that the reference translation was wrongly aligned to the source sentence. SI produced 72 perfect translations (score 100) while S1+S2 produced 105. Only 57 perfect translations were common to SI and S1+S2, meaning that S1+S2 actually deteriorated a few of the original correct translations. By analyzing the 15 translations that were "deteriorated" we noticed that they were identical, except that unlike S1+S2, SI and Reference translations either had a differently capitalized letter that marginally lowered the score or had multiword units joined by underscores (e.g. as well as vs. as_well_as). This was a small bug which has been removed and which, overall, brought a 0.05 increase in the BLEU score. The capitalization and punctuation are other sources of lower scoring against the reference. All these examples show the sensitivity of the BLEU scoring method, especially for very short sentences. 90 91 Romanian-English Statistical Translation at RACAI Tiberiu Boros, Stefan Dumitrescu, Radu Ion, Dan Stefanescu, Dan Tufis Another important variable to note is the amount of change from one layer to the other: out of all sentences, around 37% had a BLEU increase while around 20% had a BLEU decrease (but see the comment on the underscore difference), the rest 43% have not been changed in any way. The table below shows some examples with differences between the standard translation system and the cascaded one. Table 9: Comparison between the standard SMT and the cascaded SMT SI BLEU SI Sentence S2 BLEU S2 Sentence Reference 0.36 the area is situated in the Golful normand-breton , in the southern part of the Manecii 0.83 the area is situated in the Normano-Breton Gulf, on the south side of the Manecii the area is situated in the Normano-Breton Gulf, on the south side of the English Channel 0.58 other infonnation, provided that they have a suitability and reliability can be reasonably demonstrated.' 0.84 other information, provided that its suitability and reliability can be reasonably demonstrated.1 other information provided that its suitability and reliability can be reasonably demonstrated .' 0.3 ( 13 februarie 1934) is an American actor, film and television. 0.58 (February 13, 1934) is an American film and television actor. (February 13, 1934) is an American film, stage and television actor. 0.46 Speer was made available to historians and other scholars . 0.35 Speer was made available to historians and scholars. Speer made himself widely available to historians and other enquirers . We can see that in general, sentences are improved. This usually happens in three distinct ways: - SI fails to translate words which are subsequently translated by S2. While counter-intuitive, because basically no new information is added to the system, this happens because in Si's phrase table some phrases are automatically pruned, leaving for example unigrams that are found in the training corpus but do not exist in the phrase table. S2, on the other hand, does not miss these unigrams and therefore, it translates them as is presented in the first sentence in table 9. - Better word ordering. In the third sentence in table 9 we can see that S2 translates February and then reorders the day and month to match the English date format. - General phrase substitution. Sometimes there are more appropriate phrases, as it is shown in the second sentence. While the Si's translation might be accurate from a Romanian word-for-word perspective, S2 manages to find a better phrase than '... provided that they have a ...' However, the system sometimes degrades sentences, usually by shortening them. In example 4, we see that 'other' from Si's translation is removed by S2. 6. Experiments on translating in-domain versus out-of-domain texts The astute reader might have noticed that our evaluations used in-domain data. The text data have been randomly extracted from the ALL corpus. Although, not seen during the training phase, the test data qualifies as in-domain data. In (Dumitrescu et al., 2012) we provided a detailed analysis of experiments with several translation systems, corresponding to the 7 distinct domains (see Introduction section) plus the system trained on the concatenation of the 7 domain specific corpora (ALL corpus). As these types of experiments are very time consuming we considered only one translation direction (RO-EN). Depending on the way the training was performed, we obtained 48 different performing MT systems: 16 baseline (see Table 2, model #1: tO-0 mO) and 32 factored translation systems. Eight baseline systems were generated as surface phrase-based systems from the domain specific corpora using the same language model built from the ALL corpus and their performances are shown in Table 10. The other eight baseline systems were generated only from the respective domain specific corpora and their performances are shown in Table 11. Table 10: Translation results using the baseline systems with the domain-independent LM Test domain DGT EPL LIT MED NWS SPK WIKI5 Model trained on domain corpus DGT 51.45 31.98 9.80 34.16 27.74 15.45 21.1 EPL 37.73 40.97 13.13 29.97 31.98 23.34 22.74 LIT 8.31 8.76 14.09 12.44 9.01 11.49 12.48 MED 25.76 18.7 6.97 54.54 15.85 12.05 15.81 NWS 26.15 31.82 10.47 25.83 40.07 20.02 22.21 SPK 20.21 28.08 13.75 26.96 24.33 27.95 22.7 WIKI5 30.67 31.59 13.35 31.66 32.2 22.11 29.51 ALL 51.43 40.89 18.00 53.46 39.31 26.73 29.95 (difj) 0.02 0.08 -3.91 1.08 0.76 1.22 -0.44 Table 11: Translation results using the baseline systems with the domain-dependent LM Test domain DGT EPL LIT MED NWS SPK WIKI5 Model trained on domain corpus DGT 51.94 25.24 7.55 L 22.17 19.16 10.98 16.51 EPL 26.99 40.85 10.96 19.34 24.51 19.89 18.51 LIT 6.93 6.33 14.33 10.34 7.60 8.79 12.3 MED 17.24 12.11 5.18 55.34 12.60 8.39 13.58 NWS 15.64 24.31 8.3 16.48 40.23 15.24 19.05 SPK 10.42 18.91 11.49 15.34 16.24 28.65 16.9 WIKI5 18.10 21.99 11.07 21.03 24.08 17.83 29.99 ALL 51.43 40.89 18.00 53.46 39.31 26.73 29.95 (diff) 0.51 -0.04 -3.67 1.88 0.92 1.92 0.04 92 93 Romanian-English Statistical Translation at RACAI The 32 factored translation systems were constructed as described above, in Chapter 3, selecting the best performing model flow for our language pair and direction (RO->EN). As shown in Table 3, this was the model #3: tl-1 gl-2 t2-2 gl,2-0 m0,iri2. Depending on how the language models mO and m2 were built and used in the respective factored translation systems we generated the following ones: - 8 systems using LLM (mO) and GLM (m2) generated from ALL corpus; their performances are shown in Table 12. - 8 systems using LLM (mO) generated from ALL corpus and GLM (m2) generated from the domain specific corpora; their performances are shown in Table 13. - 8 systems using LLM (mO) and GLM (m2) generated from the domain specific corpora; their performances are shown in Table 14. - 8 systems using LLM (mO) generated from the domain specific corpora and GLM (m2) generated from ALL corpus; their performances are almost identical to those shown in Table 14 and are not discussed here. Table 12: Factored translation results using domain-independent lexical and grammatical LMs Test domain DGT EPL LIT MED NWS SPK WIKI5 Model trained on domain corpus DGT 46.43 30.72 10.02 30.91 25.43 15.95 21.16 EPL 33.32 39.12 12.86 27.05 28.62 22.67 22.47 LIT 9.3 9.64 14.31 15.8 10.57 13.32 13.44 MED 23.45 18.87 7 A 48.65 15.76 13.04 16.07 NWS 25.51 30.7 11.04 24.42 38.03 19.9 22.83 SPK 19.71 26.83 13.28 24.28 22.9 26.66 21.69 WIKI5 28.21 29.59 13.19 29.43 29.65 21.7 28.55 ALL 45.63 38.08 16.11 45.72 35.77 25.64 28.08 m 0.8 1.04 -1.8 2.93 2.26 1.02 0.47 Table 13: Factored translation results using domain-independent lexical LM and domain dependent grammatical LM Test domain DGT EPL LIT MED NWS SPK WIKI5 Model trained on domain corpus DGT 46.51 30.55 9.94 30.35 24.73 15.58 20.91 EPL 32.98 39.06 12.82 26.12 28.06 22.48 21.96 LIT 9.13 9.4 14.43 15.67 10.42 12.98 13.28 MED 23.1 18.41 6.71 49.55 15.64 12.54 15.91 NWS 25.1 30.66 10.92 24.17 38.49 19.33 22.61 SPK 19.32 26.4 13.12 23.45 22.58 26.82 21.53 WIKI5 27.88 29.12 13.17 28.64 29.48 21.28 28.77 ALL 45.63 38.08 16.11 45.72 35.77 25.64 28.08 ( RI-00000059-n n Pe§tera_Polovragix Pe§tera in judeful Gorj ENG30-09238926-ninstancs_hypernym The name of an administrative building is often used to refer to the institution whose headquarters it houses or to the head of that institution: "Guvernul a anuntat anterior participarea premierului Ponta la CSAT joi, de la ora 14.00, insa ulterior Palatal ^Victoria si-a retras anuntul [...]" (http://cursdeguvernare.ro/ponta-a-anuntat-sedinta-csat-inaintea-lui-antonescu-privind-atentatul-din-bulgaria.html accessed on 26th February, 2013 ). Thus, there appears the need to add further metonymic synsets in such cases, given that there already are such metonymies in wordnet: see White_House:l {the chief executive department of the United States government} and White_House:2 {the government building that serves as the residence and office of the President of the United States}. Counties and their capitals were already in RoWN due to a BalkaNet initiative of adding some area specific concepts to the wordnets in the project. What is important is that from these toponyms we can create the name of the inhabitants of the places or of the persons born in those places. This can be an automatic task: given the list of toponyms and a list of specific suffixes (i.e. suffixes that create names of inhabitants or of people born in a place), we can automatically attach the suffixes to the toponyms. The results are automatically validated on the web, establishing a threshold for the number of occurrences of a word above which we validate the form. Whenever manual intervention is necessary we appeal to it. The same approach to validate automatically derived words in Romanian was adopted by Petic (2012). These synsets designating names of people can be automatically attached a gloss, too. This remains future work. 4.2. Using the API to improve the current RoWN Besides extending RoWTSi with new instantiations, while developing RoWordNetLib we have identified some errors in the current version of the wordnet and gained some interesting insights. We further present only a few of them: One technical error identified was the use of xml-reserved characters in the definitions and in the sumo tag "TYPE". For example, the tag was written as "". The character ">" is not allowed in the content between tags, as it usually means the end of a tag. Instead, the character should be represented as ">'\ Using the SAX XML parser we were able to identify these minor errors and correct them. Stefan Daniel Dumitrescu, Virginica Barbu Mititelu Another issue we have discovered during the development of RoWordNetLib was that some of the synsets did not have a DEFINITION tag. 53 of the almost 60000 synsets were missing a definition. The synsets have been corrected by adding the missing definitions using RoWordNetLib programmatically. Having all synsets indexed means that we can perform different counts to obtain some interesting insights and statistics. For example, before adding the geonames.org entities, we had 41063 noun synsets, 10397 verb synsets, 3066 adverb and 4822 adjective synsets. We can also obtain a relation frequency table. For example, we have 3889 instance_hypernym relations, meaning there are 3889 instantiations of wordnet concepts. While RoWN contains 48316 hypenym relations, it also contains only 3 nearjparticiple and only one near_domain_topic relation. In total we have 37 different relations. Other statistics can be obtained, such as the synsets with most instantiations, the synsets with most hyponyms (or any other relation), etc. 5. Conclusions This article presented another step in the development of the Romanian WordNet. We have written a Java API specifically to allow easy RoWN access and editing. Also, the API itself was made to be as easy to use as possible, and also easy to extend with new functionalities as RoWN itself evolves. The API offers basic access to RoWN, whether in the original XML format or stored as an SQL database, basic operations to manipulate synsets, set operations (like union, difference, intersection) applied to RoWN objects, as well as more complex operations like breadth first searching on the noun graph. The development of the API has allowed us to identify and correct a few small inadvertencies in the wordnet, and also to obtain insights in its structure and entity distribution using the embedded statistics functions. The main achievement using the API was to actually expand RoWN by semi-automatically instantiating concepts with entities extracted from an external source (GeoNames - geonames.org). We focused on geographical entities belonging to the classes: administrative buildings, airports, caves, delta, islands, monasteries, passes, hills, and plains. It is worth mentioning that geography was chosen because of the big amount of such information available on the Internet. However, this is only the starting point in this experiment and other domains will also be covered. Some of the enumerated classes are of utmost importance for the domain, while others are relevant for the administrative domain and occur frequently in news of national concern, in which they offer the local coordinates of the event. All these are different categories in GeoNames, thus we manually specified for each of them the correspondent in the wordnet. For future work, we intend to make the API freely available, on a public platform such as sourceforge or github. Currently, while perfectly usable, the code needs optimizing and lacks commenting on each of its functions. Also, usage examples need to be given to help users understand the structure of RoWordNetLib an how to use the code. Also as a future work we intend to use the developed API to create an extension of RoWN containing most of the Romanian entities in geonames.org, as well as entities 116 117 Instantiating Concepts of the Romanian Wordnet from other sources. This extension will be optional when loading RoWN with RoWordNetLib. Mountains and rivers are important geographical entities. In the case of the former, for example, we consider it is not enough to simply include them as instances of mountain. In geography they talk about groups, about ranges, mounts and peaks. We want to capture the same information in our network. That is why we decided to part from GeoNames in such cases and use other available sources of such information to start from when adding them to RoWN. This will be future work. References Buscaldi D., Rosso P. (2008). Using GeoWordNet for Geographical Information Retrieval. Revised Selected Papers CLEF-2008. Springer-Verlag, LNCS(5706) 863-866 DEX - Coteanu, I., L. Seche, M. Seche eds. (1996). Dictionar explicativ al limbii romane (DEX). Bucuresti: Editura Univers Enciclopedic. Fellbaum, C. (ed.) (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press. Petic, M. (2011). Automatizarea procesului de creare a resurselor lingvistice computafionale, PhD Thesis. Institutul de Matematica si Informatica al ASM. Tufis, D., Barbu, E. (2004). A Methodology and Associated Tools for Building Interlingual Wordnets. Proceedings of the 5th LREC Conference, 1067-1070. Tufis, D., Cristea, D., Stamou, S. (2004). BalkaNet: Aims, Methods, Results and Perspectives. Special Issue on BalkaNet of the Romanian Joiamal of Information Science and Technology, 7:(l-2), 9-43. STEPS TO A NEW DTD AND SCD-BASED DICTIONARY ENTRY PARSER. OPTIMIZING RECURSIVENESS IN SENSE DEPENDENCY HYPERGRAPHS NECULAI CURTEANU1, ALEX MORUZ1'2, SVETLANA COJOCARU3 1 Institute of Computer Science, Romanian Academy, Iasi Branch, Romania "Faculty of Computer Science, "ALL Cuza" University, Iasi, Romania, 3 Institute of Mathematics and Computer Science, Chisinau, Republic of Moldova ncurteanu@yahoo.com, alex.moruz@gmail. com, svetlana.cojocaru@math.md Abstract In previous papers we developed the dictionary-entry text version for the parsing method of SCD (Segmentation-Cohesion-Dependency) configurations, which was applied to six largest (Romanian, French, German, and Russian) thesaurus-dictionaries, with outstanding efficiency and portability results. In SCD method, the Dependency Hypergraph (DH) describes, for a dictionary, the specific pre-established dependency relations between the sense marker classes of that dictionary. The DH of a dictionary is akin to its fingerprint. The present paper solves the following problem: transforming the sense DHs with non-embedded cycles and / or troublesome (e.g. disconnected) hypernodes, into DHs having only structurally embedded recursive cycles and linearly connected hypernodes. The DH optimization is based on a total ordering of literal enumeration^) within sense marker classes, obtaining linearly embedded cycles for all DHs that represent an SCD parsing level. This solution opens the effective possibility to construct the least upper bound (LUB) of several optimized DHs, the associated parametrized grammars of such LUB DHs yielding the formal descriptions of a sound DTD and a general SCD-based parser for very large dictionaries. Keywords: sense marker-depending renaming of the literal enumeration, total ordering of sense levels, parametrized grammar 1. Introduction In (Curteanu et al., 2008, 2010, 2012) we applied the method of Segmentation-Cohesion-Dependency (SCD) configurations to model and parse the following six, sensibly different, Romanian, French, German, and Russian largest thesaurus-dictionaries: DLR (The Romanian Thesaurus - new format), DAR (The Romanian Thesaurus - old format), TLF (Le Tresor de la Langue Francaise), DWB (Deutsches Worterbuch - GRIMM), GWB (Gothe-Worterbuch), and DMLRL (Dictionary of Modern Literary Russian Language). Parsing a dictionary entry means to identify its lexicographic segments (the first SCD configuration - SCD-Configl), to extract its sense tree (SCD-Config2), and to parse the atomic sense definitions (on SCD-Config3). 118 119 Steps To A New Dtd And Scd-Based Dictionary Entry Parser When applied to dictionary entry parsing, the method of SCD configurations merges the following sequence of (at least) three specific configurations, i.e. lexical-semantics sense levels: (a) The first one, abbreviated hereafter SCD-configl, performs the segmentation and dependencies for the lexicographic segments (Hauser & Storrer,1993), (Erjavec et el, 2001) of each dictionary entry, (b) Stepping down into the lexicographic segments of a thesaurus-dictionary entry, the second SCD configuration (SCD-configl) usually parses the lexicographic segment of sense description, extracting its sense tree structure (Curteanu et al., 2008, 2010). Actually, the SCD-config2 parses the entry sense definitions of larger lexical-semantics granularity in the sense description segment: primary, secondary, and literal / numeral enumeration senses, (c) The third SCD configuration (SCD-config3) continues to refine the sense definitions of SCD-config2, parsing each node in the generated sense-tree for obtaining the atomic definitions / senses, i.e. finest-grained meanings of the dictionary entry. In (Curteanu et. al, 2010, 2012), we gave a structural analysis of the dictionary entry parsing (DEP) process, comparing the classical approaches to DEP with the method of SCD configurations, applied to dictionary entry text. In the classical (called also standard) DEP, relying largely on lexicographic formal grammars, e.g. (Neff & Boguraev, 1989), (Tufis et al, 1999), the LexParse system in (Hauser & Storrer, 1993) and (Lemnitzer & Kunze, 2005), the main problem is that the sense tree construction of each entry is recursively embedded and mixed within the definition parsing procedures. The formal proof of optimality for the parsing method of SCD configurations compared to the standard DEP (Curteanu et al., 2012) shows that the latter DEP strategy contains three embedded cycles, corresponding to the parsing processes of lexicographic segment recognition, sense tree extraction (for entry senses defined by explicit marker classes), and atomic definition parsing. The main power and novelty of the SCD-based parsing method is that it succeeds to separate and run sequentially, on independent parsing levels (viz. SCD-configurations), the three above mentioned parsing cycles / processes. Since the process of sense tree construction in the method of SCD configurations could be made completely detachable from that of parsing the atomic sense definitions, the whole DEP process with SCD-based method is much more efficient and robust, actually optimal (Curteanu et al., 2010). There are (at least) two distinct and specific features of the SCD parsing method: (a) the breadth-first search of all the sense markers of an entry, and (b) working directly on the sense marker sequence(s), and only on them (for the SCD configurations of larger semantic granularity senses), to compute the sense tree of the entry. These properties of the new parsing method of SCD configurations have been effectively tested by in-depth parsing experiments on six largest Romanian, French, German, and Russian thesaurus-dictionaries (Curteanu et al., 2010, 2012). Remarkably, the proposed parsing method is a completely formal grammar-free approach, with the parsing program for a new thesaurus readily adaptable in weeks-time (depending on its lexicographic modeling task, specific to each dictionary), thus providing an outstanding portability of the parsing program, both on linguistic and computational grounds. As a major computational component of the SCD parsing method, the Dependency Hypergraph (DH) of a thesaurus-dictionary embodies (by the SCD lexicographic Neculai Curteanu, Alex Moruz, Svetlana Cojocaru modeling) the pre-established hierarchy between the sense marker classes of that dictionary, being actually its true semantic "fingerprint". The study of DHs for various thesaurus-dictionaries has a special importance for both the lexicography and parsing technology: (a) DHs have to reflect the regular (and irregular) representations of the sense dependencies, (b) The comparison between various specific DHs is the best opportunity to simplify, regularize, and standardize (bl) the dictionary sense marker classes, (b2) the rules encoding the sense definition markers, and (b3) the dependencies that can be established between the sense / subsense marker classes of the dictionary. The careful analysis of DHs for various thesaurus-dictionaries, based on the parsing method of SCD-configurations, have important consequences within the cross-linguistic lexicography-lexicology research: to establish better, standard and optimal design rules for the dictionary sense markers, and entailing optimized DHs of sense dependencies for the most complex thesaurus-dictionaries, either existing or designed ones. We strongly emphasize that this paper does not deal with effective parsing experiments of one or several dictionaries but with the formal representation and optimization of DHs, as computational objects in the parsing method of SCD configurations, with important consequences on the design of a new, procedural DTD and of a formal, general SCD-based parser for very large thesaurus-dictionaries. 2. The Problem of Non-Embedded Call Cycles in Dependency Hypergraphs ____ 3-1 i ____*____, ..... HI JtalDeCVlarii mU.Kl. Entry Liter linam 1*1 Lite* Euuia | L&w Eaam i Figure 1: DHs for DLR, DAR, and DMLRL (Curteanu et al, 2012) The project of a new, procedural DTD for dictionaries, based on the formalization of the SCD parsing method, is exposed in (Curteanu & Moruz, 2012b), based on parametrized (par-)grammars encoding the sense DHs that correspond to the SCD parsing levels (SCD configurations) in a dictionary entry. Two par-grammars for DLR are enclosed, as a typical sample from a larger package of combined grammars for the six, above mentioned, dictionaries. This package should be constructed as the "least upper bound" (LUB) of all the par-grammars, written for the parsed dictionaries on the 120 121 Steps To A New Dtd And Scd-Based Dictionary Entry Parser SCD configurations (parsing levels). Such a package of par-grammars should represent the DTD description of a SCD-based formally defined parser for large dictionary entries, and thoroughly extends the current DTD in the standard (XCES TEIP5, 2007). The soundness analysis of sense structure definitions in thesaurus-dictionaries, achieved in (Curteanu & Moruz, 2012a), revealed the special problems raised by the recursive calls between secondary sense markers (Le. filled and empty diamonds ♦ and 0, or their sense marker equivalents) and the literal enumerations (i.e. a), b), c), ...) in certain special entries, presented in Section 3. For understanding the problem whose solution we propose in this paper, a final element has to be explained. The main problem we are dealing with in this paper is the following: how to transform the DHs of the kind in Fig. 1 (with non-embedded recursive call cycles and non-connected hypernodes) into linearly recursive DHs, with completely embedded call cycles, such as the DHs in Fig. 3. The optimized DHs displayed in Fig. 3 are suitable to LUB-computing (by unification-matching algorithms), the LUB(DHj) being that DH in Fig. 4, whose par-grammar is devised in Section 6 as the new, procedural DTD representing the primary and secondary sense marker classes on the SCD-config2 parsing level. 5. Atypical Entries Generating DHs with Non-Embedded Cycles The in-depth analysis in (Curteanu & Moruz, 2012a) discussed the cyclicity calls between secondary sense markers and literal enumerations, and pointed out examples of such atypical entries in DMLRL, DLR, and DAR, where the recursive calls for literal enumeration are mixed with secondary sense markers (filled and empty diamonds, or their correspondents). These entries, viz. "LUMINA" in DLR, "CAL" in DAR, and "ELI" in DMLRL (Curteanu & Moruz, 2012a) (see the excerpts below). (Ex. 3.1) The entry "LUMINA" in DLR Romanian thesaurus-dictionary (excerpt): LCMFWA sX Ai (Predomina sensul concret de radiatie; in opozitie cuintuneric) h (Adesea cu determined calificative) Radiate care face corpurile vizibile. ii (Ca atribut al universului, al naturii ambiante; components a lumii inconjuratoare) Laudaii-l toate stealele si.........goneascd Cat vafi camp de gonit §i lumind de zdrit". ALECSANDRI, O. I, 8. a) (Ca radiatie solara, element al peisajului diura) Voi intoarce lumira soarelui de cdtrd voi, de va fi intunrearecu (a. 1600). CUV. D. BATR. II, 49/9. Lmnina soarelui face dzua. PRAV. 141.......... .........Deopotrivd se gdseste-n toate Amestecatd umbra si lumind. ISANOS, V. 281. ♦ L o c. a d j. De lumina ~ a) luminos, sclipitor; spec, (despre ochi) stralucitor. Deundzi... md simtii citfiindat ca intr-un nor intunecos ... Ancufo! tu ai prefdcut acel nor......ODOBESCU, S. I, 143..........Ochi de lumind avea fiul lui leronim, privirea lui in noapte fulgera. ROMANIA LITERARA, 1970, nr. 93, 17/3; fe) (despre un spatiu, un loc) in care patrunde lumina (A I 1), plin de lumina Acest loc ... era pe atunci, in 1650, un ochi de lumind in mijlocul mare lui codru al Capotestilor. IORGA, C. I. II, 5; c) (despre plante) care trai este la lumina (All). Dupd o fazd de 2-3 ani cu flora de buruieni de lumind, urmeaza faza de fdneata cu ierburi cu rizomi. CHIRITA, P. 71. | Lo c. adv. Pe (sau, rar, la) lumina = in timpul zilei (I 2), de.....................ARH3VA R. I, 87/20. A inviat din mortiLumina ducdndu-o Celor din mormintel EMINESCU, O. IV, 359. Zdmbetul sfdnt al martindui care-ntrevede ... lumina vieUi eteme. CARAGIALE, O. II, 64.................................................................................. b) (Ca radiatie reflectata de luna; element al peisajului nocturn) Luna, ... fire are lumina ce iase den ea sa turbure udaturile trupului. CORESL EV. 81................................ Neculai Curteanu, Alex Moruz, Svetlana Cojocaru (Ex. 3.2) The entry "CAL" in DAR Romanian thesaurus-dictionary (excerpt): NewPrg CAL s.m. Cheval. NewPrg Is. Numele generic al spe^ei cavaline; spec, individ masculin... NewPrg Adecd amii cailoru zdbalele ingurd la... ... {a large block of definitions and DefExems of the entry CAL} NewPrg In compozifii: NewPrg c) (Entom.) Cal-de-apa = o specie acalului-dracului, numita... NewPrg Calul-dracului = a.) insecta cu corpul lung... | (De aici) Baba rea... ; -b.) = cal-de-apa... NewPrg Calul-popii = a.) calul-dracului...; -b.) = cal-de-apa... Insecta lunga si cu aripile patate... NewPrg Cal-turtit = c a 1 u 1-d r a c u 1 u i... NewPrg b.) (Zool.; la romanii din A.-U.) Cal-de apa s. (dupa germ. Nilpferd) -cal-de-Nil = h i p o p o t a m LB., BARCIANU . NewPrg 2Q P. anal. (Mor.) Caii cu spetezele tin cosul si alcatuesc (Ex. 3.3) The entry "EM" in DMLRL Russian thesaurus-dictionary (excerpt): 2. B npH#aTOHHOH Hacra cjio>KHoro npexyiO)KeHHfl o6o3HanaeT nencTBHe, oGycuoBUHBaiomee co6oh to, o neivr coo6maeTCJi b raaBHOH Hacra. Kozda 6 pa36oimuKa odnaeow ne 63hjiu, To Mnozue eu\e 6bi nocmpadanu. MnxanK. BemeH, nec 3. 06o3Ha4aeT pa3JiHHHBie ottchkh >KejiaeMOCTH zteHCTBHs; |) Co6cTBeHHo ^cenaeMOCTb. Yhujicr 6u cbiu..........0 Ecjih 6bi, Koraa 6m, xoTb 6m h t. n. O, ecnu 6u Kozda-nu6ydb Cdbinacb nosma CHoeudeubRl ITymK. IIocji. k lOxmiry. [HnKOJiica:] Xomb 6u dueu3U0H nam 6uji cKopee zomoe. BynraKOB, ^hh Typ6. 0 C Heonp. rnar............Bom 6u noiiMamb! A. OcTp. He 6bijio hh rpouia............ McxynambCH 6u! Kynp. Ben. ny^eiib. // YnoTp. j\m Bbipa>KeHHa onaceHHH no noBo^y KaKoro-Ji. He^cejiaTejiLHoro rqvlctbwl (c OTpHizaHHeM). He 3a6ojien 6u oh. 0 C Heonp. (J), rjiar., HMeiomeii nepe^ co6oh OTpnuaHHe. — Fjrndu, — eoeopw, — 6a6ouKa, ne Kycamb 6u me6e noKmnl.........JlecK. BoHTejibHHna. & Tojibko 6m (6) He. — IJo une otceua kok xoneuib odeeaiicn, .. mojibKO 6 ne Kaotcduii Mecmi 3aKa3bieana cede Hoeue nnambR, a npeofcnue dpocana uoeeuieubKue. ITyinK. Apan neTpa Beji. ... .........5) IIo>KejiaHHe. Ycnoeue h 6u npednonen ne nodnucueamb. JI. Tojtct. Hhcbmo A. O. MapKcy, 27 MapTa 1899. 0 C Heonp. (J), mar. IIooxomumbCR 6u no-Hacmonu\eMy, na kohh 6u deuez dodumb, — Menmaji cmapiiK. T. MapKOB, CTporoBH. j B coneTaHHH c npeaHKaTHBHMMH HapeHHflMH co 3Han. ^0Ji>KeHCTB0BaHHii, Heo6xoztHMOCTH, bo3mo»:hocth.......Bcned cmv Kociuiucb njieuiusbie noeumnuKu: «Ilomuuie 6u nado, ..................b) }KejiaHHe-npocb6a, cobct hjth npe#Jio>KeHHe (o6bihho npn MecT. 2ji.). [MapHHa:] H nezo 3acyemuiicR? Cuden 6u: Hex. JXnm BaHii. ... — Tbi du, Cepeoica, ece-mami nozoeopun.......• npHiiiB. Kam. nenb. r) ^CejiaeMOCTb uejiecoo6pa3Horo ......... 0 C Heonp. (j), rnar. Bom 6u ecmynumbCH 3a Ylaena-mol...... M. TopbKKH, Marb............. 4. Total-Ordering for Sense Marker Classes in Dependency Hypergraphs For building linearized recursive DHs, i.e. DHs without non-embedded cycles between the sense marker classes (which is the problem enounced in Section 2), we propose the following informal description for the solution to this problem (see also Fig. 4): (a) To the marker classes of primary senses there are assigned increasing scores accordingly to their decreasing priority, actually to their decreasing semantic granularity of each sense meaning. For instance, to the four primary senses in the DLR thesaurus-dictionary (root senses and the sense marker classes A., B., C, I., II., III., IV.,..., and 1., 2., 3., ...) one can assign as priority scores the numbers 2, 4, 6, and 8. 122 123 Steps To A New Dtd And Scd-Based Dictionary Entry Parser (b) The first level of literal enumeration, i.e. pa)., pb)., pc)., ...assigned to all the primary senses in DLR, receives the score p = 9, thus greater than all the scores allocated to the primary senses in DLR. Whether in a dictionary, the first level of literal enumeration is refined by further literal enumerations (e.g. in German DWB), encoded by 2a)., 2b)., 2c)., ... and 3a)., 3b)., 3c)., ... , these two additional levels of literal enumerations receive the priority scores of 10 and 11, respectively (see Fig. 4). (c) The secondary senses and their markers are treated as a second package of senses, playing a distinct role compared to the package of primary senses, since secondary senses are considered to be endowed with a (substantial) smaller lexical-semantic granularity than the primary ones. This is why we assign to them special priority scores, correlated with the literal enumerations that are used within their levels. Namely, an example of allocated priorities is the following: the filled and empty diamonds ♦ and 0 may receive the scores 12 and 14, respectively. The literal enumerations associated with secondary senses, let they be denoted by 4a)., 4b)., 4c)., ... and 0a)., Ob)., 0c)., may receive as priority scores the numerals 13 and 15, respectively. If necessary, several layers of literal enumerations can be added to the basic level, as shown at the point (b) above, together with the corresponding codification of the additional enumeration refinements. These score allocations allow the sense recursive calls in the sample entries displayed in (Ex. 3.1-3.3) to be represented by the DHs in Fig. 3. Thus the proposed solution supports linearized recursive DHs, eliminating the non-embedded call cycles in the DHs. This fact allows for DH representation with par-grammars and tractable LUB-computing of par-grammars as DTD for SCD-config2 parsing level (of primary and secondary senses). (d) Between the secondary sense markers ♦ and 0 in DLR there exists a dependency established within the first DH of Fig. 3: senses marked with ♦ are more general than those marked by 0. The same is true for the corresponding sense markers // and 0 in DMLRL. It is not established (until now) a clear dependency relation between the semantic granularities of the senses delimited by the markers || and | in DAR thesaurus-dictionary (Curteanu et al., 2012: Fig. 4, p.43). To these secondary sense markers one may assign equal priority scores, with equal scores attributed to their literal enumeration refinements, under them being situated all the atomic sense definitions in DAR. Thus one can't establish dependency relations between, e.g., a literal in the enumeration refining the sense marker "||", and the sense marker "|"; the reverse, i.e. changing each other the "||" and markers in the previous statement, does hold too (Fig. 2). (e) Finally, the atomic sense definitions receive the smallest priority scores (represented with the greatest even natural number, compared to the other sense scores), since their lexical-semantic granularity is the smallest. For instance, the atomic senses of the RegDef BoldDef or ItalDef definitions (Curteanu et al, 2012) may all receive the priority score 16 (or 18), whether there are no established dependency relations secondary sense markers in DAR sense marker | sense marker Figure 2: Non-dependency relation between || and [ secondary sense markers in DAR Neculai Curteanu, Alex Moruz, Svetlana Cojocaru among them, while to the literal enumeration under such (a block of atomic) sense definitions should be assigned the priority score 17 (or 19). Under such an enumeration, no other lexical-semantic refinement is permitted. The total ordering procedure for the representation of sense marker classes in DHs, especially including their literal enumeration refinement, can be replaced by any other numerical or literal encoding of the sense priority scores within dictionary entries, provided that it can be preserved the total ordering of the sense definitions, entailed by their lexical-semantic granularities and delimited by the corresponding sense markers. 5. Least Upper Bounds of Optimized Dependency Hypergraphs -J3L IS! a), ft). c>, ---1 j~ — LatCapL*rt.E»nin ArabNninb.Ettum a), t»,c),. "IX"' SpecDefMudc SpSpecDefMark Mori'DftMark RegDdMark BoWDefiVIark ItalDeCVIarti ExemDetMaik »), i». c),... ; T Liter Eattm | -j J„ -~" I., H.. Ill- --• r11 JL—'-'Li ______-t_____,..J __ijm-jyHHK.___I ,y Marker _______Y_____- L«ttj£pj«n____t 0 Marker H t tetev Ennifi____\ Figure 3: Linearly recursive (optimized) DHs for DLR, DAR, and DMLRL - version 2013 Diet Enuy Ltrvci B Eiium level Bt, B2. B3, --- I --g- • Clnum level j CL C2. C3, ... | ^ : DCnma lwel ! DL D2. EK, ... I Iff (..............rfek............., I Aiomio Sense j or ihci Finn. \ cvei AL A2. ...Enum : BEaum level C I- C2.13. D hum- k.vel mm 1i! \ Atomic Sense Definition 1 with First Lit Ernira ^ a), b), c),.. Second LitJEnum = 2a). 2b>. 2c),. 4i- Tlurci Lit, Enum 3a). 3b), 3c),. 124 125 Steps To A New Dtd And Scd-Based Dictionary Entry Parser Figure 4: The two DHs as the LUB (unification procedure) outcome of the three DHs in Fig. 3 In (Curteanu & Moruz, 2012b), constructing the procedural DTDs on the three main parsing levels (SCD configurations) was outlined as a result of building LUBs of the par-grammars derived from DHs, which at their turn were designed on each SCD configuration of the considered six thesaurus dictionaries. The elegant and efficient solution to the problem of optimizing DHs, based on the total ordering of sense marker classes, remarkably including the literal enumeration(s), opened a much simpler approach to the procedural DTD computing as the (only) par-grammar of DEWB = LUBtjyRiX such as the two DHs in Fig. 4, obtained through matching-unification algorithms, as the LUB-outcome of the three optimized DHs in the Fig. 3 above. 6. Parametrized Grammars for Linear-Cyclicity Dependency Hypergraphs We propose the following par-grammar assigned to the first DH in Fig. 4, which is one of the two LUB DHs obtained from the three optimized DHs in Fig 3, on the SCD-Config2 parsing level for DLR, DAR, and DMLRL. The grammar rules are grouped in packages according to the direction of generation: descending rides go towards less general senses (e.g. from A. to B. enumeration), ascending rules return to super-ordinated senses (e.g. from C. to B. enumeration), expressing the Enumeration Closing Condition (ECC) in (Curteanu et al., 2012), while splitting rides are calls to the enumeration partitioning. The grammar rule attributes are parent and item. The parent of a node is the sense from which that node is generated, and the item of an element is its position in the list of its sister elements. In order to jump over sense levels, as most dictionaries do (e.g. from A. to C. enumeration class), we have used a dummy node for each skipped level, as the grammar is built such that it cannot generate a lower sense level without its super-ordinated level (this is a correctness restriction). The dummy nodes derivate to the empty string and are not itemized (the item attribute is never incremented for them). Table 1 :_Par-grammar for the first DH in Fig. 4, //primary_sense entry j LatCapLetA | LatCapLetB j LatCapLetC entry -> newPrg e LatCapLetA; parent(LatCapLetA) = e; item(LatCapLetA) = 0 entry -» e entry e LatSmallLet; parent(LatSraallLet) = e; item(LatSmallLet) = 0 LatCapLetC -> LatCapLetCMrk FilledDiamond; parent(LatCapLetCMrk) = parent(LatCapLetC); item(LatCapLetCMrk) = item(LatCapLetC) + 1; parent(FiUedDiamond) = LatCapLetC_Mrk; item(FilledDiamond) = 0 LatCapLetC -> LatCapLetC_Dummy FilledDiamond; parent(LatCapLetC_Dummy) = parent(LatCapLetC); item(LatCapLetC_Dummy) - item(LatCapLetC); as the LUB(DHf) outcome of the optimized DHs in Fig. 3 parent(FilledDiamond) = LatCapLetCMrk; itemfFilledDiamond) = 0 LatCapLetC -» LatCapLetC_Mrk; parent(LatCapLetC_Mrk) = parent(LatCapLetC); item(LatCapLetC_Mrk) = item(LatCapLetC) + 1; ==spiitting== LatCapLetC -> LatCapLetC_Mrk LatSmallLet; parent(LatCapLetC_Mrk) = parent(LatCapLetC); item(LatCapLetCMrk) = item(LatCapLetC) + 1; parent(LatSmallLet) = LatCapLetCMrk; item(LatSmallLet) = 0 =ascending= LatCapLetC -> LatCapLetB; parent(LatCapLetA) = parent(parent(LatCapLetC)); item(LatCapLetA) = item(parent(LatCapLetC)) =enumeration— ==descendirig= LatSmaLet -» LatSmaLetMrk LatSmaLet2; -4- Neculai Curteanu, Alex Moruz, Svetlana Cojocaru parent(LatSmaLet_Mrk) = parent(LatSmaLet); item(LatSmaLet_Mrk) = item(LatSmaLet) + 1; parent(LatSmaLet2) = LatSmaLet_Mrk; item(LatSmaLet2) = 0; LatSmaLet2 -> LatSmaLet2_Mrk LatSmaLet3; //attributes are similar to those in the previous rule LatSmaLet3 -> LatSmaLet3_Mrk FilledDiamond; parent(LatSmaLet3_Mrk) = parent(LatSmaLet3); //attributes are similar to those in the previous rule =ascending= FilledDiamond -» LatSmaLet3, if parent(FilledDiamond) = LatSmaLet3; parent(LatSmaLet3) = parent(parent(FilledDiamond)); item(LatSmaLet3) = item(parent(FilledDiamond)) LatSmaLet3 -> LatSmaLet2, if parent(LatSmaLet3) = LatSmaLet2; parent(LatSmaLet2) = parent(parent(LatSmaLet3)); item(LatSmaLet2) = item(parent(LatSmaLet3)) LatSmaLet2 -» LatSmaLet, if parent(LatSmaLet2) = LatSmaLet; parent(LatSmaLet) = parent(parent(LatSmaLet2)); item(LatSmaLet) = item(parent(LatSmaLet2)) LatSmaLet -> LatCapLetC, if parent(LatSmaLet) = LatCapLetC; parent(LatCapLetC) = parent(parent(LatSmaLet)); item(LatCapLetC) = item(parent(LatSmaLet)) LatSmaLet -> LatCapLetB, if parent(LatSmaLet) = LatCapLetB; parent(LatCapLetB) = parent(parent(LatSmaLet)); item(LatCapLetB) = item(parent(LatSmaLet)) LatSmaLet -> LatCapLetA, if parent(LatSmaLet) = LatCapLetA; parent(LatCapLetA) = parent(parent(LatSmaLet)); item(LatCapLetA) = item(parent(LatSmaLet)) LatSmaLet -» "", if parent(LatSmaLet) = entry; ► FilledDiamond | EmptyDiamond | BoldMrk I ItalMrk //secondary_sense -=descending= FilledDiamond -> ♦ EmptyDiamond; parentO) = parent(FilledDiamond); itemO) = item(FilledDiamond) + 1; parent(EmptyDiamond) = ♦; item(EmptyDiamond) = 0 FilledDiamond -» FilledDiamond_Dummy EmptyDiamond; parent(FilledDiamond_Dummy) = parent(FilledDiamond); item(FilledDiamond_Dummy) = item(FilledDiamond); parent(EmptyDiamond) = FilledDiamond_Dummy; item(EmptyDiamond) = 0 FilledDiamond —» ♦; parent(i) = parent(FilledDiamond); item(i) = item(FilledDiamond) + 1; ==ascending= FilledDiamond -» LatCapLetC; parent(LatCapLetC) = parent(parent(FilledDiamond)); item(LatCapLetC) = item(parent(FilledDiamond)) FilledDiamond -> LatSmallLet, if parent(FilledDiamond) = LatSmallLet; parent(LatSmallLet) = parent(parent(FilledDiamond)); item(LatSmallLet) = item(parent(FilledDiamond)) =splitting= FilledDiamond -> ♦ LatSmaLetFDl; parentO) = parent(FilledDiamond); item(>) = item(FilledDiamond) + 1; parent(LatSmaLetFDl) = ♦; item(LatSmaLetFDl) = 0 LatSmaLetFDl -> LatSmaLetFDl_Mrk LatSmaLetFD2; parent(LatSmaLetFDl_Mrk) = parent(LatSmaLetFDl); item(LatS maLetFD 1 Mrk) = item(LatSmaLetFDl)+ 1; parent(LatSmaLetFD2) = LatSmaLetFDl _Mrk; item(LatSmaLetFD2) = 0 LatSmaLetFD2 LatSmaLetFD2_Mrk LatSmaLetFD3; parent(LatSmaLetFD2_Mrk) = parent(LatSmaLetFD2); item(LatSmaLetFD2_Mrk) = item(LatSmaLetFD2) + 1; parent(LatSmaLetFD3) = LatS maLetFD l_Mrk; item(LatSmaLetFD3) = 0 LatSmaLetFD3 -> LatSmaLetFD3_Mrk LatSmaLetFD3 parent(LatSmaLetFD3_Mrk) = parent(LatSmaLetFD3); item(LatSmaLetFD3_Mrk) = item(LatSmaLetFD3) + 1; parent(LatSmaLetFD3) = parent(LatSmaLetFD3_Mrk) item(LatSmaLetFD3) = item(LatSmaLetFD3_Mrk) =ECC= LatSmaLetFD 1 -> FilledDiamond LatSmaLetFD2 -> LatSmaLetFDl LatSmaLetFD3 -> LatSmaLetFD2 Table 2:Schematic grammars for DLR entry parsing on the lexicographic-segment and sense-tree levels // lexicographic segment parsing in DLR / DAR entry -> entryMarker entryRootSense entryBody entryTail entryBody -> S S Seg | Seg S Seg -> Mrk Root_sense Body_sense Tail_sense Mrk -» | depTreeNode_SCDl Root_sense -> 11111 text | subSegMrk sense_list -> sense sensejist | ""definition -» definition | defltem defltem-> MorfDef | spSpecDef | specDef | regDef defExemList defExemList defExemPair defExemList | defExemPair defExemPair -» quote sigle regDef -> regDefPart regDef | regDefPart defltem 126 127 Steps to a New DTD and Scd-Based Dictionary Entry Parser Bodysense -> j senseBodyDLR j frBodyDAR | roBodyDAR | senseBodyDAR | nestDAR { MorphologicatPart Tailsense -> // sense tree parsing in DLR senseBodyDLR -> sense sense -> senseMarker definition sense list regDefPart -> regDefPartComponent regDefPart j regDefPartComponent regDefPartComponent -> gloss | reference j synonym | sigle j specDef | spSpecDef specDef (specDefPart specDefRec) | (specDefPart) specDefRec -> specDefPart specDefRec | specDefPart specDefPart -> specDeflCeyword | freeText The effective construction of a new, procedural DTD and of a corresponding SCD-based general parser for large dictionaries is the result of the following steps: (51) : For each new dictionary, in the process of its lexicographic modeling (Curteanu et al, 2010, 2012), the DHs for the three main parsing levels (viz. SCDk configurations, k = 1-3) have to be well-defined, including their calls between the three parsing levels to be structurally embedded. Whether necessary, the essential process of their optimization has to be applied, i.e. their recursiveness linearization by eliminating the non-embedded call cycles between sense marker classes and literal enumeration. (52) : On each SCDk configuration, k- 1-3, for the dependency hypergraphs SCD&-DH, (/= Un) defined for n distinct dictionaries, the LUB/(SCD£-DH/) = SCD£-DH, k= 1-3, has to be defined. The optimization procedure in Section 4 assures the process soundness. E.g., the three SCD2-DH, (i= 1-3, for DLR, DAR, DMLRL) have been defined in Fig. 3, their SCD2-DHs = LUB/(SCD2-DH/), i = DLR, DAR, DMLRL, being displayed in Fig. 4. (S3): For each SCD£-DH, £=1-3, its par-grammars represent the procedural DTD^ of the i = l-rc considered dictionaries. Their representational DTD is the unified package of the three par-grammars on the SCDk configurations, k= 1-3. E.g., the par-grammar in Table 1 is associated to SCD2-DH = LUB,(SCD2-DH,), i = DLR, DAR, DMLRL. (S4): Several par-grammars have to be integrated within each par-grammars*; associated to the SCDk parsing level (k= 1-3). (S4a): Par-grammars for constructing the dependency trees of the lexicographic structures on each SCDk level (e.g., the schematic grammars for segment and sense tree parsing in Table 2 for SCD2). (S4b): Backus-Naur grammars and their LUB outcome(s), for the atomic sense definitions of the involved dictionaries. (S4c): The procedurally connected and / or LUB-computed par-grammars for all the above considered formal grammars should constitute the new procedural DTD, resulted incrementally for the n dictionaries at hand. The procedural DTD and its SCD-based associated parser for very large dictionaries still deserve substantial efforts and innovative solutions in order to be accomplished. 7. Conclusion and Continuation The DH optimization involves the following remarks, driving to the solution of our problem: (a) The literal enumeration a)., b)., c)., ... under I., EL, HI., ... primary sense markers is not the same as a)., b)., c)., ... under 1., 2., 3., ... since the lexical-semantic granularity of the former literal enumeration is strictly larger than that of the latter, (b) The same fact holds, with even more substance and practice, for ♦ super-ordinating 0 secondary sense markers, (c) For a sound parsing of dictionary entries, the solution to DH optimization problem entails a sense marker-depending renaming of the literal enumerations, totally ordering these sense splitting processes in DHs. In (Curteanu & 128 Neculai Curteanu, Alex Moruz, Svetlana Cojocaru Moruz, 2012b), a par-grammar has been proposed to represent the DLR DH, the first (optimized) DH in Fig. 3. In the presence of non-optimized DHs, computing their par-grammars, and then their least upper bound (LUB) par-grammar, is an intricate process. Solving the problem of DH optimization changes radically the solution to obtaining the general DTD and dictionary parser (Section 6): instead of computing the LUB of par-grammars from non-optimized DHs, we apply the optimization procedure to the DHs of the involved dictionaries, compute their LUB DH(s), and write its (or their) corresponding par-grammar(s). The project of a new, procedural, DTD and of a general SCD-based parser for the largest thesaurus-dictionaries is a huge challenge because it would make possible a direct comparison among the sense marker classes utilized in the most computerized languages, among the adequacy of the lexicographic sense markers and the lexical-semantics granularity of the lexicographic units they delimit within various large dictionaries. It brings the effective means for a standardization of these such complex constructions and their automatic (and efficient) parsing. As further developments of the standardized thesauri one can mention the design of an optimal and cross-linguistic compatible network of Romanian electronic dictionaries, similar to a very good project of dictionary network, viz. the German Woerterbuch-Netz, with possible links to well-known foreign dictionaries. References Curteanu, N., Trandabat, D., Moruz, A. M. (2008). Extracting Sense Trees from the Romanian Thesaurus by Sense Segmentation & Dependency Parsing, Proceedings of CogAlex Workshop, Manchester, 55-63, http://aclweb.org/anthology/WAV08AV08-1908.pdf Curteanu, N., Moruz, A., Trandabat D. (2010). An Optimal and Portable Parsing Method for Romanian, French, and German Large Dictionaries. CogAlex II- The Second Workshop on Cognitive Aspects of the Lexicon, COLING-2010, Beijing, China, 38-47, http:/7wu^.aclweb.org/anthology-new/W/W10/W10-3407.pdf Curteanu, N., Cojocaru, S., Burca, E. (2012). Parsing the Dictionary of Modern Literary Russian Language with the Method of SCD Configurations. The Lexicographic Modeling. Computer Science Journal of Moldova, Academy of Sciences of Moldova, Vol. 20, No.l(58), 42-81, http:/Avww.math.md/files/csim/v2Q-nl/v2Q-nl-(pp42-82).pdf Curteanu, N., Moruz, A. (2012a). Toward the Soundness of Sense Structure Definitions in Thesaurus-Dictionaries. Parsing Problems and Solutions. Computer Science Journal of Moldova, Academy of Sciences of Moldova, Vol. 20, No.3 (60), 275-303, http://ww-w.math.md/files/csim/v20-n3/v20-n3-(pp275-3Q3).pdf Curteanu, N., Moruz, A. (2012b). A Procedural DTD Project for Dictionary Entry Parsing Described with Parameterized Grammars. CogALex-3 Proceedings, The Third Workshop on Cognitive Aspects of the Lexicon, COLING-2012, Bombay, India, 127-136. http://aclweb.Org/anthology-new/W/W12/W12-5110.pdf. Erjavec, T., Evans, R., Ide, N., Kilgariff A. (2001). From Machine Readable Dictionaries to Lexical Databases: the CONCEDE Experience. Research Report 129 Steps to a New DTD and Scd-Based Dictionary Entry Parser on TEI-CONCEDE LDB Project, Univ. of Ljubljana, Slovenia. Consortium for 1152^ EUr°Pean Dictionary Encoding - LNCO-COPERNICUS project no. PL96- Hauser, R., Storrer, A. (1993). Dictionary Entry Parsing Using the LexParse System. Lexikographica 9, 174-219. Lemnitzer, L., Kunze, C. (2005). Dictionary Entry Parsing, ESSLLI. Neff, M., Boguraev, B. (1989). Dictionaries, Dictionary Grammars and Dictionary Entry Parsing, Proc. of the 27th annual meeting on Association for Computational Linguistics, Vancouver, British Columbia, Canada, 91-101. Dan Tufis, Rotariu, G., Barbu, A. M. (1999). TEI-Encoding of a Core Explanatory Dictionary of Romanian. Proceedings of the 5th Comp. Lexicography COMPLEX 1999, Pecs, Hungary, ( F. Kiefer, G. Kiss, and J. Pajzs eds.), 219-228. XCES TEI Standard, Variant P5 (2007). httt>://www.tei-c.or£/Guidelines/P5/ ROMANIAN ETYMOLOGICAL CHAINS - A PRELIMINARY ANALYSIS RALUCA MOISEANU1, DAN CRISTEA2 1 Alexandru loan Cuza University of Iasi, Computer Science Facidty, Computational Linguistic Department 2 Romanian Academy, Institute for Theoretical Computer Science; {raluca. moiseanu, dcristeaj (winfo. uaic. ro Abstract In this paper the origin of describe the preliminary steps towards a recursive reconstruction of Romanian words together with the positioning of their loans within a time frame, as reflected in the European Linguistic Thesauri. A pilot application accepts as input a Romanian word and accesses online linguistic resources, such as eDTLR - The Thesaurus Dictionary of the Romanian Language in electronic form, displaying etymological information. The etymology of a word is subsequently searched in foreign sources (for the time being only French and Italian online dictionaries), in order to compute its etymological trajectory. Import years, where available, are used to place on the time axes the approximate time of imports. The research intends to highlight a methodological framework on which a future real scale investigation could be anchored. Keywords: etymon, online dictionaries, database, parser 1. Introduction This project has been triggered by the need of having a dynamic and complex structured database able to provide the etymological information of any Romanian word (except the ones with unknown etymology). In our attempt to recreate the etymological chain of a word we shall, first of all, provide an insight of what etymology as a science is, as well as the main features of the Romanian etymology. Once the theoretical background is established we shall move on to the linguistic resources and technologies used to support the generation of etymological chains. An etymological chain is a string of one or more etymons along with their origin language and entry year. As data structure, etymological chains are graphs (Alt, 2006) that have a root word in the studied languages (Romanian, in our case) and one or more descendants from source languages (Central and Eastern languages, in our case, with whom Romanian languages has had contact throughout the years). The paper describes the beta version of the application used to automatically extract the information from online Italian and French dictionaries, version that has been tested on a number of 2000 XML files from the eDTLR - the Romanian Thesaurus Dictionary in electronic form (Cristea et al., 2007), corresponding to the same number of dictionary entries. 130 131 Romanian Etymological Chains - A Preliminary Analysis Raluca Moiseanu, Dan Cristea 2. Etymology as a science Derived from the Greek etymon meaning "true sense" and the suffix, logia, denoting "the study of, etymology as a science studies the origin of words. Etymology considers words as having either an internal origin (therefore, in the target language, by applying transformation rules specific to the lexicon or the grammar of that language, through affixation, compounding and conversion) or an external origin (through borrows/loans from one or more languages). Regardless the acceptance channels, the etymology has to decipher the phonetic and morphological transformations from the original word to the actual word. The Linguistic caique is to be situated at the border between the internal and external generation of words as the new words are formed within the source language by imitating an external structure. An etymon can come from two or more languages either during the same period of time or throughout different periods of time. This is called multiple etymology. Most of the Romanian words have multiple etymology, Latin being referred to as an indirect source. Romanian is a Romance language, belonging to the Italic branch of the Indo - European language family, having much in common with languages such as French, Italian, Spanish and Portuguese. However the closest to Romanian are the other Eastern Romance dialects, spoken south of Danube: Aromanian/Macedo-Romanian, Megleno-Romanian and Istro-Romanian dialects. An alternative name for Romanian used by linguistics to disambiguate with the other Eastern Romance languages is Daco-Romanian, referring to the area where it is spoken (which corresponds roughly to the onetime Roman province of Dacia). Marius Sala et al (1988) considered 2581 words as being representative for the Romanian vocabulary. The etymological structure of this vocabulary is shown below: • Romance elements 71.66%, out of which: ❖ 30.33 % Latin ❖ 22.12% French ❖ 15.26% Classical Latin ❖ 3.95% Italian • Internally formed 3.91 % (most from Latin etymons) • Slavic 14.17 %, out of which: ❖ 9.18% Old Slavic ❖ 2.6% Bulgarian ❖ 1.12% Russian ❖ 0.85 % Serbian-Croatian ❖ 0.23 % Ukrainian ❖ 0.19% Polish • German 2.47 % • Neo-Greek 1.7% • Thracian - Dacian, a sub-layer, 0.96 % • Hungarian 1.43 % • Turkish 0.73 % • English 0.07 % (and growing) • Onomatopoeias 0.19% • Unknown origin 2.71 % The data listed above has been used to establish the first two Latin languages of focus for this preliminary study. 3. Data collection The collection of resources (online dictionaries) and simulation, trials of manually generated etymological chains represented the starting point of the project. The manually gathered etymological chains were also used as validators for the application (Burhui, 2013) that has been put together for the automatic generation of etymological chains. The quest for online resources has proved itself rather sinuous as many of the online etymology dictionaries or online dictionaries did not display etymological data. For the purpose of this paper we have narrowed down the area of research to only Italian and French, which sum up (Sala, 1988) 26.07% of the representative vocabulary of the Romanian language. Once we have identified the two online sources, for Italian - http://www.sapere.it, and for French - http://www.crrrtl.fr/etvmologie, that seemed to best fit our purposes , we have extracted from these dictionaries a list of notations used to to mark the etymon (such as: jr., jr.ant., it., ital.Jat., lat.class., lat.vulg., lat.mediev. etc.). This list has been included as an external resource onto the program. What follows below is a list of examples of etymological chains, manually extracted from the two online dictionaries (Italian and French). In these examples, the details that we would want the application to return upon interrogation are also indicated: POS, gender, entry year, and source language. > bastard s.m. din it. bastardo; IT. bastardo secXVdal jr. ant. batard; FR. batard 1150 Torig. de bastard est obsc; ro.bastard <—— itbastardo fr.batard <— unknown etymon; 132 133 Romanian Etymological Chains - A Preliminary Analysis Raluca Moiseanu, Dan Cristea > ciment s.n. secXIXdin it cimento,jr. Ciment \ IT. Cimento s.m. dal lat. Caementum; FR. Ciment s.m. 1165-70 du lat. class, caementum; ^ iteimento ro.ciment fr.Ciment latcaementum > cortinas.J. din it. Cortina; IT. Cortina n.f. dal lat. tardo Cortina; ro.cortina itcortina latfardo.coi1i.eii: > paladin s.m. din Jr. paladin, it. paladino; FR. paladin s.m. 1552 empr. a Vital, paladino; IT. paladino n.m. dal lat. mediev. palatinum; fr»paiadin ro.paladln itpaiadin ^ lat medley.pala tin tmi ^sopran s.m. din jr., it. soprano; FR. soprano 1768 du lat. vulg. super anus; IT. soprano dal lat. vulg. superanus; itsonrano ro.sopran latsuperanus fr»soprano >vaccin s.n. 1827 din jr. vaccin, lat. vaccinus, cj. it. vaccine; FR. vaccin 1801 du lat. vaccinu(s); IT. vaccino dal lat. vaccinu(m); fr.vaccin ro.vaccm lat.vaccinu(s/m); it.vaceino > vagabond s.m. 1795 din jr. vagabond, lat. vagabundus, cj. it. vagabond; FR. vagabond 1382 du lat. vagabundu(s); IT. vagabondo dal lat. vagabundu(m). fr.vagahond ro. vagabond " lat.vagabundu(s/m); itvagabond As shown in the above examples we have manually extracted the entry year, where available and also listed the etymons with unknown origins. Most of the above examples have double etymology, the etymon being both Italian and French, both pointing to Latin as being an indirect origin for the Romanian words. From this early stage four types of etymological chains can already be seen: typel: root type2: root type3:root type4: root origl orig2 ong* ^\ orig2 ^-ong3 origl orig2 ^ orig3 ^ origl ^— orig2 ong3 134 135 Romanian Etymological Chains - A Prelimiriary Analysis 4. The application and a comparison with other approaches A beta version of the application (Biirhui, 2013) allows a user to input an entry Romanian word, out of which it generates one or more linear etymological chains. At this stage the application searches the entry in the Romanian lexicographic thesaurus (eDTLR) and, once found, it extracts the etymological information. If the etymological sources indicate a French or an Italian origin, it directs the search onto the corresponding French fhttp://www.cnrtLfr/ervmologie) or Italian (http://www,sapere.it) online dictionaries, parses the etymological information and displays it. The year of the import is filled in as the year of the first citation. Figure 1: The general architecture of the system A high level overview of the application design is shown in Figure 1. A graphical interface able to display the four schemes put in evidence in the previous section remains to be implemented. Susan Alt (2006) describes the etymons as being words, located in time and space, which stand in a particular diachronic relation to other words, and etymological links as being the etymological relations between linguistic units. In her attempt to define a model of etymological structures she uses the TLFI (http://www.tlfi.fr) as the primary linguistic material to recover data. The nodes of her graph are lexical entries in diverse lexicographic sources. In linear chains, the first entry is the anchoring word, the second one represents its direct etymon, the third one - the etymon of the first etymon, a.s.o. In case of compound words, her graphs diverge towards two entries, each one continued with their corresponding sources. Raluca Moiseanu, Dan Cristea Alt pays a particular attention to the type of links between the entries (such as loan word relations or compound word relations). This type of information will be inserted also in our graphs once parsers would become refined enough to be able to distinguish this type of information in the source dictionaries. 5. Conclusions Our preliminary manual investigations, as well as the first experiments done with the tool have brought to light a high number of entries with unknown or uncertain etymons, which can easily turn into the subject of some statistics drawn based on this project. Moreover, the attachment of the import dates makes it possible to detect some incorrectly dated etymons (which, as mentioned are extracted in our primary source from the date of the first mention of the imported word). Among the peculiar etymological chains that we have obtained during our manual trials we have stumbled across entries for which the first or second etymon entry year is subsequent to the one of the target language. What we believe to be incorrectly dated etymons would have to be validated against a collection of online dictionaries, rather than just one dictionary. The main difficulty that we have faced so far is the lack of online resources that would contain the etymon and entry year as well, or the lack of online resources altogether (Bulgarian, Slavic, etc.). Among the resources that we have found so far (English, German, Spanish), differences in notation of the etymon in each dictionary makes the parsing challenging. However, in the future we aim to increase the number of online dictionaries accessed, the most wanted for studying the origins of Romanian being the German, Bulgarian, Russian, Turkish, Greek, English, Polish, Ukrainian, Hungarian and Latin dictionaries. The language barrier is not to be neglected as well. The Greek, Turkish, Russian and Slavic dictionaries pose a real issue upon retrieving the required information. Although the first steps have been done, the project is far from being completed and also raising more questions than solutions. We believe that the research, only an inception of which is described in this paper, would rather convincingly motivate the birth of an international consortium that would look into the development of this project at European scale. Let's note that similar initiatives have been suggested already for other languages: (Alt, 1996) for French or the Etymology explorer (http://roots.robestone.com) for English. Agreeing on some common conventions of notation of etymological chains, sharing lexicographic resources, parsing technologies for dictionaries and the software that builds the etymological graphs itself, could result in a reconstruction of interchangeable etymological graphs that would configure more and more dense parts of a map of linguistic influences. Their correlation with historical events could bring into light new insights over cultural interferences, could correct errors and reveal unknown linguistic and historical facts. 136 137 Romanian Etymological Chains - A Preliminary Analysis Acknowledgments: We are grateful to Gabriela Haja, from the "A.Philippide" Institute of the Romanian Academy for coining the idea of etymological chains, and to Andrei Scutelnicu and Alin Placinta - Salaru, from the Computational Linguistics at the Faculty of Computer Science of the "Alexandra loan Cuza" University of Iasi, and Anca Bibiri, from the Department of Interdisciplinary Studies of the same University, for contributing to the elaboration of software and for acquiring information about dictionaries. References Burhui A. (2013). Reconstruirea lanturilor etimologice pentru limba romana (The reconstruction of etymological chains for Romanian language). Dissertation thesis in Computational Linguistics, "Alexandru loan Cuza" University of Iasi, Faculty of Computer Science; Cristea D., Raschip M., Forascu C, Haja G., Florescu C, Aldea B., Danila E. (2007). The Digital Form of the Thesaurus Dictionary of the Romanian Language. In Proceedings of SPeD-2007 (Speech Technology and Human - Computer Dialogue), Iasi. Hristea T. (1984). Structura etimologica a lexicuiui romanesc modern, in: Theodor Hristea (coord.), Mioara Avram, Grigore Brancus, Ghorghe Bulgar, Goergeta Ciompac, Ion Diaconescu, Rodiea Bogza - Irimie, Flora Suteu, Sinteze de limba romana, Bucharest; Moroiami C. (2005). Dublete etimologice (Etymological doublets), Bucharest. Marius Sala (coord), Mihaela Birladeanu, Maria lliescu, Liliana Macarie, Ioana Nichita, Mariana Ploae-Hanganu, Maria Theban, Ioana Vmtila-Raduiescu (1988) Vocabularul reprezentativ al limbilor romanice (The representative vocabulary of Romance languages). Editura Stiintifica si Enciclopedica, Bucharest. Susan A. (2006). Data Structures for Etymology: Towards an Etymological Lexical Network, Bulag; Dictionaries: eDTLR - The Thesaurus Dictionary of Romanian Language in electronic form http:/7www.cmtl.fr/etvmologde; http://mvw.sapere.it; http:// dexonline.ro 138 VIRTUAL CIVIC IDENTITY DANIELA GIFU1, DAN STOICA2, DAN CRISTEA1'3 1 "Alexandru loan Cuza " University, Faculty of Computer Science, Iasi - Romania 2 'Alexandru loan Cuza " University, Faculty of Letters, Iasi - Romania 3 Institute for Theoretical Computer Science, Romanian Academy - Iasi branch Romania {daniela.gifu, dcristea}@info. uaic. ro, dstoica_ro@yahoo. com Abstract The paper presents a study on a typology of civic identities of public contributors of online articles on forums and their possibilities of automatic identification. We analyse the dialogic means and exploration of automatic extraction of features from forum utterances. The research suggests new perspectives for defining types of online commentators of public discourses addressing domains such as politics, arts, education, etc. In the investigation we apply some pragmalinguistics approaches on communication, mainly taken from polyphony and enunciation areas. The classification of user profiles make use of criteria that take into consideration: common topics, expression of sentiments, style features, lexical n-grams, morphosyntactic analytics and pragmatic features. Our purpose was to lay the basis for a thorough classification of categories of publics and to suggest ways of their automatic identification, in the benefit of editors of media institutions, specialists in public communication, intelligence agencies, political structures, etc. Keywords: civic identity, pragmalinguistics, semantic classes, journals forums, editors. 1. Introduction Nowadays, a part of the reality has moved in the cyberspace. And the same happened with the bar or side-way chats, traditional in older times. Almost every public page we visit is cast into a stream of ongoing discussions, comments, gossips, thus becoming a property jointly owned by its composer and any person who may want to back react. The civic identity seems to be manifested on the Internet without constraints. This study attempts to identify a model of identification of the civic identity of an individual, as revealed through online channels, by evidencing decision features whose values can be extracted automatically. The investigation focuses on a corpus of online journals' forums, from where the commentators' profiles are being extracted. Profiling 139 Virtual Civic Identity Daniela Gifu, Dan Stoica, Dan Cristea the civic identity of readers of articles should exploit their inputs, therefore now seen in the position of writers of forums' short comments. The process puts at its basis a panoply of pragmatic markers, extracted by linguistic methods at the following levels: lexical (tracking patterns of specific vocabulary), syntactic (grammar errors, punctuation, enumerations, repetitions, use of emoticons, etc.), semantic (frequent use of some semantic classes), discourse (rhetorical markers). Using these features, the resulted portrait should be characterised along the following dimensions: the capacity to stay within article's topic, the capacity to express opinions on another forumist's comment, tendency of presenting themselves rather then following the forum's debate, degrees of assuming their respective identity as individuals, preoccupation for really participating in the debate opened by the article (vs. just the desire to assert vague, general ideas). When the media product is on the Internet, the actors of the cyberspace who decide to interact with it have tremendously numerous possibilities to shadow or even hide their identities, and, of course, their communicative intentions. As such, the attempt to determine the civic identity of people hiding their identities as individuals seems impossible. Up to date, there are no consistent instruments or studies on the different nature of forums' users, and the statistics are used just to group up reactions to the journalistic material. More than this, the lack of studies on the true nature of the forums' writers makes it impossible to apply advanced statistical calculi in order to rank positions or attitudes. A smart argument put out in well-formed phrases could reveal a civic activist, but also a good PR from a political party, trying to influence the readers of the forum; an upset man who wants to let it out on any subject could reveal a shy person accepting to express himself from behind the protection of the anonymity. Basic criteria like age, gender, education level are insufficient for determining the civic identity of people under study. The markers used by specialists in pragmalinguistic analysis could reveal in one's discourse a lot more on the personality of the writer than the writer would accept to unveil. Based upon this kind of findings, a typology of forums' users from the point of view of their respective civic identity is possible. This approach shows the importance of a natural language processing system capable to extract basic linguistic features from large amounts of online texts and to organize them as a collection of pragmatic knowledge aiming to inventory the profile of online commentators. The outcome of the study could provide tools for public speakers to be used for improving their future discourses. This is why the effort to mentally represent the interlocutor - and if not the actual interlocutor, the general profile s/he belongs to -is important in improving one's ability to communicate. There are many ways one could enhance his/her capacity of well representing the others before or during an online interaction. One of them is to analyze their public discourse, in order to extract information to be used in orienting your own discourse, making it efficient. A good apprehension of media products' respective publics, for example, could serve to improve their editorial politics and so be of better use for the communities they serve. Section 2 presents the state of the art. Section 3, after a short description of the corpus analyzed, during the two hot months of the presidential crisis (July - August 2012), presents the methodology applied in identifying lexical-semantic and pragmatic features of the civic identity online. Finally, Section 4 presents some conclusions and directions for the future work. 2. State of the art Our study combines automatic user profiling techniques (opinion mining, authorship classification) with pragmatic and linguistic studies of computer-mediated communications. In this moment, many systems collect various information about millions of people on the Web. Some of the current systems rely on the information manually provided by users. In others, information is obtained often from users' actions. In this case, user profiling requires inferring acquired information, both observable and unobservable data, such as, users' behaviour (Schiaffino and Amandi, 2009), (Zukerman & Albrecht, 2001). His/her behaviour and profile can be obtained from this information using different techniques like machine learning and statistical methods. Thus we have a wide range of techniques that were used to create user profiles, such as Bayesian networks (Nurmi, 2006), (Withby et. al, 2005), (Weiwei et. al, 2007), (Mui et. al, 2001), (Garcia et. al, 2007), fuzzy models (Grishchenko, 2004), (Sabater et. al, 2002), (Manchala, 1998), association rules (Adomavicius & Tuzhilin, 2001), mechanisms of text classification (Trandabat et. al, 2012), (Gifu & Cristea, 2012), and more. Discourse/text output of users (posts, comments, forum messages) is used to infer elements about authors' identity (gender, sex, age, level of education and much more). In these text productions a user expresses his/her opinions about a given topic and interacts with other users. Content analysis is used in several applications to identify conflicts (Denis et. al, 2012), or to detect various opinions (Grivel et Bousquet, 2011). The challenge is to involve theories of pragmalinguistics, mainly from the works on polyphony and enunciation (Ducrot et. Anscombre, 1989, Plantin, 2005 and also Tutescu, 2005, Kerbrat-Orecchioni, 1999, Maingueneau, 2000). Language is no longer seen as a means to represent the world (referential function of the language), but as a means of argumentation in linguistic interactions among human beings. Enunciation is making a choice from the infinite offers of a given language: a choice of words, a choice of the order in which the words are uttered, a choice in the tone, the intensity of the voice and so on and so forth. Making those choices reveal a social profile of the enunciator will be our aim, and this is what we will try to track down, in order to set up patterns. We will search for patterns of linguistic behaviour that reveal patterns of social profiles. Trying to situate our research, we shall mention that the French revue HERMES published along the years papers on communication and the Internet, on social relations and the online communication, or on civic exchange in the cyberspace (Loh, 2009, Akiyoshi, 2009, Cardon, 2007, Oliveri, 2011), and also that (Holt, 2004) might be a model of how to use particularities of language use to determine the kind of citizen the speaker is. Email discussion messages are often expressed in a familiar register, with slang, abbreviations, and profanity and their composers frequently seem to delight in disregarding traditional rules such as those governing syntax, conventional logic, evidence and idea development, is the idea expressed by Holt in his Dialogue on the Internet. (Mortensen, 2003) discusses the use of language productions to understand the mind of a player. (Stoica, 2001) comments on the degrees of liberty authors have when writing for traditional, printed scientific journals and when they write for the web. Pragmatic and rhetoric studies identify several relevant features for characterizing specific genres focusing on the expected audience (scientific articles vs. popular science 140 141 Virtual Civic Identity Daniela Gifu, Dan Stoica, Dan Cristea articles (Hyland, 2009). Some research projects collect new media communication documents (Lin, 2007) (Stark and Durscheid, 2011) to study their features for classification purposes. 3. A case study The methods, the techniques and the tools in the development phase of the project create the premises for a thorough investigation of categorisation of online civic identities, drawn from statistics on large amount of textual data. The approach has a high degree of generality that makes it applicable to other types of investigations, provided they rely on text analytics. 3.L The corpus For the elaboration of preliminary conclusions on the configuration process of the online „civic identity5', we collected, stored and processed 11,100 relevant texts/day/newspaper (summing up 146,000 words)1, published during July-August 2012 (July 01-06, 2012 - a week before President's suspension; July 07-11, 2012 - a week after President's suspension; August 11-16, 2013- a week before President's return at the Cotroceni Palace) by three important Romanian online newspapers having similar profiles2 (Evenimentid zilei, Gdndul, Jurnalul Nafional) but usually displaying totally disjoint opinions and journalistic styles on any political topic. We talk about the hot political period when the President was suspended. 3.2. Methodology In the following, we briefly describe the steps of our analysis: - by attentive reading, we identified 10 typologies of commentators, that can be called: the-decent, the-porn-aggressive, the-incitator, the-linkable, the-affected, the-author-attacker, the-supporter, the-intellectual, the-rational, the-irrational (see. Table 1). - after manually processing the whole corpus, it resulted that 6 that out of the 10 profiles were rather accidental (too few data): the-decent, the-porn-aggressive, the-linkable, the-author-attacker, the-supporter, and the-intellectual). As the average of their occurrences was under 5%, we eliminated these texts. Only the remaining 4 profiles are quantitatively analysed below. - we established a number of features (belonging to the lexical, syntactic, semantic and discourse levels of analysis) that are, more or less, subject to automatic extraction: declared ID (hide, partial expose, expose, invented, etc.); making use of emoticons, familiarity in dialog, jokes, punctuation, etc.; the semantic classes of being rational emotional (with their sub-classes), and swear; comments that follow the topic, that have no correlation with the topic, that are connected to other comments, that are aggressive, 1 We are aware that the actual dimension of our corpus is still insufficient to obtain an accurate categorization of the clasiffication criteria, but in this study we are merely interested to investigate a research methodology than to arive to precise conclusions over types of civic identities, as revealed by text analytics. 2 These are national dailies of general information, tabloids with a circulation of tens of thousands of copies per edition, each. The newspapers were monitored on their websites: Evenimentid zilei -www.evz.ro, Gdndul - www.gandul.info, Jurnalul national - www.jurnalul.ro. 142 etc.; number of appearances of the ID / article and the number of appearances ID in other online publication(s); - all comments belonging to the same type, irrespective of their actual identity, have been put in the same folder, as belonging to the same type; Table 1: Profile's typology after manually annotations Abfemiatwas | CI a O | 14 ........................j CS a i C? CH O UO Stvle 1 ■ c«k kxL pehfcrcc s quaes pom «ft. & il links ! ematsomi lm ferrotmr ; ana \ lomiuias ; mm mf&. n?>mes Rational j x i I \ * 1 j K X } X ! > 1 1 •Negative X x x \ X X X related t« topic % * X X X rnimt &f mmmmt related t« fitters X X- X lacite eofiiau'RfcUors X x B/artides m 231) m 37* mi 134 32851 Profile's c!)min«8tat«r L . . . — _ Decent aggressive (acHatar Linkable Affected Author-•attacker Supporter lateliesttsal Rational Irraiioual - consequently, we adorned all texts with values on the established features, either manually or automatically. For instance, the semantic level has been automatically annotated with values for each of the semantic classes residing under the general classes: emotional, rational and swear, in total, 12 semantic classes; - these data are discussed below as possible input for training a classifier to recognise the civic identities (portrait types). 3.3.Lexical-semantic features After eliminating six of the manually annotated profiles, as identified initially, together with their comments, the remaining corpus was processed with the DAT3 tool (initially intended to analyse political discourses). Out of the 33 semantic classes in DAT, arranged hierarchically - see two examples of XML class definitions in (1) -, we selected only those noticed to have dominant tonalities: rational, with 5 subclasses (uncertain, inhibition, intuition, certain, and determine), emotional with 2 subclasses (positive and negative), each of them having other 3 subclasses (positive with moderation, firmness and spectacular, and negative with anxiety, anger and sadness), and swear. (1) 3 DAT (Discourse Analysis Tool) has some similarities with LIWC (Linguistic Inquire and Word Count), used during the American presidential elections in 2008 (Pennebaker, 2001). The Romanian lexicon resourcing DAT contains a collection of over 9,500 entries (roots and lemmas). 143 Virtual Civic Identity The placement of classes in hierarchies makes that, when an occurrence belonging to a lower level class is detected in the input file, ail counters in the hierarchy, from that class to the root, be incremented. For instance, in Evenimentid zilei, we can see the results outputted by DAT (Fig. 1), when analysing the streams of textual data for each semantic class. So, we analysed 4 profiles of online commentators (abbreviated with "C"), that we have considered to be predominated in cyberspace as follows: - the first type of commentator, C5, predominate the self-confidence (the class certain), he is, rather, the type of dynamic blogger (the class emotional). In general, he comments in line with the subject, being convinced about his ideas (the class firmness); - the second type, CIO, is unsure (the class uncertain). He comments in line with the subject, because he looks for a way to get himself into the dialog; - the third type, C3, has an insulting language (the classes swear, negative, anger). He prefers to shock the audience, in general he is out of subject or binds onto other commentators; Bommani. tonalities |Eve mmenfcul zitety 3Mffl% 33,00% 2,« iB.CKHfr I lilt! I -jl | I 1 --JLaJE I 1 □ C1(Ev.) C2(Ev.) C3(Ev.) C4(Ev ) S1 I % % JQ Figure 1: Analysis of user's profiles in Evenimentul zilei journal - the last type, C9, adopts a rational discourse (the class rational), with sustainable arguments (the class determine), and, often, he has a moderate tone (the class moderate) about the political topics. 3.4. A comparative lexical semantic analysis between two profile's online journals We present below a chart with two streams of data, collected during the presidential crisis, representing comments between the two profile's online journals, Gdndul and Evenimentul zilei. Our experience shows that an absolute difference value below the threshold of 0,75% should be considered as irrelevant and, therefore, ignored in the interpretation. Apart from simply computing frequencies, the system can also perform comparative studies. The assessments made are comprehensive over the selected classes because they represent averages on collections of texts, not just a single text. Daniela Gifu, Dan Stoica, Dan Cristea To exemplify, one type of graphics considered for the interpretation was the one-to-one difference, as given by Formula (2), included in the DAT Mathematical Functions Library: DiffH = average(x) - average(y) (2) where x and y are two streams; average(x) and average(y) are the average frequencies of x and y over the whole stream, and the difference is computed for each selected class. So, the graphical representation in Figure 2, where the commentator CI of Gdndul is compared against the commentator CI of Evenimentul zilei, should be interpreted as follows: - the first profile, CI, is much better argued than the second one (the classes rational, firmness), predominating self-confidence (the class certain), and uttered in an affective tone (the classes emotional, negative); - the second profile, CI, is more emotionally implicated in comments, manifesting upset, even anxious (the classes anxiety and anger). He prefers to comment with sustainable arguments (the class determine), but, often with a precaution tonality (the class moderate) because he has no intention to start a dispute with the others. k c&siparative analysis for type "C1" 12:0,00% 100,00% 80.00% 60,00% 40,00% 20,00% 0,00% ■ C1(G.) lC1(Ev.) semantic classes Figure 2 A comparative analysis between the users profiles in the journals Gdndul and Evenimentul zilei 3.5. The pragmatic perspective The pragmatic analysis should be based on the knowledge of the civic intentions of the commentator in connection with the meanings of the article or of the other comments. Only a good knowledge of the civic aspirations of the receptors and knowing that the editor knows himself this spectrum of civic aspirations, could make a human analyst succeed in interpreting the whole range of subtleties of a comment. It is clear that pragmatics makes a good deal of the forums interpretation process. It is nevertheless true that an experienced human analyst would succeed to acquire these facets of the pragmatic context of a comment even having little direct knowledge on them. It is like 144 145 Virtual Civic Identity Daniela Gifu, Dan Stoica, Dan Cristea in an act of reverse engineering in which the analyst is able to infer the civic behaviour of the speaker or of the receptor from the text itself. A closer look on a pragmatic analysis of online comments reveals the following aspects: interpretation of the text in terms of psychological distance between the partners, opponents, etc.; defining the transmitter's attitude before and after the instance of communication; determining the receptor's attitude (i.e. being pro, against or undecided); pursuing echoes of the article in the audience (immediately), or in time (offline comments), etc.; discovering the writer's intentions by evidencing the semantic roles of different sentence constituents (reiterations, expressions, etc.). 4. Conclusions The discourse is a place where the personality is disclosed, but not at a level of certainty that could lead to establishing incontestable patterns. Furthermore, on Internet, possibilities for manipulating information are endless. Manipulation by people who design web sites or participate in discussion groups can give the clues whether the information on a site is reliable. However, by using statistical tools and pragmatic methods we will challenge these risks on safer ground than before. Some features are mentioned earlier only for a theoretical reason, as their effective recuperation in the text by the technology is still out of the present day possibilities. For instance, the intentions of the online commentator, a feature falling into the pragmatic perspective, is not yet technologically feasible. An author of a text is conscious that he wakes a reaction onto the reader's mind, so her/his message has an intentional component (we talk mainly about conscious intentions, as they can be reflected in the author's and/or the editor's convictions about how the reader could be influenced). However, the automatic detection of the authors' intentions, apart from the line of research triggered by the Attentional State Theory (Grosz & Sidner, 1986) and Rhetorical Structure Theory (Mann & Thompson, 1988), are still far from being conclusive. This research opens a new direction for the study of online journals' commentators in areas such as: politics, culture, education, etc. The DAT tool becomes a necessary instrument of editorial policy and public relations departments. The study presented shows how one could shape profiles of commentators on forums of online publications (which are in a permanent dynamism). As the cyberspace is the perfect environment for hiding one's identity, some risks occur from this. An analysis on the lines presented in this study could prove helpful to different categories of beneficiaries, mainly media editors and PR specialists. They could use the results of such analyses to better plan their policy, to adapt to different categories of public they might not even imagine be part of the general public (as they call it). Public segmentation is a continuous activity for PR specialists, and it has to be performed by using adequate criteria for each topic they want to develop in a discourse. This kind of research could and will be continued further on: as society changes, media techniques change, the relation between media and their reader's changes all the time, and last but not least, the civic identity changes, but the need to know whom you can count on remains of paramount importance. Acknowledgments: In performing this research, the first author was supported by the POSDRU/89/1.5/S/63663 grant. References Adomavicius, G., Tuzhilin, A. (2001). Using Data Mining Methods to Build Customer Profiles, IEEE Computer 34:2. Akiyoshi, M. (2009). Les Japonais en ligne: le prisme des generations et des classes sociales, in HERMES (55). Cardon, D. (2007). Le style deliberatif de la «blogosphere citoyenne», in HERMES (47). Denis, AL, Quignard, M., Freard, D., Detienne, F., Baker, M. and Barcellini, F. (2012). Detection de conflits dans les communautes epistemiques en ligne? Grenoble. Ducrot, O. et. Anscombre, J.-C. (1989). Logique, structure, enonciation. Lectures sur le langage, Minuit. Garcia, P., Amandi, A., Schiaffino, S., Campo, M. (2007). Evaluating Bayesian Networks' Precision for Detecting Students' Learning Styles. Computers and Education 49:3. Gifu, D., Cristea, D. (2012). Multi-dimensional analysis of political language. Future Information Technology, Application, and Service: FutureTech2012 (volume 1) Springer, Netherlands (James J. , Jong Hyuk Park, Victor Leung, Taeshik Shon, Cho-Li Wang Eds.) Grishchenko, V. (2004). A fuzzy model for context-dependent reputation, at the Trust, Security and Reputation, Workshop at ISWC, Hiroshima, Japan. Grivel, L., Bousquet, O. (2011). A discourse analysis methodology based on semantic principles - an application to brands, journalists and consumers discourses, Journal of Intelligence Studies in Business 1. Grosz, B.J., Sidner, C.L. (1986). Attentional State Theory, Journal of Computational Linguistics, 12:3, 175-204, The MIT Press Cambridge, MA, USA. Hyland, K. (2009). Academic Discourse: English In A Global Context (Continuum Discourse). Holt, R. (2004). Dialogue on the Internet. Language, Civic Identity, and Computer-Mediated Communication, Westport, Conn., PRAEGER. Lin, J. (2007). Automatic Author Profiling of Online Chat Logs, M.S. Thesis, Naval Postgra duate School, Monterey. Loh, C. (2009). Une ancienne deputee de Hong Kong sur la Toile: le site «Civic Exchange)) (entretien avec Eric Sautede), in HERMES, (55). Kerbrat-Orecchioni, C. (1999). L'enonciation : De la subjectivite dans le langage. 4-eme edition. Paris: Armand Colin. Maingueneau, D. (2000). Analyser les textes de communication. Paris: Nathan. Manchala, D.W. (1998). Trust metrics, models and protocols for electronic commerce transactions, Proceedings of the 18th International Conference on Distributed Computing Systems. 146 147 Virtual Civic Identity Mann, W.C., Thompson, S.A. (1988). Rhetorical Structure Theory. Toward a functional theory of text organization, Text - Interdisciplinary Journal for the Study of Discourse. 8: 3, 243-281. Mortensen, T. E. (2003). Pleasures of the player. Flow and control in online games, Volda University College. Mui, L., Mohtashemi, M., Ang, C, Szolovits, P., Halberstadt, A. (2001). Ratings in distributed systems: a Bayesian approach, Proceedings of the Workshop on Information Technologies and Systems (WITS). Nurmi, P. (2006). A Bayesian framework for online reputation systems, Proceedings of the Advanced Int'l Conference on Telecommunications and Infl Conference on Internet and Web Applications and Services. Pennebaker, J. W., Francis, Martha E., Booth, R. J. (2001). Linquistic Inquiry and Word Count - LIWC2001, Mahwah, NJ, Erlbaum Publishers. Plantin, C. (2005). L'Argumentation, PUF, Que sais-je? Pragmatics, (2006)/(2011). Metaphysics Research Lab, CSLI, Stanford University. Oliveri, N. (2011). La cyberdependance: un objet pour les sciences de T information et de la communication, in HERMES (59). Sabater, J., Sierra, C. (2002). Social ReGreT, a reputation model based, on social relations, SIGecom Exchanges 3.1. Schiaffino, S., Amandi, A. (2009). Intelligent user profiling, Artificial intelligence, Lecture Notes In Computer Science, Vol. 5640. Springer-Verlag, Berlin, Heidelberg (Max Bramer ed.). Stark, A., Dirrscheid C, (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland, Crispin Thur low/Kris tine Mroczek (Hrsg.): Digital Discourse. Language in the New Media. Oxford: Oxford University Press. Stoica, D. (2001). Modalites de la communication scientifique, in NOESIS. Travaux du Comite Roumain d'Histoire et de Philosophic des Sciences, vol. XXVI. Bucuresti, Editura Academiei Romane. Trandabat, D. Irimia, E., Barbu Mititelu, V., Cristea, D., Tufis, D. (2012). The Romanian Language In The Digital Age. META-NET White Paper Series, Springer. Tutescu. M. (1998). L'argumentation. Introduction a l'etude du discours, Bucuresti, Ed. Universitatii. Zukerman, I. and Albrecht, D. (2001). Predictive Statistical Models for User Modeling. User Modeling and User-Adapted Interaction, 11(1-2). Weiwei, Y., Donghai, G., Sungyoung, L., Young-Koo, L., Heejo, L. (2007). Bayesian Memory-Based Reputation System, in Proceedings of the 3rd international conference on Mobile multimedia communications. Withby, A., Josang, A., Indulska, J. (2005). Filtering Out Unfair Ratings in Bayesian Reputation Systems, Journal of Management Research. CHAPTER 3 SPEECH PROCESSING 148 ROMANIAN CORPUS FOR SPEECH-TO-TEXT ALIGNMENT ANCA - DIANA BIBIRI1, DAN CRISTEA2'3, LAURA PISTOL2'3, LIYIU - ANDREI SCUTELNICU2'3, ADRIAN TURCULET1 1 "AL I. Cuza" University, Department of Interdisciplinary Research in Social-Human Sciences, Iasi - Romania 2 "AL I. Cuza " University, Faculty of Computer Science, Iasi - Romania 3 Institute of Computer Science, Romanian Academy, Iasi - Romania anca.bibiri@gmail.com, [dcristea, laura.pistol, liviu.scutelnicu}(cbinfo. uaic. ro, aturcu@uaic.ro Abstract In this paper we present the methodology employed in the creation of an aligned speech-to-text Romanian Corpus. The corpus uses recordings from the AMPER-ROM and AMPRom projects as well as ad-hoc recordings of continuous speech. The protocol for speech recording and labelling, as well as the manual annotation procedure, are described. The corpus is intended to be used for training a speech segmentation module and an automatic speech-to-text aligner module. Keywords: Corpus, speech-to-text, alignment, PRAAT 2. Introduction Since the early days of intonation research, automatic transcription of the intonation in speech corpora has been on the wish list of many researchers in phonetics, linguistics, and discourse analysis. For several decades, linguistics has gathered a great amount of audio material to study the aspect of spoken language. Unfortunately, some of the recordings have different dialectal signals/marks, for example, background noise, different phonetic intonation, differences in time of intonation and voice changing, etc. Alignment of the phonemes and text is the first stage of data processing necessary to provide useable training data for many phoneme-to-text conversion systems, including the most successful symbolic rule-based systems and most neural network systems (Bullinaria, 2011). A common requirement in speech technology is to align two different symbolic representations of the same linguistic message, for instance, phonemes with letters (Damper et al., 2005). As dictionaries become even bigger, manual alignment becomes less and less tenable, yet automatic alignment is a hard problem for a language like Romanian. In this paper we describe a methodology for building an aligned speech-to-text corpus for Romanian. The investigation has as goal to set the principles of acquiring a 151 Romanian Corpus for Speech-To-Text Alignment significant corpus of signal-text aligned recordings, to be used for training a speech segmenter and a speech-to-text aligner module. By exploiting already existent continuous speech tracks, doubled by their textual transcriptions, an automatic aligner could be used to fabric a large corpus of speeches aligned to their textual transcription, creating thus the prerequisite for training a speech recognition system for Romanian. Other applications of speech-to-text alignment systems are in fields, such as multimedia indexing, training of large vocabularies for speech recognition, health-related research, etc. 2. Corpora 2.1. AMPER-ROM[ANL4] L'Atlas Multimedia Prosodique de I'Espace Roman (AMPER) is a last generation atlas which combines principles of geolinguistics with techniques of instrumental phonetics and those of informatics. The atlas is conceived as an interactive database bringing together data collection and acoustic analysis concerning prosodic features of linguistic varieties specific to the Romance languages. The Romanian Midtimedia Prosodic Atlas (AMPRom) is the first prosodic atlas which aims to present the main intonation patterns of the Romanian language varieties identified both at the level of the diatopic variants of the standard language and at the level of the dialect variants. During the prosodic dialectal investigations, two questionnaires are used: AMPER-ROMfANIA] and AMPRom. The first questionnaire consists of a series of statements (45 sentences) established by morpho-syntactic and phonetic criteria and are formed of: declarative- affirmative and declarative-negative sentences and total interogative-affirmative and interogative-negative sentences, having the syntactic structure SVO (subject - verb - object). The S and O receive, in turns, adjective and/or prepositional determinants; the nouns and adjectives that are used in the utterances are trisyllabic oxitones (the last syllable of the word is stressed), paroxitones (the penultimate syllable of the word is stressed) and proparoxitones (the antepenultimate syllable of the word is stressed). Since in the Romanian language the negation usually receives the stress of the phrase, the negative-declarative and interrogative-negative sentences were also introduced in the questionnaire. The occurrences of the words are at the right and at the left of the verb for capturing all the prosodic indices (S - subject, V - verb, O - object, Adj - adjective - with the mention that the subject is interchangeable with the object): [S + V + O / S + Adj/ + V + 0/ S + V + 0 + Adj/S + S+ V + 0/ S + V + 0 + S] AMPER-ROM questionnaire (sequence) (Each sentence is labeled with a unic code in order to identify the sentence when the acoustic analysis is made: bwt, dwk,fwt9 gwt, kwt, pwt, swk, twg, twk, zwt): Anca - Diana Bibiri, Dan Cristea, Laura Pistol, Liviu - Andrei Scutelnicu, Adrian Turculet twk Nevasta vede un cdpitan.l The wife sees a captain. kwt Un cdpitan vede nevasta./ A captain sees the wife. dwk Nevasta tinerea vede un cdpitan./ The young wife sees a captain. gwt Un cdpitan elegant vede nevasta./ An elegant captain sees the wife. swk Nevasta frumoasd vede un cdpitan./ The beautiful wife sees a captain. pwt Pasdrea vede nevasta./ The bird sees the wife. zwk Nevasta harnicd vede un cdpitan./ The hardworking wife sees a captain. bwt Pasdrea papagal vede nevasta./ The parrot bird sees the wife. twg Nevasta vede un cdpitan elegant./ The wife sees an elegant captain. fwt Pasdrea frumoasd vede nevasta./ The beautiful bird sees the wife. There are in AMPER-ROM questionnaire sentences with broad focus, as in the following examples. The labels of the sentences represent: twkael - the declarative affirmative sentence with the focus on the first element - subject; twkie2 - the interrogative affirmative sentence were the object is stressed; twknev - the declarative negative sentence with focus on the verb. twkael Nevasta vede un cdpitan./ The wife sees a captain. twkie2 Nevasta vede un cdpitan?/ The wife sees a captain? twknev Nevasta nu vede un cdpitan.l The wife does not see a captain. 2.2. AMPRom In order to capture a larger number of Romanian intonation patterns in their territorial distribution, a second questionnaire includes other statements, simpler (with not so many formal constraints) to facilitate the contact with the subjects and to prepare them for the fixed questionnaire. This includes about 100 sentences and has two variants: short version (compulsory, with 84 sentences) and extended version (optional, having 111 sentences), the latter is applied only at some points of inquiry. Types of syntactic structures that make up the AMPRom questionnaire: - VO structures (with inclusive subject): la: L-ai vdzut pe Ion?/ Have [you] seen John? 3a: Ai vdzut fetele?/Have [you] seen the girls? - Structures pursuing the relation between the word order and prosody: (1) lb: Pe Ion l-ai vdzut?/ John was that you have seen? 3b: Fetele le-ai vdzut?/ Girls were that you have seen? - VS/SV Structures: 25a: Vine Ion./ There comes John. 25b: Ion vine/John is coming. 28a: Cine vine?/ Who is coming? 28b: Ion vine./John is coming. - Structures with double negation elements both in the question and in the answer: (26): Nu vine nime(ni) la noi?/ There comes There comes nobody/none to us? (30): N-a vend nime(ni) la noi./Nobody/none came to us. - Structures in which modulators are used (adverbs of manner and semi-adverbs - sure, precisely, certainly, immediately, surely, maybe, whether, really or even modal verbs - I think, it might): 20b: Chiar vine Ion?/ Really, is John coming? 152 153 Romanian Corpus for Speech-To-Text Alignment 21a: Sigur/Precis (ca) vine/ Sure/precisely he is coming. 23c: Cred ca vine.} I think he is coming. - Structures containing different types of questions: partial, alternative, confirmation: 56a: Cat e ceasul?/ What time is it? 41: Vii ori nu vii?/ Are you coming or not? 55b: Pled mdine la Iasi, nu-i asa?/ You are going tomorrow to Iasi, aren 'tyou? - Structures containing vocative addressing and calling: 40: Ion (Ioane), da-mi un mar (te rog)l I Ion (John), give me an apple (please)! 35a: Anal I Ann!, 35b: Maria!/Mary!, - Structures that require an intonation of continuity (in suspension): 49: - Apucd- te/Ia si-nvaid, ca de nu.../ Start/Put yourself at work/to learn, or else... - Exclamatory structures: 84: Ce bade frumos al!/ What a beautiful scarf [you] have! - Structures on intercalation prosody: 74a: Tata mi-a zis: Du-te repede si cheam-o pe sora-ta! / My father said, 'Go quickly and call your sister 7 74b: Du-te repede si cheam-o pe sora-ta! mi-a zis tata. / Go quickly and call your sister! my father said. - Structures containing enumerations: 66: Am fost la pia0/tdrg si am cumpdrat: rosii, ceapd, morcov si ardei./ I was at the market/fair and bought tomatoes, onions, carrots and peppers. - Structures containing a sequence of short sentences: 79: De dimineatd m-am frezit, am pregdtit micul dejun si apoi am plecat la serviciu./ This morning I woke up, I made breakfast and then I went to work. - Sentences with the same structure (V) for the affirmative, interrogative and imperative mood: 80: Asteapta./[He/she] waits. 81: Asteapta?/Does [he/she] wait? 82: Asteapta!/Wait!/Asteapta-md!/Waitforme! - Structures with a focus on different constituents 4a: Pe Vasile l-ai vazut ?/Was Basil that you saw? 4b L-ai vazut pe Vasile?/ Did you see Basil?; 58: Bei vin?/Are you drinking wine? - Structures with a successive focus on constituents 64: Mananci peste?/ Are you eating fish? 65a: Mananci peste?/ Are you eating fish? - Affective structures: 56f: E/ii amiazd? / Is it/It's noon? It's already noon? 59: Bei vin?/Areyou drinking wine? - The extended form of the questionnaire contains other type of syntactic structures: - Structures pursuing the prosody of idioms and phrases: 89 a, b, c...: da de wide!/what? no way!; nu mai spune!/yah, do not say!; ce folos?/so, what?; nici vorba/pomeneala!/no way!/not at all!; cum/unde sa facd ea asa ceva?/what/how did she do that?; da mai §tii?/that could be?;ei si?/so, what?. - Structures containing greetings and politeness: 91: Buna ziua! Good afternoon!', 97: Poftim/There you go!/Na!/Here! - Midtumesc/midtam!/ Thankyou/Thanks! - Structures that use adverbs and adverbial phrases to strengthen the assertion and negation: 104: Da, sigur/ fir este/ negresit!/ Yes, sure/ surely/ no doubt! 105: Nicidecum!/No way! Niciodatd/Never! Nici in ruptid capuluU/On no account! - Imprecations: 107 a, b, c...: Arde-l-arfocu' sa-lardd!/May he bum in hell! Lua- l-ar naiba/dracu sa-l id!/ The hell/the devil with him! Fir-ar/fi-o-ar a dracului!/ Anca - Diana Bibiri, Dan Cristea, Laura Pistol, Liviu - Andrei Scutelnicu, Adrian Turculet Damn it/Damn with it! - Du-te dracuhd/la dracu/la satana!/ Go to the devil/to Satan! The statements are recorded at least three times and are obtained through indirect questions and by verbal and non-verbal implications (facial expressions, gestures) to the context, and/or forming some speech situations during the continuous dialogue between the investigator and the informant. In rural areas, two indigenous subjects are used, representative for the local speech, with elementary education, middle-aged, who speak natural under the conditions of the investigation. In urban areas the surveys are twofold: besides the informants belonging to low and/or middle class with influences of the local dialect, there are used subjects with higher education, speaking a cultivated language. 2.3. The IIT corpus The IIT continuous speech corpus consists of recordings, summing up 45 minutes of continuous speech, uttered in an office environment and following a standard voice recording procedure, by three female speakers who currently speak Romanian standard language, aged between 33 and 50, having no pathological disorder and originated from the geographical area of North-Western Romania (the Iasi district). The recordings were single channelled with a sampling frequency of 22050 Hz and 16 bit resolution. The sentences chosen for recordings are paragraphs from "Amintiri din copilarie" (Childhood Memories), by the classical Romanian writer Ion Creanga and dialogues from sketches by the Romanian writer and dramatist Ion Luca Caragiale. The choice towards this piece of classical belletrist work was imposed by the necessity for the corpus to be copyright-free. The size of the IIT database is shown in the following table: Table 1: Size of the database (Only for the writer Ion Creanga) sentences 341 vocabulary size 2000 words (occurrences) 6505 words per sentence 19.07 3. Notation of sounds, phonemes, graphemes In the following, by sound we mean a segment of a speech track, as it is heard by a human or is recorded by a machine. A sound, in general, is characterised by steady physical parameters (amplitude, frequency) and corresponds to a letter in an alphabetic transcription. There is a huge variance of sounds corresponding to the same letter, depending on the articulatory and the co-articulations conditions of the sounds and to 154 155 Romanian Corpus for Speech-To-Text Alignment another factors, such as the context of communication and the speaker (sex, age, tonality, momentary physical and psychological state). A phoneme is the conceptualization of a sound. The Romanian language has 31 phonemes. As such, one cannot say that phonemes are recorded. Only sounds can be recorded, but out of them, phonemes are deciphered (interpreted) and, accordingly, noted, m the real world, a phoneme does not exist, but we can say "this sound records the phoneme a \ The phonemes are noted in the International Phonetic Alphabet - IPA (see below). The speech-to-text alignment conventions are based on the mappings between the two planes of expression of language: the concrete plane (of the substance of the language), populated with sounds, and the abstract plane (of the form, the linguistic plane), where phonemes coexist. These two planes are both doubled by two levels of expression: phonic and graphic (as suggested by Figure 1). The phonic level: The concrete plane sound The graphic level: letter The abstract plane phoneme grapheme Figure 1: The speech to text correspondence Although a phonemic language (sounds as they are transcribed), Romanian has some particularities: - the sounds z and s in dezbat vs. desfac, or rdzbate vs. rdsplati have as a variant ISI: /deSbat/, /rsSbate/; imbrac and mvdt have as a variant IN/: /iNbrak/, /iNvotz/; as a result of neutralization of the opposed Izl and /s/, respectively, between Id and I ml in such examples it is noticed the occurrences of the archiphonemes IS/ and /N/. - the use of morphematic principle in order to maintain the formal identity of the words, especially when one speaks about the alternation of the diphthongs oa and ua, respectively ea si ia: oa ~ o: oameni - om, toatd - tot, oald - ol; ua ~ u: bdcduan - Bacdu, fldcduas - fldcdu, and in the case of some neologisms: Anca - Diana Bibiri, Dan Cristea, Laura Pistol, Liviu - Andrei Scutelnicu, Adrian Turculet acuarela, scuar; ea ~ e: teamd - tern, cheamd - chem, ceas - cesulei, ea - ele; ia ~ ie: iarna - ierni, piatrd -pietre or in the situation when there is no alternative: chiar, ghiaur; - the morphematic principle is rarely used to differentiate morphemes: aceea(§i) vs. aceiasi, ea vs. ia: - it is maintained (totally or partially) the etymological spelling: eu, el, ei, ele, eram; absent, lied, watt, subiire, fotbal; alurd, bleu; the most typical case is that of loans from English: computer, laptop, site, whisky, weekend. Romanian spelling includes graphemes created using diacritical marks (because of the lack of specific letters in the Latin alphabet: a, a, i, s, \) as well as polyvalent, compound graphemes having different contextual values. There are polyvalent vocalic graphemes , , , , noting both the vowels /e/, /i/, Io/, lui, and the corresponding semivowels Id, iil, Iql, lui; also the sequence of a vowel + a dependent semivowel: = [ie] in eu, eram, vie; = [ii] in cais, fimfa, oiste or [ii] in academia, ia 'bluza'; = [uo] in fwr, Jpu] in merituos; = [uu] in aur or [uu] in ludnd. In some cases, according to the morphematic principle, the graphemes , also note semivowels , : aceea, ea, oameni, vioara._ The consonantic graphemes , , , , have double values depending on the context where they occur: Ikl and /tJ7, Igl and /j/, ikl and Id, Inl and /N/,7ks/ and Igzl in car and cer, gar a and ger, kaliu and kaki, nas and invaf, aks and exemplu. There are graphemes compound of two or three letters: , , , , , , , , , : ceas, arid, geam, ungi, chem, ghem, cheamd, gheatd, ochi, unchi, unghi. The description of the phonemic system of the Romanian language has several interpretations, with different numbers of phonemes, depending on the authors' theoretical and methodological assumptions. The Romanian linguist E. Petrovici (1956) proposes in his phonemic theory the largest number of phonemes: 5/7 vowels and 72 consonants, and E. Vasiliu (1965) - the smallest number of phonemes: 7 vowels, 20 consonants, and one special phoneme called 'syllabic juncture'. For our corpus we propose a simple phonemic system, which best corresponds to Romanian writing, in accordance with the Latin alphabet. This phonemic system (Turculet, 1999) is made up of 7 vowels: /e/, HI, Id, hi, HI, lol, lui, 4 semivowels Id, l\l, id, lui and 20 consonants ([c], [i] are considered allophones of the phonemes Ik, gl) - see Table 2. 156 157 Romanian Corpus for Speech-To-Text Alignment Anca - Diana Bibiri, Dan Cristea, Laura Pistol, Liviu - Andrei Scutelnicu, Adrian Turculet Table 2: Symbols for consonants Place—> Bilabial Labiodental Dental-alveola r Alveolar Alveolo-palatal Velar Glottal jManner Plosive /p//b/ Iil id! Ikl /g/ Nasal plosive Iml inl Fricative Ifl hi Isllzl ifi/zi Ihl Affricate lizl /tJ7/d3/ Lateral IV Trill It/ The reduced vowel i, asyllabic and voiceless, specific to the Romanian language called' final, asyllabic, post-consonant i' such as in [lupi], [potzi] (it occurs rarely within a compound word, at the morpheme limit [orikind], [kitziva]) as a variant of semivowel IU. Thus, the phonetic label is [I] and the phonematic one as ill (it occurs after a consonant in the final position and between two consonants in medial position). The back rounded vowels [o] and [u] originated in some French and German loans can be considered as situated at the Romanian phonetic and phonemic periphery: [alurs], [bio], [rontjen] or [rontgan]. They are realised usually as the diphthongs [iu], respectively [eo]. Regarding the correspondence between phonemes and graphemes we propose some simple solutions according to the combinations of Romanian letters used in writing. They concern the evaluation of compound graphemes (see supra). The compound graphemes from the following examples [tjas], [aritf], [d3am], [und^] are reduced to simple graphemes , , followed by the 'latent' phonemes Id and IU (possible solution proposed in generative phonology) with the phonemic transcription / tjeas/, / aritji/, /d3eam/, / und^U, and the trigraphs , , , followed by vowels , , or in final position are reduced at , : [cams], [car], [cor], [cul], [iatzo], [jozdan], [oc], [unj], with the phonemic transcription /ceams/, /ciar/, /cior/, /ciul/, /jiatzo/, /jiozdan/, /oci /, /unji/. The compound graphemes are, in fact, the digraphs and as in [cem] /cem/, [jem] /jem/, [camo] /ceamo/, [jatzo] /jeatzo/, [oc] /oci/, [unc] /unci/, [unj] /unci/. Figure 2 shows an example of a speech-to-text alignment: partial interrogative sentence uttered by a subject from Bucharest (Cristina Dabuleanu, 49 years old, computer programmer): Cum te cheamd? (What is your name?). 158 cum te cheama cum te ;he a IT ia syllables c u m t e ch e a m a graphemes k u m t e c i a m a sounds k u m t e c e a m a phonemes Time (s) 0.8366 Figure 2: Praat screen in the speech-to-text alignment of the utterance Cum te cheamd? For some loans (most of them from English), there are applied the rules for writing and speaking of foreign language, will be marked with a special sign. The letters/graphemes and the sounds/phonemes will be maintained as they are in the foreign language: [laeptop], [sajt]. 4* Speech-to-text alignment The purpose of the manual speech-to-text alignment is to determine with precision the boundaries of sounds belonging to the phonic layer and to align them with letters from the grapheme layer. The task is done by one of the co-authors, having an extensive experience in reading spectrograms and labelling phonemes. By using the graphical interface and listening the audible track in Praat, she identifies the acoustic changes in order to determine the phoneme boundaries. The annotation levels are: utterance, word, syllable, phoneme and grapheme. Table 3 shows the notations used with Praat in the alignment process. 159 Romanian Corpus for Speech-To-Text Alignment Table 3: 9 tracks revealed by Praat, shown at different moments of time; apart from duration, the first 3 tracks (sound, syllable and word) represent manual annotation, while the other 5 are automatically recorded Time Duration Sound Syllable Word Intensity(dB) F0 (Hz) FI F2 F3 (Hz) 0 0.09 n ne nevasta 74 0 390 1768 3096 0.09 0.05 eAef ne nevasta 73 205 463 1835 3088 0.14 0.06 V Vivas nevasta 70 0 567 1484 2220 0.2 0.12 a:: Vivas nevasta 78 237 807 1523 3187 0.32 0.06 s Vivas nevasta 61 0 806 1728 3208 0.38 0.06 t ta nevasta 75 0 621 1484 2965 0.44 0.06 a ta nevasta 79 228 875 1529 3092 0.5 0.08 V Vive vede-un 63 0 648 1375 2780 0.58 0.06 e Vive vede-un 77 230 497 2252 2927 0.64 0.07 d di-un vede-un 70 0 371 2291 2958 0.71 0.08 i\nvu\~A di-un vede-un 70 236 399 1087 2776 0.79 0.04 \ng di-un vede-un 71 0 452 1010 2593 0.83 0.06 k c\sw c\swpitan 63 0 257 1634 2912 0.89 0.04 \sw c\sw c\swpitan 76 223 527 1528 2827 0.93 0.09 P Pi c\swpitan 53 0 464 1903 3146 1.02 0.04 i Pi c\swpitan 69 212 389 2504 3353 1.06 0.1 t VI tan c\swpitan 53 0 301 2541 3395 1.16 0.09 a\~A\~v: VI tan c\swpitan 66 146 942 1746 3229 1.25 0.06 n VI tan c\swpitan j 62 0 1039 3137 3568 PRAAT is a flexible tool for the analysis of acoustic speech signals. It offers a wide range of standard and non-standard procedures, including spectrographic analysis, articulatory synthesis and neural network. Speech segmentation is the process of identification of boundaries between words, syllables and phonemes. Performed manually, this process attaches a label to each segment. For example, after we have finished segmenting the words and labelled them, follows the segmentation of the syllables of the structure and, finally, those of the compound sounds. The steps in the analysis of a speech waveform are as follows: The script reads sound files (.wav format - Waveform Audio File Format) from a user-specified folder; - Then create a TextGrid (which consists of a number of tiers - an interval tier is a connected sequence of labelled intervals, with boundaries in between); - Selecting both .wav and Text Grid files it opens a window spectrogram in which the annotation is made manually: 3 tiers are open in order to annotate words, syllables and phonemes; - Once the speech signal is segmented and labelled, by pushing the run button a text file is generated in output, including different parameters: the fundamental frequencies (F0, in the three points of a vowel - FI, F2 and F3), the duration and the intensity of the acoustic signal. Anca - Diana Bibiri, Dan Cristea, Laura Pistol, Liviu - Andrei Scutelnicu, Adrian Turculet For the speech-to-text alignment of the corpus, the supra-segmental features of the utterance are also taken in consideration: the stress, the intonation and the break indices (as indicated by punctuation marks). A more appropriate rendering is that used in ToBI1 - a framework for developing community-wide conventions for transcribing the intonation and prosodic structures of spoken utterances in a language variety. A ToBI framework system for a language variety is grounded on the intonation system and the relationship between intonation and the prosodic structures of the language. 5. Conclusions In this paper we presented a methodology of manual annotation of an aligned speech-to-text corpus for Romanian, and the phonetic peculiarities of this language. The intention is to use this corpus to train a speech segmentation and aligner program (let's call it a SEG-ALI module) that would be able to detect the boundaries of sounds in correlation with a text track where the textual transcription is noted. Different parameters of the speech signal, some of them having been suggested in this paper by presenting the processing capabilities of the Praat system, will be exploited by a learning system that will finally train the SEG-ALI module. A top-down strategy will, most probably, be employed for this purpose, by searching first the pauses in the sound track and aligning them with the boundaries between sentences and words and using more high level features to detect phonemes boundaries in between pauses of the continuous speech. Once such a SEG-ALI module is obtained, it could be used to segment and align automatically a very large corpus of parallel tracks containing human produced continuous speech and their textual transcription. In the long run, the intention is to acquire a large corpus of aligned speech-to-text records that will be used in training a speech recognition system for Romanian. Knowing the high costs encumbered by manual segmentation of the voice track and its alignment against the text track, our hope is to arrive at a very good performance of the SEG-ALI module that would permit the automatic acquisition of a very large corpus in a short time and with reduced costs. We do not neglect also the possibility to use a boot-strapping strategy in acquiring a high quality aligned corpus: use the manually annotated corpus as a core corpus on which a beta version (vO) of a SEG-ALI module is first trained. Use then this SEG-ALI-vO to segment&align a larger corpus, and then involve specialised humans to correct it. This activity is supposed to take less time than building it from scratch and also cost less. Once finished, use this larger corpus to retrain the SEG-ALI module to a new and enhanced version - vl, and so on. 160 Tones and Break Indices: http://ww.cs.indiana.edu/~port/teach/306/tobi.summary.html 161 Romanian Corpus for Speech-To-Text AMgnment References AMPER - Atlas Multimedia Prosodiques de I'Espace Roman, http ://w3 .u-grenoble3 .fr/dialecto/AMPER/amper.html AMPROM - http://amprom.uaic.ro/ Handbook of the International Phonetic Alphabet. A Guide to the use of the International Phonetic Alphabet (1999). Cambridge University Press. Boersma, P., Weenink, D. (2013). Praat: doing phonetics by computer [Computer program]. Version 5.3.42, retrieved 8 February 2013 from http://www.praat.org/ Bullinaria, John A. (2011). Text to Phoneme Alignment and Mapping for Speech Technology: A Neural Networks Approach, IJCNN, IEEE, 625-632. Damper, R. I., Marchand, Y., Marsters, J.-D. S., Bazin, A. I. (2005). Aligning text and phonemes for speech technology applications using an EM-like algorithm. International Journal of Speech Technology, no. 8, 149-162. Hosom, J.-P. (2009). Speaker-independent phoneme alignment using transition-dependent states, Speech Communication, no. 51, 352-368. Petrovici, E. (1956). Sistemul fonematic al limbii romane (The phonemic system of the Romanian language), in Studii si Cercetari Lingvistice, VII :l-2, 7-2 L TurculeJ, A. (1999). Introducere in fonetica generala si romaneasca (Infroduction to general and Romanian phonetics), Demiurg Editorial House, Iasi. Vasiliu, E. (1965). Fonologia limbii romane (The phonology of the Romanian language), Editura §tiintifica, Bucharest. DATA-DRIVEN METHODS FOR PHONETIC TRANSCRIPTION OF OUT-OF-VOCABULARY (OOV) WORDS TIBERIU BOROS1, RADU ION1, DAN STEFANESCU2 Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy, Bucharest, Romania 2 University of Memphis, USA {tibi, radu}@racai.ro, {dstfnscu}@memphis. edu Abstract Letter to Phoneme conversion (L2P) is a crucial problem in any modern text-to-speech (TTS) synthesis system. The L2P conversion is routinely done with the help of a lexicon. An inherent problem of this approach is that regardless of the size of the lexicon, there will always be out of vocabulary (OOV) words, for which a method for automatic phonetic transcription is required. In this paper we present our L2P system which uses a set of 4 methods for obtaining phonetic transcriptions for OOV words. We compare our results with current existing state of the art methods showing that our system is up to par. Keywords: letter-to-phoneme conversion, text-to-speech synthesis, out of vocabulary words. 1. Introduction Predicting pronunciation for OOV words is a major challenge for any TTS system. While sometimes this can be a simpler task for certain languages where there is a clear relationship between letters and their phonetic transcription (e.g. Romanian, which has a preponderantly phonemic orthography), for others, such as English, it may pose considerable difficulties. Consequently, phonetic transcription is a key component in every TTS system, but this is not the only appliance of it. Other tasks, like spelling correction, can be addressed using phonetic transcription by means of phonetic similarity and perceptive search. L2P conversion usually means detecting a set of language-dependent rules that will map letters to phonemes. These rules may be written by linguists or automatically inferred from a given list of word/phonetic transcription pairs. Phonetic transcription is the next step, where possible L2P rules are applied to the OOV word's written form and the best phonetic transcription is selected according to an optimum criterion. Various scientific studies have focused on automatically extracting L2P conversion rules from available hand-made transcriptions (Black et al., 1998; Jiampojamarn et al, 2008; Paget et al. 1998). At this point, we should note that phonetic transcription of OOV words does not address other phonetic transcription problems such as homograph 162 163 Data-Driven Methods for Phonetic Transcription of Out-Of-Vocabulary (OOV) Words Tiberiu Boros, Radu Ion, Dan Stefanescu disambiguation. OOV words are not included in the training lexicon and it is impossible to infer that these words have multiple pronunciations depending on their senses. In this paper we describe Bermuda, a system for automatic phonetic transcription of words, starting from the alignment provided by GIZA++ (Och and Ney, 2003). Bermuda combines a set of four different data-driven methods which will be detailed later in this paper. Our entire training process is automatic: there is no need for manual intervention in finding alignments between words and phonetic transcriptions. We intend to extend this system to include other state of the art methods for automatic phonetic transcription. 2. Related work Phonetic transcription is an area of active research, which produced a multitude of solutions, mostly based on machine learning (ML) methods. Basically, their objective is to generate sequences of phonemes (phonetic transcription) from sequences of letters (words). Divay and Vitale (1997) presented a L2P method that used a large number of context-sensitive and context-free rules with a minimum number of ordering constraints for phonetic transcription of words. Another approach was to use part-of-speech (POS) tagging methods (Hidden Markov Models) and to treat each individual letter independently as if it were a word in a sentence that required POS tagging (Taylor, 2005). This method did not yield high accuracy results. Later, it was shown that better results could be obtained by pairing letter substrings with phoneme substrings (Bisani & Ney, 2002; Marchand & Damper, 2000; Jiampojamarn et al., 2008), instead of treating each letter individually. The reason resides in the fact that phonetic transcriptions are context dependent — the next phoneme in line dependents on the current and previous letters — and reportedly, also on the next letters (Demberg, 2007). Multinomial classifiers have also been used to predict phonetic transcription based on features extracted from letters and groups of letters inside words (Black et al., 1998; Jiampojamarn et al., 2008; Paget et al. 1998). All ML methods require training data but obtaining such a corpus is not straightforward. Lexicons usually contain words with associated phonetic transcriptions. However, the relationship between letters and phonemes is not always a one-to-one relationship. For example, not all words have the same number of letters as the number of phonemes in their phonetic transcription (e.g. feared: FIH R D) and, even if the number of phonemes is equal to the number of letters, this does not necessarily imply that only one-to-one alignments exist between them (e.g. experience: IH K S P IH R IY AH N S; the letter x spawns two phonemes 'KM- 'Sf and the ending V is silent). This relationship is captured by what is known as L2P alignments. The Expectation-Maximization (EM) algorithm and its variants have been used to find one-to-one or many-to-many alignments between letters and phonemes in (Black et al., 1998; Jiampojamarn et al., 2008; Paget et al. 1998). Given the fact that certain pairs of letters and phonemes are much more frequent than others, EM can be employed in order to automatically detect the most probable alignments given a list of pairs of words and their transcriptions as training data. 2.1. System overview Bermuda's architecture is organized in two layers. The first layer uses two methods for obtaining phonetic transcriptions of words: the first method implements the Dictionary Lookup or Probability Smoothing (DLOPS) algorithm (see section 3.1) and the second method (Phonetic Transcription Classifier - PTC) is based on a MaxEnt classifier (see section 3.2). The second layer is designed to automatically correct systematic failures in the first layer methods. As shown in section 7, chaining the second layer correction method (ERC) to the output of the first layer methods gives an increase in accuracy ranging from 1 to 7%. DLOPS is a data-driven algorithm used for generating phonetic transcriptions of OOV words by optimally adjoining maximal spans of phonetic transcriptions found in a transcription dictionary, corresponding to adjacent parts of the input word. The MaxEnt classifier uses features constructed from contextual letters, groups of letters and previously predicted phonemes in order to predict the phonetic transcription of an input word. Starting from the alignment between letters and phonemes, Bermuda trains the first layer methods, building models for DLOPS and PTC (sections 3.1 and 3.2). Next, the first layerVs methods are used to predict phonetic transcriptions of words inside the training lexicon. ERC uses features similar to PTC features, supplemented by features extracted from the predicted phonetic transcriptions which, at this step, have become available. Systematic errors in the phonetic transcriptions obtained using the first layer methods are corrected using ERC 2.2. Letter to phoneme alignment According to Jiampojamarn et al. (2008) the L2P task is characterized by a hidden structure that connects the input set (letters) to the output set (phonemes). Pairing (aligning) the two sets is not a straightforward problem (in section 1 we presented the example of the word experience). Bermuda uses the services of GIZA++ (Och and Ney, 2003) in order to find alignments between the input word segmented at letter level and its composing phonemes. GIZA++ is a free toolkit for generating word alignments in a parallel corpus. In is usually used to create training data for machine translation (MT) systems but, as Rama et al. (2009) showed, it can also be used to pre-process training data for L2P conversion systems. For each training lexicon we run GIZA++ for a primary letter to phoneme alignment with default parameters (10 iterations of IBM-1, HMM, IBM-3 and IBM-4 models). The available dictionary is split into two files: the first file contains one word per line with its letters separated by spaces, so that GIZA++ will treat them as words in the source language. The second file contains phonetic symbols that "translate" the corresponding word on line number n, also separated by spaces (regarded as words in the target language). 164 165 Data-Driven Methods for Phonetic Transcription of Out-Of-Vocabulary (OOV) Words 3. First layer methods This section focuses on the first layer methods. We introduce the Dictionary Lookup or Probability Smoothing (DLOPS) algorithm (section 3.1) and we explain how we used the Maximum Entropy classifier to predict the pronunciation of OOV words (section 3.2). 3.1. The DLOPS algorithm DLOPS is a recursive, divide and conquer algorithm. Although its name starts with Dictionary Lookup, this does not mean that it tries to retrieve whole words from a dictionary. Instead, it attempts to get the phonetic transcription of a group of letters, either by doing a table lookup or approximating the transcription from smaller contained units. Its primary goal is to predict pronunciation for OOV words, without getting into the problem of disambiguating between heteronyms: words having the same spelling and different pronunciations. This would require additional contextual, semantic or etymologic information about a word and such information is not available for standalone OOV words. The pseudo code for our method is: Input: • w[] - vector containing letters of the word • n - size of vector (number of letters) • table - hash table containing groups of letters and phonetic transcriptions with probabilities Output: • t[] - vector of phonetic transcriptions and their scores 1. DLOPS ( w[] ) { 2. if ( exists(table[w]) ) then 3. return transcriptions from table [w] ; 4. else 5. idx<-f indPivot (w) ; 6. return MergeResults ( DLOPS (w [l...idx] ) , DLOPS (w [idx...n] )) ; 7. endif 8. } The algorithm performs a dictionary lookup (line 2) and if there is a corresponding set of phonemes for the given letter sequence, all possible phonetic transcriptions with their associated probabilities (line 3) are returned. If the lookup procedure fails the algorithm seeks an optimal split position in the letter sequence (line 5). Once this location is obtained, the phonetic transcription of the given letter sequence is approximated using phonetic transcriptions of the two overlapped substrings (see the next paragraph). Given Tiberiu Boros, Radu Ion, Dan Stefanescu that the two substrings overlap on the character located at the juncture point, we expect the candidate phonetic transcriptions for the two substrings to also overlap. The score S of a transcription candidate, composed of two adjoined phoneme sequences Si and S2, is computed using the original transcription probabilities (Pi and P2 given letter sequences w[l...idx], w[idx...n]: P]=P(Si|w[l...idx]); P2=rP(S2|w[idx...n])) of these phoneme sequences and a fusion probability. The fusion probability is a smoothing function applied over a 5 symbols phoneme sequence that is composed of the last two symbols of SI before the fusion index and the next 3 symbols (equation 2); 5 = ^0^-^ (2) PlfP2 - emission probability of phoneme sequences Si and S2 / - the fusion index P N-gram interpolation model for position j using a smoothing function. K - half of the length of the fusion window (k=2) FindPivot is a function that maximizes the transcription probability of the first ranking transcription candidate for one or both letter substrings. We have experimented different functions for estimating the pivot location (line 5 of the pseudo code, findPivot; see section 4 for results). For the first test (FPi version of findPivot) we calculated the position by splitting the word in half. For the second test (FP2), we tried to detect an index position that would yield the highest score for the first ranking candidate in the transcription probabilities table for the letter sequence found either to the left or to the right. For the third test (FP3), we looked for an index position that would maximize the transcription score for the first ranking transcription candidate (the highest score after merging overlapped results) for both left and right letter sequences, if they are contained in the transcription table. In case this was not possible, the FP3 version of the findPivot function falls back to the FP2 version. A runtime example is illustrated in figure 3. The chosen word for L2P conversion is "absenteeism" and we explain the execution of our method using the FPi pivot function (splits the letter sequence in the middle). This example has an equal execution depth for each node. This does not apply to all cases and some strings generate unequal execution depth for the nodes. The algorithm has to cope with a couple exceptions. In case there are no transcription candidates that overlap we use the Cartesian product of the all transcription candidates. If the input sequence has the length of 2 and there are no transcription candidates for this letter sequence we split the input string into non-overlapping sequences ("ab" -> "a"+"b"). For the algorithm to always return results, the database must contain transcription candidates for every letter in the alphabet of the target language. Results for each FP function were considered for CMUDICT.06D (English) (CMU, 2011), BRULEX (French) (Content et al, 1990), CELEX (German) (Baayen & Gulikers, 1995) to show how each FP function influence the results. The tests were performed using the 10 fold method. 166 167 pata-Driven Methods for Phonetic Transition of Out-Of-Vocabularv fOOV^ w«nh Not found in table. Searching deeper. After fusion the fusion function we have 15 results: AEBSAHNT Not found in table. Searching deeper. After fusion the fusion function we have 10 results-AEBSAHNTIYIHZAHM 0.S1776681919754I Not found in table. Searching deeper. After fiision the fusion function we have 10 results: TIY-IHZAHM Return 2 results AEB S EH 0.1429 AE B S AH 0.8571 return 22 results AH NT 0.4920 EH NT 0.3365 EY AH.N T 0.0005 return 1 result T FY - EH 1.0000 return 11 results IH Z ARM 0.7493 MS MO. 1267 m Z M 0.0716 Figure 1: Execution of DLOPS for the word "absenteeism" We calculated the Mean Reciprocal Rank (MRR) for each FP function we used (Table 2). As shown FP3 function gets the best results, so we used its output when chaining the second layer ERC method. The results obtained using the FP3 function are very close to those obtained using CART (Black et al., 1998). p _Numh&rof ggyyggg ftrst T&nMng suggestions T&taiuumbsr af words ^ MRR = ■ Tomi number af words * -rankf where rank, is the rank of the correct transcription. Table 1: Experimental results with FP functions (4) Dictionary CMUDICT BRULEX CELEX-GE RM AN FPi MRR P MRR 55.16 72.55 53.21 76.22 FP, MRR 80.41 93.99 79.63 93.79 57.22 74.54 81.41 93.99 80.63 93.79 80.64 93.28 82.94 93.28 DLOPS training For DLOPS we extract n-grams up to order 5 from the phonetic transcription symbols by moving a context window and counting occurrences of symbol sequences. Next, we build a model consisting of a set of letters and their possible phonetic transcriptions with corresponding probabilities. We compute transcription probabilities for 1, 2, 3 and 4 letters (Equation 5). Tiberiu Boros, Radu Ion, Dan Stefanescu (5) €(LSJ - number of occurrences of letter sequence i £(lSitPTtk) - number of occurrences of letter sequence i with phonetic transcription k emission probability of phonetic transcription k given the letter sequence i 5.2. Phonetic Transcription with Maximum Entropy The second method used for the task of PT is based on a Maximum Entropy (MaxEnt) classifier. MaxEnt classifiers have been used in the NLP field to solve problems such as detecting sentence boundaries (Reynar and Ratnaparkhi, 1997; Agarwal, 2005), POS tagging (Ratnaparkhi, 1996), text classification (Nigam et al, 1999), etc. The guiding principle of MaxEnt is constructing a statistical prediction model from training data without extrapolating for unseen data. The model assumes a uniform distribution for the data, maximizing the entropy. This principle is thoroughly described in Berger et al. (1996). We apply this powerful methodology to phonetic transcription by employing a publically available MaxEnt classifier: SharpEntropy1 . In order to do this, we need to reframe the phonetic transcription problem as a label prediction process applied to each letter inside the word. Each letter is now described by an object characterized by a set of n features (corresponding to a point inside the ^-dimensional feature space). We experimented with features extracted from a limited context window divided into lexical features (based on letters of the word) and phonetic features (based on previously predicted labels). After testing different feature sets, we chose the one yielding the best results (see Table 4 for an example). For a given letter L, we have the following features: • Li L 1+i , for i=X2: features 1 and 2 in Table 4, • l_i L , for i=L3: features 3 to 5, • L 1+i , for i=X3: features 6 to 8, • p_i : feature 9, where l.\ is the previous i-th letter and /+i is the next i-th letter; p.\ is the previous predicted phoneme. We have tested some other features based on word length or the position of the letter in the word or weather the letter is a vowel or not, etc., but the use of such features did not improve the model. There are cases when certain features are excluded (see table above). For example, the /_3 L feature (the 4-gram ending with the given letter) is never used for the first letter of a word mainly because the information it encodes is already contained by the L\ L 1 http://www.codeproject.com/Articles/11090/Maximum-Entropy-Modeling-Using-SharpEntropy 168 169 Data-Driven Methods for Phonetic Transcription of Out-Of-Vocabulary (OOV) Words feature. Moreover this feature would be completely ^discriminative in these cases because its value would be identical for all the beginning is. Letters of abolish abolish Features Label abolish i:#aD ^:##abo 5:#a 6:ab 7:abo 8'abol 9-# l:abo2:#aboi4:#ab5:ab6:bo 7:bol 8*boli 9:AH AH B abolish l:bol 2:aboh 3:#abo 4:abo 5:bo 6:oi 7'oli 8:olis 9:B AA abo/ish l:oh 2:bohs 3:abol 4:bol 5:oI 6:11 7-fis 8:lish 9:AA L abolish abolish l:lis 2:ohsh 3:boh 4:oli 5:li 6:is 7:ish 8:ish# 9:L IS abolis/i l.isn ^.iisnff J.olis 4:lis 5:is 6:sh 7:sh# 9TS l:sh# 2:ish## 3:hsh 4:ish 5:sh 6:h# 9SH SH 4. Second layer methods Systematic errors in predictions for both the DLOPS method and the MaxEnt PT method were noticed, so a second layer method trained to correct these errors was added This task is also performed by a MaxEnt classifier. Here, we use different features than the ones used in the first MaxEnt classifier. The already predicted labels for all the letters in a word are used to add additional (phonetic based) features. The system is then trained to re-label all the letters inside the word based on the initial prediction and the correct label (according to the training data). This is done in order to assure cohesion at the phonetic level, correcting certain predictions that would be unpronounceable. Thus, when doing error correction we use the following features for a given letter L, having the phonetic transcription P: • Li L 1+i , for i=l2: features 1 and 2 in Table 5, • Lj L , for i=lj3: features 3 to 5, • L 1+j , for i=i~3: features 6 to 8, • p_i P p+j , for i=3U2: features 9 and 10, • p-i P , for i=l]3: features 11 to 13, • P p+j , for i=I3: features 14 to 16. where L\ is the previous i-th letter and /+i is the next i-th letter; p.\ is the previous i-th predicted phoneme and p+\ is the next i-th predicted phoneme. Since there are two first layer methods, an error correction classifier must be trained for each of these methods. Thus, we end up with two error correction models, each trained to correct the systematic errors of each prediction method. This means there are actually four ways to perform phonetic transcription: DLOPS, PTC, DLOPS + ERC and PTC + ERC. 5. Comparison to other methods In this section we compare our method to other approaches in L2P conversion. Only the results obtained using the same dictionaries and similar evaluation methods were taken Tiberiu Boros, Radu Ion, Dan Stefanescu into consideration. We express the performance of the algorithm in terms of word accuracy rate and we use the first ranking result as the transcription candidate when we calculate the scores (the DLOPS method produces more transcription suggestions with an associated confidence score). We do not use n-best score functions or letter error rates because they do not correctly assess whether this phonetic transcription tool can be used in text-to speech synthesis, where only the first ranking candidate is used for the phonetic transcription of a word. Table 3 contains the results obtained by our method compared to various other methods (the best results are marked with BOLD): CART Decision Tree System (Black et al., 1998), 1-1 Align, M-M align, HMM: one-one alignments, many-many alignments, HMM with local prediction (Jiampojamarn et al., 2007), Constraint Satisfaction Inference(CSIF) (Bosch & Canisius, 2006), minimum error rate training, A* search decoder (MeR+A*) (Rama et al., 2009), averaged perceptron (Perceptron) and Margin Infused Relaxed Algorithm (MIRA) (Jiampojamarn et al., 2008). The results of the CART method for Romanian are extracted from Stan et al. (2011) (the training corpus is identical to the one we used). We conducted tests on all dictionaries (except the Romanian one) using the datasets on the Pronalsyl Website. Each dictionary was 10-folded (divided into 10 sets) and the final score was computed as the average of the 10 scores obtained by testing against each set while training on the other 9. We have to acknowledge that the score obtained for the CMUdict lexicon is lower than expected. This can be explained by the fact that it contains many non-English words which are hard to predict because they do not follow the same phonetic transcription rules as the English ones, a fact also noted in Black et al. (1998) and Jiampojamarn et al. (2007). Such words are practically isolated examples and so, the evidence for inferring phonetic transcription rules for them is practically non-existent. On one hand, if they are to be found in the test set, then there will be no similar examples to learn from in the training data. On the other hand, if they are to be found in the training set, they will merely be a source of noise for the model. Consequently it is practically impossible to predict their pronunciation. Bermuda's score for this lexicon is 68%, being 3% below MIRA's performance and 2% below that of the Perceptron's. On the Netlalk lexicon, Bermuda's PTC+ERC method outranks all other methods by 2%. The only method outside Bermuda which was applied on the Romanian lexicon is CART (Stan et al., 2011). For this lexicon, Bermuda's best score is 6% higher than CART's accuracy of 87%. Again, on the CELEX and BRULEX lexicons, the PTC+ERC method places third after MIRA and Perceptron. SCORE = — (5), where Si is the score obtained by testing on test set i and training on the other 9. Looking on the performance figures, one might consider MIRA a better L2P system. However, the reader should bear in mind the fact that we did NOT perform any preprocessing on the training sets. The letters and phonemes where automatically aligned with GIZA++ and NO supplementary intervention was conducted on the alignments. The other systems have pre-processing steps which include removing heteronyms, 170 171 Data-Driven Methods for Phonetic Transcription of Out-Of-Vocabulary (OOV) Words words that have no more than four letters or functional words. We did not include such steps in the first set of tests, firstly because we aimed at developing a purely data-driven phonetic transcription system and secondly, because they are very hard to be identically reproduced for an accurate comparison. Still, the above mentioned pre-processing steps can be performed, but we are not recommending this practice since it might lead to unreliable results. For example, removing the words that have less than four letters will considerably reduce the necessary evidence for predicting the phonetic transcription for small words. We also encourage leaving heteronyms inside the training lexicon, because the classifiers will learn consistent rules from their letter-phoneme pairs. Instead of removing words based on letter counts and duplicate entries we recommend filtering out all non-English words as we did in our next experiment. Table 3 - Experimental results Bermuda Other methods DLOPS FP3 PTC DLOPS FP3 +ERC PTC +ERC Perceptron MIRA CART 1-1 Align 1-1+CSIF 1-1 HMM M-M Align CMU diet 57.00 UK BEEP 64.07 67.22 63.60 68.29 71.03 71.99 57.80 60.30 62.90 62.10 M- M+HMM I MeR+A* 65.10 65.60 63.81 72.41 67.96 73.56 Net Talk 53.14 BRU LEX 79.17 68.55 59.70 69.19 64.87 67.82 90.99 85.79 91.68 93.89 94.51 87.00 CEL EX 79.27 90.17 CEL EX 78.11 86.99 92.25 95.13 95.32 86.50 88.20 90.60 90.90 86.71 90.49 84.89 Romanian 85.74 93.29 91.81 91.05 92.84 93.61 89.38 86.60 87.50 87.60 91.10 91.40 90.63 93.34 87 __,_:_I ---- _1_I-----'_1_I Using the support of the WordNet (WN) (Miller, 1995) lexical ontology we cleaned up CMUDictionary of all the non-compliant English words. For each entry inside CMUDict we used the WN interface to check if there was a corresponding entry inside the WN. All unknown entries were removed from CMUDict. The purpose for conducting such a trial is obvious: if one wants to train a method for automatic L2S conversion for OOV words on a given language, there is no need to create difficulties by introducing foreign words that do not employ the same phonetic transcription rules as the target language. After removing foreign words and abbreviations the number of entries in CMU was reduced to 46K words. Some examples of removed entries are: • Italian: braggiotti, castelli, castelluccio • German: aachen, abbenhaus, schlender, schlenker • Polish: zawistowski Table 4 shows results obtained by our methods on the cleaned CMUDict lexicon, using the same 10-fold validation methodology. There is a clear improvement over the previously obtained results on the unmodified lexicon. Table 4 - Experiments on the filtered CMUDict 172 Tiberiu Boros, Radu Ion, Dan Stefanescu Method Result DLOPS FP3 60.44 DLOPS FP3 + ERC 69.35 PTC 72.33 PTC + ERC 75.45 6. Conclusions We thoroughly described the Bermuda system, which implements two methods for data-driven phonetic transcription and a method for error correction. This system can be used within a TTS system for L2P conversion on OOV words, but also for problems like perceptive search and spelling correction. The required training data is freely available on the Internet (downloadable from the Pronalsyl Challenge website - see section 5) for the languages we have tested (English, French, Dutch and German) and it can also be generated from existing resources, if these contain phonetic transcriptions. Also, our tests showed that a cleaner version of CMUDict (using WN) will significantly increase the accuracy of the results (from 68 to 77 percent). Furthermore, in the case of CMUDict, we have conducted paired t-tests between the runs without error correction and the runs with error correction. Specifically, we have compared DLOPS FP3 with DLOPS FP3+ERC and PTC with PTC+ERC and we can report semnificative increases of the mean word accuracies when using error correction at a significance level \alpha of much less than 0.0001. It is important to state that comparing to the other existing systems, Bermuda is freely available on-line at RACAI TOOLS Website2. We offer a downloadable version but also an on-line test version of Bermuda that has be trained using the internal Romanian lexicon, CMUdict and UK BEEP. We also offer International Phonetic Alphabet (IPA) transcriptions for both Romanian and English. In the immediate future, we will add French and German for the online version and we will also include a page for of our perceptive search tool (based on Google APIs). Moreover, the phonetic transcription module is already used within the Bermuda Voice Synthesizer system, also publically available at RACAI Romanian TTS demo page3. The next version of Bermuda will add other current state-of-the art methods for phonetic transcription to its inventory of already implemented techniques. We are also working on a voting system that will increase the accuracy of the tool. While tweaking some parameters of the DLOPS algorithm (fusion probability function, cut-off factor for the letter-phoneme pairs etc.) we noticed that it can achieve better results on some lexicons. We conclude that always using the same values for these parameters is not effective and that, in order to obtain better results, they should be tuned for each lexicon. Consequently, in the near future we plan to include a minimum error rate training (MERT) option for these parameters. Furthermore, we plan to address the reverse problem of phoneme to grapheme (P2G) conversion. References Baayen,R., Piepenbrock,R., and Gulikers, L. (1995).The CELEX lexical database. Linguistic Data Consortium, University of Pennsylvania, Philadelphia. 2 http://nlptools.racai.ro/nlptools/index.php?page==phontrans 3 http://nlptools.racai.ro/nlptools/index.php?page=tts 173 Data-Driven Methods for Phonetic Transcription of Out-Of-Vocabulary (OOV) Words Bisani,M., and Ney, H. (2002). Investigations on joint-multigram models for grapheme- tophoneme conversion. Proceedings of the 7th International Conference on Spoken Language Processing, 105-108 Black, A., Lenzo, K. and Page!, V. (1998). Issues in building general letter to sound rules, ESCA Speech Synthesis Work-shop, Jenolan Caves. Bosch, A., and Canisius, S. (2006). Improved morpho phonological sequence processing with constraint satisfaction inference. Proceeding^ of the Eighth Meeting of the ACL-SIGPHON at HLT-NAACL, 41-49. CMU (2011). Carnegie Mellon Pronuncing Dictionary. http://www. speech, cs. emu. edu/cgi-bin/cmudict. Content, A., Mousty, P., and Radeau, M. (1990). Une base de dormees lexicaies informatisee pour le francais ecrit et parlem. L Annie Psychologique, 90:551-566. Demberg, V. (2007). Phonological constraints and morphological preprocessing for grapheme-to-phoneme conversion. Proceedings ofACL-2007. Divay, M. and Vitale, A. J. (1997). Algorithms for grapheme-phoneme translation for English and French: Applications, Computational Lingidstics, 23(4):495-524. Jiampojamarn, S., Cherry, C. and Kondrak, G. (2008). Joint processing and discriminative training for letter-to-phoneme conversion (2008). Proceedings of ACL-2008: Human Language Technology Conference, Columbus, Ohio, 905-913. Laurent, A., Deleglise, P., and Meignier, S. (2009). Grapheme to phoneme conversion using an SMT system. Interspeech. Marchand, Y. and Damper, R.I. (2000). A multistrategy approach to improving pronunciation by analogy. Computational Linguistics, 26(2): 195-219. Och, F. J., Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Lingidstics, 29: 1, 19-51. Pagel, V., Lenzo, K. and Black, A. (1998). Letter to sound rules for accented lexicon compression. International Conference on Spoken Language Processing, Sydney, Australia. Rama, T., Singh, A. K., Kolachina, S. (2009). Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training. Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, 124—127, Suntec, Singapore. Stan, A., Yamagishi, J., King, S., Aylett, M. (2011). The Romanian Speech Synthesis (RSS) corpus: building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53: 3, 442-450. Taylor, P. (2005). Hidden Markov Models for grapheme to phoneme conversion. Proceedings of the 9th European Conference on Speech Communication and Technology. 174 USING FUNCTION WORDS FOR GUIDING THE PREDICTION OF THE ROMANIAN INTONATION VASILE APOPEI, DOINA JITCA, OTILIA PADURARU Institute of Computer Science of the Romanian Academy Iasi Branch, Romania jdoina@iit. tuiasi. ro Abstract This paper presents a method for determining the syntactic markers of a given text, starting from a set of function words. The syntactic markers usually consist of sequences of function words combined or not with content words. We selected a set of key words and searched for different lexical contexts of each key word into a large Romanian text corpus. The context including a key word was structured into a set of morpho-lexical descriptions (sequences). The prosodic aspects of the morpho-lexical contexts were analyzed starting from the utterances of the corresponding texts. Each context will be assigned a set of prosodic markers which can be further processed by the prosodic prediction module of a Romanian Text-to-Speech (TtS) system. The syntactic markers are useful for guiding a prediction module for the Romanian intonation to correctly generate the prosodic phrasing of the input text and the melodic contours of each phrase. Keywords: function words, morpho-lexical contents, prosodic marks 1. Introduction The goal of this paper is to build an inventory of syntactic markers, starting from a Romanian text corpus and from the detection of the function words and their morpho-lexical contexts within a given text. A prosodic module has to assign to these markers prosodic markers (focus events, boundary events, etc.) in order to generate an adequate F0 contour for an input text. This preliminary step is useful for improving the prosodic predictor of a Romanian TtS system, developed starting from a previously described functional model of Romanian intonation (Jitca and Apopei 2007, 2009). The functional model-based predictor assigns a functional label to each prosodic word (accentual units). Consequently, a phrase is described by a sequence of function labels. The task of the prosodic predictor is to assign a melodic contour to the input text, starting from the prosodic markers deduced from the analysis of the syntactic markers. Each contour has a particular functional description. 175 i Using Function Words for Guiding the Prediction of the Romanian Intonation Not all focus events of the neutral sentences can be predicted based on lexical analysis, because the occurrence of a focus event has sometimes prosodic reason (Kratzer & Selkirk 2007). According to the authors of this study, the distribution of the major phrase stresses (major focuses) in all-new sentences is determined by the principles underlying the syntax-phonology interface, whereas the distribution of the minor phrase stresses (the rest of the focuses) is apparently a matter for the phonology per se, and is detennined by the principles of the prosodic structure organization. In most cases, the syntactic markers can be correlated with the predicate prosodic units (accentual units - AUs, intermediate phrases - ips, intonationai phrases - IPs) of an F0 contour. Setting correctly the positions of the predicate units between phonological phrases entails a correct identification of the beginning of the next phrase and of the position of its focused prosodic unit, by taking into account different types of functional prosodic unit structures. Our approach consists in generating rules that allow the identification of various syntactic markers, without the need to resort to the syntactic structure of the text. The role of the predicate prosodic units within the prosodic hierarchy of a F0 contour is detailed in chapter 2. Several examples of markers, accompanied by different lexical contexts, are presented in chapters 3 and 4. 2. A functional perspective on the prosodic events Apart from the text semantics, intonation has its own meaning, resulting from its own grammar and several functional categories of prosodic constituents (Selkirk 1995; Schwarzschild 1999). In these papers, the authors have analyzed a text from a functionally semantic perspective to obtain an Information Structure (IS) in terms of 'Focus' and 'Given' marks. At the prosody level, 'Focus' is associated with the 'Stress' prosodic category, while 'Given' is assigned the 'Destress' category. However, the analysis of the intonationai contours reveals 'Focus' constituents without a pitch accent and pitch accents without a focus. For this reason, other authors have not agreed with deriving expression of 'Focus' and 'Given' constituents directly from their marks (Fery & Vieri 2006). They have suggested that the prosodic events results entirely from the interaction between the constraints governing the prosodic organization of the clause (the prosodic reasons) and the general constraints governing the prosodic expression ('Stress'-'Focus' and 'Destress'-'Given') of the discourse status. In their opinion, the relation between the discourse structure and prosody relies on the ranking of several constraints. Three of them (two resulting from the 'Stress-Focus' association and one from the 'Destress-Given' association) relate the accent or its absence to the discourse structure. The others govern the position of the prosodic prominences resulting from the phrasal stress, head alignment to the intonationai phrase (IP) boundaries and head alignment to the phonological phrase boundaries. Consequently, accent assignment to the heads results from the prosodic constraints, while deviation from this default is imposed, when necessary, by the discourse constraints. Vasile Apopei, Doina Jitca, Otilia Paduraru In our intonationai model, the prosodic constrains are expressed by functional label sequences (Jitca & Apopei 2009), assigned to different melodic contour types applied at the IP and intermediate phrase (ip) levels. These functional label sequences are further translated into F0 pattern sequences, each pattern having its own prominence. The functional labels of a sequence correspond to the prosodic constituents (usually prosodic word) of a phrase (IP/ip). The main functions are the folio wings: • PUSH and POP - correspond to the delimitative units of a phrase. In descending contours, the PUSH accentual units mark the beginning of a phrase, while the POP accentual units mark its end. In neutral intonation, a PUSH unit is more prominent than a POP one. • LINK - corresponds to a prosodic unit endowed with a predication function at the prosodic level. It links the initial AU/AU group to the final AU/AU group within an intermediate or intonationai phrase. At the morphological level, the 'Link' unit may correspond to a verbal constituent or not, while the predicate units frequently correspond to adverbial and prepositional constituents or to nouns derived from verbs. • FOCUS (F) - corresponds to a prominent prosodic unit with a target tone reaching the maximal pitch level in an affirmative statement. A prosodic constituent of a phrase may cumulate two functions. For example, in neutral intonationai contours, when the target tone of a PUSH unit reaches the top level of the tonal space, a PUSH+FOCUS unit is generated, hi contrastive focus intonationai contours, when a POP unit has a target tone reaching the top level of the tonal space, a POP+FOCUS unit is generated. Therefore, the functional analysis of an intonationai contour has led to different melodic contours, described by sequences of functional accentual units: PUSH - LINK+FOCUS - POP, (PUSH + FOCUS) - POP, PUSH -(POP+FOCUS), etc. The aim of our research we have examined how certain functional prosodic units can be correlated with certain morpho-lexical constituents to generate rules for prosody prediction of an input text. For this study we have limited the analysis to a set of function words and to their context extracted by searching them into different text corpuses. 5. Using the function words by the prosodic prediction module In what follows, we shall analyze the intonationai contours of a short Romanian text, to illustrate how the function words predict the prosodic events of the intonationai contour corresponding to the text. Example: Dificidtatea inerenta in cadrul acestor competifii apare deoarece activitatea evaluatorilor nu este una absolut cuantificabila. (The inherent difficulty within these competitions emerges because the assessors' work is not absolutely quantifiable.) The lexical analysis of the text led to the following lexical cues: in cadrul (within), acestor (these), deoarece (because), nu (not), and absolut (absolutely). 176 177 Using Function Words for Guiding the Prediction of the Romanian Intonation In this paper, the term 'lexical event' will refer to the occurrence of a function word. A morpho-syntactic event will refer to the occurrence of a sequence of words with a particular morphological function sequence. A model of building up rules to predict prosodic markers is presented in Table 1. The symbols in this table have the following meaning: • NG = nominal group; • V =verb; • N = noun; • VAux = auxiliary verb; • P_mark = predicate marker; • F_mark = focus marker; • VAjmark - auxiliary verb marker; • Bi_mark = break index z, i = 2, 3 or 4. In Table 1, only B2_mark and B4_mark are present. The first four rows in Table 1 present rules based on lexical events. Here, 'in cadrul' (within), 'deoarece' (because), 'nu' (not), and 'absolut' (absolutely) are function words. For example, the rule in the third row has the following meaning: If a word sequence composed of a nominal group containing more than two words, followed by the word 'nu' (not), followed by a verbal group is detected, then the nominal group receives (+-) a break index 3 mark and 'nu' (not) receives a focus marker. The last five rows present rules based on morpho-syntactic events. An example of a morpho-syntactic event is the occurrence of the verb 'apare' (emerges) after a nominal group containing more than two words (row 5 in Table 1). In this case, the nominal group will end in a B4_mark. Table 1: A model of building up rules to predict prosodic markers Event Event type Rule: ^/'sequence' is detected then 'prosodic markers' are set Sequence Prosodic markers in+NG lexical in cadrul^ {NG} in cadrul +- P mark deoarece lexical {V}+deoarece+{NG} V <— F_mark: deoarece <— P mark nu lexical {NG>2 words} +ra/ + {VG} {NG>2 words} <-B3_mark nu <— F mark absolute comparison degrees absolute {adjective} absolut <— P_mark. {NG>2 words} +V+deoarece morpho-syntactic {NG>2 words}+{VG} {NG>2 words} <-B4 mark dificultatea inerenta morpho-syntactic {N}+{adjective} B2 mark acestor competitii morpho-syntactic Detenrrinat+N B2 mark este morpho-syntactic {VAux} VA mark point lexical word+. {word-h} «- B4 mark Vasile Apopei, Doina Jitca, Otilia Paduraru After detecting all lexical and morpho-lexical events, the predictor maps the input text at the prosodic level. As a result, the words will be translated into accentual units. In the selected example, each word is assigned an accentual unit, except for the clitic 'in', which forms an accentual unit together with 'cadrul'. For each event, the prediction module is endowed with one or more predefined rules, used to check its morpho-syntactic context. The existence of at least two rules means that the contexts have different mapping at the prosodic level. When processing a rule, the predictor uses one or more prosodic marks which characterize the mapping of the input text at the prosodic level. There are three types of prosodic markers: • markers for FOCUS prosodic units. Such a marker corresponds to a word in the context of the rule being processed; • markers for LINK prosodic units. Such a marker also corresponds to a word in the context of the rule being processed; • markers for the end boundary of an intonational phrase. They are similar to the Break Indices of the ToBI annotation system: - B4_mark is used for an IP boundary, corresponding to a Break Index 4; - B3_mark is used for an ip boundary, corresponding to a Break Index 3. - B2_mark is used for a phonological group boundary, corresponding to a Break Index 2. Using the prosodic markers deduced after processing the rules in Table 1, the prediction module has generated the following phrasing for the selected text: {[(Dificultatea inerenta) in cadrul (acestor competitii)]} {[apare deoarece (activitatea evaluatorilor)][nu este una absolut cuantificabila]}. Here, the IPs are demarcated by ' { }', the ip's by '[ ]', and the minor phrases by '( )'. The words selected by the predictor for predicate (link) intonation appear in underline, while the focused words appear in bold. The prosody prediction module (Jitca & Apopei 2011) has been designed to use these prosodic markers during the phrasing process and also when selecting a melodic contour for each phrase from the utterance tree hierarchy. 4. Building the set of lexical events starting from function words In this section, we shall briefly present the results of an analysis on the occurrence of a set of function words and on their accompanying contexts in a selected Romanian text corpus. This corpus represents the text of George Orwell's "1984" novel and has 108.000 words. The set of function words was chosen taking into account the discourse markers analyzed by Teodorescu (2005), the statistical structure of words in a literary Romanian corpus analyzed by Vlad et al. (2011) and our own remarks concerning the implications of certain functional words on prosody. The search for the selected function words and their accompanying contexts within the text corpus was performed with our own Visual C++ program. The program output the number of occurrences of each word in the input list and a set of lists containing the contexts of the searched words. 178 179 T Using Function Words for Guiding the Prediction of the Romanian Intonation The selected set of function words contains the following words: acest (this), adicd (that is, i.e., I mean), asadar (therefore), astfel (thus), atunci (then, at that time, in this case/situation, in these circumstances; atunci cdnd = when), cdnd (when), care (which, who), cdci (because), caret (whose), cdtre (to, toward), chiar (even), cdie (how many), cum (how), dacd (if), deci (therefore), deoarece (since, because, as), desi (even if, even though), door (only, just), dupd (after, following), incd (yet, still, even; incd o data — one more time, again), meat (that; astfel incdt = so that), in (in), insd (but), insdsi (herself, itself), intotdeauna (always), intruedtva (somewhat), la (to, at), nici (neither, nor, even), nu (no, not), numai (only), pared (I think/thought; de pared = as if), pentru (for), tocmai (just, precisely, exactly, very, right), tot (all, everything), totusi (however, but). Fig. 1 depicts the number of occurrences of the selected function words within the text corpus used in the present study. This figure shows: a large number of statements containing the preposition (in, la), the relative pronoun care and negation words (nici, nu); a high rate of occurrence of the indefinite pronoun/adjective tot, in all its forms (tot, toatd, toil, toate); a relatively high rate of occurrence of particular prepositions (pentru, dupd) and adverbs (pared, cdnd, atunci). 2000 1500 1000 500 . n "Sill* 3 ^ -----__---;—-- — f> i s a narut curios atunci a fost faptul ca in vis vorbele acestea £^SXri«»u. to him at that time was the fact that these words, m his dream...) ca le vedea atunci in ochii mari ai mamei si ai surorii ( thathesawthe^n^ ( trial ne saw mem at *-----------_---. Winston si-a dat seama atunci ca batranului tocmai, se intamplase erne stie-ce (Winston thPn that something terrible happened to the old man.) becomes truth.) Daca nu poate, atunci macar sa-1 deiormeze sau sa-1 manjeasca. (If not, then at least to deform or mar it.. .)^ chiar si atunci ar fi putut suporta sa traiasca langa ea, (.. .even t*">" W.nn1d hear to live with her.) Lui Winston i-a sarit atunci inima din loc. (Winston's heart jumped into his throat then.) The contexts accompanying a function word are elicited by the computer program, then are further processed during two stages. During the first stage, the program detects all contiguous sequences (contexts) of the function words. In the second stage, these sequences are manually extended with new words, to build up meaningful phrases, ready for a subsequent utterance. Table 3 presents the list of contiguous contexts including the function word 'atunci'. These contexts will be analyzed from a prosodic point of view in order to assign a set of rules containing prosodic markers to the corresponding function word. The utterances of the sentences associated with the selected contexts allow us to build up rules for setting adequate prosodic markers which will further be used by prediction module for phrasing and melodic contour selection. 180 181 1 Using Function Words for Guiding the Prediction of the Romanian Intonation Table 3: The contiguous contexts of taction words containing the word 'atunci' Left context Key word Right context [de] abia ca [si] [ceval care — ca chiar [si] [iar] daca dar [sil insa de parca inseamna deci cand decat atunci numai desi [pel aia doar acel; acea numai abia [de] pe incoace; incolo pentru ca parca nici [niciodata; ceva care; in care] pana inca — fiindca ?i [totl -- imediat; exact; precis; tocmai; VG cum atunci ca mca Vasile Apopei, Doina Jitca, Otilia Paduraru Table 4: Excerpt of meaningful sentences selected for further utterance and prosodic analysis Table 4 presents the set of meaningful phrases built up after the second sta^e of processmg the contexts of the word 'atunci'. Some of them have been utteredtd are ready for prosody analysis at the IP/ip level, using the functions presented in Section 2 Word Phrases including morpho-lexical contexts atunci Este imposibil sa intri acolo altfel decat cu treburi oficiale, si chiar si atunci patrunzi numai printr-un labirint de retele de sarma ghimpata. (It is impossible to enter there, except on official business, and even then, you can enter only through a maze of wire networks.) atunci Lucrul ramane valabil si atunci cand acela§i eveniment trebuie modificat de mai multe ori in cursul aceluiasi an. (This thing remains valid even when the same event must be changed several times during the same year.) atunci Parca vede §i acum pre^ioasa bucatica de ciocolata care pe atunci inca se mai masura in uncii. (Even now, he sees in his mind's eye the precious piece of chocolate which was still measured in ounces at that time.) atunci Julia nu pune la indoiala tezele partidului decat atunci cand ii afecteaza propria ei via^a intr-un fel sau altul (Julia does not question the party's theses except when these affect her own life in one way or another.) atunci Winston §i-a dat seama atunci ca batranului tocmai i se intamplase cine §tie-ce lucru cumplit. (Winston realized then that something terrible happened to the old man.) atunci Daca toate documentele povestesc aceeasj gogorita, atunci minciuna se transfera in istorie si devine adevar. (If all documents tell about the same bugbear, then the lie passes into history and becomes truth.) atunci Idealul celor de jos, atunci cand se intampla ca ace§tia sa aiba vreun scop in viata, este desfiintarea tuturor diferentelor intre oameni coplesTti de greutatile vie^ii. (Lower class people's ideal, when it happens that they have a purpose in life, is the abolition of all differences between people overwhelmed by the hardships of life.) atunci Din moment ce toate aceste bunuri nu mai constituiau proprietate privata, atunci insemna ca formau proprietate publica. (Since all these goods have no longer been private property, it means that they turned into public property, atunci Vede armata eurasiana navalind peste frontiera pana atunci neatinsa si scurgandu-se spre sudul Africii. (He sees the Eurasian army rushing across the hitherto untouched border, and running toward South Africa.) 182 183 Using Function Words for Guiding the Prediction of the Romanian Intonation 5. Conclusions In this paper, we have proposed a method of using function words and their morpho-lexical contexts by the prosody prediction module to generate prosodic markers. These markers will be further used during the phrasing process and when selecting a melodic contour for each phrase. The analysis of the rate of occurrence of the function words presented in section 4 has allowed us to find the most frequently encountered contexts. The sentences in the selected text corpus including these contexts have been elicited for farther utterance. The prosodic analysis of the parallel text-speech corpus has led to finding prosodic markers which will be assigned to morpho-lexical contexts. Acknowledgments: This study has been conducted within the research program of the Institute of Computer Science of the Romanian Academy. References Fery C, Vieri S. L. (2006). Focus projection and prosodic prominence in nested foci. Language 82, 131-150. Jitca D., Apopei V. (2007). Corpus de voce pentru limba romana adnotat cu etichete functionate la nivelul unitatilor de accentuare, Lucrdrile atelierului "Resurse lingvistice si instrumentepentru prelucrarea limbii romane Iasi, 31-39. Jitca D., Apopei V., Jitca M. (2009). The F0 contour Modelling as Functional Accentual Unit Sequences, International Journal of Speech Technology, 12:(2-3), 75-82. Jitca D., Apopei V. (2011). An Intonation Prediction Module for Romanian TTS System, as a Prosodic Tree Generator, SPED-2011, IEEE Conference Publications Program, IEEEXplore Digital Library. Katzer A., Selkirk E. (2007). Phase theory and prosodic spellout: The case of verbs The Linguistic Review 24, Special issue on Prosodic Phrasing, (Sonia Frota and Pilar Prietoeds.), 93-135. Schwarzschild R. (1999). GIVENness, AVOIDF and Other Constraints on the Placement of Accent, Natural Language Semantics 7, 141-177. Selkirk E. (1995). Sentence Prosody: Intonation, Stress and Phrasing. Handbook of Phonological Theory, (John Goldsmith ed.), Cambridge, MA: Blackwell, 550-569. Teodorescu H.N. (2005). A proposed Theory in Prosody Generation and Perception: The Multidimensional Contextual Integration Principle of Prosody, Trend in Speech Technology, Editura Academiei Romane, 109-118. Vlad A., Mitrea A., Ciuca S., Luca A. (2011). A study on the statistical structure of words and of word digrams in a literary romanian corpus, SPED-2011, IEEE Conference Publications Program, IEEEXplore Digital Library. 184 MAXIMUM ENTROPY BASED MACHINE TRANSLITERATION. APPLICATIONS AND RESULTS ADRIAN ZAFIU1, TIBERIU BORO§2 1 University of Pitesti, Electronics, Communications and Computers Department, Pitesti, Romania 2 Research Institute for Artificial Intelligence "Mihai Draganescu, Romanian Academy, Bucharest, Romania adrian.zafiu&Mpit. ro, tibi(cbr acai.ro ABSTRACT Transliteration has been previously used in the field of Natural Language Processing (NLP) with emphasis for machine translation (MT) between languages that are either incompatible at the phonetic level or employ very different alphabet systems. In this article we propose a new statistical method for transliteration and we discuss the possibility of using transliteration for two new tasks, besides MT. The first task refers to multilingual search based on the phonetic similarity between words (what we call perception-based search) and the second task is linked to text-to-speech (TTS) synthesis in the multilingual environment. The method that we propose for transliteration is similar to direct-orthographic-mapping in the sense that it does not require any intermediate phonetic level. Our experiments currently focus on the following languages: English, Bulgarian, Romanian and French. For the above mentioned languages, we seek to answer two questions: "can transliteration be achieved based on limited lexical context classification?" and "what other applications besides MT can benefit from transliteration?". Keywords: machine translation, maximum transliteration, text-to-speech synthesis entropy optimization, 1. Introduction Machine translation (MT) systems are often faced with the task of handling words that do not have a (known) corresponding translation (e.g. proper nouns, some technical terms, etc.). When the two languages share similar orthographic inventories it is a common practice to leave such words as they appear in the original text. Such a resolution is not possible when the two languages are highly incompatible at the orthographic and phonetic levels (for example, the English sounds 4L' and 4R' collapse into a single sound in Japanese). A solution to this task is to convert the original words, using a set of mappings from one orthographic system to another, in such a way that the resulting word would have a similar phonetic representation. The transliteration is the process used by the component responsible for this type of operation. By definition, transliteration means converting letter by letter from one writing system to another and transcription is the process of phonetically mapping words between languages. 185 Adrian Zafiu, Tiberiu Boros Maximum Entropy Based Machine Transliteration. Applications and Results However, most transliteration systems work by mapping the letters of a word to similarly sounding letters in the target language. So, to be consistent with other research papers we will use the term transliteration to refer to the task of mapping the letters of a word in the source language to letters of a "pseudo-word" in the target language so that the two words have similar pronunciations. In the past, several methods for transliterating between two languages were introduced, mainly focused on automatic transliteration between English, Chinese, Japanese, Korean and Arab. In (Knight & Graehl, 1997), finite state transducers were used to transliterate between Japanese and English. Their method was later adapted in (Stalls and Knight, 1998) for bidirectional transliteration between English and Arab. Similar methods for transliteration were presented in (Jung et al., 2000), (Meng et al., 2001), (Virga & Khudanpur, 2003). In their work, (Haizhou et al, 2004) classify the above mentioned methods as phonetic approaches to transliteration. They propose a new technique that focuses on direct orthographic mapping (DOM). Their method is also referred as n-gram based transliteration. In this paper we seek to answer two questions: "can transliteration be achieved based on limited lexical context classification?" and "what other applications besides MT can benefit from transliteration". To answer the first question we proposed a data-driven method for transliteration based on a MaxEnt classifier (see section 3). The proposed method performs transliteration at orthographic level without using an intermediate phonetic level and it only requires a lexicon composed of original words in the source language with their corresponding transliterations in the target language. (Haizhou et al, 2004) introduce a comparison between transliterations obtained with their method versus an ID3 algorithm for limited context classification. However, it was clearly demonstrated that this algorithm (ID3) is outranked by other classifiers when applied to letter-to-sound (LTS) conversion (Black et al, 1998), (Jiampojamara et al., 2008), (Pagel et al. 1998), (Bisani & Ney, 2002), (Marchand & Damper, 2000), (Demberg, 2007). Given the similarities that arise between LTS and transliteration it is likely that other classifiers could perform better thanID3. For the later question we set out to see if transliteration can improve TTS synthesis (first application) (section 4.1) and we propose a multilingual phonetic perception based search technique (second application) that can highly improve user experience with search engines (section 4.2), travel assistants and navigation systems. 2. Building the training lexicons This research was initially focused on improving the performance of a TTS system, when handling out-of-vocabulary (OOV) words. For objective reasons we focused our attention on transliteration between English, Bulgarian, French and Romanian. When we run our TTS experiments, we noticed the problem with some OOV words belonging to the foreign word class. Most of these words originated from English and French. We 186 added Bulgarian to our list, because it uses a different alphabet from the others. Our work was focused on minimizing the impediments posed by foreign OOV words in Romanian TTS synthesis and, in our case we had to handle words coming from the above mentioned languages. To our knowledge there are no freely available transliteration lexicons between any of these languages. For this reasons we set out to create our own corpora, which will be made publically available for research. The general method for building transliteration lexicons as presented in (Knight & Graehl, 1997) is: 1. Choose a set of representative words for the source language and obtain their phonetic transcriptions manually or automatically using rules specific to the source language. 2. Adjust the phonetic transcriptions using hand-written rules that map from the phonetic inventory of the source language to the phonetic inventory of the target language 3. Manually or automatically map back to orthography using rules specific for the target language. The first transliteration lexicon we created was an English to Romanian corpus. We chose the CMUDict as a starting point in our development and we proceeded using the phonetic transcriptions provided inside. However, the CMUDict contains a lot of foreign words adapted to English such as: Italian: braggiotti, castelli, castelluccio; German: aachen, abbenhaus, schlender, schlenker; Polish: zawistowski. Because we aimed (in this case) at learning transliteration rules only from English native words to Romanian we filtered out all foreign words and proper names, leaving 40,606 entries in the CMUDict. The remaining data was converted to their Romanian transliterations using a set of hand-written rules with post-validation (see table 1 for examples). Table 1: English to Romanian transliteration examples En Phoneme A A AE AH AO AW AY Example word odd at hut ought_ . cow hide English phonetic transcription aad aet HHAHT Romanian transliteration ad et AO T kaw hhayd BIY hat ot cau haid Bi ^^^"^.•^^^ZI^T^^ 2) (w«l>o»t requiring languages). 187 Maximum Entropy Based Machine Transliteration. Ar^^ Table 2: Bulgarian to Romanian transliteration examples BG orthography Word RO orthography 6 6aHCKo b n IU BH7UIH d % CBHIIJOB CMonm s ia X XaCKOBO h The French to Romanian transliteration corpus was created similarly as the English to Romanian lexicon (see table 3 for some example rules). In this case, we used the Brulex pronunciation lexicon as a starting point. Table 3: Examples of French to Romanian transliteration corpus FR phoneme Romanian orthography 3. Automatic transliteration using a limited lexical context classifier The task of transliteration can be formulated as finding a set of rules, which applied to an input sequence of orthographic symbols/characters specific to the source language, generates a set of symbols/characters specific to the target language. The goal is to maximize the value of a similarity function between the two phonetic representations of the original and the processed words. There are two types of methods used for transliteration: • Type 1: Phonetic based methods require three sets of rules to be applied for orthography-to-sound conversion (for the source language), phonetic adaptation (between source and target languages) and sound-to-orthography conversion (for the target language); • Type 2: Direct orthographic methods do not require such knowledge, as the idea is to infer rules for direct conversion at orthographic level. All sub-tasks of first type methods are prone to errors when applied separately, and the overall error rate is higher than the errors of the second type methods, hence our choice to base our research on direct orthographic mapping. Adrian Zafiu, Tiberiu Boros Transliteration does not have a one-to-one (bijective) correspondence between the source and target sequences at orthographic level. One orthographic symbol can spawn two or more orthographic symbols in the target language or can even have a void (NULL character) mapping. This means that before proceeding with the training process, transliteration requires alignments between letters of the source and target words. These alignments can be obtained using the Expectation Maximization (EM) algorithm (Hartley, 1958), (Dempster et al., 1977). We based our method on a limited lexical context Maximum Entropy (MaxEnt) classifier. MaxEnt classifiers have been previously used in natural language processing (NLP) for part-of-speech (POS) tagging (Ratnaparkhi, 1996), sentence splitting (Reynar and Ratnaparkhi, 1997), (Agarwal, 2005) etc. Maximum Entropy builds a model that maximizes entropy by assuming a uniform distribution for unseen data (Berger et al., 1996). Treating transliteration as a classification task means that for each orthographic symbol in the input sequence the system has to predict a label using features extracted from a limited lexical context window. The label represents an orthographic symbol, group of symbols or an empty sequence (NULL character) in the target language. The sum of predicted labels represents the transliteration of the input sequence to the target language. Using s to denote the current orthographic symbol, si to denote the orthographic symbol at distance i from the current symbol and l_i to represent the previously assigned label (for the previous symbol) we have the following features: • s_i,s - current symbol plus previous symbol; • s-2 ,s_i,s - current symbol plus the previous two symbols; • s,s+i - current symbol plus the following symbol; • s,s+i,s+2 - current symbol plus the following two symbols; • S-i,s,s+i - current symbol plus the previous and following symbols (the identity feature); • Li - previously assigned label (for output cohesion). This set of features was chosen using a trial an error process. The current symbol alone was not informative enough compared to the combination of the current symbol plus the previous and following symbol, thus we named this composite feature the identity feature. Increasing the current context window length did not significantly improve the prediction accuracy of the model and in some cases lead to overtraining. Ignoring the previously assigned label had a large negative impact on the accuracy, and adding more labels to the history did not yield statistically relevant higher accuracy rates. 5.7. Evaluation We evaluated our transliteration accuracy using 10-fold validation methodology. Each training lexicon was divided in 10 equal subsets and we measured our systems accuracy 189 Maximum Entropy Based Machine Transliteration. Applications and Results by averaging the prediction accuracy on each subset while training on the other nine We present the results in terms of Word Accuracy Rates (WAR) as the number of fully correct transliterated word versus the total number of words (see table 4). Table 4 : Current transliteration results in terms of word accuracy rates Target English Romanian Bulgarian French English - 78.15% 77.12% N/A Romanian 43.18% - 97.34% 92.08% © in Bulgarian N/A 97.21% - N/A French N/A 56.45% N/A - 4. Application of transliteration Recent focus in improving accessibility and general access to information through constant advances in the field of human-computer interaction has led to the wide spread of spoken language processing technologies applied in computers and micro-devices, with an increased interest for speech synthesis, speech recognition and improved text accessibility. hi this section, we focus on the mainstream task of TTS synthesis and how transliteration can help to improve the quality of speech synthesis from arbitrary texts. We then introduce an application for transliteration, which improves user experience with search engines, GPS systems and any other type of travel assistants. To our knowledge, the later introduced application has not been suggested by other authors. In this section, we focus on the mainstream task of TTS synthesis and how transliteration can help improve the quality of speech synthesis from arbitrary texts. 4J. Transliteration and text-to-speech synthesis We have to start by pointing that text-to-speech synthesis has to he-able to synthetize voice from any arbitrary or unrestricted text. This involves a series of pre-processing steps such as: converting numbers, dates, formulas to their spoken form; phonetically transcribing words; syllabification; lexical stress prediction etc. Such tasks can be normally attained using large lexicons of already processed words but there are always exceptions, in the form of out-of-vocabulary (OOV) words, which have to be treated automatically. Over the years, a number of machine-learning (ML) methods have been proposed to solve each of the above stated subtasks with the main focus on OOV words that are also native language words. It is expected that there are also foreign words that fall into the class of OOV. Such situations are fatal for TTS synthesis, as leaving such 190 Adrian Zafm, Tiberiu Boros words unchanged and applying the same rules for phonetic transcription or syllabification as those specific to the native language of the TTS system would yield faulty results. We differentiate two ways to handle such words. The first method is to attain the above mentioned sub-tasks using a custom set of rules adapted to the foreign language from which the word originates. However, having different sets of rules for more than the native language is challenging. The second method (that we propose here) is to use transliteration on these foreign words and to convert them to pseudo-native words. This facilitates using a single package of native rules for the tasks of phonetic transcription, syllabification and lexical stress. The difference between the two methods (figure 1) is that the first method applies phonetic transcription with syllabification and lexical stress rules for the foreign language(s) followed by an adaptation at phonetic level between the two languages, while the second method uses transliteration to produce a pseudo-native word and then uses the native rule sets for attaining the final goal. Foreign word Phonetic transcription, syllabification and $ lexical stress with foreign language rules | Transliteration 1-^- t Foreign language phonetics Pseudo-word Phonetic adaptation Phonetic transcription, syllabification and lexical stress with native language rules Native phonemes Figure 2: Foreign word handling in TTS synthesis There are several reasons for using transliteration in TTS synthesis. First we argue that constructing or attaining lexicons for the TTS tasks of syllabification, lexical stress and phonetic transcription is more demanding than building transliteration lexicons. Secondly the ML methods used for generating the prosody of the TTS system are trained using native utterances. It is likely that using custom rules for generating syllabification and lexical stress on foreign words would generate previously unseen data which impedes the correct functioning of such methods. Applying syllabification and lexical stress on the pseudo-words obtained by transliteration is likely to produce pronunciations different from those generated by a native speaker (of the foreign language). However, in practice, this is not an understanding issue since many non- 191 Maximum Entropy Based Machine Transliteration. Applications and Results native speakers could pronounce such foreign words similarly, misplacing the lexical stress and making adaptations at the phonetic level. 4.2. Detecting which words require transliteration in TTS One common problem with both approaches to foreign word adaptation for TTS synthesis is detecting when an OOV word is a foreign word and also what is the source language of that word. One partial solution to the problem is to use a lookup table of word-forms for each foreign language that the system has transliteration rules for. Such a list is easier to attain than a list of fully processed words and it can be done by crawling through documents written in specific languages. Any OOV word found by the TTS system has to be checked against these precompiled lists and once the word occurs in the lexicon of some language it can be transliterated to a native pseudo-word using the source language specific rule set. It is also important to keep a separate word-form list for the native language and to check if the word is not inside this list (some words may have identical orthographies in more languages: e.g. "mi-nus" is written identically in both Romanian and English). The later list is important for determining when not to apply transliteration. There are however cases where a word or a group of words do not appear in any lexicon (such may be the case of uncommon proper nouns). Based on the fact that some orthographic symbols (especially those that have diacritics) or groups of symbols are uncommon in certain languages the assumption that a word should be transliterated can arise from testing for such occurrences. For example, characters such as 4y' or groups like "ck" are highly uncommon for Romanian. The entire process is summarized in figure 2. .....'rWord : Check to see if word exists in ihe native I No lexicons to see if they contain the word Yes Proceed with normal processing for TTS No Yes Transliteration Yes Processed word No Figure 3 : OOV word handling in TTS synthesis 192 Adrian Zafiu, Tiberiu Boros 4.3. Perception base search As explained earlier in this article perception based search is a method for finding persons, street names, cities, etc. based on the phonetic perception of what that word "sounds like". Using the proposed method we obtain possible spellings for a word written "as heard" in the native language of a user. To give an example, let's suppose that we know nothing about a city except the fact that it sounds like 'YI AE N T S YI AW. There are no other information regarding neither the country nor the language in which it should be written and therefore no information about what orthography to use in order to find out more about this location. Perception based search allows obtaining the exact spelling (in the source language) for this location just by typing in the word in the ones native language. A Romanian native speaker would just input the word as "ienjiau", an English native speaker would enter "ientsiaw" and a Bulgarian native speaker would type "fl6mifly". The answer would be "?si£fl"5 which is a location situated in north-east of China in the province Hebei. To our knowledge, the closest method to the one proposed here is described in (Krishnan et al, 2009), (WO Patent WO/2009/005,961, 2009). As we will show, there is an important difference between using phonetic representations (their method) and directly mapping at orthographic level: 1. A non-native speaker's perception of what a word sounds like is influenced by the phonetic inventory of his native language and it is not 100% accurate because not all languages share the same inventory; 2. Conversion rules from orthography to sounds and back are complex and there are cases where there is no possible combination of orthographic symbols that would generate the perceived phonetic sequence; 3. Multiple spellings can generate the same phonetic sequence (homophones); 4. As mentioned by (Knight & Graehl, 1997), back transliteration does not share the same flexibility as forward transliteration. All the above mentioned facts reduce the level of reliability when using phonetic representations to get the similarity between two words originating from different languages. To overcome these problems we propose a different strategy: when given an input string in a native language we transliterate all known locations, names etc. from their source languages into the speaker's native language and we directly compare the resulting strings with the input, using a function such as the Levenshtein Distance. Each language has its unique characteristics that dictate the phonetic inventory, the phonetic transcription rules, the way native speakers perceive words from other languages and the way they would spell these words (which is a process accompanied by information loss). Choosing the direction of transliteration should be based on the highest accuracy obtained by the transliteration of the OOV words. For example, if we search a string in Romanian and we want to check foreign English names against our input string we transliterate from English to Romanian and compare results, not the other way around, because English to Romanian transliteration (regarded by (Knight & Graehl, 1997) as forward transliteration) works a lot better that Romanian to English transliteration (back transliteration). For the same reason, we also use forward-transliteration if the input string is in English. 193 Maximum Entropy Based Machine Transliteration. Applications and Results I _____ j The method proposed in the Patent uses what is referred to as a phonetically normalized \ character set for word encoding. They store the words in a database and they use this j phonetically normalized encodings to perform search. No details are given on the \ construction of the phonetically normalized character set or on the models used for 1 converting words into this type of representation. f 5. Conclusions and future work j We presented a method that can be automatically trained for transliterating between any | two pairs of languages and we thoroughly tested our system for English, French, j Bulgarian and Romanian. Using a limited context classifier for attaining transliterations for the above mentioned languages is a viable solution. In section 4.1 we proposed a strategy for handling OOV foreign words, which are one j of the plagues of unrestricted TTS synthesis. Although WAR rates are lower for some j language pairs, in a true scenario, not ail words are OOV and even if transliteration fails j and does not produce fully correct transliterated words, the letter accuracy rate is very ? high, indicating that there, very well, may be only one incorrectly classified letter. This \. means that using the pseudo-word, even if it is not fully correct, is preferable to using I the direct unmodified foreign word. t The transliteration corpora we created will be made publically available for research j purposes. Our current priority is increasing the number of lexicons. The next language j of interest to us is German. f Using Romanian as a pivot we will add another transliteration lexicon from English to j Bulgarian. We plan to exploit the fact that the accuracy of Romanian to Bulgarian 1 transliteration is very high (above 99% letter accuracy rate and 97% word accuracy j rate), allowing us the following procedure: \ 1. Train to transliterate from Romanian to Bulgarian; 1 2. Use our tool to transliterate the Romanian pseudo-words from the English to ; Romanian corpus into Bulgarian pseudo-words, thus generating English to } Bulgarian mappings. i Evaluating the perception based search methodology poses a series of challenges. In j order to correctly asses the performance of the system in real conditions we have to use I native speakers of the languages in which the search is performed. The test corpus has | to be created manually and it has to contain a significant number of entries in order to I correctly asses the system accuracy. Also a comparison with other multilingual oriented j search algorithms is required for a thorough validation of the presented idea. j I References & Bisani, M., and Ney, H. (2002). Investigations on joint-multigram models for grapheme-tophoneme conversion. Proceedings of the 7th International A' Conference on Spoken Language Processing, 105-108. Black, A., Lenzo, K. and Pagel, V. (1998). "Issues in building general letter to sound fi rules", ESCA Speech Synthesis Workshop, Jenolan Caves. ! i 194 jj> Adrian Zafiu, Tiberiu Boros Bosch, A., and Canisius, S. (2006). Improved morpho phonological sequence processing with constraint satisfaction inference. Proceedings of the Eighth Meeting of the ACL-SIGPHON at HLT-NAACL, 41-49. CMU (2011). Carnegie Mellon Pronuncing Dictionary. http://www.speech, cs. emu. edu/cgi-bin/cmudict. Demberg, V. (2007). "Phonological constraints and morphological preprocessing for grapheme-to-phoneme conversion". In Proceedings of ACL-2007. Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1, 1-38. Hartley, H. (1958). Maximum likelihood estimation from incomplete data, Biometrics, 14, 174-194. Jiampojamarn, S., Cherry, C. and Kondrak, G. (2008). Joint processing and discriminative training for letter-to-phoneme conversion. Proceedings of ACL-2008: Human Language Technology Conference, Columbus, Ohio, 905-913. Jung, S. Y., Hong, L. S. si Paek, E. (2000). An English to Korean Transliteration Model of Extended Markov Window. Proceedings of COLING. Knight, K. and Graehl, J. (1997). Machine transliteration. Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Somerset, New Jersey, 128-135. Krishnan S, H., Bendapudi, P., Gore, A. S. (2009). WIPO Patent No. 2009005961. Geneva, Switzerland: World Intellectual Property Organization. Li, H., Zhang, M. si Su, J. (2004). A joint source-channel model for machine transliteration. Proceedings of the 42nd ACL Annual Meeting, Barcelona, Spain, 159-166. Li, M., Zhang, Y., Zhu, M. and Zhou, M. (2006). Exploring distributional similarity based models for query spelling correction. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, 1025-1032. Marchand, Y. and Damper, R.I. (2000). A multistrategy approach to improving pronunciation by analogy. Computational Linguistics, 26(2): 195-219. Meng, H.M., Lo,W-K., Chen, B. si Tang, K. (2001). Generate Phonetic Cognates to Handle Name Entities. English-Chinese cross-language spoken document retrieval, ASRU. Pagel, V., Lenzo, K. and Black, A. (1998). "Letter to sound rules for accented lexicon compression", International Conference on Spoken Language Processing, Sydney, Australia. Rama, T., Singh, A. K., Kolachina, S. (2009). Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training. Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, Suntec, Singapore, 124-127. 195 Maximum Entropy Based Machine Transliteration. Applications and Results Stalls, B.G. si Knight, K. (1998). Translating Names and Technical Terms in Arabic Text. Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages. Virga, P., Khudanpur, S. (2003). Transliteration of Proper Names in Crosslingual Information Retrieval. Proceedings of ACL 2003 workshop MLNER. INDEX OF AUTHORS Apopei, Vasile, 175 Barbu Mititelu, Verginica, 99, 109 Bibiri, Anca-Diana, 151 Boian, Elena, 35 Boros, Tiberiu, 81, 163, 185 Botosineanu, Luminita, 13 Catana-Spenchiu, Ana, 51 Ciubotaru, Constantin, 35 Clim, Marius-Radu, 51 Cojocaru, Svetlana, 35, 119 Colesnicov, Alexandru, 35 Cristea, Dan, 131, 139, 151 Curteanu, Neculai, 119 Dumistracel, Steiian, 13 Dumitrescu, Stefan Daniel, 81, 109 Gifu, Daniela, 139 Hreapca, Doina, 13 Ion, Radu, 81, 163 Irimia, Elena, 3 Jitca, Doina, 175 Malahov, Ludmila, 35 Maranduc, Catalina, 59 Moiseanu, Raluca, 131 Moruz, Alex, 119 Paduraru, Otilia, 175 Patrascu, Madalin lonel, 51 Petic, Mircea, 35 Pistol, Laura, 151 Scutelnicu, Liviu Andrei, 151 Stoica, Dan, 71, 139 Stefanescu, Dan, 81, 163 Tamba, Elena, 51 Tufis, Dan, 81 Turculet, Adrian, 151 Zafiu, Adrian, 185 196 197 TIPARUL EXECUTAT LA IMPRIMERIA EDITURII UNIVERSITATIJ „ALEXANDRU IOAN CUZA" DIN IA§I 700109 faftPinuiuitA. tel/fax 0232 314947 Aparut:2013 Comanda:152 Informatii §i comenzi: www.editura.uaic.ro editura@uaic.ro