Several attempts were made to study the linguistic features of the Arabic translations of the Gospels in order to identify the different textual traditions. The majority of these projects use verbal agreement between texts to define their identities. In some cases, other techniques were used on a reduced scale in analyzing selected texts. The limitations of these attempts lies in the fact that the studies used only a small number of readings selected from only a few manuscripts and that the selection was not formalized or automated.
To fill these gaps, our project offers automated linguistic corpus processing features. All transcribed texts are subject to a morphosyntactic annotation. Lexical, grammatical and inflectional properties (tense, grammatical mood, grammatical voice, aspect, person, number, gender and case) are associated with the annotated text. These linguistic properties allow the system to perform complex searches based on abstract representations of a specific word, sentence, paragraph, syntax and occurrence.
In order to formalize all possible verbal tokens, we defined a taxonomy of inflectional classes for Arabic verbs. This taxonomy allows the system to encode simultaneously in the lexical representation three variations: inflectional, morphophonemic and orthographic.
L’annotation morpho-syntaxique d’un texte consiste essentiellement à associer des informations lexicales, grammaticales et flexionnelles aux formes présentes dans un texte. L’objetif de l’annotation est de pouvoir effectuer des recherches complexes sur une représentation abstraite d’un mot et de lister leur contexte d’occurrence. La recherche est composé de critères de recherche. Les critères peuvent porter, par exemple, sur des variations d’une forme canonique sur des motifs syntaxiques (syntactic pattern), sur des traits flexionnels, ou une combinaison complexe de ces critères. Nous avons choisi l’outil Unitex, un processeur de corpus multilingue, afin de traiter les textes de notre prototype. Le processeur de corpus Unitex a été conçu au départ pour le français et l’anglais. Le traitement des autres langues nécessite des choix linguistiques d’annotation, des adaptations et des ajustements du logiciel de ce processeur.
Par rapport au français et l’anglais, la langue arabe se distingue par trois caractéristiques:
Afin de donner des solutions pour annoter des formes réalisés en surface, nous avons pris les décisions suivantes.
Un mot est une suite d’un ou plusieurs segments séparées par des accolades { } dont le format est le suivant:
{forme_fléchie,lemme,CAT: traits-flexionnels }
Les deux exemples suivants illustrent deux suites d’agglutination de segments :
Autour d’un nom: |
وبافعالهم |
{وَ,.CONJC} {بِ,.PREP} {افعال,فعل.N:q} {هم,ه.PRO+Gen:3mp} |
Autour d’un verbe: |
فاعطاهم |
{فَ,.CONJC} {اعطا,أعطى.V+pro:aP3ms} {هم,ه.PRO+Acc:3mp} |
Nous détaillons ci-après les valeurs des traits flexionnels des principales catégories grammaticales, à savoir, les verbes, les noms et les adjectifs.
Pour un verbe, les traits flexionnels sont :
Pour un nom ou un adjectif, les traits flexionnels sont :
A language changes over time and varies according to place and social setting. In the case of Arabic, we can observe grammatical variation like differences in the structure of words, phrases or sentences by comparing the same translated Gospel text taken from different manuscripts. One of the most common differences between these manuscripts is the way in which tenses are formed. Our corpus processor provides an appropriate tool capable of formalizing all the instances of the same lemma and drawing the variation graph of its tenses on a timeline. This type of analysis assists the researcher in building a complex system in order to classify the textual traditions. For example, if a Greek verb requires an accusative for its object, the corpus processor can identify the case of all the objects in the Arabic translation of the same verb wherever it is found in the transcribed manuscripts. The results can be grouped and sorted by case. This process can be done in both horizontal (in the same manuscripts) and vertical (in all the manuscripts) ways.
Orthographic variations may serve to reveal the identity of the scribe and may also contribute to locate a text on a timeline. The use of diacritics in Arabic Gospel manuscripts remains unstudied. The formalization and abstraction of this use may help the scholar to define sets of standards that enable the text’s grouping according to orthographic variations.
Many aspects of lexico-semantic variation are linked to a sociolinguistic framework. This interplay between society, culture, religion and language is very strong in Arabic manuscripts of the Gospels. In this perspective the corpus allows the scholars to define semantic arrays through a process of abstraction.
The syntax reflects the principles and processes by which sentences are constructed in particular languages. In the case of Arabic Gospels, the syntax underlying the text is one of the strongest characteristics of the linguistic origin of the manuscript. Since it is a sacred text, the translators did their best to make their Arabic translation very close to the original (Greek, Syriac, Latin and potentially other oriental languages). As a result of this effort, they compromised the Arabic syntax and imported the original one, but now using Arabic words. The annotated corpus allows the scholar to define a specific syntax as a pattern and to find all the Arabic instances in the transcribed manuscripts that follow it. For each verse, the researcher can extract its syntax and compare it to that of the same verse in other languages (this requires that the same verse in other languages is annotated in the corpus). This tool may help in identifying formally the Vorlagen of the translations.
It is common in Arabic manuscripts of the Gospels to have different forms for the same inflected word. For example, the same lemma or canonical form of a word may have different forms of broken plural. The current type of analysis allows the scholar to perform a search based on the dialectic pattern of a broken plural identified in a specific text. In this case, the search operation will return all the lemmas or lemmata affected by this pattern. A more advanced search may return the geographical or historical distribution of the lemmas affected by a specific inflectional pattern.
Chapitre |
Nombre de manuscrits |
Nombre de versets |
Nombre total de versets |
Nombre de mots |
JN01 |
21 |
17 |
357 |
3850 |
JN06 |
39 |
6 |
234 |
3400 |
JN18 |
35 |
9 |
315 |
4850 |
LK08 |
45 |
7 |
315 |
5350 |
LK15 |
45 |
22 |
990 |
11528 |
MK13 |
37 |
13 |
481 |
5820 |
MT07 |
38 |
7 |
266 |
2920 |
MT16 |
36 |
4 |
144 |
2257 |
Totaux |
|
85 |
3102 |
39975 |