佛經中英語料非監督式自動句對齊之研究

呂玟儀

Title:	佛經中英語料非監督式自動句對齊之研究 Unsupervised Sentence Alignment of Corpora of Chinese-English Buddhist Texts
Authors:	呂玟儀
Keywords:	句對齊;中英文佛經句對語料庫;動態規劃演算法;Sentence Alignment;The Parallel Corpora of Chinese and English Buddhist Texts;Dynamic Programming
Issue Date:	Jul-2019
Abstract:	早在佛陀時代，世尊即透過多種不同語言、方言來傳授佛法。後來以文字記錄成佛經，更是隨著交通、經濟貿易將佛教更往外傳播，佛經也因此被翻譯成許多不同的語言，像是早期翻譯成的吐火羅文、于闐文、犍陀羅語等，而後傳至漢地所譯成的漢譯佛經，傳至藏地所譯成的藏譯佛經。乃至近代隨著佛教傳至西方國家，更是將佛經翻譯成許多西方語言版本。語言是人際之間得以順利交流的重要溝通工具。而現今線上翻譯的出現，使得語言的學習與人際溝通更快速又便利。目前人工智慧深度學習技術的發展大大地提昇了自動翻譯系統的準確度。而此翻譯技術，首先必須建立一個擁有大量對譯語言之間，以句子為單位的平行語料庫。然而，目前佛經方面仍然缺乏如此句層級相互平行對應的大量數位語料。所以，本論文將針對非監督式自動「句對齊」的方法進行研究，以找到一個適當的演算法，高效地完成佛經中英文文本自動「句對齊」工作。本研究以《大正新脩大藏經》中，第一部經典《長阿含經》中英文本裡隨機挑選出其中的部分段落與小經，和《佛說阿彌陀經》中英文本來作為主要實驗對象。我們首先將中英文二個文本各自進行斷句、分詞後，並將英文句子轉換成一組英文詞群，也把中文句子使用整合了佛學、古漢語和一般性英漢詞典的中英對應詞彙資料轉換成一組中英譯詞群。然後將英文與中英譯二組詞群進行比對，找出二組詞群中所共有的詞彙，利用資訊檢索的概念計算來計算二組詞群之間的相似度分數，加上搭配動態規劃演算法，推算出佛經中英文文本之間最佳的「句對齊」狀況。實驗的結果分為嚴格與寬鬆二組標準來評估，評估數據顯示：嚴格準確率平均為0.5957；嚴格召回率平均為0.6774；嚴格F1度量平均為0.6335；寬鬆準確率平均為0.7847；寬鬆召回率平均為0.7133；寬鬆F1度量平均為0.7454。為了提高演算法效能，本論文針對影響演算法效能的錯誤比對結果，深入分析在本實驗中明顯發現影響相似度判斷的狀況，像是：中文詞彙的英文定義不足、大量比對不到的多餘詞彙、分詞的錯誤、中英文斷句方式差異過大、中英句對對應結構過於複雜等，並提出可行的改善建議，期許建立一個更高準確率的佛經中英文語料自動「句對齊」的模組，以進一步自動化完成大量又準確的中英文佛經句對語料。 The Buddha taught the dharma with a variety of dialects or languages. Afterward, the teachings of the Buddha were preserved orally for a long time before being eventually written down. With the spread of Buddhism, the Buddhist texts were translated into many different languages. The Buddhist texts were translated into Chinese since the Han Dynasty and then began to be translated into Tibetan during the Tang Dynasty. In modern times, as Buddhism spread to Western countries, the Buddhist texts were translated into many Western languages. Language is an important tool of smooth communication between people. Today, online translation tools make learning language and communication with each other faster and more convenient. At present, the development of deep learning in artificial intelligence greatly improves the precision of the automatic translation system. To achieve acceptable translation performance, these methods require a corpus with a large number of parallel sentences in both languages for training. However, although there are many Buddhist texts in different languages, it still lacks a well-constructed parallel sentence aligned corpus. Therefore, this thesis studies the method of the unsupervised sentence alignment and finds an appropriate algorithm to efficiently deal the sentence alignment of all Chinese-English Buddhist texts. In this study, for evaluations, several sutras with both Chinese and English versions are selected, such as some of the sutras in the "Chang Ahan Jing (Dīrgha Āgama)" and the "Foshuo Amituo Jing" from the "Taishō Shinshū Daizōkyō". Chinese and English texts are separated into sentences, and then segmented as words. For Chinese words, the English explanations are gathered from Chinese-English dictionaries to transform the Chinese words into English terms. Next, each sentence with words is transformed as a vector. To measure the similarity between two sentences now is regarded as the similarity of the two vectors. With the similarity measurement between two sentences, we adopt an alignment algorithm based on dynamic programming to generate the optimal sentence alignment results. The results of the experiment are evaluated in precision and recall through two standards: rigid and relax. The evaluation results show that the average of the rigid precision, rigid recall, rigid F1-measure, relax precision, relax recall, and relax F1-measure are 0.5957, 0.6774, 0.6335, 0.7847, 0.7133, and 0.7454 respectively. The results show the effectiveness of our proposed method. After deeply examining and analyzing the error cases, several clues cause incorrect alignments, such as, insufficient English definition of Chinese terms, a large number of redundant terms, incorrect word segmentations, excessive difference in the sentence separation between Chinese and English, and construction of Chinese-English sentence alignment is too complicated etc. The goal of this thesis is to design a practical sentence alignment approach between Chinses and English Buddhist texts to build parallel corpora. In the future, we will continue improving our method to achieve higher precision.
URI:	http://172.27.2.131/handle/123456789/862
Appears in Collections:	佛教學系

Show full item record

Page view(s)

118

checked on Mar 30, 2025

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM