![]() ![]() Side-conditioned language model have recently shown promising results in ![]() Sequence-to-sequence translation methods based on generation with a This word alignment tool is in the form of an API and is being developed as part of Sanchay, (a collection of tools and APIs for NLP with focus on Indian languages). ![]() We are also working on further improvements using morphological information and a better similarity measure etc. The results indicate that though the performance of our word aligner is lower than that of GIZA++, it can be improved by adding some techniques like smoothing to take care of the data sparsity problem. After training on 7399 sentence aligned sentences, we compared the results with GIZA++, an existing word alignment tool. For our experiments on English-Hindi word alignment, we also tried to use a bilingual dictionary to bootstrap the Expectation Maximization (EM) algorithm. Use of information about cognates is especially relevant for Indian languages because these languages have a lot of borrowed and inherited words which are common to more than one language. We have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer. This is an ongoing work in which we are trying to explore the possible enhancements to the IBM models, especially for related languages like the Indian languages. This algorithm is based on the first three IBM models. In this paper we describe a platform independent and object oriented implementation (in Java) of a word alignment algorithm. In recent years statistical word alignment models have been widely used for various Natural Language Processing (NLP) problems. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus. Accordingly, we have restricted our work to these two languages but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We give an algorithm for seeking the most probable of these alignments. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We define a concept of word-by-word alignment between such pairs of sentences. We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |