Download PDFOpen PDF in browserUnsupervised Cross-lingual Word Embeddings Based on Subword AlignmentEasyChair Preprint 225412 pages•Date: December 25, 2019AbstractCross-lingual word embeddings are crucial building blocks for multilingual models, and recent studies indicate that they are obtainable without any cross-lingual resources. However, experimental results indicate that performance of such cross-lingual word embeddings degrades on distant language pairs such as English-Japanese. In this paper, we propose an unsupervised method to obtain cross-lingual word embeddings that utilize subword alignment to capture trivially translatable pairs with less ambiguity such as named entities, loanwords. These words tend to be unambiguously translatable and thus can provide a more reliable signal to obtain bilingual dictionary. Our method first obtains initial cross-lingual word embeddings by an existing unsupervised method to induce bilingual dictionary to learn subword alignment, and then extract the word pairs whose surfaces are alignable to construct a high-quality bilingual dictionary by the induced alignment model. We finally use the resulting bilingual dictionary to obtain high-quality cross-lingual word embeddings. Experimental results in four language pairs, English-Japanese, English-Finnish, English-Spanish, and English-Italic, indicate that cross-lingual word embeddings obtained with our method outperform an existing method, especially on distant language pairs (3% in English-Japanese and 2% in English-Finnish). Keyphrases: Cross-lingual, Natural Language Processing, Representation Learning, bilingual lexicon induction, cross-lingual word embeddings, distant language, multilingual, word embedding
|