Document Type
Honors Thesis
Abstract
This study explored the importance of, and challenges facing, the application of machine translation (MT) to Québécois, a low-resource variety of French native to the Canadian province of Québec. The history of MT in Canada is discussed, and a QuébécoisEnglish MT engine was trained to investigate practical questions around applying automated translation to this low-resource French variant. Québécois has only rarely been the focus of published MT research work, even within the Canadian governmental setting, which relies primarily on human translation. Marian, an open-source neural machine translation (NMT) toolkit, was utilized for training an MT engine on the 36th Canadian Parliament’s aligned Québécois-English Hansards (debate transcripts). Parliamentary debates are a common source of training data for MT, but the Canadian Parliament data has not been widely used. The engine’s BLEU score, an automated metric of translation quality, was 32.0, indicating moderately good translation quality. Further qualitative analyses are performed by translating authentic Québécois texts taken from a range of linguistic domains—interviews, health/medical, technology, and politics—and a hand-crafted set of sentences containing “challenge” words that were expected to be difficult for the engine to translate. The resulting engine trained on Hansards data struggled in basic Québécois FrenchEnglish translation across multiple domains. In-domain, the engine globally and automatically received a BLEU score of 32, which is within the normal range for a new engine. It did not perform as well on a test set of sentence probes based on Québécois terminology, nor did it output anything other than post-edit-ready strings when translating modern-day news magazine stories. With the addition of a large, aligned, bilingual Canadian dataset, an adequately satisfying specialized MT engine for this French variant could be built. In the meantime, it is advisable for MT researchers and administrators in that environment to continue to pair MT output with human translators and post-editors, an arrangement with which the Canadian Translation Bureau has demonstrated greater comfort for some years.
Publication Date
5-1-2020
Language
English
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Soueid, Sophie, "DEBATABLE DATA: UTILIZING FRENCH CANADIAN PARLIAMENTARY RECORDS TO BUILD A MACHINE TRANSLATION ENGINE FOR A LOW-RESOURCE FRENCH VARIANT (QUÉBÉCOIS)" (2020). 2020 Spring Honors Capstone Projects. 48.
https://mavmatrix.uta.edu/honors_spring2020/48