Objectives

Provide a systematic investigation and exploration of the space of possible choices in Hybrid MT, in order to provide optimal support for Hybrid MT design, using sophisticated machine-learning (ML) technologies. The variants of hybrid MT to be explored in this way will subsume the approaches that have been investigated in recent work, such as

  • using phrases from different types of MT in e.g. phrase-based SMT
  • syntactic reordering prior to phrase-based SMT training
  • system combination approaches, either parallel in multi-engine MT (MEMT) or sequential in statistical post-editing (SPMT), effectively treating the participating MT systems as stand-alone "black boxes"
  • using richer linguistic information in phrase-based SMT (e.g. in factored models or hierarchical SMT)
  • learning resources (e.g. transfer rules, transduction grammars) for probabilistic rule-based MT
  • approaches focused on scarcely resourced scenarios where sufficiently large amounts of bitext training resources are not readily available (e.g. METIS).

The systematic variation and optimisation along these dimensions will most likely reveal new combinations of the various components and approaches that have so far only been investigated in isolation.

One further important objective of WP 2 is to build bridges to the Machine Learning Community to systematically and jointly explore the choice space for Hybrid MT as ML technologies are likely to allow for a well-directed combination of the different components.

Description of work

Task 2.1: Data Preparation

In close cooperation with Pillar II (Resource Infrastructure), we will develop a sample bitext corpus (English, German, Spanish and one "new" language (e.g. Czech) if possible based on a 1, 000 sentence section of the Europarl corpus), annotated with phrases and linguistic information from different (basic and hybrid) MT approaches, including SMT, phrase-based SMT, EBMT and RBMT. We will evaluate random samples for annotation consistency and omissions. We will also collect human judgements on the quality of alternative phrases using a suitable MT-evaluation tool ("Appraise", Federmann, 2010).

Task 2.2: Learning Optimal Choice Section for Hybrid MT

We will use the resource established in Task 2.1 for feature extraction and training ML methods to systematically explore optimal combination possibilities for Hybrid MT architectures. This task will take into account the influence of context-based word choice on hybrid translation quality.

Task 2.3: Challenges

We will evaluate the resulting Hybrid-MT models against (i) state-of-the-art phrase-based SMT systems, (ii) current syntax-enhanced SMT systems (e.g. Hiero, Joshua or Mosed with factored models), (iii) phrase-based SMT models that use phrases derived from different types of MT and (iv) system combinations (both MEMT and SPMT) and integrate the novel Hybrid Systems into system combinations. We will participate in shared tasks and open evaluation campaigns, in particular those organised by the International Workshops for Statistical Machine Translation (WMT).

In addition, we will evaluate our approach at approx. M23 using a simple phrase-level combination SMT approach, and perform a final evaluation at M35 of the ML-based approach, measuring the effect on translation quality of the integration of the hybrid system in multi-engine MT/system combinations. The results will be published. Technical work will be completed in M33, leaving time for wrap up and reporting.

Task 2.4: Workshops

We will organise or co-organise three workshops, one project internal workshop (on Data Preparation, around M9) and two joint workshops (ML/MT, around M21, M33) with representatives from both the machine leaning and the machine translation communities (reflecting the main ML approaches including MBL, SVM, etc. and the main MT approaches).

Deliverables

No.TitleDue DateType
D2.1Annotated Hybrid Sample MT CorpusM12Software
D2.2Optimal Choice Selection in MTM24Report
D2.3Evaluation ReportM36Report
D2.4.1-3WorkshopsM8, M21, M33Other