Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT (ML4HMT-2012)

Call for Papers

The workshop and associated shared task are an effort to trigger a systematic investigation on improving state-of-the-art hybrid machine translation, making use of advanced machine-learning (ML) methodologies. It follows the ML4HMT-11 workshop which took place last November in Barcelona. The first workshop also road-tested a shared task (and associated data set) and laid the basis for a broader reach in 2012.

Regular Papers ML4HMT-12

We are soliciting original papers on hybrid MT, including (but not limited to):

use of machine learning methods in hybrid MT;
system combination: parallel in multi-engine MT (MEMT) or sequential in statistical post-editing (SPMT);
combining phrases and translation units from different types of MT;
syntactic pre-/re-ordering;
using richer linguistic information in phrase-based or in hierarchical SMT;
learning resources (e.g., transfer rules, transduction grammars) for probabilistic rule-based MT.

Full papers should be anonymous and follow the COLING full paper format

Shared Task ML4HMT-12

The main focus of the Shared Task is to address the question:

“Can Hybrid MT and System Combination techniques benefit from extra information (linguistically motivated, decoding, runtime, confidence scores, or other meta-data) from the systems involved?”

Participants are invited to build hybrid MT systems and/or system combinations by using the output of several MT systems of different types, as provided by the organisers. While participants are encouraged to explore machine learning techniques to explore the additional meta-data information sources, other general improvements in hybrid and combination based MT are welcome to participate in the challenge. For systems that exploit additional meta-data information the challenge is that additional meta-data is highly heterogeneous and (individual) system specific.

Shared task participants are invited to submit system description papers (7 pages, not anonymised) and should follow COLING short paper format.

Data

The ML4HMT-12 Shared Task involves (ES-EN) and (ZH-EN) data sets, in each case translating into EN.

(ES-EN): Participants are given a development bilingual set aligned at a sentence level. Each "bilingual sentence" contains: 1) the source sentence, 2) the target (reference) sentence and 3) the corresponding multiple output translations from four systems, based on different MT approaches (Apertium, Ramirez-Sanchez, 2006; Lucy, Alonso and Thurmair, 2003; Moses, Koehn et. al., 2007). The output has been annotated with system-internal meta-data information derived from the translation process of each of the systems.
Downloads — available from github.com/cfedermann/ML4HMT-2012
1. tuning data ES→EN: data/tuningDataESEN.tar.bz2 (57.1 MB)
2. test data ES→EN: data/ml4hmt-12.testDataESEN.tar.bz2 (6.1 MB)
(ZH-EN): A corresponding data set for ZH-EN with output translations from three systems (Moses; ICT_Chiero, Mi et al., 2009; Huajian RBMT) will be provided.
Downloads
1. Data for ZH→EN is available for download from the LDC. Participants have to fill out a user agreement form and individually obtain the data packages from the LDC.

Participants are challenged to build an MT mechanism where possible making effective use of the system-specific MT meta-data output. They can provide solutions based on open-source systems, or develop their own mechanisms. The development set can be used for tuning the systems during the development phase. Final submissions have to include translation output on a test set, which will be made available one week after training data release. Data will be provided to build language/reordering models, possibly re-using existing resources from MT research.

Participants can also make use of additional (linguistic analysis, confidence estimation, etc.) tools, if their systems require so, but they have to explicitly declare this upon submission, so that they are judged as "unconstrained" systems. This will allow for a better comparison between participating systems.

Submission Details

Shared task results should be submitted via email attachment. Please compress your results as .zip or .gz archive and send them via email. Use ML4HMT-12 Shared Task Submission as mail subject. Shared task results are due by extended October 28th.

System output will be judged via peer-based human evaluation as well as automatic evaluation. During the evaluation phase, participants will be requested to rank system outputs of other participants through a web-based interface (Appraise; Federmann, 2010). Automatic metrics include BLEU (Papineni et.al., 2002), TER (Snover et.al., 2006) and METEOR (Lavie, 2005).

Results from the automatic evaluation of submitted shared task results will be made available to participants on updated November 12th so that they could be referred to in system description papers. As the manual evaluation will take longer, its results will be presented and published at the workshop.

Papers need to be written in English and can either be 1) research papers, or 2) shared task system description papers. Please indicate the type of your submission by choosing the correct "Submission Type". Note that research papers need to be properly anonymised; system description papers may include author information. Paper submission deadline for research papers is extended October 22nd while system description papers need to be submitted on updated November 12th.

Workshop Participation

If you are interested in our workshop and intend to participate, we'd much appreciate if you could inform us about your participation intent beforehand so that we can better plan the workshop; to do so, send an email. Participation is required for at least one author of any accepted paper.

Important Dates 2012

August 15th Shared Task training data release
August 23rd Shared Task test data release
October 22nd Workshop research paper submission
October 28th Shared Task translation results submission deadline
November 7th Workshop research paper accept/reject notification
November 12th Shared Task automatic evaluation results release
November 12th Shared Task system description paper submission
November 13th Workshop and Shared Task camera ready paper due
December 9th COLING 2012 Pre-conference workshop, 9am

Organizers

Prof. Josef van Genabith, Dublin City University (DCU) and Centre for Next Generation Localisation (CNGL)
Prof. Toni Badia, Universitat Pompeu Fabra and Barcelona Media (BM)
Christian Federmann, German Research Center for Artificial Intelligence (DFKI), contact person
Dr. Maite Melero, Barcelona Media (BM)
Dr. Marta R. Costa-jussà, Barcelona Media (BM)
Dr. Tsuyoshi Okita, Dublin City University (DCU)

Program committee

Eleftherios Avramidis (German Research Center for Artificial Intelligence, Germany)
Prof. Sivaji Bandyopadhyay (Jadavpur University, India)
Dr. Rafael Banchs (Institute for Infocomm Research - I2R, Singapore)
Prof. Loïc Barrault (LIUM - University of Le Mans, France)
Prof. Antal van den Bosch (Centre for Language Studies, Radboud University Nijmegen, Netherlands)
Dr. Grzegorz Chrupala (Saarland University, Saarbrücken, Germany)
Prof. Jinhua Du (Xi'an University of Technology (XAUT), China)
Dr. Andreas Eisele (Directorate-General for Translation (DGT), Luxembourg)
Dr. Cristina España-Bonet (Technical University of Catalonia, TALP, Barcelona)
Dr. Declan Groves (Center for Next Generation Localisation, Dublin City University, Ireland)
Prof. Jan Hajic (Institute of Formal and Applied Linguistics, Charles University in Prague)
Prof. Timo Honkela (Aalto University, Finland)
Dr. Patrick Lambert (LIUM - University of Le Mans, France)
Prof. Qun Liu (Institute of Computing Technology, Chinese Academy of Sciences, China)
Dr. Maite Melero (Barcelona Media Innovation Center, Spain)
Dr. Tsuyoshi Okita (Dublin City University, Ireland)
Prof. Pavel Pecina (Institute of Formal and Applied Linguistics, Charles University in Prague)
Dr. Marta R. Costa-jussà (Barcelona Media Innovation Center, Spain)
Dr. Felipe Sanchez Martinez (Escuela Politecnica Superior, Universidad de Alicante, Spain)
Dr. Nicolas Stroppa (Google, Zurich, Switzerland)
Prof. Hans Uszkoreit (German Research Center for Artificial Intelligence, Germany)
Dr. David Vilar (German Research Center for Artificial Intelligence, Germany)

Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT (ML4HMT-12)

Mumbai (India) · December 9th, 2012

In conjunction with 24th International Conference on Computational Linguistics (COLING 2012)