Challenges in Pronoun Resolution System for Biomedical Text
Ngan Nguyen1, Jin-Dong Kim1, Junichi Tsujii1,2,3
1 University of Tokyo,7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-0033 Japan
2 University of Manchester, Oxford Road, Manchester, M13 9PL, UK
3 National Centre for Text Mining, 131 Princess Street, Manchester, M1 7DN, UK
{nltngan, jdkim, tsujii}@is.s.u-tokyo.ac.jp
Abstract
This paper presents our findings on the feasibility of doing pronoun resolution for biomedical texts, in comparison with conducting pronoun resolution for the newswire domain. In our experiments, we built a simple machine learning-based pronoun resolution system, and evaluated the system on three different corpora: MUC, ACE, and GENIA. Comparative statistics not only reveal the noticeable issues in constructing an effective pronoun resolution system for a new domain, but also provides a comprehensive view of those corpora often used for this task.
Link: pdf
Vox Populi Annotation: Measuring Intensity of Ideological Perspectives by
Aggregating Group Judgments
Wei-Hao Lin and Alexander Hauptmann
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
fwhlin,alexg@cs.cmu.edu
Abstract
Polarizing discussions about political and social issues are common in mass media. Annotations on the degree to which a sentence expresses an ideological perspective can be valuable for evaluating computer programs that can automatically identify strongly biased
sentences, but such annotations remain scarce. We annotated the intensity of ideological perspectives expressed in 250 sentences by aggregating judgments from 18 annotators. We proposed methods of determining the number of annotators and assessing reliability, and showed the the sentence-level annotations on ideological perspectives were reliable across different annotator groups.
Link: pdf
Automatic phone segmentation of expressive speech
Laure Charonnat, Ga¨elle Vidal and Olivier Boeffard
IRISA – Institut de Recherche en Informatique et Syst`emes Al´eatoires
Universit´e de Rennes 1, Enssat, Lannion, France
fLaure.Charonnat,Ga¨elle.Vidalg@enssat.fr, Olivier.Boeffard@irisa.fr
Abstract
In order to improve the flexibility and the precision of an automatic phone segmentation system for a type of expressive speech, the dubbing into French of fiction movies, we developed both the phonetic labelling process and the alignment process . The automatic labelling system relies on an automatic grapheme-to-phoneme conversion including all the variants of the phonetic chain and on HMM modelling. In this article, we will distinguish three sets of phone models: a set of context independent models, a set of left and right context dependant models and finally a mixing of the two that combines phone and triphone models according to the precision of alignment obtained for each phonetic broad-class. The three models are evaluated on a test corpus. On the one hand we notice a little decrease in the score of phonetic labelling mainly due to pauses insertions, but on the other hand the mixed set of models gives the best results for the score of precision of the alignment.
Link: pdf
L-ISA: Learning Domain Specific Isa-Relations from the Web
Alessandra Potrich, Emanuele Pianta
Fondazione Bruno Kessler
38050 Povo (Trento), Italy
potrich@fbk.eu, pianta@fbk.eu
Abstract
Automated extraction of ontological knowledge from text corpora is a relevant task in Natural Language Processing. In this paper, we focus on the problem of finding hypernyms for relevant concepts in a specific domain (e.g. Optical Recording) in the context of a concrete and challenging application scenario (patent processing). To this end information available on the Web is exploited. The extraction method includes four mains steps. Firstly, the Google search engine is exploited to retrieve possible instances of isa-patterns reported in the literature. Then, the returned snippets are filtered on the basis of lexico-syntactic criteria (e.g. the candidate hypernym must be expressed as a noun phrase without complex modifiers). In a further filtering step, only candidate hypernyms compatible with
the target domain are kept. Finally a candidate ranking mechanism is applied to select one hypernym as output of the algorithm. The extraction method was evaluated on 100 concepts of the Optical Recording domain. Moreover, the reliability of isa-patterns reported in the literature as predictors of isa-relations was assessed by manually evaluating the template instances remaining after lexico-syntactic filtering, for 3 concepts of the same domain. While more extensive testing is needed the method appears promising especially for its portability across different domains.
Link: pdf
Learning Morphology with Morfette
Grzegorz Chrupała, Georgiana Dinu, Josef van Genabith
Dublin City University Universit¨at des Saarlandes Dublin City University
Dublin 9, Ireland D-66041 Saarbr¨ucken, Germany Dublin 9, Ireland
gchrupala@computing.dcu.ie dinu@coli.uni-sb.de josef@computing.dcu.ie
Abstract
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources.
Link: pdf
Automatic Learning and Evaluation of User-Centered Objective Functions for
Dialogue System Optimisation
Verena Rieser and Oliver Lemon
School of Informatics, University of Edinburgh, UK
vrieser,olemon@inf.ed.ac.uk
Abstract
The ultimate goal when building dialogue systems is to satisfy the needs of real users, but quality assurance for dialogue strategies is a non-trivial problem. The applied evaluation metrics and resulting design principles are often obscure, emerge by trial-and-error, and are highly context dependent. This paper introduces data-driven methods for obtaining reliable objective functions for system design. In particular, we test whether an objective function obtained from Wizard-of-Oz (WOZ) data is a valid estimate of real users’ preferences. We test this in a test-retest comparison between the model obtained from the WOZ study and the models obtained when testing with real users. We can show that, despite a low fit to the initial data, the objective function obtained from WOZ data makes accurate predictions for automatic dialogue evaluation, and, when automatically optimising a policy using these predictions, the improvement over a strategy simply mimicking the data becomes clear from an error analysis.
Link: pdf
Chinese Core Ontology Construction
from a Bilingual Term Bank
Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying
Department of Computing, the Hong Kong Polytechnic University
E-mail: csyrchen@comp.polyu.edu.hk, csluqin@comp.polyu.edu.hk, cswjli@comp.polyu.edu.hk,
csgycui@comp.polyu.edu.hk
Abstract
A core ontology is a mid-level ontology which bridges the gap between an upper ontology and a domain ontology. Automatic Chinese core ontology construction can help quickly model domain knowledge. A graph based core ontology construction algorithm (COCA) is proposed to automatically construct a core ontology from an English-Chinese bilingual term bank. This algorithm computes the mapping strength from a selected Chinese term to WordNet synset with association to an upper-level SUMO concept. The strength is measured using a graph model integrated with several mapping features from multiple information sources. The features include multiple translation feature between Chinese core term and WordNet, extended string feature and Part-of-Speech feature. Evaluation of
COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries shows that the algorithm is improved in performance compared with our previous research and can better serve the semi-automatic construction of mid-level ontology.
Link: pdf
Semantically Annotated Snapshot of the English Wikipedia
Jordi Atserias, Hugo Zaragoza, Massimiliano Ciaramita, Giuseppe Attardi
Yahoo! Research Barcelona, U. Pisa, on sabbatical at Yahoo! Research
C/Ocata 1
Barcelona 08003
Spain
{jordi, hugoz, massi}@yahoo-inc.com, attardi@di.unipi.it
Abstract
This paper describes SW1, the first version of a semantically annotated snapshot of the EnglishWikipedia. In recent yearsWikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a “entity containment” derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia.
Link: pdf
Building the Valency Lexicon of Arabic Verbs
Viktor Bielick´y Otakar Smrˇz
Institute of Formal and Applied Linguistics, Charles University in Prague
Malostransk´e n´amˇest´ı 25, Prague 1, 118 00, Czech Republic
padt@ufal.mff.cuni.cz
Abstract
This paper describes the building of a valency lexicon of Arabic verbs using a morphologically and syntactically annotated corpus, the Prague Arabic Dependency Treebank, as its primary source. We present the theoretical account on valency developed within the Functional Generative Description theory. We apply the framework to Arabic and discuss various valency-related phenomena with respect to examples from the corpus. We then outline the methodology and the linguistic and technical resources used in the building of the lexicon. Valency lexicons can find application in automatic parsing as well as in language generation.
Link: pdf
F0 Of Adolescent Speakers – First Results for the German Ph@ttSessionz
Database
Christoph Draxler, Florian Schiel, Tania Ellbogen
BAS Bavarian Archive of Speech Signals, University of Munich, Germany
draxler@phonetik.uni-muenchen.de, schiel@phonetik.uni-muenchen.de, ellbogen@phonetik.uni-muenchen.de
Abstract
The first release of the German Ph@ttSessionz speech database contains read and spontaneous speech from 864 adolescent speakers and is the largest database of its kind for German. It was recorded via the WWW in over 40 public schools in all dialect regions of Germany. In this paper, we present a cross-sectional study of f0 measurements on this database. The study documents the profound changes in male voices at the age 13-15. Furthermore, it shows that on a perceptive mel-scale, there is little difference in the relative f0 variability for male and female speakers. A closer analysis reveals that f0 variability is dependent on the speech style and both the length and the type of the utterance. The study provides statistically reliable voice parameters of adolescent speakers for German. The results may contribute to making spoken dialog systems more robust by restricting user input to utterances with low f0 variability.
Link: pdf