Information Extraction: Techniques, Advances and ... - Semantic Scholar

3 downloads 344 Views 4MB Size Report
Jun 12, 2012 - event from news, financial, bio-medical domains… ▫ ..... making fee, and a lucrative multifaceted con
Information Extraction: Techniques, Advances and Challenges

Heng Ji Computer Science Department and Linguistics Department Queens College and Graduate Center City Univeristy of New York [email protected] June 12, 2012

Outline   

Introduction Basic Information Extraction (IE) Advanced IE  



Popular Research Directions  



Enhance Quality Enhance Portability Cross-source IE IE for Noisy > Jim Parsons eng-WL-11-174592-12943233 PER E0300113 per:date_of_birth per:age per:country_of_birth per:city_of_birth School Attended: University of Houston

Slot Types Person per:alternate_names per:date_of_birth per:age per:country_of_birth per:stateorprovince_of_birth per:city_of_birth per:origin per:date_of_death per:country_of_death per:stateorprovince_of_death per:city_of_death per:cause_of_death per:countries_of_residence per:stateorprovinces_of_residence per:cities_of_residence per:schools_attended

per:title per:member_of per:employee_of per:religion per:spouse per:children per:parents per:siblings per:other_family per:charges

Organization org:alternate_names org:political/religious_affiliation org:top_members/employees org:number_of_employees/members org:members org:member_of org:subsidiaries org:parents org:founded_by org:founded org:dissolved org:country_of_headquarters org:stateorprovince_of_headquarters org:city_of_headquarters org:shareholders org:website

How much Inference is Needed? 

Difficulty to push above F = 0.30



High entry cost for competitive performance;

needs good performance at each IE component

Why KBP is more difficult than ACE 

Cross-slot Inference (per:children) 

People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named Henry Daniel Moder. Henry was born Monday in Los Angeles and weighed 8 lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November 2006. son-of (Julia Roberts, Henry Moder) & spouse-of (Julia Roberts, Danny Moder) usually  son-of (Danny Moder, Henry Moder)

 25% of the examples involve coreference which is beyond current system capabilities, such as nominal anaphors 









“Alexandra Burke is out with the video for her second single … taken from the British artist’s debut album” “a woman charged with running a prostitution ring … her business, Pamela Martin and Associates” systems would benefit from specialists which are able to reason about times, locations, family relationships, and employment relationships.

It places more emphasis on cross-document entity resolution which received limited effort in ACE It forces systems to deal with redundant and conflicting answers across large corpora

Cross-media IE



(Lee et al., 2010; Qi et al., 2011)

Fact Type Examples in Cross-Media IE

Cross-lingual IE Ang Lee XIN20030616.0130.0053 PER E0300112 per:date_of_birth, per:spouse per:children Parent: Li Sheng Birth-place: Taiwan Pindong City Residence: Hua Lian

Attended-School: NYU

(Snover et al., 2011)

Alternative Cross-lingual IE Pipelines 

References:   



Riloff et al., 2002; Sudo et al., 2004; Hakkani-Tür et al., 2007; Snover et al., 2011

Impact of MT Errors on Cross-lingual IE 





Re-training extraction components directly from MT output did not help  MT errors were too diverse to generalize 59% of the missing errors were due to text, query or answer translation errors; 20% were due to slot filling errors; Name translation is a bottleneck Source Text

俄塔社援引紧急情况部莫斯科市总局新闻处处长博贝列夫 (Bo Bei Lie Fu)的话... 

Reference Translation The Russian news agency Tass, quoting Director Bobylev of the news office of the Moscow city headquarters of the Emergency Situation Department...



Various MT System Translations     

Russia 's Tass news agency quoted the ministry for emergency situations of the Moscow city , Director of Information Services , German Gref... Itar-Tass quoted the Emergency Situations Ministry in Moscow City Administration Director Bo , yakovlev... Russia 's Tass news agency of the Ministry of Emergency Situations Moscow city administration of Addis Ababa , Director of Information Services... Russian news agency quoted the ministry of emergency situations in Moscow city administration of the Director of Information Services , A. Kozyrev... Itar-Tass quoted the Emergencies Ministry in Moscow , the Director of information in 1988 lev...

Cross-lingual Validation Characteristics Description Scope Depth Language Global f1: frequency of that appears in all baseline outputs (CrossShallow English f2: number of conflicting slot types in which answer a appears in all system) baseline outputs f3: conjunction of t and whether a is a year answer Shallow English f4: conjunction of t and whether a includes numbers or letters f5: conjunction of place t and whether a is a country name Local Deep f6: conjunction of per:origin t and whether a is a nationality Based on English f7: if t=per:title, whether a is an acceptable title IE f8: if t requires a name answer, whether a is a name f9: whether a has appropriate semantic type Global Deep f10: conjunction of org:top_members/employees and whether there is a (WithinBased on English high-level title in s Document) IE f11: conjunction of alternative name and whether a is an acronym of q Chinese f12: conditional probability of q/q' and a/a' appear in the same document Shallow Global (Statistics) English f13: conditional probability of q/q' and a/a' appear in the same sentence (CrossBoth f14: co-occurrence of q/q' and a/a' appear in coreference links document Deep English f15: co-occurrence of q/q' and a/a' appear in relation/event links in Factcomparable based on English f16: conditional probability of q/q' and a/a' appear in relation/event links corpora) InfoNets English f17: mutual information of q/q' and a/a' appear in relation/event links

Achieved 11% absolute F-measure gain (Snover et al., 2011)

IE for ASR Output 

Problems 







Using an IE system trained from newswire, the performance degrades notably, 15% relative, when the system is tested on broadcast news transcriptions and 27% relative, when ASR output is used instead of reference transcriptions (Makhoul et al., 2005) Optimizing based on downstream applications (IE) is better than optimizing Fmeasure (Favre et al., 2008) Need better pronoun resolution for speech conversation

Possible Solutions 







Optimize downstream applications for ASR and speech segmentation (JHU2012 Summer Workshop on “Complementary Evaluation Measures for Speech Transcription”) Use n-best hypotheses, ASR lattices, word confusion networks, phonemes or graphemes for IE Improve pronoun resolution by incorporating automatic speaker role identification techniques Apply Modality, Polarity, Genericity analysis to reduce uncertainty

Segmenting Speech for IE

(Favre et al., 2008)

Outline   

Introduction Basic Information Extraction (IE) Advanced IE  



Popular Research Directions  



Enhance Quality Enhance Portability Cross-source IE IE for Noisy Data

Resources

Resources: Data Sets • ACE IE: http://projects.ldc.upenn.edu/ace/data/ IE training data for English/Chinese/Arabic/Spanish • CONLL 2002: http://www.cnts.ua.ac.be/conll2002/ner.tgz Name tagging training data for Dutch and Spanish • CONLL 2003: http://www.cnts.ua.ac.be/conll2003/ner.tgz Name tagging training data for English and German • KBP 2009-2012: http://www.nist.gov/tac/2012/KBP/data.html http://nlp.cs.qc.cuny.edu/kbp/2011/ http://nlp.cs.qc.cuny.edu/kbp/2010/ Knowledge Base Population for English, Chinese and Spanish

Resources: Publicly Available IE Toolkits NYU IE: http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html University of Sheffield IE: http://gate.ac.uk/download/index.html NLTK: http://nltk.sourceforge.net/ CUNY KBP system: http://nlp.cs.qc.cuny.edu/kbptoolkit-1.5.0.tar.gz http://nlp.cs.qc.cuny.edu/Temporal_Slot_Filling_1.0.1.tar.gz Name Taggers: Stanford Name Tagger: http://nlp.stanford.edu/ner/index.shtml UIUC Name Tagger: http://cogcomp.cs.illinois.edu/page/software_view/NETagger CUNY Name Taggers: Chinese tagger: http://nlp.cs.qc.cuny.edu/ChineseNameTagger.tar.gz English tagger: http://nlp.cs.qc.cuny.edu/en_nametagging_release.tar.gz Coreference Resolvers: JavaRAP: http://aye.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html BART: http://bart-coref.org/ The Illinois Coreference Package: http://cogcomp.cs.illinois.edu/page/software_view/18 Reconcile: http://www.cs.utah.edu/nlp/reconcile/ CherryPicker: http://www.hlt.utdallas.edu/~altaf/cherrypicker.html

Thank You and Join the IE World!