Relation Extraction using Distant Supervision ... - Semantic Scholar

Artificial Intelligence, 165(1):91–134. Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating non-local infor- mation into information extraction systems ...
2MB Sizes 0 Downloads 270 Views
Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic

Malcolm W. Greaves CMU-CS-14-128 May 2014

School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA

Thesis Committee William Cohen, Chair Tom M. Mitchell

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

c 2014 Malcolm W. Greaves Copyright This research was generously supported by Google Inc. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of Google Inc., Carnegie Mellon University, or any other entity.

Keywords: Information extraction, Machine Learning, Natural Language Processing, Probabilistic First-Order Logic, Relation Extraction, Big Data, Large Scale Machine Learning, Support Vector Machines, Cost-Sensitive Learning

“Each new program that is built is an experiment. It poses a question to nature, and its behavior offers clues to an answer.”

Allen Newell (1975)

CARNEGIE MELLON UNIVERSITY

Abstract Computer Science Department School of Computer Science Master’s of Science in Computer Science Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic Malcolm W. Greaves

We are drowning in information and having difficulty finding knowledge: useful and actionable information. Recent studies estimate that humanity has stored in excess of 295 exabytes (295*1018 bytes) of data. Much data is stored in the form of unstructured text, such as news articles, message boards and forums, texts, emails, status updates, tweets, and nearly a billion webpages. In this thesis, we present a solution to extracting knowledge present in untold amounts of unstructured text. We define our problem as one of relation extraction: given a document, extract all instantiations of well-defined binary relations present in the text. To this end, we use distant supervision and a novel probabilistic first order logic system combined with co-reference resolution to identify candidate relation instances. These candidates are then classified by a series of cost augmented, soft-margin, binary Support Vector Machines to produce the final relation extractions. Results on a corpus of 5.7 million newswire articles over 27 different relations results in an across-relation, microaveraged F1 of 42.02%. Results on a smaller, targeted search, consisting of 10 thousand documents, achieve F1 of 33.15%.

Acknowledgements I stand of the shoulders of giants. For this, I am eternally grateful. I acknowledge all of those who have studied the relation extraction task and have disseminated their knowledge. Without the results compiled from a community of scientists, engineers, and researchers, this thesis work would not exist. I want to give a special thanks to my thesis advisor, Professor William Cohen. William has been an absolutely excellent mentor and teacher. He taught me many lessons, ranging from extremely useful, practical algorithmic implementations to support on how to balance research and life and the mindset required for research. He has graciously given me his time through the years to help me become a more effective researcher and, ultimately, person. I also sincerely appreciate his financial generosity as a graduate student. In addition to Professor Cohen, I want to thank Professor Tom Mitchell. Tom was the first person to give me a chance to learn about natural language processing, information extraction, and machine learning. As a first semester freshman, Tom invited me to sit-in on his research group meetings, where I became acquainted with the world of research. His seemingly care-free attitude mixed with his firm, directed, and intense focus have always inspired me. He gave me an excellent foothold to step into the world of research. I would also like to th