Natural Language Interfaces to Databases - Semantic Scholar

Krishna Kavi, Chair of the Department of Computer ... School of Graduate Studies ... (Computer Science), December 2006, 62 pp., 11 tables, 15 illustrations, ...
813KB Sizes 0 Downloads 135 Views
NATURAL LANGUAGE INTERFACES TO DATABASES Yohan Chandra

Thesis Prepared for the Degree of MASTER OF SCIENCE

UNIVERSITY OF NORTH TEXAS December 2006

APPROVED: Rada Mihalcea, Major Professor Yan Huang, Committee Member Robert P. Brazile, Committee Member Armin R Mikler, Graduate Coordinator Krishna Kavi, Chair of the Department of Computer Science and Engineering Oscar Garcia, Dean of the College of Engineering Sandra L. Terrell, Dean of the Robert B. Toulouse School of Graduate Studies

Chandra, Yohan, Natural Language Interfaces to Databases. Master of Science (Computer Science), December 2006, 62 pp., 11 tables, 15 illustrations, bibliography , 18 titles. Natural language interfaces to databases (NLIDB) are systems that aim to bridge the gap between the languages used by humans and computers, and automatically translate natural language sentences to database queries. This thesis proposes a novel approach to NLIDB, using graph-based models. The system starts by collecting as much information as possible from existing databases and sentences, and transforms this information into a knowledge base for the system. Given a new question, the system will use this knowledge to analyze and translate the sentence into its corresponding database query statement. The graph-based NLIDB system uses English as the natural language, a relational database model, and SQL as the formal query language. In experiments performed with natural language questions ran against a large database containing information about U.S. geography, the system showed good performance compared to the state-of-the-art in the field.

CONTENTS LIST OF TABLES

v

LIST OF FIGURES

vi

CHAPTER 1. INTRODUCTION

1

1.1. Definition and Motivation

1

1.1.1. Natural Language Interfaces to Databases

1

1.2. Scope

3

1.3. Linguistic Problems

4

1.3.1. Ambiguity

4

1.3.2. Nominal Compound Problem

5

1.3.3. Grammatical Correctness

6

1.3.4. Conjunction and Disjunction

6

1.4. Advantages and Disadvantages

7

1.4.1. Advantages of NLIDB

7

1.4.2. Disadvantages of NLIDB

8

CHAPTER 2. BACKGROUND

9

2.1. Overview

9

2.1.1. Pattern Matching Systems

10

2.1.2. Syntax Based Systems

11

2.1.3. Semantic Grammar Systems

12

2.1.4. Intermediate Representation Languages

14

2.2. Recent Development

16 ii

2.2.1. NALIX

16

2.2.2. PRECISE

18

2.2.3. WASP

20

CHAPTER 3. SYSTEM OVERVIEW

22

3.1. Basic Concepts

22

3.2. System Architectures

24

3.2.1. Obtaining Knowledge from a Database

24

3.2.2. Obtaining Knowledge from Sentences

26

3.3. The Database

27

CHAPTER 4. SELECT EXTRACTION

30

4.1. Pattern Extraction Model

30

4.1.1. Graph Based 1

32

4.1.2. Graph Based 2

36

4.2. Vectorial Model

36

4.3. Pre-evaluation

37

CHAPTER 5. WHERE EXTRACTION

40

5.1. Identifying Hints

40

5.1.1. User Defined Dictionary

41

5.1.2. Matching to the Database

42

5.2. Building WHERE using Shortest Path Approach

42

5.2.1. Defining Values

43

5.2.2. Constructing Relations

45

CHAPTER 6. EVALUATION A