Data extraction tools - CiteSeerX

2.3- HTML problems to extract data. ... 3.4- Descriptive comparison of HTML-based tools . ... 4.4.4- Test 2: Delete previous content from the extracted data.
3MB Sizes 13 Downloads 322 Views
Universität Leipzig Fakultät für Mathematik und Informatik Abteilung Datenbanken

A comparison of HTML-aware tools for Web Data extraction Diplomarbeit

Leipzig, September, 2008

vorgelegt von Xavier Azagra Boronat Master-Studiengang Informatik

Betreuender Hochschullehrer: Prof. Dr. Erhard Rahm Betreuer: Dr. Andreas Thor 1

2

Index 1-

Introduction........................................................................................................ 5

2-

Data extraction process ................................................................................ 6 2.1- Characteristics of the data extraction process ................................ 7 2.2- Representation of Web page elements .............................................. 10 2.3- HTML problems to extract data ............................................................ 11 2.4- Ideal characteristics for a Web page to extract data: An example ................................................................................................................. 13

3-

Data extraction tools .................................................................................... 15 3.1- Related work ............................................................................................... 15 3.2- A taxonomy for characterizing Web data extraction tools.......... 16 3.3- Overview of tools ....................................................................................... 17 3.4- Descriptive comparison of HTML-based tools ................................. 22

4-

Tests using the data extraction tools ..................................................... 27 4.1- Overview of tests....................................................................................... 27 4.2- Methodology ................................................................................................ 28 4.3- Problems with some of our tools .......................................................... 29 4.3- General data extraction tests ............................................................... 30 4.3.14.3.24.3.24.3.3-

Basic data extractions................................................................................... 30 Data extraction from Web search engines.............................................. 36 Data extraction from Ebay............................................................................ 45 Data extraction from dynamic content Web pages ............................. 49

4.4- Resilience against changing HTML code........................................... 53

4.4.1- Testing the resilience of our tools ............................................................. 54 4.4.2- Structure ............................................................................................................. 56 4.4.3- Test 1: Delete a table column next to the extracted data ................ 57 4.4.4- Test 2: Delete previous content from the extracted data ................. 59 4.4.5- Test 3: Making modifications to DIV and SPAN tags .......................... 60 4.4.6- Test 4: Duplicating extracted data ............................................................ 61 4.4.7- Test 5: Changing order of extracted data ............................................... 62 4.4.8- A concrete example: Improving resilience with Robomaker against structure changes........................................................................................................ 65

4.5- Precision in extracted data .................................................................... 66 4.5.14.5.24.5.34.5.44.5.5-

5-

Precision extracting a date field ................................................................ 66 Extracting data from simple text ............................................................... 66 Extracting data from formatted text...............................