Information Extraction - UI

5 downloads 223 Views 866KB Size Report
POS & NER on Python. ▻ Step 1: Tokenization. ▻ Split each token (word) on a sentence tok_sentence = nltk.word_to
Information Extraction Part of Speech & Name Entity Recognition Bayu Distiawan Trisedya [email protected] http://bit.ly/1l6Wu5V

Extracting Information from Text Data stored digitally





Image, video, music, text



What information are stored (on internet)?



How can we use that information?

2

What information are stored (on internet)? Structured Data



Name

GPE

Barack Obama

USA

Joko Widodo

Indonesia

Malcolm Turnbull

Australia

Najib Razak

Malaysia

Unstructured Data



“Malcolm Bligh Turnbull is the 29th and current Prime Minister of Australia and the Leader of the Liberal Party, having assumed office in September 2015. He has served as the Member of Parliament for Wentworth since 2004.”

3

Finding Information

4

How do we get a machine to understand the text? One approach to this problem:





Convert the unstructured data of natural language sentences into the structured data 



Table, relational database, etc

Once the date are structured, we can use query tools such as SQL

Getting meaning from text is called Information Extraction



5

Named Entity Recognition (NER) 

A very important sub-task: find and classify names in text, for example: 

6

The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Slide acknowledgement: Stanford NLP

Named Entity Recognition (NER) 

A very important sub-task: find and classify names in text, for example: 

7

The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Slide acknowledgement: Stanford NLP

Named Entity Recognition (NER) 

A very important sub-task: find and classify names in text, for example: 

The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Person

8

Date

Location

Organization

Slide acknowledgement: Stanford NLP

NER Task Given a sentence, get the list of token with the POS tag and Entity tag. Example:







Input: 



“George Washington was unanimously elected the first President of the United States”

Output: (S (PERSON George/NNP Washington/NNP) was/VBD unanimously/RB elected/VBN the/DT first/JJ President/NNP of/IN the/DT (GPE United/NNP States/NNPS)

9

)

ML sequence model approach to NER Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities 10

Slide acknowledgement: Stanford NLP

Encoding classes for sequence labeling IO encoding IOB encoding

Fred showed Sue Mengqiu Huang „s new painting 11

PER O PER PER PER O O O

B-PER O B-PER B-PER I-PER O O O Slide acknowledgement: Stanford NLP

Features for sequence labeling 

Words 





Other kinds of inferred linguistic classification 



Part-of-speech tags

Label context 



Current word (essentially like a learned dictionary) Previous/next word (context)

Previous (and perhaps next) label

Word substrings 

12

Cotrimoxazole, ciprofloxacin, sulfamethoxazole

Slide acknowledgement: Stanford NLP

Sequence problems 



Many problems in NLP have data which is a sequence of characters, words, phrases, lines, or sentences … We can think of our task as one of labeling each item VBG

NN

IN

DT

NN

IN

NN

Chasing

opportunity

in

an

age

of

upheaval

POS tagging PERS

O

O

O

ORG

ORG

Murdoch

discusses

future

of

News

Corp.

Named entity recognition

13

Slide acknowledgement: Stanford NLP

MEMM inference in systems For a Conditional Markov Model (CMM) a.k.a. a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions A larger space of sequences is usually explored via search





Local Context

Decision Point

-3

-2

-1

0

+1

DT

NNP

VBD

???

???

The

Dow

fell

22.6

%

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

14

Features W0

22.6

W+1

%

W-1

fell

T-1

VBD

T-1-T-2 hasDigit?

NNP-VBD true



Slide acknowledgement: Stanford NLP



Inference in Systems Sequence Level

Sequence Model

Inference

Sequence Data

Local Level Local Local Local Data Data Data

Feature Extraction

Maximum Entropy Models

15

Label

Classifier Type

Label

Optimization Features

Conjugate Gradient

Smoothing

Quadratic Penalties

Features

POS & NER on Python: Tools Preparation Install Anaconda Python (http://telaga.cs.ui.ac.id/~b.distiawan/summer_school/tools) Install nltk module: pip install nltk Download nltk module:

1.

2. 3. 



Enter python interactive mode Type:  



Download:   

16

import nltk; nltk.download();

Model: maxent_ne_chunker Model: maxent_treebank_pos Model: tagset

POS & NER on Python Make file D:\NER\ner.py import nltk; sentence = "George Washington was unanimously elected the first President of the United States"; tok_sentence = nltk.word_tokenize(sentence); print tok_sentence; tagged_sentence = nltk.pos_tag(tok_sentence); print tagged_sentence;

tree = nltk.ne_chunk(tagged_sentence); print tree; tree.draw(); 17

POS & NER on Python 

Step 1: Tokenization 

Split each token (word) on a sentence tok_sentence = nltk.word_tokenize(sentence); print tok_sentence;

18

POS & NER on Python 

Step 2: POS Tagging 

Identify the Part of Speech (POS) for each token tagged_sentence = nltk.pos_tag(tok_sentence); print tagged_sentence;



To see the list of tag set nltk.help.upenn_tagset()

19

POS & NER on Python 

Step 3: NE Recognition 

Given words and its POS, identify the entity of each token tree = nltk.ne_chunk(tagged_sentence); print tree;

20

POS & NER on Python 

Error on reconition (S (PERSON George/NNP) (GPE Washington/NNP) was/VBD unanimously/RB elected/VBN the/DT first/JJ President/NNP of/IN the/DT (GPE United/NNP States/NNPS) )



Why this can be happened?



How to fix (improve) the NER? 21

How to Train POS Model Prepare the corpus

1. 

 

IO / IOB Format Example: http://telaga.cs.ui.ac.id/~b.distiawan/POS/corpus Make the *.iob files to your corpus directory 

22

D:/POS/corpus

How to Train POS Model Build POS model using your own corpus

2.  

Download the script for train NER model: http://telaga.cs.ui.ac.id/~b.distiawan/POS/trainer.zip Extract the zip file to 



D:/POS/

Run the following commands:

>>> D: >>> cd POS\trainer >>> python train_tagger.py "D:\POS\corpus" --reader nltk.corpus.reader.tagged.TaggedCorpusReader --sequential b -classifier Maxent --filename D:\POS\model.pickles --no-eval

23

How to Train POS Model Use the model

3.

Copy:

 

D:\POS\trainer\nltk_trainer to





D:\POS\

Make: D:\POS\pos.py

import nltk; import pickle; tagger = pickle.load(open("D:\POS\model.pickles")) sentence = "John Kuhns eats pizza"; tok_sentence = nltk.word_tokenize(sentence); tagged_sentence = tagger.tag(tok_sentence); print tagged_sentence; 24

How to Train NER Model Prepare the corpus

1. 

 

IO / IOB Format Example: http://telaga.cs.ui.ac.id/~b.distiawan/NER/corpus Make the *.iob files to your corpus directory 

25

D:/NER/corpus

How to Train NER Model Build NER model using your own corpus

2.  

Download the script for train NER model: http://telaga.cs.ui.ac.id/~b.distiawan/NER/trainer.zip Extract the zip file to 



D:/NER/

Run the following commands:

>>> D: >>> cd NER\trainer >>> python train_chunker.py "D:\NER\corpus" --reader nltk.corpus.reader.ChunkedCorpusReader --fraction 0.85 --sequential '' --classifier Maxent --filename D:\NER\model.pickles --no-eval

26

How to Train NER Model Use the model

3.

Copy:

 

D:\NER\trainer\nltk_trainer to





D:\NER\

Make: D:\NER\ner2.py

import nltk; import pickle; chunker = pickle.load(open("D:\NER\model.pickles")) sentence = “John Kuhns eats pizza"; tok_sentence = nltk.word_tokenize(sentence); tagged_sentence = nltk.pos_tag(tok_sentence); tree = chunker.parse(tagged_sentence); tree.draw(); 27

How to Train NER Model 

Try for another sentence sentence = “George Washington eats pizza";



What happened?



Take a moment to fix your model. George NNP B-PERS Washington NNP I-PERS was VBD O the DT O first JJ O president NN O of IN O America NNP B-LOC 28

Homework  

Build your own model using large data. You can make your own IOB file or generate it from available corpus. 

29

http://schwa.org/projects/resources/wiki/Wikiner