POS & NER on Python. â» Step 1: Tokenization. â» Split each token (word) on a sentence tok_sentence = nltk.word_to
Information Extraction Part of Speech & Name Entity Recognition Bayu Distiawan Trisedya
[email protected] http://bit.ly/1l6Wu5V
Extracting Information from Text Data stored digitally
Image, video, music, text
What information are stored (on internet)?
How can we use that information?
2
What information are stored (on internet)? Structured Data
Name
GPE
Barack Obama
USA
Joko Widodo
Indonesia
Malcolm Turnbull
Australia
Najib Razak
Malaysia
Unstructured Data
“Malcolm Bligh Turnbull is the 29th and current Prime Minister of Australia and the Leader of the Liberal Party, having assumed office in September 2015. He has served as the Member of Parliament for Wentworth since 2004.”
3
Finding Information
4
How do we get a machine to understand the text? One approach to this problem:
Convert the unstructured data of natural language sentences into the structured data
Table, relational database, etc
Once the date are structured, we can use query tools such as SQL
Getting meaning from text is called Information Extraction
5
Named Entity Recognition (NER)
A very important sub-task: find and classify names in text, for example:
6
The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Slide acknowledgement: Stanford NLP
Named Entity Recognition (NER)
A very important sub-task: find and classify names in text, for example:
7
The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Slide acknowledgement: Stanford NLP
Named Entity Recognition (NER)
A very important sub-task: find and classify names in text, for example:
The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Person
8
Date
Location
Organization
Slide acknowledgement: Stanford NLP
NER Task Given a sentence, get the list of token with the POS tag and Entity tag. Example:
Input:
“George Washington was unanimously elected the first President of the United States”
Output: (S (PERSON George/NNP Washington/NNP) was/VBD unanimously/RB elected/VBN the/DT first/JJ President/NNP of/IN the/DT (GPE United/NNP States/NNPS)
9
)
ML sequence model approach to NER Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities 10
Slide acknowledgement: Stanford NLP
Encoding classes for sequence labeling IO encoding IOB encoding
Fred showed Sue Mengqiu Huang „s new painting 11
PER O PER PER PER O O O
B-PER O B-PER B-PER I-PER O O O Slide acknowledgement: Stanford NLP
Features for sequence labeling
Words
Other kinds of inferred linguistic classification
Part-of-speech tags
Label context
Current word (essentially like a learned dictionary) Previous/next word (context)
Previous (and perhaps next) label
Word substrings
12
Cotrimoxazole, ciprofloxacin, sulfamethoxazole
Slide acknowledgement: Stanford NLP
Sequence problems
Many problems in NLP have data which is a sequence of characters, words, phrases, lines, or sentences … We can think of our task as one of labeling each item VBG
NN
IN
DT
NN
IN
NN
Chasing
opportunity
in
an
age
of
upheaval
POS tagging PERS
O
O
O
ORG
ORG
Murdoch
discusses
future
of
News
Corp.
Named entity recognition
13
Slide acknowledgement: Stanford NLP
MEMM inference in systems For a Conditional Markov Model (CMM) a.k.a. a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions A larger space of sequences is usually explored via search
Local Context
Decision Point
-3
-2
-1
0
+1
DT
NNP
VBD
???
???
The
Dow
fell
22.6
%
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
14
Features W0
22.6
W+1
%
W-1
fell
T-1
VBD
T-1-T-2 hasDigit?
NNP-VBD true
…
Slide acknowledgement: Stanford NLP
…
Inference in Systems Sequence Level
Sequence Model
Inference
Sequence Data
Local Level Local Local Local Data Data Data
Feature Extraction
Maximum Entropy Models
15
Label
Classifier Type
Label
Optimization Features
Conjugate Gradient
Smoothing
Quadratic Penalties
Features
POS & NER on Python: Tools Preparation Install Anaconda Python (http://telaga.cs.ui.ac.id/~b.distiawan/summer_school/tools) Install nltk module: pip install nltk Download nltk module:
1.
2. 3.
Enter python interactive mode Type:
Download:
16
import nltk; nltk.download();
Model: maxent_ne_chunker Model: maxent_treebank_pos Model: tagset
POS & NER on Python Make file D:\NER\ner.py import nltk; sentence = "George Washington was unanimously elected the first President of the United States"; tok_sentence = nltk.word_tokenize(sentence); print tok_sentence; tagged_sentence = nltk.pos_tag(tok_sentence); print tagged_sentence;
tree = nltk.ne_chunk(tagged_sentence); print tree; tree.draw(); 17
POS & NER on Python
Step 1: Tokenization
Split each token (word) on a sentence tok_sentence = nltk.word_tokenize(sentence); print tok_sentence;
18
POS & NER on Python
Step 2: POS Tagging
Identify the Part of Speech (POS) for each token tagged_sentence = nltk.pos_tag(tok_sentence); print tagged_sentence;
To see the list of tag set nltk.help.upenn_tagset()
19
POS & NER on Python
Step 3: NE Recognition
Given words and its POS, identify the entity of each token tree = nltk.ne_chunk(tagged_sentence); print tree;
20
POS & NER on Python
Error on reconition (S (PERSON George/NNP) (GPE Washington/NNP) was/VBD unanimously/RB elected/VBN the/DT first/JJ President/NNP of/IN the/DT (GPE United/NNP States/NNPS) )
Why this can be happened?
How to fix (improve) the NER? 21
How to Train POS Model Prepare the corpus
1.
IO / IOB Format Example: http://telaga.cs.ui.ac.id/~b.distiawan/POS/corpus Make the *.iob files to your corpus directory
22
D:/POS/corpus
How to Train POS Model Build POS model using your own corpus
2.
Download the script for train NER model: http://telaga.cs.ui.ac.id/~b.distiawan/POS/trainer.zip Extract the zip file to
D:/POS/
Run the following commands:
>>> D: >>> cd POS\trainer >>> python train_tagger.py "D:\POS\corpus" --reader nltk.corpus.reader.tagged.TaggedCorpusReader --sequential b -classifier Maxent --filename D:\POS\model.pickles --no-eval
23
How to Train POS Model Use the model
3.
Copy:
D:\POS\trainer\nltk_trainer to
D:\POS\
Make: D:\POS\pos.py
import nltk; import pickle; tagger = pickle.load(open("D:\POS\model.pickles")) sentence = "John Kuhns eats pizza"; tok_sentence = nltk.word_tokenize(sentence); tagged_sentence = tagger.tag(tok_sentence); print tagged_sentence; 24
How to Train NER Model Prepare the corpus
1.
IO / IOB Format Example: http://telaga.cs.ui.ac.id/~b.distiawan/NER/corpus Make the *.iob files to your corpus directory
25
D:/NER/corpus
How to Train NER Model Build NER model using your own corpus
2.
Download the script for train NER model: http://telaga.cs.ui.ac.id/~b.distiawan/NER/trainer.zip Extract the zip file to
D:/NER/
Run the following commands:
>>> D: >>> cd NER\trainer >>> python train_chunker.py "D:\NER\corpus" --reader nltk.corpus.reader.ChunkedCorpusReader --fraction 0.85 --sequential '' --classifier Maxent --filename D:\NER\model.pickles --no-eval
26
How to Train NER Model Use the model
3.
Copy:
D:\NER\trainer\nltk_trainer to
D:\NER\
Make: D:\NER\ner2.py
import nltk; import pickle; chunker = pickle.load(open("D:\NER\model.pickles")) sentence = “John Kuhns eats pizza"; tok_sentence = nltk.word_tokenize(sentence); tagged_sentence = nltk.pos_tag(tok_sentence); tree = chunker.parse(tagged_sentence); tree.draw(); 27
How to Train NER Model
Try for another sentence sentence = “George Washington eats pizza";
What happened?
Take a moment to fix your model. George NNP B-PERS Washington NNP I-PERS was VBD O the DT O first JJ O president NN O of IN O America NNP B-LOC 28
Homework
Build your own model using large data. You can make your own IOB file or generate it from available corpus.
29
http://schwa.org/projects/resources/wiki/Wikiner