读取IOB格式与CoNLL2000分块语料库
CoNLL2000,是已经加载标注的文本,使用IOB符号分块。
这个语料库提供的类型有NP,VP,PP。
例如:
hePRPB-NPaccepted VBDB-VPthe DTB-NPpositionNNI-NP...
chunk.conllstr2tree()的函数作用:将字符串建立一个树表示。
例如:
>>>text = '''... he PRPB-NP... accepted VBDB-VP... the DTB-NP... position NNI-NP... of IN B-PP... vice NNB-NP... chairman NNI-NP... of IN B-PP... CarlyleNNPB-NP... GroupNNPI-NP... , , O... a DTB-NP... merchantNNI-NP... banking NNI-NP... concernNNI-NP... . . O... '''>>>nltk.chunk.conllstr2tree(text,chunk_types=['NP']).draw()
运行结果如图所示:
对于CoNLL2000分块语料,我们可以对他进行如下操作:
#访问分块语料文件>>>from nltk.corpusimport conll2000>>>print conll2000.chunked_sents('train.txt')[99](S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) ,/, (NP Mr./NNPStone/NNP) (VP told/VBD) (NP his/PRP$story/NN) ./.)
#如果只对NP感兴趣,可以这样写>>>print conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99](S Over/IN (NP a/DT cup/NN) of/IN (NP coffee/NN) ,/, (NP Mr./NNPStone/NNP) told/VBD (NP his/PRP$story/NN) ./.)
简单评估和基准
>>>grammar= r"NP: {<[CDJNP].*>+}">>>cp = nltk.RegexpParser(grammar)>>>print cp.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 87.7%Precision: 70.6%Recall: 67.8%F-Measure: 69.2%
我们可以构造一个Unigram标注器来建立一个分块器。
#我们定义一个分块器,其中包括构造函数和一个parse方法,用来给新的句子分块例7-4. 使用unigram标注器对名词短语分块。classUnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) def parse(self, sentence): pos_tags= [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags =[(word, pos,chunktag)for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags)
注意parse这个函数,他的工作流程是这样的:
1、取一个已经标注的句子作为输入
2、从那句话提取的词性标记开始
3、使用在构造函数中训练过的标注器self.tagger,为词性添加标注IOB块标记。
4、提取块标记,与原句组合。
5、组合成一个块树。
做好块标记器之后,使用分块语料库库训练他。
>>>test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])>>>train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP'])>>>unigram_chunker= UnigramChunker(train_sents)>>>print unigram_chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 92.9%Precision: 79.9%Recall: 86.8%F-Measure: 83.2%
#我们可以通过这些代码,看到学习情况>>>postags= sorted(set(pos for sent in train_sents... for (word,pos) in sent.leaves()))>>>print unigram_chunker.tagger.tag(postags)[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'),(',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'),('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'),('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'),('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'),('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'),('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'),('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'),('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'),('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]
同样,我们也可以建立bigramTagger。
>>>bigram_chunker= BigramChunker(train_sents)>>>print bigram_chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 93.3%Precision: 82.3%Recall: 86.8%F-Measure: 84.5%
训练基于分类器的分块器
目前讨论的分块器有:正则表达式分块器、n-gram分块器,决定创建什么块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何分块。
例如:
(3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.
虽然标记都一样,但是很明显分块并不一样。
所以,我们需要使用词的内容信息作为词性标记的补充。
如果想使用词的内容信息的方法之一,是使用基于分类器的标注器对句子分块。
基于分类器的NP分块器的基础代码如下面的代码所示:
#在第2个类上,基本上是标注器的一个包装器,将它变成一个分块器。训练期间,这第二个类映射训练预料中的块树到标记序列#在parse方法中,它将标注器提供的标记序列转换回一个块树。classConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append( (featureset, tag) ) history.append(tag) self.classifier = nltk.MaxentClassifier.train( train_set, algorithm='megam', trace=0) def tag(self, sentence): history = [] for i, wordin enumerate(sentence): featureset = npchunk_features(sentence,i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history)classConsecutiveNPChunker(nltk.ChunkParserI):④ def __init__(self, train_sents): tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags =[(w,t,c) for ((w,t),c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags)
然后,定义一个特征提取函数:
>>>def npchunk_features(sentence,i, history):... word,pos= sentence[i]... return { "pos": pos}>>>chunker = ConsecutiveNPChunker(train_sents)>>>print chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 92.9%Precision: 79.9%Recall: 86.7%F-Measure: 83.2%
对于这个分类标记器我们还可以做改进,增添一个前面的词性标记。
>>>def npchunk_features(sentence,i, history):... word,pos= sentence[i].. . if i ==0:... prevword, prevpos= "", " "... else:... prevword, prevpos= sentence[i-1]... return { "pos": pos,"prevpos": prevpos}>>>chunker = ConsecutiveNPChunker(train_sents)>>>print chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 93.6%Precision: 81.9%Recall: 87.1%F-Measure: 84.4%
我们可以不仅仅以两个词性为特征,还可以再添加一个词的内容。
>>>def npchunk_features(sentence,i, history):... word,pos= sentence[i].. . if i ==0:.. . prevword, prevpos= "", " "... else:... prevword, prevpos= sentence[i-1]... return { "pos": pos,"word": word,"prevpos": prevpos}>>>chunker = ConsecutiveNPChunker(train_sents)>>>print chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 94.2%Precision: 83.4%Recall: 88.6%F-Measure: 85.9%
我们可以试着尝试多加几种特征提取,来增加分块器的表现,例如下面代码中增添了预取特征、配对功能和复杂的语境特征。最后一个特征是tags-since-dt,创建了一个字符串,描述自最近的限定词以来遇到的所有的词性标记。
>>>def npchunk_features(sentence,i, history):... word,pos= sentence[i]... if i ==0:... prevword, prevpos= "", " "... else:... prevword, prevpos= sentence[i-1]... if i ==len(sentence)-1:... nextword, nextpos= " ", " "... else:... nextword, nextpos= sentence[i+1]... return { "pos": pos,... "word": word,... "prevpos": prevpos,... "nextpos": nextpos,.. . "prevpos+pos": "%s+%s" %(prevpos, pos),... "pos+nextpos": "%s+%s" %(pos, nextpos),... "tags-since-dt": tags_since_dt(sentence, i)}>>>def tags_since_dt(sentence, i):... tags = set()... for word,pos in sentence[:i]:... if pos=='DT':... tags = set()... else:... tags.add(pos)... return '+'.join(sorted(tags))>>>chunker = ConsecutiveNPChunker(train_sents)>>>print chunker.evaluate(test_sents)ChunkParsescore:IOB Accuracy: 95.9%Precision: 88.3%Recall: 90.7%F-Measure: 89.5%