NLP - ckiptagger - HackMD
文章推薦指數: 80 %
筆記. Ckiptagger. 開源的國產自動化斷詞工具(CKIP Lab). 可依據自己需求,修改原始碼,增加新功能或特色,用於處理文本、語義分析的使用。
Published
LinkedwithGitHub
Like
Bookmark
Subscribe
---
title:NLP-ckiptagger
tags:self-learning,NLP
---
{%hackmdBkVfcTxlQ%}
#**_NLP-ckiptagger_**
>[name=BessyHuang][time=Sun,Apr12,2020]
#**課程大綱**
[TOC]
:::warning
**_Reference:_**
*[CKIPLab-下載軟體與資源](https://ckip.iis.sinica.edu.tw/resource/)
*中文斷詞暨實體辨識系統
*[線上展示](https://ckip.iis.sinica.edu.tw/service/ckiptagger/)
*[GitHub:CkipTagger](https://github.com/ckiplab/ckiptagger)
*[PythonAPI:ckiptagger](https://pypi.org/project/ckiptagger/)
*[Paper:WhyAttention?AnalyzeBiLSTMDeficiencyandItsRemediesintheCaseofNER](https://arxiv.org/abs/1908.11046)
*[中文自然語言處理(NLP)的進展與挑戰](https://allen108108.github.io/blog/2019/11/01/%E4%B8%AD%E6%96%87%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%20(NLP)%20%E7%9A%84%E9%80%B2%E5%B1%95%E8%88%87%E6%8C%91%E6%88%B0/)
:::
---
##**筆記**
###Ckiptagger
*開源的國產自動化斷詞工具(CKIPLab)
*可依據自己需求,修改原始碼,增加新功能或特色,用於處理文本、語義分析的使用。
*功能
*繁體中文斷詞(WS)
*詞性標註(POS)
*[18類專有名詞的實體辨識(NER)](https://github.com/ckiplab/ckiptagger/wiki/Entity-Types)
*特色
*加強斷詞表現
*可以不自動刪/改字
*支援不限長度的句子
*使用者自訂功能:提供參考/強制詞典
*[優勢](https://www.ithome.com.tw/news/132838)
>以多達5萬句的ASBC4.0漢語語料庫測試集,來進行中文斷詞測試時,
>CkipTagger表現遠高於中國的結巴,
>中研院在中文斷詞準確度可達到97.49%,
>相較之下,中國的結巴只有90.51%。
>
:::info
(WS)prec(WS)rec(WS)f1(POS)acc
:::
###JiebaVS.Ckiptagger
*[最強中文自然語言處理工具CKIPtagger](https://mc.ai/%E6%9C%80%E5%BC%B7%E4%B8%AD%E6%96%87%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%E5%B7%A5%E5%85%B7ckiptagger/)
---
##**中文斷詞工具:實作**
###Installation(forGPU)
```
$pip3installckiptagger
$pip3installtensorflow-gpu
$pip3installgdown
```
###基本範例
*下載modelandextractsto./data/
*model:data.zip(2GB)
```python=
fromckiptaggerimportdata_utils
data_utils.download_data_gdown("./")
```
*開始斷詞WS,詞性標註POS,命名實體識別NER
```python=
importos
fromckiptaggerimportWS,POS,NER
#SettingGPU
os.environ["CUDA_VISIBLE_DEVICES"]="0"
#Loadmodel&SettingGPU
ws=WS("./data",disable_cuda=False)
pos=POS("./data",disable_cuda=False)
ner=NER("./data",disable_cuda=False)
#RuntheWS-POS-NERpipeline
sentence_list=[
"你現在一時的同情,不是幫助,只會讓她更痛苦,因為你永遠不會是她的家人,而她自始自終還是一個人。
",
"因為是兩個人做的事情,有人牽著,去哪裡都可以;有人回應著,說什麼也可以,因為那是兩個人的事情,就算再無聊,它都變得好幸福。
——慕橙",
"如果人的記憶,只能選擇一秒鐘的額度,我希望,就是這一瞬間。
——光晞",
"如果時間能夠倒轉那有多好,當然那是不可能的事情,但是能有讓時間暫停的魔法,那就是攝影,不是嗎?攝影師讓瞬間,變成永恆的魔法。
——攝影師",
]
word_sentence_list=ws(
sentence_list,
#sentence_segmentation=True,#Toconsiderdelimiters分隔符號
#segment_delimiter_set={",","。
",":","?","!",";"}),#Thisisthedefualtsetofdelimit$
#recommend_dictionary=dictionary1,#wordsinthisdictionaryareencouraged
#coerce_dictionary=dictionary2,#wordsinthisdictionaryareforced
)
pos_sentence_list=pos(word_sentence_list)
entity_sentence_list=ner(word_sentence_list,pos_sentence_list)
print('WS:',word_sentence_list)
print('POS:',pos_sentence_list)
print('NER:',entity_sentence_list)
fornameinentity_sentence_list[2]:
print(name)
```
*Output
```
WS:[['你','現在','一時','的','同情',',','不','是','幫助',',','只','會','讓','她','更','痛苦',',','因為','你','永遠','不會','是','她','的','家人',',','而','她','自始自終','還','是','一','個','人','。
'],['因為','是','兩','個','人','做','的','事情',',','有','人','牽','著',',','去','哪裡','都','可以',';','有','人','回應','著',',','說','什麼','也','可以',',','因為','那','是','兩','個','人','的','事情',',','就算','再','無聊',',','它','都','變','得','好','幸福','。
','—','—','慕橙'],['如果','人','的','記憶',',','只','能','選擇','一','秒鐘','的','額度',',','我','希望',',','就','是','這','一瞬間','。
','—','—','光晞'],['如果','時間','能夠','倒轉','那','有','多','好',',','當然','那','是','不可能','的','事情',',','但是','能','有','讓','時間','暫停','的','魔法',',','那','就','是','攝影',',','不','是','嗎','?','攝影師','讓','瞬間',',','變成','永恆','的','魔法','。
','—','—','攝影師','']]
POS:[['Nh','Nd','Nd','DE','VJ','COMMACATEGORY','D','SHI','VC','COMMACATEGORY','Da','D','VL','Nh','Dfa','VH','COMMACATEGORY','Cbb','Nh','D','D','SHI','Nh','DE','Na','COMMACATEGORY','Cbb','Nh','VH','D','SHI','Neu','Nf','Na','PERIODCATEGORY'],['Cbb','SHI','Neu','Nf','Na','VC','DE','Na','COMMACATEGORY','V_2','Na','VC','Di','COMMACATEGORY','VCL','Ncd','D','VH','SEMICOLONCATEGORY','V_2','Na','VC','Di','COMMACATEGORY','VE','Nep','D','VH','COMMACATEGORY','Cbb','Nep','SHI','Neu','Nf','Na','DE','Na','COMMACATEGORY','Cbb','D','VH','COMMACATEGORY','Nh','D','VG','DE','Dfa','VH','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Nb'],['Cbb','Na','DE','Na','COMMACATEGORY','Da','D','VC','Neu','Nf','DE','Na','COMMACATEGORY','Nh','VK','COMMACATEGORY','D','SHI','Nep','Nd','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Nb'],['Cbb','Na','D','VAC','Nep','V_2','Dfa','VH','COMMACATEGORY','D','Nep','SHI','A','DE','Na','COMMACATEGORY','Cbb','D','V_2','VL','Na','VHC','DE','Na','COMMACATEGORY','Nep','D','SHI','Na','COMMACATEGORY','D','SHI','T','QUESTIONCATEGORY','Na','VL','Nd','COMMACATEGORY','VG','VH','DE','Na','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Na','WHITESPACE']]
NER:[set(),{(40,41,'CARDINAL','兩'),(3,4,'CARDINAL','兩')},{(11,14,'TIME','一秒鐘'),(31,33,'PERSON','光晞')},set()]
(11,14,'TIME','一秒鐘')
(31,33,'PERSON','光晞')
```
*定義字典
```python=
fromckiptaggerimportconstruct_dictionary
word_to_weight={
"慕橙":1,
"一個人":1,
"一個":2,#權重較重,代表會斷詞成'一個''人',而非'一個人'
}
dictionary2=construct_dictionary(word_to_weight)
print(dictionary2)
word_sentence_list=ws(
sentence_list,
#sentence_segmentation=True,#Toconsiderdelimiters分隔符號
#segment_delimiter_set={",","。
",":","?","!",";"}),#Thisisthedefualtsetofdelimit$
recommend_dictionary=dictionary1,#wordsinthisdictionaryareencouraged鼓勵,支援
coerce_dictionary=dictionary2,#wordsinthisdictionaryareforced強制
)
```
*`recommend_dictionary`:支援字典
*Ex:句子中有"梁慕橙",但字典卻只定義"慕橙",則ckiptagger會把"梁慕橙"辨識為人名。
*`coerce_dictionary`:強制字典
*只要出現在"定義字典"裡面的詞,且權重較重,就會直接強制成詞。
*Ex:句子中有"梁慕橙",但字典卻只定義"慕橙",則ckiptagger會把"慕橙"變成詞,但並不會辨識為人名。
×
Signin
Email
Password
Forgotpassword
or
Byclickingbelow,youagreetoourtermsofservice.
SigninviaFacebook
SigninviaTwitter
SigninviaGitHub
SigninviaDropbox
SigninviaGoogle
NewtoHackMD?Signup
延伸文章資訊
- 1中研院開源NLP套件「CKIPtagger」,繁中不結巴. 雙十假日有 ...
完整安裝:pip install -U ckiptagger[tfgpu,gdown]. 差別就在於要不要裝「Tensorflow」與「gdown」這兩個套件,由於我本身的環境已經有 ...
- 2[NLP][Python] 透過ckiptagger 來使用繁體中文斷詞的最佳工具 ...
中研院的繁體中文斷詞系統CKIP 終於開源在Github 上了,名稱就叫做ckiptagger 。我迫不及待地馬上進行了試用,也順便在這裡分享了下我試用的心得。
- 3CkipTagger - CKIP Lab - 中央研究院
CkipTagger GitHub PyPI · CKIP Lab 資訊所 中央研究院. 仁今1 緯來體育台1. WS recommend dictionary and weights (斷詞參...
- 4ckiplab/ckiptagger: CKIP Neural Chinese Word ... - GitHub
CkipTagger is a Python library hosted on PyPI. Requirements: python>=3.6; tensorflow>=1.13.1 / te...
- 5NLP - ckiptagger - HackMD
筆記. Ckiptagger. 開源的國產自動化斷詞工具(CKIP Lab). 可依據自己需求,修改原始碼,增加新功能或特色,用於處理文本、語義分析的使用。