NLP - ckiptagger - HackMD

文章推薦指數: 80 %
投票人數:10人

筆記. Ckiptagger. 開源的國產自動化斷詞工具(CKIP Lab). 可依據自己需求,修改原始碼,增加新功能或特色,用於處理文本、語義分析的使用。

      Published LinkedwithGitHub Like Bookmark Subscribe --- title:NLP-ckiptagger tags:self-learning,NLP --- {%hackmdBkVfcTxlQ%} #**_NLP-ckiptagger_** >[name=BessyHuang][time=Sun,Apr12,2020] #**課程大綱** [TOC] :::warning **_Reference:_** *[CKIPLab-下載軟體與資源](https://ckip.iis.sinica.edu.tw/resource/) *中文斷詞暨實體辨識系統 *[線上展示](https://ckip.iis.sinica.edu.tw/service/ckiptagger/) *[GitHub:CkipTagger](https://github.com/ckiplab/ckiptagger) *[PythonAPI:ckiptagger](https://pypi.org/project/ckiptagger/) *[Paper:WhyAttention?AnalyzeBiLSTMDeficiencyandItsRemediesintheCaseofNER](https://arxiv.org/abs/1908.11046) *[中文自然語言處理(NLP)的進展與挑戰](https://allen108108.github.io/blog/2019/11/01/%E4%B8%AD%E6%96%87%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%20(NLP)%20%E7%9A%84%E9%80%B2%E5%B1%95%E8%88%87%E6%8C%91%E6%88%B0/) ::: --- ##**筆記** ###Ckiptagger *開源的國產自動化斷詞工具(CKIPLab) *可依據自己需求,修改原始碼,增加新功能或特色,用於處理文本、語義分析的使用。

*功能 *繁體中文斷詞(WS) *詞性標註(POS) *[18類專有名詞的實體辨識(NER)](https://github.com/ckiplab/ckiptagger/wiki/Entity-Types) *特色 *加強斷詞表現 *可以不自動刪/改字 *支援不限長度的句子 *使用者自訂功能:提供參考/強制詞典 *[優勢](https://www.ithome.com.tw/news/132838) >以多達5萬句的ASBC4.0漢語語料庫測試集,來進行中文斷詞測試時, >CkipTagger表現遠高於中國的結巴, >中研院在中文斷詞準確度可達到97.49%, >相較之下,中國的結巴只有90.51%。

> :::info (WS)prec(WS)rec(WS)f1(POS)acc ::: ###JiebaVS.Ckiptagger *[最強中文自然語言處理工具CKIPtagger](https://mc.ai/%E6%9C%80%E5%BC%B7%E4%B8%AD%E6%96%87%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%E5%B7%A5%E5%85%B7ckiptagger/) --- ##**中文斷詞工具:實作** ###Installation(forGPU) ``` $pip3installckiptagger $pip3installtensorflow-gpu $pip3installgdown ``` ###基本範例 *下載modelandextractsto./data/ *model:data.zip(2GB) ```python= fromckiptaggerimportdata_utils data_utils.download_data_gdown("./") ``` *開始斷詞WS,詞性標註POS,命名實體識別NER ```python= importos fromckiptaggerimportWS,POS,NER #SettingGPU os.environ["CUDA_VISIBLE_DEVICES"]="0" #Loadmodel&SettingGPU ws=WS("./data",disable_cuda=False) pos=POS("./data",disable_cuda=False) ner=NER("./data",disable_cuda=False) #RuntheWS-POS-NERpipeline sentence_list=[ "你現在一時的同情,不是幫助,只會讓她更痛苦,因為你永遠不會是她的家人,而她自始自終還是一個人。

", "因為是兩個人做的事情,有人牽著,去哪裡都可以;有人回應著,說什麼也可以,因為那是兩個人的事情,就算再無聊,它都變得好幸福。

——慕橙", "如果人的記憶,只能選擇一秒鐘的額度,我希望,就是這一瞬間。

——光晞", "如果時間能夠倒轉那有多好,當然那是不可能的事情,但是能有讓時間暫停的魔法,那就是攝影,不是嗎?攝影師讓瞬間,變成永恆的魔法。

——攝影師", ] word_sentence_list=ws( sentence_list, #sentence_segmentation=True,#Toconsiderdelimiters分隔符號 #segment_delimiter_set={",","。

",":","?","!",";"}),#Thisisthedefualtsetofdelimit$ #recommend_dictionary=dictionary1,#wordsinthisdictionaryareencouraged #coerce_dictionary=dictionary2,#wordsinthisdictionaryareforced ) pos_sentence_list=pos(word_sentence_list) entity_sentence_list=ner(word_sentence_list,pos_sentence_list) print('WS:',word_sentence_list) print('POS:',pos_sentence_list) print('NER:',entity_sentence_list) fornameinentity_sentence_list[2]: print(name) ``` *Output ``` WS:[['你','現在','一時','的','同情',',','不','是','幫助',',','只','會','讓','她','更','痛苦',',','因為','你','永遠','不會','是','她','的','家人',',','而','她','自始自終','還','是','一','個','人','。

'],['因為','是','兩','個','人','做','的','事情',',','有','人','牽','著',',','去','哪裡','都','可以',';','有','人','回應','著',',','說','什麼','也','可以',',','因為','那','是','兩','個','人','的','事情',',','就算','再','無聊',',','它','都','變','得','好','幸福','。

','—','—','慕橙'],['如果','人','的','記憶',',','只','能','選擇','一','秒鐘','的','額度',',','我','希望',',','就','是','這','一瞬間','。

','—','—','光晞'],['如果','時間','能夠','倒轉','那','有','多','好',',','當然','那','是','不可能','的','事情',',','但是','能','有','讓','時間','暫停','的','魔法',',','那','就','是','攝影',',','不','是','嗎','?','攝影師','讓','瞬間',',','變成','永恆','的','魔法','。

','—','—','攝影師','']] POS:[['Nh','Nd','Nd','DE','VJ','COMMACATEGORY','D','SHI','VC','COMMACATEGORY','Da','D','VL','Nh','Dfa','VH','COMMACATEGORY','Cbb','Nh','D','D','SHI','Nh','DE','Na','COMMACATEGORY','Cbb','Nh','VH','D','SHI','Neu','Nf','Na','PERIODCATEGORY'],['Cbb','SHI','Neu','Nf','Na','VC','DE','Na','COMMACATEGORY','V_2','Na','VC','Di','COMMACATEGORY','VCL','Ncd','D','VH','SEMICOLONCATEGORY','V_2','Na','VC','Di','COMMACATEGORY','VE','Nep','D','VH','COMMACATEGORY','Cbb','Nep','SHI','Neu','Nf','Na','DE','Na','COMMACATEGORY','Cbb','D','VH','COMMACATEGORY','Nh','D','VG','DE','Dfa','VH','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Nb'],['Cbb','Na','DE','Na','COMMACATEGORY','Da','D','VC','Neu','Nf','DE','Na','COMMACATEGORY','Nh','VK','COMMACATEGORY','D','SHI','Nep','Nd','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Nb'],['Cbb','Na','D','VAC','Nep','V_2','Dfa','VH','COMMACATEGORY','D','Nep','SHI','A','DE','Na','COMMACATEGORY','Cbb','D','V_2','VL','Na','VHC','DE','Na','COMMACATEGORY','Nep','D','SHI','Na','COMMACATEGORY','D','SHI','T','QUESTIONCATEGORY','Na','VL','Nd','COMMACATEGORY','VG','VH','DE','Na','PERIODCATEGORY','DASHCATEGORY','DASHCATEGORY','Na','WHITESPACE']] NER:[set(),{(40,41,'CARDINAL','兩'),(3,4,'CARDINAL','兩')},{(11,14,'TIME','一秒鐘'),(31,33,'PERSON','光晞')},set()] (11,14,'TIME','一秒鐘') (31,33,'PERSON','光晞') ``` *定義字典 ```python= fromckiptaggerimportconstruct_dictionary word_to_weight={ "慕橙":1, "一個人":1, "一個":2,#權重較重,代表會斷詞成'一個''人',而非'一個人' } dictionary2=construct_dictionary(word_to_weight) print(dictionary2) word_sentence_list=ws( sentence_list, #sentence_segmentation=True,#Toconsiderdelimiters分隔符號 #segment_delimiter_set={",","。

",":","?","!",";"}),#Thisisthedefualtsetofdelimit$ recommend_dictionary=dictionary1,#wordsinthisdictionaryareencouraged鼓勵,支援 coerce_dictionary=dictionary2,#wordsinthisdictionaryareforced強制 ) ``` *`recommend_dictionary`:支援字典 *Ex:句子中有"梁慕橙",但字典卻只定義"慕橙",則ckiptagger會把"梁慕橙"辨識為人名。

*`coerce_dictionary`:強制字典 *只要出現在"定義字典"裡面的詞,且權重較重,就會直接強制成詞。

*Ex:句子中有"梁慕橙",但字典卻只定義"慕橙",則ckiptagger會把"慕橙"變成詞,但並不會辨識為人名。

× Signin Email Password Forgotpassword or Byclickingbelow,youagreetoourtermsofservice. SigninviaFacebook SigninviaTwitter SigninviaGitHub SigninviaDropbox SigninviaGoogle NewtoHackMD?Signup



請為這篇文章評分?