ckiplab/ckiptagger: CKIP Neural Chinese Word ... - GitHub
文章推薦指數: 80 %
CkipTagger is a Python library hosted on PyPI. Requirements: python>=3.6; tensorflow>=1.13.1 / tensorflow-gpu>=1.13.1 (one of them); gdown (optional, ...
Skiptocontent
{{message}}
ckiplab
/
ckiptagger
Public
Notifications
Fork
191
Star
1.5k
CKIPNeuralChineseWordSegmentation,POSTagging,andNER
License
GPL-3.0license
1.5k
stars
191
forks
Star
Notifications
Code
Issues
21
Pullrequests
3
Actions
Projects
0
Wiki
Security
Insights
More
Code
Issues
Pullrequests
Actions
Projects
Wiki
Security
Insights
ckiplab/ckiptagger
Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository.
master
Branches
Tags
Couldnotloadbranches
Nothingtoshow
{{refName}}
default
Couldnotloadtags
Nothingtoshow
{{refName}}
default
1
branch
9
tags
Code
Latestcommit
jacobvsdanniel
updateREADME
…
50add41
Sep10,2020
updateREADME
50add41
Gitstats
68
commits
Files
Permalink
Failedtoloadlatestcommitinformation.
Type
Name
Latestcommitmessage
Committime
src
nowcompatibletotensorflow2.3.0
Sep9,2020
LICENSE
changelicense
Sep5,2019
README.md
updateREADME
Sep10,2020
demo.py
normalizecharactersone-by-one
Nov22,2019
setup.py
nowcompatibletotensorflow2.3.0
Sep9,2020
Viewcode
CkipTagger
GitHub
PyPI
Documentation
Author/Maintainers
Introduction
Installation
Usage
1.Downloadmodelfiles
2.Loadmodel
3.(Optional)Createdictionary
4.RuntheWS-POS-NERpipeline
5.(Optional)Releasememory
6.ShowResults
ModelDetails
LICENSE
README.md
CkipTagger
Also:中文README
GitHub
https://github.com/ckiplab/ckiptagger
PyPI
https://pypi.org/project/ckiptagger
Documentation
https://github.com/ckiplab/ckiptagger/wiki
Author/Maintainers
Peng-HsuanLi@CKIP(author/maintainer)
Wei-YunMa@CKIP(maintainer)
Introduction
Thisopen-sourcelibraryimplementsneuralCKIP-styleChineseNLPtools.
(WS)wordsegmentation
(POS)part-of-speechtagging
(NER)namedentityrecognition
Relateddemosites
CkipTagger
CKIPCoreNLP
CKIPWS(classic)
Features
Performanceimprovements
Donotautodelete/change/addcharacters
Supportindefinitelylongsentences
Supportuser-definedrecommended-wordlistandmust-wordlist
ASBC4.0TestSplit(50,000sentences)
Tool
(WS)prec
(WS)rec
(WS)f1
(POS)acc
CkipTagger
97.49%
97.17%
97.33%
94.59%
CKIPWS(classic)
95.85%
95.96%
95.91%
90.62%
Jieba-zh_TW
90.51%
89.10%
89.80%
--
Installation
tl;dr.
pipinstall-Uckiptagger[tf,gdown]
CkipTaggerisaPythonlibraryhostedonPyPI.Requirements:
python>=3.6
tensorflow>=1.13.1/tensorflow-gpu>=1.13.1(oneofthem)
gdown(optional,fordownloadingmodelfilesfromgoogledrive)
(Minimuminstallation)Ifyouhavesetuptensorflow,andwouldliketodownloadmodelfilesbyyourself.
pipinstall-Uckiptagger
(Completeinstallation)Ifyouhavejustsetupacleanvirtualenvironment,andwanteverything,includingGPUsupport.
pipinstall-Uckiptagger[tfgpu,gdown]
Usage
Completedemoscript:demo.py.Thefollowingsectionsassume:
fromckiptaggerimportdata_utils,construct_dictionary,WS,POS,NER
1.Downloadmodelfiles
Themodelfilesareavailableonseveralmirrorsites.
iis-ckip
gdrive-ckip
gdrive-jacobvsdanniel
YoucandownloadandextracttothedesiredpathbyoneoftheincludedAPI.
#Downloadsto./data.zip(2GB)andextractsto./data/
#data_utils.download_data_url("./")#iis-ckip
data_utils.download_data_gdown("./")#gdrive-ckip
./data/model_ner/pos_list.txt->POStaglist,seeWiki/TechnicalReportno.93-05
./data/model_ner/label_list.txt->Entitytypelist,seeWiki/OntoNotesRelease5.0
./data/embedding_*->character/wordembeddings,seeWiki
2.Loadmodel
#TouseGPU:
#1.Installtensorflow-gpu(seeInstallation)
#2.SetCUDA_VISIBLE_DEVICESenvironmentvariable,e.g.os.environ["CUDA_VISIBLE_DEVICES"]="0"
#3.Setdisable_cuda=False,e.g.ws=WS("./data",disable_cuda=False)
#TouseCPU:
ws=WS("./data")
pos=POS("./data")
ner=NER("./data")
3.(Optional)Createdictionary
YoucansupplywordsforWSspecialconsideration,includingtheirrelativeweights.
word_to_weight={
"土地公":1,
"土地婆":1,
"公有":2,
"":1,
"來亂的":"啦",
"緯來體育台":1,
}
dictionary=construct_dictionary(word_to_weight)
print(dictionary)
[(2,{'公有':2.0}),(3,{'土地公':1.0,'土地婆':1.0}),(5,{'緯來體育台':1.0})]
4.RuntheWS-POS-NERpipeline
sentence_list=[
"傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
",
"美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
",
"",
"土地公有政策??還是土地婆有政策。
.",
"…你確定嗎…不要再騙了……",
"最多容納59,000個人,或5.9萬人,再多就不行了.這是環評的結論.",
"科長說:1,坪數對人數為1:3。
2,可以再增加。
",
]
word_sentence_list=ws(
sentence_list,
#sentence_segmentation=True,#Toconsiderdelimiters
#segment_delimiter_set={",","。
",":","?","!",";"}),#Thisisthedefualtsetofdelimiters
#recommend_dictionary=dictionary1,#wordsinthisdictionaryareencouraged
#coerce_dictionary=dictionary2,#wordsinthisdictionaryareforced
)
pos_sentence_list=pos(word_sentence_list)
entity_sentence_list=ner(word_sentence_list,pos_sentence_list)
5.(Optional)Releasememory
delws
delpos
delner
6.ShowResults
defprint_word_pos_sentence(word_sentence,pos_sentence):
assertlen(word_sentence)==len(pos_sentence)
forword,posinzip(word_sentence,pos_sentence):
print(f"{word}({pos})",end="\u3000")
print()
return
fori,sentenceinenumerate(sentence_list):
print()
print(f"'{sentence}'")
print_word_pos_sentence(word_sentence_list[i],pos_sentence_list[i])
forentityinsorted(entity_sentence_list[i]):
print(entity)
'傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
'
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nf) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VJ) 電視台(Nc) 。
(PERIODCATEGORY)
(0,3,'PERSON','傅達仁')
(18,22,'DATE','20年前')
(23,28,'ORG','緯來體育台')
'美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
'
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。
(PERIODCATEGORY)
(0,2,'GPE','美國')
(2,5,'ORG','參議院')
(7,9,'DATE','今天')
(11,13,'PERSON','布什')
(17,21,'ORG','勞工部長')
(21,24,'PERSON','趙小蘭')
(42,45,'ORG','參議院')
(56,58,'ORDINAL','第一')
(60,62,'NORP','華裔')
''
'土地公有政策??還是土地婆有政策。
.'
土地公(Nb) 有(V_2) 政策(Na) ?(QUESTIONCATEGORY) ?(QUESTIONCATEGORY) 還是(Caa) 土地(Na) 婆(Na) 有(V_2) 政策(Na) 。
(PERIODCATEGORY) .(PERIODCATEGORY)
(0,3,'PERSON','土地公')
'…你確定嗎…不要再騙了……'
…(ETCCATEGORY) (WHITESPACE) 你(Nh) 確定(VK) 嗎(T) …(ETCCATEGORY) (WHITESPACE) 不要(D) 再(D) 騙(VC) 了(Di) …(ETCCATEGORY) …(ETCCATEGORY)
'最多容納59,000個人,或5.9萬人,再多就不行了.這是環評的結論.'
最多(VH) 容納(VJ) 59,000(Neu) 個(Nf) 人(Na) ,(COMMACATEGORY) 或(Caa) 5.9萬(Neu) 人(Na) ,(COMMACATEGORY) 再(D) 多(D) 就(D) 不行(VH) 了(T) .(PERIODCATEGORY) 這(Nep) 是(SHI) 環評(Na) 的(DE) 結論(Na) .(PERIODCATEGORY)
(4,10,'CARDINAL','59,000')
(14,18,'CARDINAL','5.9萬')
'科長說:1,坪數對人數為1:3。
2,可以再增加。
'
科長(Na) 說(VE) :1,(Neu) 坪數(Na) 對(P) 人數(Na) 為(VG) 1:3(Neu) 。
(PERIODCATEGORY) 2(Neu) ,(COMMACATEGORY) 可以(D) 再(D) 增加(VHC) 。
(PERIODCATEGORY)
(4,6,'CARDINAL','1,')
(12,13,'CARDINAL','1')
(14,15,'CARDINAL','3')
(16,17,'CARDINAL','2')
ModelDetails
Pleasesee:
Peng-HsuanLi,Tsu-JuiFu,andWei-YunMa.2020.WhyAttention?AnalyzeBiLSTMDeficiencyandItsRemediesintheCaseofNER.InProceedingsoftheThirty-ThirdAAAIConferenceonArtificialIntelligence(AAAI/arXiv).
LICENSE
Copyright(c)2019CKIPLab.
ThisWorkislicensedundertheGNUGeneralPublicLicensev3.0withoutanywarranties.ThelicensetextinfullcanbegettingaccessatthefilenamedCOPYING-GPL-3.0.AnypersonobtainingacopyofthisWorkandassociateddocumentationfilesisgrantedtherightstouse,copy,modify,merge,publish,anddistributetheWorkforanypurpose.HoweverifanyworkisbaseduponthisWorkandhenceconstitutesaDerivativeWork,theGPL-3.0licenserequiresdistributionsoftheWorkandtheDerivativeWorktoremainunderthesamelicenseorasimilarlicensewiththeSourceCodeprovisionobligation.
ForcommerciallicensewithouttheSourceCodeconveyingliability,pleasecontact
延伸文章資訊
- 1文本前處理:CKIPTagger 斷詞、詞性標記與句法學- YouTube
- 2中研院開源NLP套件「CKIPtagger」,繁中不結巴. 雙十假日有 ...
完整安裝:pip install -U ckiptagger[tfgpu,gdown]. 差別就在於要不要裝「Tensorflow」與「gdown」這兩個套件,由於我本身的環境已經有 ...
- 3ckiplab/ckiptagger: CKIP Neural Chinese Word ... - GitHub
CkipTagger is a Python library hosted on PyPI. Requirements: python>=3.6; tensorflow>=1.13.1 / te...
- 46. 中文斷詞工具:CkipTagger
CkipTagger 為台灣中央研究院詞庫小組所開發的NLP(自然語言處理) 套件,是個以深度學習模型為基礎而成的NLP(自然語言處理) 應用。 · 在繁體中文上斷詞與詞性標記的表現 ...
- 5詞性標記、實體辨識的一站式中文處理開源套件- CkipTagger
目前有五個主要研究方向:深度學習、自然語言理解、知識表達、知識擷取、聊天機. 器人。 中研院詞庫小組(CKIP). Page 3. CkipTagger Developers.