UTF-8 Byte Sequences - Markus ICU/Unicode - Google Sites
文章推薦指數: 80 %
UTF-8 is specified with a simple algorithm, but its large number of sequence lengths and its byte value restrictions result in a large number of illegal ... SearchthissiteSkiptomaincontentSkiptonavigationUTF-8ByteSequences(Movedhereunchangedfromdefuncthttp://www.mindspring.com/~markus.scherer/unicode/utf-8-bytes.html)MarkusW.Scherer2002-aug-10UTF-8isspecifiedwithasimplealgorithm,butitslargenumberofsequencelengthsanditsbytevaluerestrictionsresultinalargenumberofillegalbytesequences.Aconformantdecodermustdetectmalformedsequencesandwell-formedbutotherwiseillegalsequences.AsimpleUTF-8decoderfunctionmayreturnapairofvalues—aboolean"islegal"flaganda32-bitcodepoint—whilemovinganindextotheinputcodeunitsaheadpastthedecodedsequence.Itispossibletoreturnuniquevaluesforeachillegalsequence,togetherwiththe"legal"flagsettofalse.Withthefollowingsuggestederrorvaluesitissimpletosubsumethe"islegal"flaginthereturnvalueandtotestthe"islegal"statusjustbytestingifthereturnvalueisatorabove80000000(or110000forUnicode),orifitisasurrogateD800..DFFF.Thefollowingtablelistswell-formedlegalandillegalsequencesaswellasmalformedones,withsuggested,uniqueerrorreturnvaluesforillegalones.Itassumesthatthedecoderfunctionalwaysconsumesatleastonebyte,andafteraleadbyteconsumesasmanytrailbytesastheleadbyteindicates,butthatitalsostopsconsumingbytesassoonas(before)itfindsthefirstnon-trailbyteafteraleadbyte.Thissuggestedbehaviorhelpsresynchronizingafteranillegalsequence.Otherpossibleerrorhandlingstrategieswouldresultinfewerormoreillegalsequencesandvalues.Forexample,amuchsimplerstrategyistotreateachofthesequenceslistedasillegalbelowasasequenceofsingle-byteerrors,withonly128errorreturnvaluesbutslowresynchronization.Anotherexampleistosynchronizeassuggestedbelowbuttoreturnonly6differentvalueslike-1..-6indicatingthelengthoftheillegalsequence.Thesuggestederrorreturnvaluesallhavebit31set,exceptforsinglesurrogatevalues(witha*)whicharesuggestedtobereturnedwiththeirnaturalvalues(withthe"legal"flagsettofalse,ofcourse).ThetablefurtherassumestheoriginaldefinitionofUTF-8.TheUnicodeStandardadditionallyforbidsvalues110000..7FFFFFFF.ForaUnicodeUTF-8decoderfunctionthatfollowsthesuggestedschemeforbest-effortresynchronizationtheratioofillegalsequencestolegalonesisabout200010:1!Bycomparison,forasimilarlysynchronizingUTF-16decoderthisratioisalmosttheinverse,about1:50010.Allvaluesbelowarewritteninhexadecimalnotation.ReportabusePagedetailsPageupdatedGoogleSitesReportabuse
延伸文章資訊
- 1UTF-8 - 维基百科,自由的百科全书
UTF-8(8-bit Unicode Transformation Format)是一種針對Unicode的可變長度字元編碼,也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有...
- 2UTF-8 - OpenHome.cc
Unicode 的實作方式之一UTF-8(8-bit Unicode Transformation Format),使用可變 ... 作為位元組順序記號(Byte-Order Mark,BOM)...
- 3UTF-8 - 字嗨!
因此它成為E-mail、網頁、程式碼等各種純文字文件處理Unicode時最通用的編碼方式。 編碼方式. 碼點範圍, 序列長度, Byte 1, Byte 2, Byte 3, Byte 4. 0...
- 4UTF 8 - :: 痞客邦::
BIG-5 使用兩個byte 的固定長度編碼, UTF-8 使用1 到4 個byte 的浮動長度編碼 ( 例如字母C ,在UTF-8 只會用一個byte ,中文字大部分會有3 個byte ...
- 5請問"李襎"這個字是算幾個BYTE - iT 邦幫忙
他沒有算錯, 在UTF-8 的編碼,一個中文3 bytes big5 一個中文算2 byte, 不同的編碼,中文的長度不同. 3 則回應 分享. 回應; 沒有幫助. ccsh1205 (發問者)...