UTF-8 Byte Sequences - Markus ICU/Unicode - Google Sites

2025-02-02

文章推薦指數： 80 %

投票人數：10人

UTF-8 is specified with a simple algorithm, but its large number of sequence lengths and its byte value restrictions result in a large number of illegal ... SearchthissiteSkiptomaincontentSkiptonavigationUTF-8ByteSequences(Movedhereunchangedfromdefuncthttp://www.mindspring.com/~markus.scherer/unicode/utf-8-bytes.html)MarkusW.Scherer2002-aug-10UTF-8isspecifiedwithasimplealgorithm,butitslargenumberofsequencelengthsanditsbytevaluerestrictionsresultinalargenumberofillegalbytesequences.Aconformantdecodermustdetectmalformedsequencesandwell-formedbutotherwiseillegalsequences.AsimpleUTF-8decoderfunctionmayreturnapairofvalues—aboolean"islegal"flaganda32-bitcodepoint—whilemovinganindextotheinputcodeunitsaheadpastthedecodedsequence.Itispossibletoreturnuniquevaluesforeachillegalsequence,togetherwiththe"legal"flagsettofalse.Withthefollowingsuggestederrorvaluesitissimpletosubsumethe"islegal"flaginthereturnvalueandtotestthe"islegal"statusjustbytestingifthereturnvalueisatorabove80000000(or110000forUnicode),orifitisasurrogateD800..DFFF.Thefollowingtablelistswell-formedlegalandillegalsequencesaswellasmalformedones,withsuggested,uniqueerrorreturnvaluesforillegalones.Itassumesthatthedecoderfunctionalwaysconsumesatleastonebyte,andafteraleadbyteconsumesasmanytrailbytesastheleadbyteindicates,butthatitalsostopsconsumingbytesassoonas(before)itfindsthefirstnon-trailbyteafteraleadbyte.Thissuggestedbehaviorhelpsresynchronizingafteranillegalsequence.Otherpossibleerrorhandlingstrategieswouldresultinfewerormoreillegalsequencesandvalues.Forexample,amuchsimplerstrategyistotreateachofthesequenceslistedasillegalbelowasasequenceofsingle-byteerrors,withonly128errorreturnvaluesbutslowresynchronization.Anotherexampleistosynchronizeassuggestedbelowbuttoreturnonly6differentvalueslike-1..-6indicatingthelengthoftheillegalsequence.Thesuggestederrorreturnvaluesallhavebit31set,exceptforsinglesurrogatevalues(witha*)whicharesuggestedtobereturnedwiththeirnaturalvalues(withthe"legal"flagsettofalse,ofcourse).ThetablefurtherassumestheoriginaldefinitionofUTF-8.TheUnicodeStandardadditionallyforbidsvalues110000..7FFFFFFF.ForaUnicodeUTF-8decoderfunctionthatfollowsthesuggestedschemeforbest-effortresynchronizationtheratioofillegalsequencestolegalonesisabout200010:1!Bycomparison,forasimilarlysynchronizingUTF-16decoderthisratioisalmosttheinverse,about1:50010.Allvaluesbelowarewritteninhexadecimalnotation.ReportabusePagedetailsPageupdatedGoogleSitesReportabuse

請為這篇文章評分？

延伸文章資訊

UTF-8 - 维基百科，自由的百科全书

UTF-8（8-bit Unicode Transformation Format）是一種針對Unicode的可變長度字元編碼，也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有...

UTF-8 - OpenHome.cc

Unicode 的實作方式之一UTF-8（8-bit Unicode Transformation Format），使用可變 ... 作為位元組順序記號（Byte-Order Mark，BOM）...

UTF-8 - 字嗨！

因此它成為E-mail、網頁、程式碼等各種純文字文件處理Unicode時最通用的編碼方式。編碼方式. 碼點範圍, 序列長度, Byte 1, Byte 2, Byte 3, Byte 4. 0...

UTF 8 - :: 痞客邦::

BIG-5 使用兩個byte 的固定長度編碼， UTF-8 使用1 到4 個byte 的浮動長度編碼 ( 例如字母C ，在UTF-8 只會用一個byte ，中文字大部分會有3 個byte ...

請問"李襎"這個字是算幾個BYTE - iT 邦幫忙

他沒有算錯, 在UTF-8 的編碼，一個中文3 bytes big5 一個中文算2 byte, 不同的編碼，中文的長度不同. 3 則回應分享. 回應; 沒有幫助. ccsh1205 （發問者）...

UTF-8 Byte Sequences - Markus ICU/Unicode - Google Sites

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

中日口譯課程

中國生產力中心口譯評價

紙的應用