UTF-8 Byte Sequences - Markus ICU/Unicode - Google Sites

文章推薦指數: 80 %
投票人數:10人

UTF-8 is specified with a simple algorithm, but its large number of sequence lengths and its byte value restrictions result in a large number of illegal ... SearchthissiteSkiptomaincontentSkiptonavigationUTF-8ByteSequences(Movedhereunchangedfromdefuncthttp://www.mindspring.com/~markus.scherer/unicode/utf-8-bytes.html)MarkusW.Scherer2002-aug-10UTF-8isspecifiedwithasimplealgorithm,butitslargenumberofsequencelengthsanditsbytevaluerestrictionsresultinalargenumberofillegalbytesequences.Aconformantdecodermustdetectmalformedsequencesandwell-formedbutotherwiseillegalsequences.AsimpleUTF-8decoderfunctionmayreturnapairofvalues—aboolean"islegal"flaganda32-bitcodepoint—whilemovinganindextotheinputcodeunitsaheadpastthedecodedsequence.Itispossibletoreturnuniquevaluesforeachillegalsequence,togetherwiththe"legal"flagsettofalse.Withthefollowingsuggestederrorvaluesitissimpletosubsumethe"islegal"flaginthereturnvalueandtotestthe"islegal"statusjustbytestingifthereturnvalueisatorabove80000000(or110000forUnicode),orifitisasurrogateD800..DFFF.Thefollowingtablelistswell-formedlegalandillegalsequencesaswellasmalformedones,withsuggested,uniqueerrorreturnvaluesforillegalones.Itassumesthatthedecoderfunctionalwaysconsumesatleastonebyte,andafteraleadbyteconsumesasmanytrailbytesastheleadbyteindicates,butthatitalsostopsconsumingbytesassoonas(before)itfindsthefirstnon-trailbyteafteraleadbyte.Thissuggestedbehaviorhelpsresynchronizingafteranillegalsequence.Otherpossibleerrorhandlingstrategieswouldresultinfewerormoreillegalsequencesandvalues.Forexample,amuchsimplerstrategyistotreateachofthesequenceslistedasillegalbelowasasequenceofsingle-byteerrors,withonly128errorreturnvaluesbutslowresynchronization.Anotherexampleistosynchronizeassuggestedbelowbuttoreturnonly6differentvalueslike-1..-6indicatingthelengthoftheillegalsequence.Thesuggestederrorreturnvaluesallhavebit31set,exceptforsinglesurrogatevalues(witha*)whicharesuggestedtobereturnedwiththeirnaturalvalues(withthe"legal"flagsettofalse,ofcourse).ThetablefurtherassumestheoriginaldefinitionofUTF-8.TheUnicodeStandardadditionallyforbidsvalues110000..7FFFFFFF.ForaUnicodeUTF-8decoderfunctionthatfollowsthesuggestedschemeforbest-effortresynchronizationtheratioofillegalsequencestolegalonesisabout200010:1!Bycomparison,forasimilarlysynchronizingUTF-16decoderthisratioisalmosttheinverse,about1:50010.Allvaluesbelowarewritteninhexadecimalnotation.ReportabusePagedetailsPageupdatedGoogleSitesReportabuse



請為這篇文章評分?