FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode

文章推薦指數: 80 %
投票人數:10人

Therefore, it works well in any environment where ASCII characters have a significance as syntax characters, e.g. file name syntaxes, markup languages, ... UTF-8,UTF-16,UTF-32&BOM [REVIEWED] Generalquestions,relatingtoUTForEncodingForm Q:IsUnicodea16-bitencoding? Initsfirstversion,from1991to1995,Unicodewasa16-bitencoding.butStartingwithUnicode2.0(July,1996),theUnicodeStandardhasencodedcharactersintherangeU+0000..U+10FFFF,whichamountstoa21-bitcodespace.Dependingonthe encodingformyouchoose(UTF-8,UTF-16,orUTF-32),eachcharacterwillthenberepresentedeitherasasequenceofonetofour8-bitbytes, oneortwo16-bitcodeunits,orasingle32-bitcodeunit. Q:CanUnicodetextberepresentedinmorethanoneway? Yes,thereareseveralpossiblerepresentationsof Unicodedata,includingUTF-8, UTF-16andUTF-32. Inaddition, therearecompressiontransformationssuchastheonedescribedintheUTS#6:AStandardCompressionSchemeforUnicode(SCSU). Q:WhatisaUTF? AUnicodetransformationformat(UTF)isan algorithmicmappingfromeveryUnicodecodepoint(exceptsurrogatecode points)toauniquebyte sequence.TheISO/IEC10646standardusestheterm“UCStransformation format”forUTF;thetwotermsaremerelysynonymsforthesameconcept. EachUTFisreversible,thuseveryUTFsupportslosslessroundtripping:mapping fromanyUnicodecodedcharactersequenceStoasequenceofbytesand backwillproduceSagain.Toensureroundtripping,aUTFmapping mustmapallcodepoints(exceptsurrogatecodepoints)to uniquebytesequences.Thisincludesreserved(unassigned)codepointsandthe66noncharacters (includingU+FFFEandU+FFFF). TheSCSU compressionmethod,eventhoughitisreversible,isnotaUTFbecausethesamestringcanmaptovery manydifferentbytesequences,dependingontheparticularSCSU compressor.[AF] Q:WherecanIgetmoreinformationon encodingforms? FortheformaldefinitionofUTFssee Section3.9,UnicodeEncodingFormsin TheUnicodeStandard.Formoreinformationonencoding formsseeUTR#17:UnicodeCharacterEncodingModel. [AF] Q:HowdoIwriteaUTFconverter? ThefreelyavailableopensourceprojectInternationalComponentsforUnicode(ICU)hasUTFconversionbuiltintoit.ThelatestversionmaybedownloadedfromtheICUProjectwebsite.[AF] Q:Arethereanybytesequencesthat arenotgeneratedbyaUTF?HowshouldIinterpretthem? NoneoftheUTFscangenerateeveryarbitrarybyte sequence.Forexample,inUTF-8everybyteoftheform110xxxxx2 mustbefollowedwithabyteoftheform10xxxxxx2. Asequencesuchas<110xxxxx20xxxxxxx2> isillegal,andmustneverbegenerated.Whenfacedwiththisillegal bytesequencewhiletransformingorinterpreting,aUTF-8conformant processmusttreatthefirstbyte110xxxxx2asan illegalterminationerror:forexample,eithersignalinganerror, filteringthebyteout,orrepresentingthebytewithamarkersuchas FFFD(REPLACEMENTCHARACTER).Inthelattertwocases,itwillcontinue processingatthesecondbyte0xxxxxxx2. Aconformantprocessmustnotinterpretillegalor ill-formedbytesequencesascharacters,however,itmaytakeerror recoveryactions.Noconformantprocess mayuseirregularbyte sequencestoencodeout-of-bandinformation. Q:WhichoftheUTFsdoIneedtosupport? UTF-8ismostcommonontheweb.UTF-16isusedbyJavaandWindows.UTF-8andUTF-32 are usedbyLinuxandvariousUnixsystems.Theconversionsbetweenallofthemare algorithmicallybased,fastandlossless.Thismakesiteasytosupport datainputoroutputinmultipleformats,whileusingaparticularUTF forinternalstorageorprocessing.  [AF] Q:Whataresomeofthedifferences betweentheUTFs? Thefollowingtablesummarizessomeofthepropertiesof eachoftheUTFs.  Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE Smallestcodepoint 0000 0000 0000 0000 0000 0000 0000 Largestcodepoint 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF Codeunitsize 8bits 16bits 16bits 16bits 32bits 32bits 32bits Byteorder N/A big-endian little-endian big-endian little-endian Fewestbytespercharacter 1 2 2 2 4 4 4 Mostbytespercharacter 4 4 4 4 4 4 4 Inthetableindicatesthatthebyteorderis determinedbyabyteordermark,ifpresentatthebeginningofthedata stream,otherwiseitisbig-endian. [AF] Q:WhydosomeoftheUTFshaveaBEorLE intheirlabel,suchasUTF-16LE? UTF-16andUTF-32usecodeunitsthataretwoandfour byteslongrespectively.FortheseUTFs,therearethreesub-flavors: BE,LEandunmarked.TheBEformusesbig-endianbyteserialization (mostsignificantbytefirst),theLEformuseslittle-endianbyte serialization(leastsignificantbytefirst)andtheunmarkedformuses big-endianbyteserializationbydefault,butmayincludeabyteorder markatthebeginningtoindicatetheactualbyteserializationused.[AF] Q:Isthereastandardmethodtopackagea Unicodecharactersoitfitsan8-BitASCIIstream? ThereareseveraloptionsformakingUnicodefitinto an8-bitformat: UseUTF-8.ThispreservesASCII,butnotLatin-1, becausethecharacters>127aredifferentfromLatin-1.UTF-8uses thebytesintheASCIIonlyforASCIIcharacters.Therefore,itworks wellinanyenvironmentwhereASCIIcharactershaveasignificanceas syntaxcharacters,e.g.filenamesyntaxes,markuplanguages,etc.,but wheretheallothercharactersmayusearbitrarybytes. Forexample:“LatinSmallLetterswithAcute”(015B)wouldbe encodedastwobytes:C59B. UseJavaorCstyleescapes,oftheform\uXXXXor\xXXXX. Thisformatisnotstandardfortextfiles,butwelldefinedinthe frameworkofthelanguagesinquestion,primarilyforsourcefiles. Forexample:ThePolishword“wyjście”withcharacter“LatinSmall LetterswithAcute”(015B)inthemiddle(śisonecharacter)would looklike:“wyj\u015Bcie”. UsetheXXXX;orDDD;numericcharacterescapes asinHTMLorXML.Again,thesearenotstandardforplaintextfiles, butwelldefinedwithintheframeworkofthesemarkuplanguages. Forexample:“wyjście”wouldlooklike“wyjście” UsePunycodeforconvertinglabelsthatarepartofnetworkidentifiersintoaformcompatiblewithASCIIlabels.ThismethodisrequiredaspartofIDNA2008andearlierforInternationalizedDomainNames(IDN).ItreencodesUnicodeintoasubsetofASCIIcharacterscontainingonlythelettersanddigits. Forexample:thedomainname“wyjście.com”wouldlooklike“xn--wyjcie-5ib.com”,withthe“xn--”prefixmarkingitaspunycodeandwithanyASCIIcharacterscollectedatthefront. UseSCSU. ThisformatcompressesUnicodeinto8-bitformat,preservingmostof ASCII,butusingsomeofthecontrolcodesascommandsforthedecoder. However,whileASCIItextwilllooklikeASCIItextafterbeingencoded inSCSU,othercharactersmayoccasionallybeencodedwiththesamebyte values,makingSCSUunsuitablefor8-bitchannelsthatblindlyinterpret anyofthebytesasASCIIcharacters. Forexample:“wyjÛcie”whereindicatesthebyte0x12and “Û”correspondstobyte0xDB.[AF] Q:WhichmethodofpackingUnicodecharactersintoan8-bitstreamisthebest? Thechoiceofapproachdependsonthecircumstances: UTF-8 isthemostwidelyusedASCII-compatibleencodingformforUnicode.itisdesignedtobeusedtransparently,meaningthatanypartofthedatathatwasinASCIIisstillinASCII(andwithoutchangeinrelativelocation)andnootherpartsare.Itisalsoreasonablycompactand independentofbyteorderissues.IthasbecomethepreferredfromforUnicodetextfiles. ThedownsideofUTF-8isthat withoutconvertingintoaformatthatcanbedisplayedonyoursystem,youcannottellwhichnon-ASCIIcharactersareinyourdata.Characterescapesornumericcharacterentitiesletyouseewhichcodepointhadbeenescaped,evenifyouareworkinginanASCIIenvirornmentordon'thavethefonttoviewthecharacter,orevenwhenyoumightbeunabletorecognizethecharacter.Intherightcicumstances,theyareappropriateinsourcecodeorsourcedocuments.Characterescapesandentitiesusemorespace, whichmakesthenunattractiveexceptforoccasionaluse. Punycodeisrequiredaspartofthedomainnameprotocols,butnotsuitableforanythingbutshortstrings.  [AF] SCSUwasdesignedforcompressionofshortstrings.Itusestheleastspace,butcannotbeusedtransparentlyinmost8-bitenvironments. Q:Whichoftheseformatsisthemoststandard? [rewordtohandleunnumberedbullet,mergelasttwo]Allfourrequirethatthereceivercanunderstandthat format,buta)isconsideredoneofthethreeequivalentUnicode EncodingFormsandthereforestandard.Theuseofb),orc)outoftheir givencontextwoulddefinitelybeconsiderednon-standard,butcouldbe agoodsolutionforinternaldatatransmission.TheuseofSCSUis itselfastandard(forcompresseddatastreams)butfewgeneralpurpose receiverssupportSCSU,soitisagainmostusefulininternaldata transmission.[AF] UTF-8FAQ Q:WhatisthedefinitionofUTF-8? UTF-8isthebyte-orientedencodingformofUnicode.For detailsofitsdefinition,seeSection2.5,EncodingFormsandSection 3.9,UnicodeEncodingForms”inTheUnicodeStandard.See,inparticular,Table3-6UTF-8BitDistribution andTable3-7Well-formedUTF-8ByteSequences,whichgive succinctsummariesoftheencodingform.Makesureyourefertothelatestversionofthe UnicodeStandard,asthe UnicodeTechnicalCommitteehastightenedthedefinitionofUTF-8 overtimetomorestrictlyenforceuniquesequencesandtoprohibit encodingofcertaininvalidcharacters.ThereisanInternet RFC3629 aboutUTF-8.UTF-8isalsodefinedinAnnexDofISO/IEC10646.Seealso thequestionabove,HowdoIwriteaUTFconverter? Q:IstheUTF-8encodingschemethesame irrespectiveofwhethertheunderlyingprocessorislittleendianorbig endian? Yes.SinceUTF-8isinterpretedasasequenceofbytes, thereisnoendianproblemasthereisforencodingformsthatuse 16-bitor32-bitcodeunits.WhereaBOMisusedwithUTF-8,itis onlyusedasanencodingsignaturetodistinguishUTF-8fromotherencodings—ithasnothing todowithbyteorder.  [AF] Q:IstheUTF-8encodingschemethesame irrespectiveofwhethertheunderlyingsystemusesASCIIorEBCDIC encoding? ThereisonlyonedefinitionofUTF-8.Itispreciselythesame, whetherthedatawereconvertedfromASCIIorEBCDICbasedcharacter sets.However,bytesequencesfromstandardUTF-8won’tinteroperate wellinanEBCDICsystem,becauseofthedifferentarrangementsof controlcodesbetweenASCIIandEBCDIC. UTR#16: UTF-EBCDICdefinesisaspecializedUTF thatwill interoperateinEBCDICsystems. [AF] Q:HowdoIconvertaUTF-16surrogate pairsuchastoUTF-8?Asone4-bytesequenceorastwo separate3-bytesequences? ThedefinitionofUTF-8requiresthatsupplementary characters(thoseusingsurrogatepairsinUTF-16)beencodedwitha single4-bytesequence.However,thereisawidespreadpracticeofgenerating pairsof3-bytesequencesinoldersoftware,especiallysoftwarewhichpre-datesthe introductionofUTF-16orthatisinteroperatingwithUTF-16 environmentsunderparticularconstraints.Suchanencodingisnotconformant toUTF-8asdefined.SeeUTR #26:CompatabilityEncodingSchemeforUTF-16:8-bit(CESU)fora formaldescriptionofsuchanon-UTF-8dataformat.WhenusingCESU-8, greatcaremustbetakenthatdataisnotaccidentallytreatedasifit wasUTF-8,duetothesimilarityoftheformats. [AF] Q:HowdoIconvertanunpairedUTF-16surrogate toUTF-8? Adifferentissuearisesifanunpairedsurrogateis encounteredwhenconvertingill-formedUTF-16data.Byrepresentingsuch anunpairedsurrogateonitsownas a3-bytesequence,theresultingUTF-8datastreamwouldbecome ill-formed.Whileitfaithfullyreflectsthenatureoftheinput, Unicodeconformancerequiresthatencodingformconversionalways resultsinavaliddatastream.Thereforeaconvertermusttreat thisasanerror.[AF] UTF-16FAQ Q:WhatisUTF-16? UTF-16usesasingle16-bitcodeunittoencodethemost common63Kcharacters,andapairof16-bitcodeunits,called surrogates,toencodethe1MlesscommonlyusedcharactersinUnicode. Originally,Unicodewasdesignedasapure16-bit encoding,aimedatrepresentingallmodernscripts.(Ancientscripts weretoberepresentedwithprivate-usecharacters.)Overtime,and especiallyaftertheadditionofover14,500compositecharactersfor compatibilitywithlegacysets,itbecameclearthat16-bitswerenot sufficientfortheusercommunity.OutofthisaroseUTF-16. [AF] Q:Whataresurrogates? SurrogatesarecodepointsfromtwospecialrangesofUnicode values,reserved foruseastheleading,andtrailingvaluesofpairedcodeunits inUTF-16.Leading,alsocalledhigh,surrogatesare fromD80016toDBFF16,andtrailing,orlow, surrogatesarefromDC0016toDFFF16.Theyarecalled surrogates,sincetheydonotrepresentcharactersdirectly,butonlyasa pair. Q:What’sthealgorithmtoconvertfrom UTF-16tocodepoints? TheUnicodeStandardusedtocontainsashortalgorithm, nowthereisjustabitdistributiontablethatshowstherelationbetweensurrogatesandtheresultingsupplementarycodepoints,butdoesgiveanalgorithm.Herearethreeshortcodesnippets thattranslatetheinformationfromthebitdistributiontableintoC codethatwillconverttoandfromUTF-16. Usingthefollowingtypedefinitions typedefunsignedint16UTF16; typedefunsignedint32UTF32; thefirstsnippetcalculates thehigh(orleading)surrogatefromacharactercodeC. constUTF16HI_SURROGATE_START=0xD800 UTF16X=(UTF16)C; UTF32U=(C>>16)&((1<<5)-1); UTF16W=(UTF16)U-1; UTF16HiSurrogate=HI_SURROGATE_START|(W<<6)|X>>10; where"X","U"and"W"correspondtothelabelsusedinTable 3-5UTF-16BitDistribution.Thenextsnippetdoesthesameforthelowsurrogate. constUTF16LO_SURROGATE_START=0xDC00 UTF16X=(UTF16)C; UTF16LoSurrogate=(UTF16)(LO_SURROGATE_START|X&((1<<10)-1)); Finally,thereverse,wherehiandloarethehighandlow surrogate,and"C"theresultingcharacter UTF32X=(hi&((1<<6)-1))<<10|lo&((1<<10)-1); UTF32W=(hi>>6)&((1<<5)-1); UTF32U=W+1; UTF32C=U<<16|X; Acallerwouldneedtoensurethat"C","hi",and"lo"areinthe appropriateranges.[AF] Q:Isn’tthereasimplerwaytodothis? Thereisamuchsimplercomputationthatdoesnottryto followthebitdistributiontable. //constants constUTF32LEAD_OFFSET=0xD800-(0x10000>>10); constUTF32SURROGATE_OFFSET=0x10000-(0xD800<<10)-0xDC00; //computations UTF16lead=LEAD_OFFSET+(codepoint>>10); UTF16trail=0xDC00+(codepoint&0x3FF); UTF32codepoint=(lead<<10)+trail+SURROGATE_OFFSET; [MD] Q:WhyaresomepeopleopposedtoUTF-16? PeoplefamiliarwithvariablewidthEastAsiancharacter setssuchasShift-JIS(SJIS)areunderstandablynervousaboutUTF-16, whichsometimesrequirestwocodeunitstorepresentasinglecharacter. Theyarewellacquaintedwiththeproblemsthatvariable-width codeshavecaused. However,therearesomeimportantdifferencesbetweenthemechanisms usedinSJISandUTF-16: Overlap: InSJIS,thereisoverlapbetweentheleadingand trailingcodeunitvalues,andbetweenthetrailingandsinglecodeunitvalues.Thiscausesanumberofproblems: Itcausesfalsematches.Forexample,searchingfor an“a”maymatchagainstthetrailingcodeunitofaJapanesecharacter. Itpreventsefficientrandomaccess.Toknowwhether youareonacharacterboundary,youhavetosearchbackwardsto findaknownboundary. Itmakesthetextextremelyfragile.Ifaunitis droppedfromaleading-trailingcodeunitpair,manyfollowingcharacterscanbe corrupted. InUTF-16,thecodepointrangesforhighandlow surrogates,aswellasforsingleunitsareallcompletelydisjoint. Noneoftheseproblemsoccur: Therearenofalsematches. Thelocationofthecharacterboundarycanbedirectly determinedfromeachcodeunitvalue. Adroppedsurrogatewillcorruptonlyasingle character. Frequency: ThevastmajorityofSJIScharactersrequire2units, butcharactersusingsingleunitsoccurcommonlyandoftenhave specialimportance,forexampleinfilenames. WithUTF-16,relativelyfewcharactersrequire2units. Thevastmajorityofcharactersincommonusearesinglecodeunits. EveninEastAsiantext,theincidenceofsurrogatepairsshouldbe welllessthan1%ofalltextstorageonaverage.(Certain documents,ofcourse,mayhaveahigherincidenceofsurrogate pairs,justasphthisiqueisanfairlyinfrequentwordin English,butmayoccurquiteofteninaparticularscholarlytext.) Therecentincreaseduseofemojimeansthatthepercentageofwidely-usedsupplementarycharactershasalsoincreased.[AF] Q:WillUTF-16everbeextendedtomore thanamillioncharacters? No.BothUnicodeandISO10646have policiesinplacethatformallylimitfuturecodeassignmentto theintegerrangethatcanbeexpressedwithcurrentUTF-16(0to 1,114,111).Evenifotherencodingforms(i.e.otherUTFs)canrepresent largerintegers,thesepoliciesmeanthatallencodingformswill alwaysrepresentthesamesetofcharacters.Overamillionpossiblecodesisfarmorethanenough forthegoalofUnicodeofencodingcharacters,notglyphs.Unicodeisnotdesignedtoencodearbitrarydata.If youwanted,forexample,togiveeach“instanceofacharacteronpaper throughouthistory”itsowncode,youmightneedtrillionsor quadrillionsofsuchcodes;nobleasthiseffortmightbe,youwouldnot useUnicodeforsuchanencoding. [AF] Q:Arethereany16-bitvaluesthatare invalid? UnpairedsurrogatesareinvalidinUTFs.Theseincludeanyvalue intherangeD80016toDBFF16notfollowedbyavalueintherangeDC0016 toDFFF16,oranyvalueintherangeDC0016toDFFF16notprecededbya valueintherangeD80016toDBFF16.[AF] Q:Whataboutnoncharacters?Aretheyinvalid? Notatall.NoncharactersarevalidinUTFsandmustbeproperlyconverted. Formoredetailsonthedefinitionanduseofnoncharacters,aswellastheircorrectrepresentationineachUTF, seetheNoncharactersFAQ. Q:Becausemostsupplementarycharactersareuncommon,doesthatmeanIcanignorethem? Mostsupplementarycharacters(expressedwithsurrogatepairsin UTF-16)arenottoocommon.However,thatdoesnotmeanthat supplementarycharactersshouldbeneglected.Amongthemareanumberof individualcharactersthatareverypopular,aswellasmanysets importanttoEastAsianprocurementspecifications.Amongthenotable supplementarycharactersare: manypopularemojiandemoticons symbolsusedforinteroperatingwithWingdingsandWebdings numeroussmallsetsofCJKcharactersimportantforprocurement,includingpersonalandplacenames variationselectorsusedforallideographicvariationsequences numerousminorityscriptsimportantforsomeusercommunities somehighlysalienthistoricscripts,suchasEgyptianhieroglyphics KenLundehasaninterestingpresentationfileonthistopic,withaTopTenlist:WhySupportBeyond-BMPCodePoints? Q:HowshouldIhandlesupplementarycharactersinmycode? ComparedwithBMPcharactersasawhole,thesupplementarycharacters occurlesscommonlyintext.Thisremainstruenow,eventhoughmany thousandsofsupplementarycharactershavebeenaddedtothestandard, andafewindividualcharacters,suchaspopularemoji,havebecome quitecommon.TherelativefrequencyofBMPcharacters,andof theASCIIsubsetwithintheBMP,canbetakenintoaccountwhen optimizingimplementationsforbestperformance:executionspeed,memory usage,anddatastorage. SuchstrategiesareparticularlyusefulforUTF-16implementations, whereBMPcharactersrequireone16-bitcodeunittoprocessorstore, whereassupplementarycharactersrequiretwo. StrategiesthatoptimizefortheBMParelessusefulforUTF-8 implementations,butifthedistributionofdatawarrantsit,an optimizationfortheASCIIsubsetmaymakesense,asthatsubsetonly requiresasinglebyteforprocessingandstorageinUTF-8. Q:WhatisthedifferencebetweenUCS-2andUTF-16? UCS-2isobsoleteterminologywhichreferstoaUnicodeimplementationuptoUnicode1.1,beforesurrogatecodepointsandUTF-16wereaddedtoVersion2.0ofthestandard.Thistermshouldnowbeavoided. UCS-2doesnotdescribeadataformatdistinctfromUTF-16,because bothuseexactlythesame16-bitcodeunitrepresentations.However, UCS-2doesnotinterpretsurrogatecodepoints,andthus cannotbeusedtoconformantlyrepresentsupplementarycharacters. Sometimesinthepastanimplementationhasbeenlabeled"UCS-2"toindicatethatitdoesnotsupportsupplementarycharactersanddoesn'tinterpretpairsofsurrogatecodepointsascharacters.Suchanimplementationwouldnothandleprocessingofcharacterproperties,codepointboundaries,collation,etc.forsupplementarycharacters,norwoulditbeabletosupportmostemoji,forexample.[AF] UTF-32FAQ Q:WhatisUTF-32? AnyUnicodecharactercanbe representedasasingle32-bitunitinUTF-32.Thissingle4codeunit correspondstotheUnicodescalarvalue,whichistheabstractnumber associatedwithaUnicodecharacter.UTF-32isasubsetoftheencoding mechanismcalledUCS-4inISO10646.Formoreinformation,seeSection3.9,UnicodeEncodingFormsinTheUnicodeStandard. [AF] Q:ShouldIuseUTF-32(orUCS-4)for storingUnicodestringsinmemory? Thisdepends.ItmayseemcompellingtouseUTF-32asyourinternalstringformatbecauseitusesonecodeunitpercodepoint.However,Unicodecharactersarerarelyprocessedincompleteisolation.Combiningcharactersequencesmayneedtobeprocessedasaunit,forexample.Thisissuenotonlyaffectscomplexscripts,butalsoseeminglysimplethingslikeemoji.DefiningyourAPIssotheyworkprimarilywithstringsand substrings,insteadofcharactersandcharacteroffsetswillmakeiteasiertocorrectlysupportcombiningcharactersequences.ThiswillalsomakethedistinctionbetweenworkinginUTF-32andotherencodingformslessrelevant. ThedownsideofUTF-32 isthatitforcesyoutouse32-bitsforeachcharacter,whenonly21 bitsareeverneeded.Thenumberofsignificantbitsneededforthe averagecharacterincommontextsismuchlower,makingtheratio effectivelythatmuchworse.Increasingthestorageforthesame numberofcharactersdoeshaveitscostinapplicationsdealingwith largevolumeoftextdata:itcanmeanexhaustingcachelimitssooner; itcanresultinnoticeablyincreasedread/writetimesorinreaching bandwidthlimits;anditrequiresmorespaceforstorage.Whatanumberofimplementationsdoistorepresentstringswith[UTF-8or] UTF-16, butindividualcharactervalueswith UTF-32. IfyoufrequentlyneedtoaccessAPIsthat requirestringparameterstobeinUTF-32,itmaybemoreconvenientto workwithUTF-32stringsallthetime.However,Inmanysituationsthatdoesnotmatter, andtheconvenienceofhavingafixednumberofcodeunitspercharacter canbethedecidingfactor. ThechiefsellingpointforUnicodeisprovidinga representationforalltheworld’scharacters,eliminatingtheneedfor jugglingmultiplecharactersetsandavoidingtheassociateddatacorruption problems.Thesefeatureswereenoughtoswingindustrytothesideof usingUnicodeasin-memoryformat.WhileaUTF-32representationdoesmakethe programmingmodelsomewhatsimpler,theincreasedaveragestoragesize hasrealdrawbacks,makingacompletetransitiontoUTF-32lesscompelling. [AF] Q:HowaboutusingUTF-32interfacesinmy APIs? ExceptinsomeenvironmentsthatstoretextasUTF-32in memory,mostUnicodeAPIsareusingUTF-16.WithUTF-16APIs the lowlevel indexingisatthestorageorcodeunitlevel,withhigher-levelmechanisms forgraphemesorwordsspecifyingtheirboundariesintermsofthe codeunits.Thisprovidesefficiencyatthelowlevels,andthe requiredfunctionalityatthehighlevels. Ifitsevernecessarytolocatethenth character,indexingbycharactercanbeimplementedasahighlevel operation.However,whileconverting fromsuchaUTF-16codeunitindextoacharacterindexorviceversaisfairly straightforward,itdoesinvolveascanthroughthe16-bitunitsupto theindexpoint.Inatestrun,forexample,accessingUTF-16storageas characters,insteadofcodeunitsresultedina10×degradation.While therearesomeinterestingoptimizationsthatcanbeperformed,itwill alwaysbesloweronaverage.Thereforelocatingotherboundaries,such asgrapheme,word,lineorsentenceboundariesproceedsdirectlyfrom thecodeunitindex,notindirectlyviaanintermediatecharactercode index. Insituationwhereitisnecesarytoworkwiththe"units"thattheuserinteractswith,indexingbyUnicodecharactergivesonlylimitedadvantageoverindexingbycodeunit:manytimeswhatusersperceiveasasincleunit,anemojiforexample,isrepresentedasacombiningorothercharactersequence,anditmakeslittledifferenceiniteratingoversuch"units"whethertheunderlyingcodeuses16-bitor32-bitcodeunits. Q:Doesn’titcauseaproblemtohave onlyUTF-16stringAPIs,insteadofUTF-32charAPIs? Almostallinternationalfunctions(upper-,lower-, titlecasing,casefolding,drawing,measuring,collation, transliteration,grapheme-,word-,linebreaks,etc.)shouldtake stringparametersintheAPI,notsinglecode-points (UTF-32).Singlecode-pointAPIsalmostalwaysproducethewrongresults exceptforvery simplelanguages,eitherbecauseyouneedmorecontexttogettherightanswer, orbecauseyouneedtogenerateasequenceofcharacterstoreturn therightanswer,orboth. Forexample,anyUnicode-compliant collation(SeeUTS#10:UnicodeCollationAlgogrithm(UCA))mustbeabletohandlesequencesofmorethanone code-point,andtreatthatsequenceasasingleentity.Tryingtocollatebyhandlingsinglecode-points atatime,wouldgetthewronganswer.Thesamewillhappenfordrawing ormeasuringtextasinglecode-pointatatime;becausescriptslike Arabicarecontextual,thewidthofxplusthewidthofyisnotequal tothewidthofxy.Onceyougetbeyondbasictypography,thesameis trueforEnglishaswell;becauseofkerningandligaturesthewidthof “fi”inthefontmaybedifferentthanthewidthof“f”plusthewidth of“i".Casingoperationsmustreturnstrings,notsinglecode-points; see https://www.unicode.org/charts/case/.Inparticular,thetitle casingoperationrequiresstringsasinput,notsinglecode-pointsata time. Storingasinglecodepoint inastructorclassinsteadofastring,wouldexcludesupportfor graphemes,suchas“ch”forSlovak,whereasinglecodepointmaynotbesufficient, butacharactersequenceisneededtoexpresswhat isrequired.Inotherwords,mostAPIparametersandfieldsofcomposite datatypesshould notbedefinedasacharacter,butasastring.Andiftheyare strings,itdoesnotmatterwhattheinternalrepresentationofthe stringis. Giventhatanyindustrial-strengthtextand internationalizationsupportAPIhastobeabletohandlesequencesof characters,itmakes littledifferencewhetherthestringisinternallyrepresentedbya sequenceofUTF-16codeunits,orbyasequenceofcode-points(=UTF-32codeunits). BothUTF-16andUTF-8aredesignedtomakeworkingwithsubstringseasy, bythefactthatthesequenceofcodeunitsforagivencodepointis unique.[AF] Q:Arethereexceptionstotheruleofexclusivelyusing stringparametersinAPIs? Themainexceptionareverylow-level operationssuchasgettingcharacterproperties(e.g.GeneralCategory orCanonicalClassintheUCD).Forthoseitishandytohaveinterfaces thatconvertquicklytoandfromUTF-16andUTF-32,andthatallowyou toiteratethroughstringsreturningUTF-32values(eventhoughthe internalformatisUTF-16). Q:HowdoIconvertaUTF-16surrogate pairsuchastoUTF-32?Asone4-bytesequenceorastwo 4-bytesequences? ThedefinitionofUTF-32requiresthatsupplementary characters(thoseusingsurrogatepairsinUTF-16)beencodedwitha single4-bytesequence. Q:HowdoIconvertanunpairedUTF-16surrogate toUTF-32? Ifanunpairedsurrogateisencounteredwhen convertingill-formedUTF-16data,anyconformantconvertermust treatthisasanerror.Byrepresentingsuchanunpairedsurrogateonits own,theresultingUTF-32datastreamwouldbecomeill-formed.Whileit faithfullyreflectsthenatureoftheinput,Unicodeconformance requiresthatencodingformconversionalwaysresultsinvaliddata stream.[AF] ByteOrderMark(BOM)FAQ Q:WhatisaBOM? Abyteordermark(BOM)consistsofthecharacter codeU+FEFFatthebeginningofadatastream,whereitcanbeused asasignaturedefiningthebyteorderandencodingform,primarilyofunmarkedplaintext files.Undersomehigherlevelprotocols,useofaBOMmaybemandatory (orprohibited)intheUnicodedatastreamdefinedinthat protocol. [AF] Q:WhereisaBOMuseful? ABOMisusefulatthebeginningoffilesthataretypedas text,butforwhichitisnotknownwhethertheyareinbigorlittleendianformat—it canalsoserveasahintindicatingthatthefileisinUnicode,as opposedtoinalegacyencodingandfurthermore,itactasasignature forthespecificencodingformused.[AF] Q:Whatdoes‘endian’mean? Datatypeslongerthanabytecanbestoredincomputer memorywiththemostsignificantbyte(MSB)firstorlast.Theformeris calledbig-endian,thelatterlittle-endian.Whendataisexchanged,bytes thatappearinthe"correct"orderonthesendingsystemmayappeartobe outoforderonthereceivingsystem.Inthatsituation,aBOMwouldlook like0xFFFEwhichisanoncharacter,allowingthereceivingsystemto applybytereversalbeforeprocessingthedata.UTF-8isbyteorientedand thereforedoesnothavethatissue.Nevertheless,aninitialBOMmightbe usefultoidentifythedatastreamasUTF-8.[AF] Q:WhenaBOMisused,isitonlyin 16-bitUnicodetext? No,aBOMcanbeusedasasignaturenomatterhowthe Unicodetextistransformed:UTF-16,UTF-8,orUTF-32.Theexactbytes comprisingtheBOMwillbewhatevertheUnicodecharacterU+FEFFis convertedintobythattransformationformat.Inthatform,theBOM servestoindicateboththatitisaUnicodefile,andwhichofthe formatsitisin.Examples: Bytes EncodingForm 0000FEFF UTF-32,big-endian FFFE0000 UTF-32,little-endian FEFF UTF-16,big-endian FFFE UTF-16,little-endian EFBBBF UTF-8 Q:CanaUTF-8datastreamcontaintheBOM character(inUTF-8form)?Ifyes,thencanIstillassumetheremaining UTF-8bytesareinbig-endianorder? Yes,UTF-8cancontainaBOM.However,itmakesno differenceastotheendiannessofthebytestream.UTF-8alwayshasthe samebyteorder.AninitialBOMisonlyusedasasignature—an indicationthatanotherwiseunmarkedtextfileisinUTF-8.Notethat somerecipientsofUTF-8encodeddatadonotexpectaBOM.WhereUTF-8 isusedtransparentlyin8-bitenvironments,theuseofaBOM willinterferewithanyprotocolorfileformatthatexpectsspecific ASCIIcharactersatthebeginning,suchastheuseof"#!"ofatthe beginningofUnixshellscripts. [AF] Q:WhatshouldIdowithU+FEFFinthe middleofafile? IntheabsenceofaprotocolsupportingitsuseasaBOMandwhennotatthe beginningofatextstream,U+FEFFshouldnormallynotoccur.For backwardscompatibilityitshouldbetreatedasZEROWIDTH NON-BREAKINGSPACE(ZWNBSP), andisthenpartofthecontentofthefileorstring.Theuseof U+2060WORDJOINERisstronglypreferredoverZWNBSPforexpressingwordjoining semanticssinceitcannotbeconfusedwithaBOM.Whendesigningamarkup languageordataprotocol,theuseofU+FEFFcanberestrictedtothat ofByteOrderMark.Inthatcase,anyU+FEFFoccurringinthemiddleofafilecanbetreatedasan unsupportedcharacter. [AF] Q:IamusingaprotocolthathasBOMat thestartoftext.HowdoIrepresentaninitialZWNBSP? UseU+2060WORDJOINERinstead.  Q:HowdoItagdatathatdoesnot interpretU+FEFFasaBOM? UsethetagUTF-16BEtoindicatebig-endian UTF-16text,andUTF-16LEtoindicatelittle-endianUTF-16 text.IfyoudouseaBOM,tagthetextassimplyUTF-16. [MD] Q:Whywouldn’tIalwaysuseaprotocol thatrequiresaBOM? Wherethedatahasanassociatedtype,suchasafieldinadatabase, aBOMisunnecessary.Inparticular,ifatextdatastreamismarkedas UTF-16BE,UTF-16LE,UTF-32BEorUTF-32LE,aBOMisneithernecessarynorpermitted. AnyU+FEFFwouldbeinterpretedasaZWNBSP. DonottageverystringinadatabaseorsetoffieldswithaBOM, sinceitwastesspaceandcomplicatesstringconcatenation.Moreover,italsomeanstwodatafieldsmayhave preciselythesamecontent,butnotbebinary-equal(whereoneis prefacedbyaBOM). Q:HowIshoulddeal withBOMs? Herearesomeguidelinestofollow: Aparticularprotocol(e.g.Microsoftconventionsfor .txtfiles)mayrequireuseoftheBOMoncertainUnicodedata streams,suchasfiles.Whenyouneedtoconformtosuchaprotocol, useaBOM. SomeprotocolsallowoptionalBOMsinthecaseof untaggedtext.Inthosecases, Whereatextdatastreamisknowntobeplaintext,but ofunknownencoding,BOMcanbeusedasasignature.Ifthereisno BOM,theencodingcouldbeanything. WhereatextdatastreamisknowntobeplainUnicode text(butnotwhichendian),thenBOMcanbeusedasasignature.If thereisnoBOM,thetextshouldbeinterpretedasbig-endian. SomebyteorientedprotocolsexpectASCIIcharactersat thebeginningofafile.IfUTF-8isusedwiththeseprotocols,use oftheBOMasencodingformsignatureshouldbeavoided. Wheretheprecisetypeofthedatastreamisknown(e.g. Unicodebig-endianorUnicodelittle-endian),theBOMshouldnotbe used.Inparticular,wheneveradatastreamisdeclaredtobe UTF-16BE,UTF-16LE,UTF-32BEorUTF-32LEaBOMmustnotbe used.(SeealsoQ:Whatisthe differencebetweenUCS-2andUTF-16?.) [AF]



請為這篇文章評分?