Therefore, it works well in any environment where ASCII characters have a significance as syntax characters, e.g. file name syntaxes, markup languages, ...
UTF-8,UTF-16,UTF-32&BOM
[REVIEWED]
Generalquestions,relatingtoUTForEncodingForm
Q:IsUnicodea16-bitencoding?
Initsfirstversion,from1991to1995,Unicodewasa16-bitencoding.butStartingwithUnicode2.0(July,1996),theUnicodeStandardhasencodedcharactersintherangeU+0000..U+10FFFF,whichamountstoa21-bitcodespace.Dependingonthe
encodingformyouchoose(UTF-8,UTF-16,orUTF-32),eachcharacterwillthenberepresentedeitherasasequenceofonetofour8-bitbytes,
oneortwo16-bitcodeunits,orasingle32-bitcodeunit.
Q:CanUnicodetextberepresentedinmorethanoneway?
Yes,thereareseveralpossiblerepresentationsof
Unicodedata,includingUTF-8, UTF-16andUTF-32.
Inaddition,
therearecompressiontransformationssuchastheonedescribedintheUTS#6:AStandardCompressionSchemeforUnicode(SCSU).
Q:WhatisaUTF?
AUnicodetransformationformat(UTF)isan
algorithmicmappingfromeveryUnicodecodepoint(exceptsurrogatecode
points)toauniquebyte
sequence.TheISO/IEC10646standardusestheterm“UCStransformation
format”forUTF;thetwotermsaremerelysynonymsforthesameconcept.
EachUTFisreversible,thuseveryUTFsupportslosslessroundtripping:mapping
fromanyUnicodecodedcharactersequenceStoasequenceofbytesand
backwillproduceSagain.Toensureroundtripping,aUTFmapping
mustmapallcodepoints(exceptsurrogatecodepoints)to
uniquebytesequences.Thisincludesreserved(unassigned)codepointsandthe66noncharacters
(includingU+FFFEandU+FFFF).
TheSCSU
compressionmethod,eventhoughitisreversible,isnotaUTFbecausethesamestringcanmaptovery
manydifferentbytesequences,dependingontheparticularSCSU
compressor.[AF]
Q:WherecanIgetmoreinformationon
encodingforms?
FortheformaldefinitionofUTFssee
Section3.9,UnicodeEncodingFormsin
TheUnicodeStandard.Formoreinformationonencoding
formsseeUTR#17:UnicodeCharacterEncodingModel.
[AF]
Q:HowdoIwriteaUTFconverter?
ThefreelyavailableopensourceprojectInternationalComponentsforUnicode(ICU)hasUTFconversionbuiltintoit.ThelatestversionmaybedownloadedfromtheICUProjectwebsite.[AF]
Q:Arethereanybytesequencesthat
arenotgeneratedbyaUTF?HowshouldIinterpretthem?
NoneoftheUTFscangenerateeveryarbitrarybyte
sequence.Forexample,inUTF-8everybyteoftheform110xxxxx2
mustbefollowedwithabyteoftheform10xxxxxx2.
Asequencesuchas<110xxxxx20xxxxxxx2>
isillegal,andmustneverbegenerated.Whenfacedwiththisillegal
bytesequencewhiletransformingorinterpreting,aUTF-8conformant
processmusttreatthefirstbyte110xxxxx2asan
illegalterminationerror:forexample,eithersignalinganerror,
filteringthebyteout,orrepresentingthebytewithamarkersuchas
FFFD(REPLACEMENTCHARACTER).Inthelattertwocases,itwillcontinue
processingatthesecondbyte0xxxxxxx2.
Aconformantprocessmustnotinterpretillegalor
ill-formedbytesequencesascharacters,however,itmaytakeerror
recoveryactions.Noconformantprocess mayuseirregularbyte
sequencestoencodeout-of-bandinformation.
Q:WhichoftheUTFsdoIneedtosupport?
UTF-8ismostcommonontheweb.UTF-16isusedbyJavaandWindows.UTF-8andUTF-32
are
usedbyLinuxandvariousUnixsystems.Theconversionsbetweenallofthemare
algorithmicallybased,fastandlossless.Thismakesiteasytosupport
datainputoroutputinmultipleformats,whileusingaparticularUTF
forinternalstorageorprocessing.
[AF]
Q:Whataresomeofthedifferences
betweentheUTFs?
Thefollowingtablesummarizessomeofthepropertiesof
eachoftheUTFs.
Name
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Smallestcodepoint
0000
0000
0000
0000
0000
0000
0000
Largestcodepoint
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
Codeunitsize
8bits
16bits
16bits
16bits
32bits
32bits
32bits
Byteorder
N/A
big-endian
little-endian
big-endian
little-endian
Fewestbytespercharacter
1
2
2
2
4
4
4
Mostbytespercharacter
4
4
4
4
4
4
4
Inthetableindicatesthatthebyteorderis
determinedbyabyteordermark,ifpresentatthebeginningofthedata
stream,otherwiseitisbig-endian. [AF]
Q:WhydosomeoftheUTFshaveaBEorLE
intheirlabel,suchasUTF-16LE?
UTF-16andUTF-32usecodeunitsthataretwoandfour
byteslongrespectively.FortheseUTFs,therearethreesub-flavors:
BE,LEandunmarked.TheBEformusesbig-endianbyteserialization
(mostsignificantbytefirst),theLEformuseslittle-endianbyte
serialization(leastsignificantbytefirst)andtheunmarkedformuses
big-endianbyteserializationbydefault,butmayincludeabyteorder
markatthebeginningtoindicatetheactualbyteserializationused.[AF]
Q:Isthereastandardmethodtopackagea
Unicodecharactersoitfitsan8-BitASCIIstream?
ThereareseveraloptionsformakingUnicodefitinto
an8-bitformat:
UseUTF-8.ThispreservesASCII,butnotLatin-1,
becausethecharacters>127aredifferentfromLatin-1.UTF-8uses
thebytesintheASCIIonlyforASCIIcharacters.Therefore,itworks
wellinanyenvironmentwhereASCIIcharactershaveasignificanceas
syntaxcharacters,e.g.filenamesyntaxes,markuplanguages,etc.,but
wheretheallothercharactersmayusearbitrarybytes.
Forexample:“LatinSmallLetterswithAcute”(015B)wouldbe
encodedastwobytes:C59B.
UseJavaorCstyleescapes,oftheform\uXXXXor\xXXXX.
Thisformatisnotstandardfortextfiles,butwelldefinedinthe
frameworkofthelanguagesinquestion,primarilyforsourcefiles.
Forexample:ThePolishword“wyjście”withcharacter“LatinSmall
LetterswithAcute”(015B)inthemiddle(śisonecharacter)would
looklike:“wyj\u015Bcie”.
UsetheXXXX;orDDD;numericcharacterescapes
asinHTMLorXML.Again,thesearenotstandardforplaintextfiles,
butwelldefinedwithintheframeworkofthesemarkuplanguages.
Forexample:“wyjście”wouldlooklike“wyjście”
UsePunycodeforconvertinglabelsthatarepartofnetworkidentifiersintoaformcompatiblewithASCIIlabels.ThismethodisrequiredaspartofIDNA2008andearlierforInternationalizedDomainNames(IDN).ItreencodesUnicodeintoasubsetofASCIIcharacterscontainingonlythelettersanddigits.
Forexample:thedomainname“wyjście.com”wouldlooklike“xn--wyjcie-5ib.com”,withthe“xn--”prefixmarkingitaspunycodeandwithanyASCIIcharacterscollectedatthefront.
UseSCSU.
ThisformatcompressesUnicodeinto8-bitformat,preservingmostof
ASCII,butusingsomeofthecontrolcodesascommandsforthedecoder.
However,whileASCIItextwilllooklikeASCIItextafterbeingencoded
inSCSU,othercharactersmayoccasionallybeencodedwiththesamebyte
values,makingSCSUunsuitablefor8-bitchannelsthatblindlyinterpret
anyofthebytesasASCIIcharacters.
Forexample:“wyjÛcie”whereindicatesthebyte0x12and
“Û”correspondstobyte0xDB.[AF]
Q:WhichmethodofpackingUnicodecharactersintoan8-bitstreamisthebest?
Thechoiceofapproachdependsonthecircumstances:
UTF-8 isthemostwidelyusedASCII-compatibleencodingformforUnicode.itisdesignedtobeusedtransparently,meaningthatanypartofthedatathatwasinASCIIisstillinASCII(andwithoutchangeinrelativelocation)andnootherpartsare.Itisalsoreasonablycompactand independentofbyteorderissues.IthasbecomethepreferredfromforUnicodetextfiles.
ThedownsideofUTF-8isthat withoutconvertingintoaformatthatcanbedisplayedonyoursystem,youcannottellwhichnon-ASCIIcharactersareinyourdata.Characterescapesornumericcharacterentitiesletyouseewhichcodepointhadbeenescaped,evenifyouareworkinginanASCIIenvirornmentordon'thavethefonttoviewthecharacter,orevenwhenyoumightbeunabletorecognizethecharacter.Intherightcicumstances,theyareappropriateinsourcecodeorsourcedocuments.Characterescapesandentitiesusemorespace, whichmakesthenunattractiveexceptforoccasionaluse.
Punycodeisrequiredaspartofthedomainnameprotocols,butnotsuitableforanythingbutshortstrings.
[AF]
SCSUwasdesignedforcompressionofshortstrings.Itusestheleastspace,butcannotbeusedtransparentlyinmost8-bitenvironments.
Q:Whichoftheseformatsisthemoststandard?
[rewordtohandleunnumberedbullet,mergelasttwo]Allfourrequirethatthereceivercanunderstandthat
format,buta)isconsideredoneofthethreeequivalentUnicode
EncodingFormsandthereforestandard.Theuseofb),orc)outoftheir
givencontextwoulddefinitelybeconsiderednon-standard,butcouldbe
agoodsolutionforinternaldatatransmission.TheuseofSCSUis
itselfastandard(forcompresseddatastreams)butfewgeneralpurpose
receiverssupportSCSU,soitisagainmostusefulininternaldata
transmission.[AF]
UTF-8FAQ
Q:WhatisthedefinitionofUTF-8?
UTF-8isthebyte-orientedencodingformofUnicode.For
detailsofitsdefinition,seeSection2.5,EncodingFormsandSection
3.9,UnicodeEncodingForms”inTheUnicodeStandard.See,inparticular,Table3-6UTF-8BitDistribution
andTable3-7Well-formedUTF-8ByteSequences,whichgive
succinctsummariesoftheencodingform.Makesureyourefertothelatestversionofthe
UnicodeStandard,asthe
UnicodeTechnicalCommitteehastightenedthedefinitionofUTF-8
overtimetomorestrictlyenforceuniquesequencesandtoprohibit
encodingofcertaininvalidcharacters.ThereisanInternet
RFC3629
aboutUTF-8.UTF-8isalsodefinedinAnnexDofISO/IEC10646.Seealso
thequestionabove,HowdoIwriteaUTFconverter?
Q:IstheUTF-8encodingschemethesame
irrespectiveofwhethertheunderlyingprocessorislittleendianorbig
endian?
Yes.SinceUTF-8isinterpretedasasequenceofbytes,
thereisnoendianproblemasthereisforencodingformsthatuse
16-bitor32-bitcodeunits.WhereaBOMisusedwithUTF-8,itis
onlyusedasanencodingsignaturetodistinguishUTF-8fromotherencodings—ithasnothing
todowithbyteorder.
[AF]
Q:IstheUTF-8encodingschemethesame
irrespectiveofwhethertheunderlyingsystemusesASCIIorEBCDIC
encoding?
ThereisonlyonedefinitionofUTF-8.Itispreciselythesame,
whetherthedatawereconvertedfromASCIIorEBCDICbasedcharacter
sets.However,bytesequencesfromstandardUTF-8won’tinteroperate
wellinanEBCDICsystem,becauseofthedifferentarrangementsof
controlcodesbetweenASCIIandEBCDIC.
UTR#16:
UTF-EBCDICdefinesisaspecializedUTF thatwill
interoperateinEBCDICsystems.
[AF]
Q:HowdoIconvertaUTF-16surrogate
pairsuchastoUTF-8?Asone4-bytesequenceorastwo
separate3-bytesequences?
ThedefinitionofUTF-8requiresthatsupplementary
characters(thoseusingsurrogatepairsinUTF-16)beencodedwitha
single4-bytesequence.However,thereisawidespreadpracticeofgenerating
pairsof3-bytesequencesinoldersoftware,especiallysoftwarewhichpre-datesthe
introductionofUTF-16orthatisinteroperatingwithUTF-16
environmentsunderparticularconstraints.Suchanencodingisnotconformant
toUTF-8asdefined.SeeUTR
#26:CompatabilityEncodingSchemeforUTF-16:8-bit(CESU)fora
formaldescriptionofsuchanon-UTF-8dataformat.WhenusingCESU-8,
greatcaremustbetakenthatdataisnotaccidentallytreatedasifit
wasUTF-8,duetothesimilarityoftheformats.
[AF]
Q:HowdoIconvertanunpairedUTF-16surrogate
toUTF-8?
Adifferentissuearisesifanunpairedsurrogateis
encounteredwhenconvertingill-formedUTF-16data.Byrepresentingsuch
anunpairedsurrogateonitsownas
a3-bytesequence,theresultingUTF-8datastreamwouldbecome
ill-formed.Whileitfaithfullyreflectsthenatureoftheinput,
Unicodeconformancerequiresthatencodingformconversionalways
resultsinavaliddatastream.Thereforeaconvertermusttreat
thisasanerror.[AF]
UTF-16FAQ
Q:WhatisUTF-16?
UTF-16usesasingle16-bitcodeunittoencodethemost
common63Kcharacters,andapairof16-bitcodeunits,called
surrogates,toencodethe1MlesscommonlyusedcharactersinUnicode.
Originally,Unicodewasdesignedasapure16-bit
encoding,aimedatrepresentingallmodernscripts.(Ancientscripts
weretoberepresentedwithprivate-usecharacters.)Overtime,and
especiallyaftertheadditionofover14,500compositecharactersfor
compatibilitywithlegacysets,itbecameclearthat16-bitswerenot
sufficientfortheusercommunity.OutofthisaroseUTF-16.
[AF]
Q:Whataresurrogates?
SurrogatesarecodepointsfromtwospecialrangesofUnicode
values,reserved
foruseastheleading,andtrailingvaluesofpairedcodeunits
inUTF-16.Leading,alsocalledhigh,surrogatesare
fromD80016toDBFF16,andtrailing,orlow,
surrogatesarefromDC0016toDFFF16.Theyarecalled
surrogates,sincetheydonotrepresentcharactersdirectly,butonlyasa
pair.
Q:What’sthealgorithmtoconvertfrom
UTF-16tocodepoints?
TheUnicodeStandardusedtocontainsashortalgorithm,
nowthereisjustabitdistributiontablethatshowstherelationbetweensurrogatesandtheresultingsupplementarycodepoints,butdoesgiveanalgorithm.Herearethreeshortcodesnippets
thattranslatetheinformationfromthebitdistributiontableintoC
codethatwillconverttoandfromUTF-16.
Usingthefollowingtypedefinitions
typedefunsignedint16UTF16;
typedefunsignedint32UTF32;
thefirstsnippetcalculates
thehigh(orleading)surrogatefromacharactercodeC.
constUTF16HI_SURROGATE_START=0xD800
UTF16X=(UTF16)C;
UTF32U=(C>>16)&((1<<5)-1);
UTF16W=(UTF16)U-1;
UTF16HiSurrogate=HI_SURROGATE_START|(W<<6)|X>>10;
where"X","U"and"W"correspondtothelabelsusedinTable
3-5UTF-16BitDistribution.Thenextsnippetdoesthesameforthelowsurrogate.
constUTF16LO_SURROGATE_START=0xDC00
UTF16X=(UTF16)C;
UTF16LoSurrogate=(UTF16)(LO_SURROGATE_START|X&((1<<10)-1));
Finally,thereverse,wherehiandloarethehighandlow
surrogate,and"C"theresultingcharacter
UTF32X=(hi&((1<<6)-1))<<10|lo&((1<<10)-1);
UTF32W=(hi>>6)&((1<<5)-1);
UTF32U=W+1;
UTF32C=U<<16|X;
Acallerwouldneedtoensurethat"C","hi",and"lo"areinthe
appropriateranges.[AF]
Q:Isn’tthereasimplerwaytodothis?
Thereisamuchsimplercomputationthatdoesnottryto
followthebitdistributiontable.
//constants
constUTF32LEAD_OFFSET=0xD800-(0x10000>>10);
constUTF32SURROGATE_OFFSET=0x10000-(0xD800<<10)-0xDC00;
//computations
UTF16lead=LEAD_OFFSET+(codepoint>>10);
UTF16trail=0xDC00+(codepoint&0x3FF);
UTF32codepoint=(lead<<10)+trail+SURROGATE_OFFSET;
[MD]
Q:WhyaresomepeopleopposedtoUTF-16?
PeoplefamiliarwithvariablewidthEastAsiancharacter
setssuchasShift-JIS(SJIS)areunderstandablynervousaboutUTF-16,
whichsometimesrequirestwocodeunitstorepresentasinglecharacter.
Theyarewellacquaintedwiththeproblemsthatvariable-width
codeshavecaused.
However,therearesomeimportantdifferencesbetweenthemechanisms
usedinSJISandUTF-16:
Overlap:
InSJIS,thereisoverlapbetweentheleadingand
trailingcodeunitvalues,andbetweenthetrailingandsinglecodeunitvalues.Thiscausesanumberofproblems:
Itcausesfalsematches.Forexample,searchingfor
an“a”maymatchagainstthetrailingcodeunitofaJapanesecharacter.
Itpreventsefficientrandomaccess.Toknowwhether
youareonacharacterboundary,youhavetosearchbackwardsto
findaknownboundary.
Itmakesthetextextremelyfragile.Ifaunitis
droppedfromaleading-trailingcodeunitpair,manyfollowingcharacterscanbe
corrupted.
InUTF-16,thecodepointrangesforhighandlow
surrogates,aswellasforsingleunitsareallcompletelydisjoint.
Noneoftheseproblemsoccur:
Therearenofalsematches.
Thelocationofthecharacterboundarycanbedirectly
determinedfromeachcodeunitvalue.
Adroppedsurrogatewillcorruptonlyasingle
character.
Frequency:
ThevastmajorityofSJIScharactersrequire2units,
butcharactersusingsingleunitsoccurcommonlyandoftenhave
specialimportance,forexampleinfilenames.
WithUTF-16,relativelyfewcharactersrequire2units.
Thevastmajorityofcharactersincommonusearesinglecodeunits.
EveninEastAsiantext,theincidenceofsurrogatepairsshouldbe
welllessthan1%ofalltextstorageonaverage.(Certain
documents,ofcourse,mayhaveahigherincidenceofsurrogate
pairs,justasphthisiqueisanfairlyinfrequentwordin
English,butmayoccurquiteofteninaparticularscholarlytext.)
Therecentincreaseduseofemojimeansthatthepercentageofwidely-usedsupplementarycharactershasalsoincreased.[AF]
Q:WillUTF-16everbeextendedtomore
thanamillioncharacters?
No.BothUnicodeandISO10646have
policiesinplacethatformallylimitfuturecodeassignmentto
theintegerrangethatcanbeexpressedwithcurrentUTF-16(0to
1,114,111).Evenifotherencodingforms(i.e.otherUTFs)canrepresent
largerintegers,thesepoliciesmeanthatallencodingformswill
alwaysrepresentthesamesetofcharacters.Overamillionpossiblecodesisfarmorethanenough
forthegoalofUnicodeofencodingcharacters,notglyphs.Unicodeisnotdesignedtoencodearbitrarydata.If
youwanted,forexample,togiveeach“instanceofacharacteronpaper
throughouthistory”itsowncode,youmightneedtrillionsor
quadrillionsofsuchcodes;nobleasthiseffortmightbe,youwouldnot
useUnicodeforsuchanencoding. [AF]
Q:Arethereany16-bitvaluesthatare
invalid?
UnpairedsurrogatesareinvalidinUTFs.Theseincludeanyvalue
intherangeD80016toDBFF16notfollowedbyavalueintherangeDC0016
toDFFF16,oranyvalueintherangeDC0016toDFFF16notprecededbya
valueintherangeD80016toDBFF16.[AF]
Q:Whataboutnoncharacters?Aretheyinvalid?
Notatall.NoncharactersarevalidinUTFsandmustbeproperlyconverted.
Formoredetailsonthedefinitionanduseofnoncharacters,aswellastheircorrectrepresentationineachUTF,
seetheNoncharactersFAQ.
Q:Becausemostsupplementarycharactersareuncommon,doesthatmeanIcanignorethem?
Mostsupplementarycharacters(expressedwithsurrogatepairsin
UTF-16)arenottoocommon.However,thatdoesnotmeanthat
supplementarycharactersshouldbeneglected.Amongthemareanumberof
individualcharactersthatareverypopular,aswellasmanysets
importanttoEastAsianprocurementspecifications.Amongthenotable
supplementarycharactersare:
manypopularemojiandemoticons
symbolsusedforinteroperatingwithWingdingsandWebdings
numeroussmallsetsofCJKcharactersimportantforprocurement,includingpersonalandplacenames
variationselectorsusedforallideographicvariationsequences
numerousminorityscriptsimportantforsomeusercommunities
somehighlysalienthistoricscripts,suchasEgyptianhieroglyphics
KenLundehasaninterestingpresentationfileonthistopic,withaTopTenlist:WhySupportBeyond-BMPCodePoints?
Q:HowshouldIhandlesupplementarycharactersinmycode?
ComparedwithBMPcharactersasawhole,thesupplementarycharacters
occurlesscommonlyintext.Thisremainstruenow,eventhoughmany
thousandsofsupplementarycharactershavebeenaddedtothestandard,
andafewindividualcharacters,suchaspopularemoji,havebecome
quitecommon.TherelativefrequencyofBMPcharacters,andof
theASCIIsubsetwithintheBMP,canbetakenintoaccountwhen
optimizingimplementationsforbestperformance:executionspeed,memory
usage,anddatastorage.
SuchstrategiesareparticularlyusefulforUTF-16implementations,
whereBMPcharactersrequireone16-bitcodeunittoprocessorstore,
whereassupplementarycharactersrequiretwo.
StrategiesthatoptimizefortheBMParelessusefulforUTF-8
implementations,butifthedistributionofdatawarrantsit,an
optimizationfortheASCIIsubsetmaymakesense,asthatsubsetonly
requiresasinglebyteforprocessingandstorageinUTF-8.
Q:WhatisthedifferencebetweenUCS-2andUTF-16?
UCS-2isobsoleteterminologywhichreferstoaUnicodeimplementationuptoUnicode1.1,beforesurrogatecodepointsandUTF-16wereaddedtoVersion2.0ofthestandard.Thistermshouldnowbeavoided.
UCS-2doesnotdescribeadataformatdistinctfromUTF-16,because
bothuseexactlythesame16-bitcodeunitrepresentations.However,
UCS-2doesnotinterpretsurrogatecodepoints,andthus
cannotbeusedtoconformantlyrepresentsupplementarycharacters.
Sometimesinthepastanimplementationhasbeenlabeled"UCS-2"toindicatethatitdoesnotsupportsupplementarycharactersanddoesn'tinterpretpairsofsurrogatecodepointsascharacters.Suchanimplementationwouldnothandleprocessingofcharacterproperties,codepointboundaries,collation,etc.forsupplementarycharacters,norwoulditbeabletosupportmostemoji,forexample.[AF]
UTF-32FAQ
Q:WhatisUTF-32?
AnyUnicodecharactercanbe
representedasasingle32-bitunitinUTF-32.Thissingle4codeunit
correspondstotheUnicodescalarvalue,whichistheabstractnumber
associatedwithaUnicodecharacter.UTF-32isasubsetoftheencoding
mechanismcalledUCS-4inISO10646.Formoreinformation,seeSection3.9,UnicodeEncodingFormsinTheUnicodeStandard.
[AF]
Q:ShouldIuseUTF-32(orUCS-4)for
storingUnicodestringsinmemory?
Thisdepends.ItmayseemcompellingtouseUTF-32asyourinternalstringformatbecauseitusesonecodeunitpercodepoint.However,Unicodecharactersarerarelyprocessedincompleteisolation.Combiningcharactersequencesmayneedtobeprocessedasaunit,forexample.Thisissuenotonlyaffectscomplexscripts,butalsoseeminglysimplethingslikeemoji.DefiningyourAPIssotheyworkprimarilywithstringsand substrings,insteadofcharactersandcharacteroffsetswillmakeiteasiertocorrectlysupportcombiningcharactersequences.ThiswillalsomakethedistinctionbetweenworkinginUTF-32andotherencodingformslessrelevant.
ThedownsideofUTF-32
isthatitforcesyoutouse32-bitsforeachcharacter,whenonly21
bitsareeverneeded.Thenumberofsignificantbitsneededforthe
averagecharacterincommontextsismuchlower,makingtheratio
effectivelythatmuchworse.Increasingthestorageforthesame
numberofcharactersdoeshaveitscostinapplicationsdealingwith
largevolumeoftextdata:itcanmeanexhaustingcachelimitssooner;
itcanresultinnoticeablyincreasedread/writetimesorinreaching
bandwidthlimits;anditrequiresmorespaceforstorage.Whatanumberofimplementationsdoistorepresentstringswith[UTF-8or]
UTF-16, butindividualcharactervalueswith
UTF-32.
IfyoufrequentlyneedtoaccessAPIsthat
requirestringparameterstobeinUTF-32,itmaybemoreconvenientto
workwithUTF-32stringsallthetime.However,Inmanysituationsthatdoesnotmatter,
andtheconvenienceofhavingafixednumberofcodeunitspercharacter
canbethedecidingfactor.
ThechiefsellingpointforUnicodeisprovidinga
representationforalltheworld’scharacters,eliminatingtheneedfor
jugglingmultiplecharactersetsandavoidingtheassociateddatacorruption
problems.Thesefeatureswereenoughtoswingindustrytothesideof
usingUnicodeasin-memoryformat.WhileaUTF-32representationdoesmakethe
programmingmodelsomewhatsimpler,theincreasedaveragestoragesize
hasrealdrawbacks,makingacompletetransitiontoUTF-32lesscompelling.
[AF]
Q:HowaboutusingUTF-32interfacesinmy
APIs?
ExceptinsomeenvironmentsthatstoretextasUTF-32in
memory,mostUnicodeAPIsareusingUTF-16.WithUTF-16APIs the
lowlevel
indexingisatthestorageorcodeunitlevel,withhigher-levelmechanisms
forgraphemesorwordsspecifyingtheirboundariesintermsofthe
codeunits.Thisprovidesefficiencyatthelowlevels,andthe
requiredfunctionalityatthehighlevels.
Ifitsevernecessarytolocatethenth
character,indexingbycharactercanbeimplementedasahighlevel
operation.However,whileconverting
fromsuchaUTF-16codeunitindextoacharacterindexorviceversaisfairly
straightforward,itdoesinvolveascanthroughthe16-bitunitsupto
theindexpoint.Inatestrun,forexample,accessingUTF-16storageas
characters,insteadofcodeunitsresultedina10×degradation.While
therearesomeinterestingoptimizationsthatcanbeperformed,itwill
alwaysbesloweronaverage.Thereforelocatingotherboundaries,such
asgrapheme,word,lineorsentenceboundariesproceedsdirectlyfrom
thecodeunitindex,notindirectlyviaanintermediatecharactercode
index.
Insituationwhereitisnecesarytoworkwiththe"units"thattheuserinteractswith,indexingbyUnicodecharactergivesonlylimitedadvantageoverindexingbycodeunit:manytimeswhatusersperceiveasasincleunit,anemojiforexample,isrepresentedasacombiningorothercharactersequence,anditmakeslittledifferenceiniteratingoversuch"units"whethertheunderlyingcodeuses16-bitor32-bitcodeunits.
Q:Doesn’titcauseaproblemtohave
onlyUTF-16stringAPIs,insteadofUTF-32charAPIs?
Almostallinternationalfunctions(upper-,lower-,
titlecasing,casefolding,drawing,measuring,collation,
transliteration,grapheme-,word-,linebreaks,etc.)shouldtake
stringparametersintheAPI,notsinglecode-points
(UTF-32).Singlecode-pointAPIsalmostalwaysproducethewrongresults
exceptforvery
simplelanguages,eitherbecauseyouneedmorecontexttogettherightanswer,
orbecauseyouneedtogenerateasequenceofcharacterstoreturn
therightanswer,orboth.
Forexample,anyUnicode-compliant
collation(SeeUTS#10:UnicodeCollationAlgogrithm(UCA))mustbeabletohandlesequencesofmorethanone
code-point,andtreatthatsequenceasasingleentity.Tryingtocollatebyhandlingsinglecode-points
atatime,wouldgetthewronganswer.Thesamewillhappenfordrawing
ormeasuringtextasinglecode-pointatatime;becausescriptslike
Arabicarecontextual,thewidthofxplusthewidthofyisnotequal
tothewidthofxy.Onceyougetbeyondbasictypography,thesameis
trueforEnglishaswell;becauseofkerningandligaturesthewidthof
“fi”inthefontmaybedifferentthanthewidthof“f”plusthewidth
of“i".Casingoperationsmustreturnstrings,notsinglecode-points;
see
https://www.unicode.org/charts/case/.Inparticular,thetitle
casingoperationrequiresstringsasinput,notsinglecode-pointsata
time.
Storingasinglecodepoint
inastructorclassinsteadofastring,wouldexcludesupportfor
graphemes,suchas“ch”forSlovak,whereasinglecodepointmaynotbesufficient,
butacharactersequenceisneededtoexpresswhat
isrequired.Inotherwords,mostAPIparametersandfieldsofcomposite
datatypesshould
notbedefinedasacharacter,butasastring.Andiftheyare
strings,itdoesnotmatterwhattheinternalrepresentationofthe
stringis.
Giventhatanyindustrial-strengthtextand
internationalizationsupportAPIhastobeabletohandlesequencesof
characters,itmakes
littledifferencewhetherthestringisinternallyrepresentedbya
sequenceofUTF-16codeunits,orbyasequenceofcode-points(=UTF-32codeunits).
BothUTF-16andUTF-8aredesignedtomakeworkingwithsubstringseasy,
bythefactthatthesequenceofcodeunitsforagivencodepointis
unique.[AF]
Q:Arethereexceptionstotheruleofexclusivelyusing
stringparametersinAPIs?
Themainexceptionareverylow-level
operationssuchasgettingcharacterproperties(e.g.GeneralCategory
orCanonicalClassintheUCD).Forthoseitishandytohaveinterfaces
thatconvertquicklytoandfromUTF-16andUTF-32,andthatallowyou
toiteratethroughstringsreturningUTF-32values(eventhoughthe
internalformatisUTF-16).
Q:HowdoIconvertaUTF-16surrogate
pairsuchastoUTF-32?Asone4-bytesequenceorastwo
4-bytesequences?
ThedefinitionofUTF-32requiresthatsupplementary
characters(thoseusingsurrogatepairsinUTF-16)beencodedwitha
single4-bytesequence.
Q:HowdoIconvertanunpairedUTF-16surrogate
toUTF-32?
Ifanunpairedsurrogateisencounteredwhen
convertingill-formedUTF-16data,anyconformantconvertermust
treatthisasanerror.Byrepresentingsuchanunpairedsurrogateonits
own,theresultingUTF-32datastreamwouldbecomeill-formed.Whileit
faithfullyreflectsthenatureoftheinput,Unicodeconformance
requiresthatencodingformconversionalwaysresultsinvaliddata
stream.[AF]
ByteOrderMark(BOM)FAQ
Q:WhatisaBOM?
Abyteordermark(BOM)consistsofthecharacter
codeU+FEFFatthebeginningofadatastream,whereitcanbeused
asasignaturedefiningthebyteorderandencodingform,primarilyofunmarkedplaintext
files.Undersomehigherlevelprotocols,useofaBOMmaybemandatory
(orprohibited)intheUnicodedatastreamdefinedinthat
protocol.
[AF]
Q:WhereisaBOMuseful?
ABOMisusefulatthebeginningoffilesthataretypedas
text,butforwhichitisnotknownwhethertheyareinbigorlittleendianformat—it
canalsoserveasahintindicatingthatthefileisinUnicode,as
opposedtoinalegacyencodingandfurthermore,itactasasignature
forthespecificencodingformused.[AF]
Q:Whatdoes‘endian’mean?
Datatypeslongerthanabytecanbestoredincomputer
memorywiththemostsignificantbyte(MSB)firstorlast.Theformeris
calledbig-endian,thelatterlittle-endian.Whendataisexchanged,bytes
thatappearinthe"correct"orderonthesendingsystemmayappeartobe
outoforderonthereceivingsystem.Inthatsituation,aBOMwouldlook
like0xFFFEwhichisanoncharacter,allowingthereceivingsystemto
applybytereversalbeforeprocessingthedata.UTF-8isbyteorientedand
thereforedoesnothavethatissue.Nevertheless,aninitialBOMmightbe
usefultoidentifythedatastreamasUTF-8.[AF]
Q:WhenaBOMisused,isitonlyin
16-bitUnicodetext?
No,aBOMcanbeusedasasignaturenomatterhowthe
Unicodetextistransformed:UTF-16,UTF-8,orUTF-32.Theexactbytes
comprisingtheBOMwillbewhatevertheUnicodecharacterU+FEFFis
convertedintobythattransformationformat.Inthatform,theBOM
servestoindicateboththatitisaUnicodefile,andwhichofthe
formatsitisin.Examples:
Bytes
EncodingForm
0000FEFF
UTF-32,big-endian
FFFE0000
UTF-32,little-endian
FEFF
UTF-16,big-endian
FFFE
UTF-16,little-endian
EFBBBF
UTF-8
Q:CanaUTF-8datastreamcontaintheBOM
character(inUTF-8form)?Ifyes,thencanIstillassumetheremaining
UTF-8bytesareinbig-endianorder?
Yes,UTF-8cancontainaBOM.However,itmakesno
differenceastotheendiannessofthebytestream.UTF-8alwayshasthe
samebyteorder.AninitialBOMisonlyusedasasignature—an
indicationthatanotherwiseunmarkedtextfileisinUTF-8.Notethat
somerecipientsofUTF-8encodeddatadonotexpectaBOM.WhereUTF-8
isusedtransparentlyin8-bitenvironments,theuseofaBOM
willinterferewithanyprotocolorfileformatthatexpectsspecific
ASCIIcharactersatthebeginning,suchastheuseof"#!"ofatthe
beginningofUnixshellscripts.
[AF]
Q:WhatshouldIdowithU+FEFFinthe
middleofafile?
IntheabsenceofaprotocolsupportingitsuseasaBOMandwhennotatthe
beginningofatextstream,U+FEFFshouldnormallynotoccur.For
backwardscompatibilityitshouldbetreatedasZEROWIDTH
NON-BREAKINGSPACE(ZWNBSP),
andisthenpartofthecontentofthefileorstring.Theuseof
U+2060WORDJOINERisstronglypreferredoverZWNBSPforexpressingwordjoining
semanticssinceitcannotbeconfusedwithaBOM.Whendesigningamarkup
languageordataprotocol,theuseofU+FEFFcanberestrictedtothat
ofByteOrderMark.Inthatcase,anyU+FEFFoccurringinthemiddleofafilecanbetreatedasan
unsupportedcharacter. [AF]
Q:IamusingaprotocolthathasBOMat
thestartoftext.HowdoIrepresentaninitialZWNBSP?
UseU+2060WORDJOINERinstead.
Q:HowdoItagdatathatdoesnot
interpretU+FEFFasaBOM?
UsethetagUTF-16BEtoindicatebig-endian
UTF-16text,andUTF-16LEtoindicatelittle-endianUTF-16
text.IfyoudouseaBOM,tagthetextassimplyUTF-16.
[MD]
Q:Whywouldn’tIalwaysuseaprotocol
thatrequiresaBOM?
Wherethedatahasanassociatedtype,suchasafieldinadatabase,
aBOMisunnecessary.Inparticular,ifatextdatastreamismarkedas
UTF-16BE,UTF-16LE,UTF-32BEorUTF-32LE,aBOMisneithernecessarynorpermitted.
AnyU+FEFFwouldbeinterpretedasaZWNBSP.
DonottageverystringinadatabaseorsetoffieldswithaBOM,
sinceitwastesspaceandcomplicatesstringconcatenation.Moreover,italsomeanstwodatafieldsmayhave
preciselythesamecontent,butnotbebinary-equal(whereoneis
prefacedbyaBOM).
Q:HowIshoulddeal
withBOMs?
Herearesomeguidelinestofollow:
Aparticularprotocol(e.g.Microsoftconventionsfor
.txtfiles)mayrequireuseoftheBOMoncertainUnicodedata
streams,suchasfiles.Whenyouneedtoconformtosuchaprotocol,
useaBOM.
SomeprotocolsallowoptionalBOMsinthecaseof
untaggedtext.Inthosecases,
Whereatextdatastreamisknowntobeplaintext,but
ofunknownencoding,BOMcanbeusedasasignature.Ifthereisno
BOM,theencodingcouldbeanything.
WhereatextdatastreamisknowntobeplainUnicode
text(butnotwhichendian),thenBOMcanbeusedasasignature.If
thereisnoBOM,thetextshouldbeinterpretedasbig-endian.
SomebyteorientedprotocolsexpectASCIIcharactersat
thebeginningofafile.IfUTF-8isusedwiththeseprotocols,use
oftheBOMasencodingformsignatureshouldbeavoided.
Wheretheprecisetypeofthedatastreamisknown(e.g.
Unicodebig-endianorUnicodelittle-endian),theBOMshouldnotbe
used.Inparticular,wheneveradatastreamisdeclaredtobe
UTF-16BE,UTF-16LE,UTF-32BEorUTF-32LEaBOMmustnotbe
used.(SeealsoQ:Whatisthe
differencebetweenUCS-2andUTF-16?.)
[AF]