UTF-16 - Wikipedia

2024-11-26

文章推薦指數： 80 %

投票人數：10人

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode The encoding is ... UTF-16 FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Variable-widthencodingofUnicode,usingoneortwo16-bitcodeunits UTF-16Thefirst216Unicodecodepoints.ThestripeofsolidgraynearthebottomarethesurrogatehalvesusedbyUTF-16(thewhiteregionbelowthestripeisthePrivateUseArea)Language(s)InternationalStandardUnicodeStandardClassificationUnicodeTransformationFormat,variable-widthencodingExtendsUCS-2Transforms/EncodesISO/IEC10646(Unicode)vte UTF-16(16-bitUnicodeTransformationFormat)isacharacterencodingcapableofencodingall1,112,064validcodepointsofUnicode(infactthisnumberofcodepointsisdictatedbythedesignofUTF-16).Theencodingisvariable-length,ascodepointsareencodedwithoneortwo16-bitcodeunits.UTF-16arosefromanearlierobsoletefixed-width16-bitencoding,nowknownasUCS-2(for2-byteUniversalCharacterSet),onceitbecameclearthatmorethan216(65,536)codepointswereneeded.[1] UTF-16isusedbysystemssuchastheMicrosoftWindowsAPI,theJavaprogramminglanguageandJavaScript/ECMAScript.Itisalsosometimesusedforplaintextandword-processingdatafilesonMicrosoftWindows.ItisrarelyusedforfilesonUnix-likesystems.SinceMay2019,MicrosofthasbegunsupportingUTF-8(aswellasUTF-16)andencouragingitsuse.[2] UTF-16istheonlyweb-encodingincompatiblewithASCII[3]andnevergainedpopularityontheweb,whereitisusedbyunder0.002%(littleover1thousandthof1percent)ofwebpages.[4]UTF-8,bycomparison,accountsfor98%ofallwebpages.[5]TheWebHypertextApplicationTechnologyWorkingGroup(WHATWG)considersUTF-8"themandatoryencodingforall[text]"andthatforsecurityreasonsbrowserapplicationsshouldnotuseUTF-16.[6] ItisusedbySMS(i.e.thevariable-lengthUTF-16neededtosupportallemojicharacters,theSMSstandardspecifiesitspredecessorfixed-widthUCS-2whichdonotsupportmostofthem).[citationneeded] Contents 1History 2Description 2.1U+0000toU+D7FFandU+E000toU+FFFF 2.2CodepointsfromU+010000toU+10FFFF 2.3U+D800toU+DFFF 2.4Examples 3Byte-orderencodingschemes 4Usage 5Seealso 6Notes 7References 8Externallinks History[edit] Inthelate1980s,workbeganondevelopingauniformencodingfora"UniversalCharacterSet"(UCS)thatwouldreplaceearlierlanguage-specificencodingswithonecoordinatedsystem.Thegoalwastoincludeallrequiredcharactersfrommostoftheworld'slanguages,aswellassymbolsfromtechnicaldomainssuchasscience,mathematics,andmusic.Theoriginalideawastoreplacethetypical256-characterencodings,whichrequired1bytepercharacter,withanencodingusing65,536(216)values,whichwouldrequire2bytes(16bits)percharacter. Twogroupsworkedonthisinparallel,ISO/IECJTC1/SC2andtheUnicodeConsortium,thelatterrepresentingmostlymanufacturersofcomputingequipment.Thetwogroupsattemptedtosynchronizetheircharacterassignmentssothatthedevelopingencodingswouldbemutuallycompatible.Theearly2-byteencodingwasoriginallycalled"Unicode",butisnowcalled"UCS-2".[7] Whenitbecameincreasinglyclearthat216characterswouldnotsuffice,[1]IEEEintroducedalarger31-bitspaceandanencoding(UCS-4)thatwouldrequire4bytespercharacter.ThiswasresistedbytheUnicodeConsortium,bothbecause4bytespercharacterwastedalotofmemoryanddiskspace,andbecausesomemanufacturerswerealreadyheavilyinvestedin2-byte-per-charactertechnology.TheUTF-16encodingschemewasdevelopedasacompromiseandintroducedwithversion2.0oftheUnicodestandardinJuly1996.[8]ItisfullyspecifiedinRFC2781,publishedin2000bytheIETF.[9][10] IntheUTF-16encoding,codepointslessthan216areencodedwithasingle16-bitcodeunitequaltothenumericalvalueofthecodepoint,asintheolderUCS-2.Thenewercodepointsgreaterthanorequalto216areencodedbyacompoundvalueusingtwo16-bitcodeunits.Thesetwo16-bitcodeunitsarechosenfromtheUTF-16surrogaterange0xD800–0xDFFFwhichhadnotpreviouslybeenassignedtocharacters.Valuesinthisrangearenotusedascharacters,andUTF-16providesnolegalwaytocodethemasindividualcodepoints.AUTF-16stream,therefore,consistsofsingle16-bitcodepointsoutsidethesurrogaterangeforcodepointsintheBasicMultilingualPlane(BMP),andpairsof16-bitvalueswithinthesurrogaterangeforcodepointsabovetheBMP. UTF-16isspecifiedinthelatestversionsofboththeinternationalstandardISO/IEC10646andtheUnicodeStandard."UCS-2shouldnowbeconsideredobsolete.Itnolongerreferstoanencodingformineither10646ortheUnicodeStandard."[11]UTF-16willneverbeextendedtosupportalargernumberofcodepointsortosupportthecodepointsthatwerereplacedbysurrogates,asthiswouldviolatetheUnicodeStabilityPolicywithrespecttogeneralcategoryorsurrogatecodepoints.[12](Anyschemethatremainsaself-synchronizingcodewouldrequireallocatingatleastoneBMPcodepointtostartasequence.Changingthepurposeofacodepointisdisallowed.) Description[edit] EachUnicodecodepointisencodedeitherasoneortwo16-bitcodeunits.Howthese16-bitcodesarestoredasbytesthendependsontheendiannessofthetextfileorcommunicationprotocol. A"character"mayneedfromasfewastwobytestofourteen[13]orevenmorebytestoberecorded.Forinstanceanemojiflagcharactertakes8bytes,sinceitis"constructedfromapairofUnicodescalarvalues"[14](andthosevaluesareoutsidetheBMPandrequire4byteseach). U+0000toU+D7FFandU+E000toU+FFFF[edit] U+D800toU+DFFFhaveaspecialpurpose,seebelow. BothUTF-16andUCS-2encodecodepointsinthisrangeassingle16-bitcodeunitsthatarenumericallyequaltothecorrespondingcodepoints.ThesecodepointsintheBasicMultilingualPlane(BMP)aretheonlycodepointsthatcanberepresentedinUCS-2.[citationneeded]AsofUnicode9.0,somemodernnon-LatinAsian,Middle-Eastern,andAfricanscriptsfalloutsidethisrange,asdomostemojicharacters. CodepointsfromU+010000toU+10FFFF[edit] Codepointsfromtheotherplanes(calledSupplementaryPlanes)areencodedastwo16-bitcodeunitscalledasurrogatepair,bythefollowingscheme: UTF-16decoder LowHigh DC00 DC01 ... DFFF D800 010000 010001 ... 0103FF D801 010400 010401 ... 0107FF ⋮ ⋮ ⋮ ⋱ ⋮ DBFF 10FC00 10FC01 ... 10FFFF 0x10000issubtractedfromthecodepoint(U),leavinga20-bitnumber(U')inthehexnumberrange0x00000–0xFFFFF.Noteforthesepurposes,Uisdefinedtobenogreaterthan0x10FFFF. Thehightenbits(intherange0x000–0x3FF)areaddedto0xD800togivethefirst16-bitcodeunitorhighsurrogate(W1),whichwillbeintherange0xD800–0xDBFF. Thelowtenbits(alsointherange0x000–0x3FF)areaddedto0xDC00togivethesecond16-bitcodeunitorlowsurrogate(W2),whichwillbeintherange0xDC00–0xDFFF. Illustratedvisually,thedistributionofU'betweenW1andW2lookslike:[15]U'=yyyyyyyyyyxxxxxxxxxx//U-0x10000 W1=110110yyyyyyyyyy//0xD800+yyyyyyyyyy W2=110111xxxxxxxxxx//0xDC00+xxxxxxxxxx Thehighsurrogateandlowsurrogatearealsoknownas"leading"and"trailing"surrogates,respectively,analogoustotheleadingandtrailingbytesofUTF-8.[16] Sincetherangesforthehighsurrogates(0xD800–0xDBFF),lowsurrogates(0xDC00–0xDFFF),andvalidBMPcharacters(0x0000–0xD7FF,0xE000–0xFFFF)aredisjoint,itisnotpossibleforasurrogatetomatchaBMPcharacter,orfortwoadjacentcodeunitstolooklikealegalsurrogatepair.Thissimplifiessearchesagreatdeal.ItalsomeansthatUTF-16isself-synchronizingon16-bitwords:whetheracodeunitstartsacharactercanbedeterminedwithoutexaminingearliercodeunits(i.e.thetypeofcodeunitcanbedeterminedbytherangesofvaluesinwhichitfalls).UTF-8sharestheseadvantages,butmanyearliermulti-byteencodingschemes(suchasShiftJISandotherAsianmulti-byteencodings)didnotallowunambiguoussearchingandcouldonlybesynchronizedbyre-parsingfromthestartofthestring(UTF-16isnotself-synchronizingifonebyteislostoriftraversalstartsatarandombyte). BecausethemostcommonlyusedcharactersareallintheBMP,handlingofsurrogatepairsisoftennotthoroughlytested.Thisleadstopersistentbugsandpotentialsecurityholes,eveninpopularandwell-reviewedapplicationsoftware(e.g.CVE-2008-2938,CVE-2012-2135). TheSupplementaryPlanescontainemoji,historicscripts,lessusedsymbols,lessusedChineseideographs,etc.SincetheencodingofSupplementaryPlanescontains20significantbits(10of16bitsineachofthehighandlowsurrogates),220codepointscanbeencoded,dividedinto16planesof216codepointseach.Includingtheseparately-handledBasicMultilingualPlane,thereareatotalof17planes. U+D800toU+DFFF[edit] Thissectiondoesnotciteanysources.Pleasehelpimprovethissectionbyaddingcitationstoreliablesources.Unsourcedmaterialmaybechallengedandremoved.(March2021)(Learnhowandwhentoremovethistemplatemessage) TheUnicodestandardreservesthesecodepointvaluesforthehighandlowsurrogates,andtheywillneverbeassignedacharacter,sothereshouldbenoreasontoencodethem.TheofficialUnicodestandardsaysthatnoUTFforms,includingUTF-16,canencodethesecodepoints.However,Windowsallowsunpairedsurrogatesinfilenames[17]andotherplaces,whichgenerallymeanstheyhavetobesupportedbysoftwareinspiteoftheirexclusionfromtheUnicodestandard. UCS-2,UTF-8,andUTF-32canencodethesecodepointsintrivialandobviousways,andalargeamountofsoftwaredoesso,eventhoughthestandardstatesthatsucharrangementsshouldbetreatedasencodingerrors. Itispossibletounambiguouslyencodeanunpairedsurrogate(ahighsurrogatecodepointnotfollowedbyalowone,oralowonenotprecededbyahighone)intheformatofUTF-16byusingacodeunitequaltothecodepoint.TheresultisnotvalidUTF-16,butthemajorityofUTF-16encoderanddecoderimplementationsdothisthenwhentranslatingbetweenencodings.[citationneeded] Examples[edit] ToencodeU+10437(𐐷)toUTF-16: Subtract0x10000fromthecodepoint,leaving0x0437. Forthehighsurrogate,shiftrightby10(divideby0x400),thenadd0xD800,resultingin0x0001+0xD800=0xD801. Forthelowsurrogate,takethelow10bits(remainderofdividingby0x400),thenadd0xDC00,resultingin0x0037+0xDC00=0xDC37. TodecodeU+10437(𐐷)fromUTF-16: Takethehighsurrogate(0xD801)andsubtract0xD800,thenmultiplyby0x400,resultingin0x0001×0x400=0x0400. Takethelowsurrogate(0xDC37)andsubtract0xDC00,resultingin0x37. Addthesetworesultstogether(0x0437),andfinallyadd0x10000togetthefinaldecodedUTF-32codepoint,0x10437. Thefollowingtablesummarizesthisconversion,aswellasothers.ThecolorsindicatehowbitsfromthecodepointaredistributedamongtheUTF-16bytes.AdditionalbitsaddedbytheUTF-16encodingprocessareshowninblack. Character Binarycodepoint BinaryUTF-16 UTF-16hexcodeunits UTF-16BEhexbytes UTF-16LEhexbytes $ U+0024 0000000000100100 0000000000100100 0024 0024 2400 € U+20AC 0010000010101100 0010000010101100 20AC 20AC AC20 𐐷 U+10437 00010000010000110111 11011000000000011101110000110111 D801DC37 D801DC37 01D837DC 𤭢 U+24B62 00100100101101100010 11011000010100101101111101100010 D852DF62 D852DF62 52D862DF Byte-orderencodingschemes[edit] UTF-16andUCS-2produceasequenceof16-bitcodeunits.Sincemostcommunicationandstorageprotocolsaredefinedforbytes,andeachunitthustakestwo8-bitbytes,theorderofthebytesmaydependontheendianness(byteorder)ofthecomputerarchitecture. Toassistinrecognizingthebyteorderofcodeunits,UTF-16allowsabyteordermark(BOM),acodepointwiththevalueU+FEFF,toprecedethefirstactualcodedvalue.[nb1](U+FEFFistheinvisiblezero-widthnon-breakingspace/ZWNBSPcharacter.)[nb2]Iftheendianarchitectureofthedecodermatchesthatoftheencoder,thedecoderdetectsthe0xFEFFvalue,butanopposite-endiandecoderinterpretstheBOMasthenoncharactervalueU+FFFEreservedforthispurpose.Thisincorrectresultprovidesahinttoperformbyte-swappingfortheremainingvalues. IftheBOMismissing,RFC2781recommends[nb3]thatbig-endian(BE)encodingbeassumed.Inpractice,duetoWindowsusinglittle-endian(LE)orderbydefault,manyapplicationsassumelittle-endianencoding.Itisalsoreliabletodetectendiannessbylookingfornullbytes,ontheassumptionthatcharacterslessthanU+0100areverycommon.Ifmoreevenbytes(startingat0)arenull,thenitisbig-endian. ThestandardalsoallowsthebyteordertobestatedexplicitlybyspecifyingUTF-16BEorUTF-16LEastheencodingtype.Whenthebyteorderisspecifiedexplicitlythisway,aBOMisspecificallynotsupposedtobeprependedtothetext,andaU+FEFFatthebeginningshouldbehandledasaZWNBSPcharacter.MostapplicationsignoreaBOMinallcasesdespitethisrule. ForInternetprotocols,IANAhasapproved"UTF-16","UTF-16BE",and"UTF-16LE"asthenamesfortheseencodings(thenamesarecaseinsensitive).ThealiasesUTF_16orUTF16maybemeaningfulinsomeprogramminglanguagesorsoftwareapplications,buttheyarenotstandardnamesinInternetprotocols. Similardesignations,UCS-2BEandUCS-2LE,areusedtoshowversionsofUCS-2. Usage[edit] UTF-16isusedfortextintheOS APIofallcurrentlysupportedversionsofMicrosoftWindows(andincludingatleastallsinceWindowsCE/2000/XP/2003/Vista/7[18])includingWindows10.InWindowsXP,nocodepointaboveU+FFFFisincludedinanyfontdeliveredwithWindowsforEuropeanlanguages.[19][20]OlderWindowsNTsystems(priortoWindows2000)onlysupportUCS-2.[21]FilesandnetworkdatatendtobeamixofUTF-16,UTF-8,andlegacybyteencodings. Whilethere'sbeensomeUTF-8supportforevenWindowsXP,[22]itwasimproved(inparticulartheabilitytonameafileusingUTF-8)inWindows10insiderbuild17035andtheApril2018update.AsofMay2019MicrosoftrecommendssoftwareuseUTF-8insteadofUTF-16.[2] TheIBMioperatingsystemdesignatesCCSID(codepage)13488forUCS-2encodingandCCSID1200forUTF-16encoding,thoughthesystemtreatsthembothasUTF-16.[23] UTF-16isusedbytheQualcommBREWoperatingsystems;the.NETenvironments;andtheQtcross-platformgraphicalwidgettoolkit. SymbianOSusedinNokiaS60handsetsandSonyEricssonUIQhandsetsusesUCS-2.iPhonehandsetsuseUTF-16forShortMessageServiceinsteadofUCS-2describedinthe3GPPTS23.038(GSM)andIS-637(CDMA)standards.[24] TheJolietfilesystem,usedinCD-ROMmedia,encodesfilenamesusingUCS-2BE(uptosixty-fourUnicodecharactersperfilename). ThePythonlanguageenvironmentofficiallyonlyusesUCS-2internallysinceversion2.0,buttheUTF-8decoderto"Unicode"producescorrectUTF-16.SincePython2.2,"wide"buildsofUnicodearesupportedwhichuseUTF-32instead;[25]theseareprimarilyusedonLinux.Python3.3nolongereverusesUTF-16,insteadanencodingthatgivesthemostcompactrepresentationforthegivenstringischosenfromASCII/Latin-1,UCS-2,andUTF-32.[26] JavaoriginallyusedUCS-2,andaddedUTF-16supplementarycharactersupportinJ2SE5.0. JavaScriptmayuseUCS-2orUTF-16.[27]AsofES2015,stringmethodsandregularexpressionflagshavebeenaddedtothelanguagethatpermithandlingstringsfromanencoding-agnosticperspective. Inmanylanguages,quotedstringsneedanewsyntaxforquotingnon-BMPcharacters,astheC-style"\uXXXX"syntaxexplicitlylimitsitselfto4hexdigits.Thefollowingexamplesillustratethesyntaxforthenon-BMPcharacter"𝄞"(U+1D11E,MUSICALSYMBOLGCLEF): Themostcommon(usedbyC++,C#,D,andseveralotherlanguages)istouseanupper-case'U'with8hexdigitssuchas"\U0001D11E".[28] InJava7regularexpressions,ICU,andPerl,thesyntax"\x{1D11E}"mustbeused;similarly,inECMAScript2015(JavaScript),theescapeformatis"\u{1D11E}". Inmanyothercases(suchasJavaoutsideofregularexpressions),[29]theonlywaytogetnon-BMPcharactersistoenterthesurrogatehalvesindividually,forexample:"\uD834\uDD1E"forU+1D11E. StringimplementationsbasedonUTF-16typicallydefinelengthsofthestringandallowindexingintermsofthese16-bitcodeunits,notintermsofcodepoints.Neithercodepointsnorcodeunitscorrespondtoanythinganendusermightrecognizeasa“character”;thethingsusersidentifyascharactersmayingeneralconsistofabasecodepointandasequenceofcombiningcharacters(ormightbeasequenceofcodepointsofsomeotherkind,forexampleHangulconjoiningjamos) –Unicodereferstothisconstructasagraphemecluster[30] –andassuch,applicationsdealingwithUnicodestrings,whatevertheencoding,mustcopewiththefactthatthislimitstheirabilitytoarbitrarilysplitandcombinestrings. UCS-2isalsosupportedbythePHPlanguage[31]andMySQL.[7] Swift,version5,Apple'spreferredapplicationlanguage,switchedfromUTF-16toUTF-8asthepreferredencoding.[32] Seealso[edit] ComparisonofUnicodeencodings Plane(Unicode) UTF-8 UTF-32 Notes[edit] ^UTF-8encodingproducesbytevaluesstrictlylessthan0xFE,soeitherbyteintheBOMsequencealsoidentifiestheencodingasUTF-16(assumingthatUTF-32isnotexpected). ^UseofU+FEFFasthecharacterZWNBSPinsteadofasaBOMhasbeendeprecatedinfavorofU+2060(WORDJOINER);seeByteOrderMark(BOM)FAQatunicode.org.ButifanapplicationinterpretsaninitialBOMasacharacter,theZWNBSPcharacterisinvisible,sotheimpactisminimal. ^RFC 2781section4.3saysthatifthereisnoBOM,"thetextSHOULDbeinterpretedasbeingbig-endian."Accordingtosection1.2,themeaningoftheterm"SHOULD"isgovernedbyRFC 2119.Inthatdocument,section3says"...theremayexistvalidreasonsinparticularcircumstancestoignoreaparticularitem,butthefullimplicationsmustbeunderstoodandcarefullyweighedbeforechoosingadifferentcourse". References[edit] ^ab"WhatisUTF-16?".TheUnicodeConsortium.Unicode,Inc.Retrieved29March2018. ^ab"UsetheWindowsUTF-8codepage-UWPapplications".docs.microsoft.com.Retrieved2020-06-06.AsofWindowsVersion1903(May2019Update),youcanusetheActiveCodePagepropertyintheappxmanifestforpackagedapps,orthefusionmanifestforunpackagedapps,toforceaprocesstouseUTF-8astheprocesscodepage.[..]CP_ACPequatestoCP_UTF8onlyifrunningonWindowsVersion1903(May2019Update)oraboveandtheActiveCodePagepropertydescribedaboveissettoUTF-8.Otherwise,ithonorsthelegacysystemcodepage.WerecommendusingCP_UTF8explicitly. ^"HTMLLivingStandard".w3.org.2020-06-10.Retrieved2020-06-15.UTF-16encodingsaretheonlyencodingsthatthisspecificationneedstotreatasnotbeingASCII-compatibleencodings. ^"UsageStatisticsofUTF-16forWebsites,June2021".w3techs.com.Retrieved2021-06-17. ^"UsageStatisticsofUTF-8forWebsites,November2021".w3techs.com.Retrieved2021-11-10. ^"EncodingStandard".encoding.spec.whatwg.org.Retrieved2018-10-22.TheUTF-8encodingisthemostappropriateencodingforinterchangeofUnicode,theuniversalcodedcharacterset.Thereforefornewprotocolsandformats,aswellasexistingformatsdeployedinnewcontexts,thisspecificationrequires(anddefines)theUTF-8encoding.[..]TheproblemsoutlinedheregoawaywhenexclusivelyusingUTF-8,whichisoneofthemanyreasonsthatUTF-8isnowthemandatoryencodingforalltextthingsontheWeb. ^ab"MySQL ::MySQL5.7ReferenceManual ::10.1.9.4Theucs2CharacterSet(UCS-2UnicodeEncoding)".dev.mysql.com. ^"Questionsaboutencodingforms".Retrieved2010-11-12. ^ISO/IEC10646:2014"Informationtechnology–UniversalCodedCharacterSet(UCS)"sections9and10. ^TheUnicodeStandardversion7.0(2014)section2.5. ^"TheUnicode®StandardVersion10.0–CoreSpecification.AppendixCRelationshiptoISO/IEC10646"(PDF).UnicodeConsortium.Archived(PDF)fromtheoriginalon2022-10-09.sectionC.2page913(pdfpage10) ^"UnicodeCharacterEncodingStabilityPolicies".unicode.org. ^"It'snotwrongthat"🤦🏼‍♂️".length==7".hsivonen.fi.Retrieved2021-03-15. ^"AppleDeveloperDocumentation".developer.apple.com.Retrieved2021-03-15. ^Yergeau,Francois;Hoffman,Paul(February2000)."UTF-16,anencodingofISO10646".tools.ietf.org.Retrieved2019-06-18. ^Allen,JulieD.;Anderson,Deborah;Becker,Joe;Cook,Richard,eds.(2014)."3.8Surrogates"(PDF).TheUnicodeStandard,Version7.0—CoreSpecification.MountainView:TheUnicodeConsortium.p. 118.Archived(PDF)fromtheoriginalon2022-10-09.Retrieved3November2014. ^"MaximumPathLengthLimitation".Microsoft.2022-07-18.Retrieved2022-10-10.[…]thefilesystemtreatspathandfilenamesasanopaquesequenceofWCHARs ^Unicode(Windows).Retrieved2011-03-08"ThesefunctionsuseUTF-16(widecharacter)encoding(…)usedfornativeUnicodeencodingonWindowsoperatingsystems." ^"Unicode".microsoft.com.Retrieved2009-07-20. ^"SurrogatesandSupplementaryCharacters".microsoft.com.Retrieved2009-07-20. ^"DescriptionofstoringUTF-8datainSQLServer".microsoft.com.7December2005.Retrieved2008-02-01. ^"[Updated]Patchforcmd.exeforwindowsxpforcp65001-Page2-DosTips.com".www.dostips.com.Retrieved2021-06-17. ^"UCS-2anditsrelationshiptoUnicode(UTF-16)".IBM.Retrieved2019-04-26. ^Selph,Chad(2012-11-08)."AdventuresinUnicodeSMS".Twilio.Archivedfromtheoriginalon2012-11-09.Retrieved2015-08-28. ^"PEP261–Supportfor"wide"Unicodecharacters".Python.org.Retrieved2015-05-29. ^"PEP0393–FlexibleStringRepresentation".Python.org.Retrieved2015-05-29. ^"JavaScript'sinternalcharacterencoding:UCS-2orUTF-16?·MathiasBynens". ^"ECMA-334:9.4.1Unicodeescapesequences".en.csharp-online.net.Archivedfromtheoriginalon2013-05-01. ^LexicalStructure:UnicodeEscapesin"TheJavaLanguageSpecification,ThirdEdition".SunMicrosystems,Inc.2005.Retrieved2019-10-11. ^"GlossaryofUnicodeTerms".Retrieved2016-06-21. ^"PHP:SupportedCharacterEncodings-Manual".php.net. ^"UTF-8String".Swift.org.2019-03-20.Retrieved2020-08-20. Externallinks[edit] Averyshortalgorithmfordeterminingthesurrogatepairforanycodepoint UnicodeTechnicalNote#12:UTF-16forProcessing UnicodeFAQ:WhatisthedifferencebetweenUCS-2andUTF-16? UnicodeCharacterNameIndex RFC 2781:UTF-16,anencodingofISO10646 java.lang.Stringdocumentation,discussingsurrogatehandling vteUnicodeUnicode UnicodeConsortium ISO/IEC10646(UniversalCharacterSet) Versions Codepoints Block List UniversalCharacterSet Charactercharts Characterproperty Plane PrivateUseArea CharactersSpecialpurpose BOM Combininggraphemejoiner Left-to-rightmark /Right-to-leftmark Softhyphen Variantform Wordjoiner Zero-widthjoiner Zero-widthnon-joiner Zero-widthspace Lists Characters CJKUnifiedIdeographs Combiningcharacter Duplicatecharacters Numerals Scripts Spaces Symbols Halfwidthandfullwidth Aliasnamesandabbreviations Whitespacecharacters ProcessingAlgorithms Bidirectionaltext Collation ISO/IEC14651 Equivalence Variationsequences InternationalIdeographsCore Comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Onpairsofcodepoints Combiningcharacter Compatibilitycharacters Duplicatecharacters Equivalence Homoglyph Precomposedcharacter list Z-variant Variationsequences Regionalindicatorsymbol Emojiskincolor Usage Domainnames(IDN) Email Fonts HTML entityreferences numericreferences Input InternationalIdeographsCore Relatedstandards CommonLocaleDataRepository(CLDR) GB18030 ISO/IEC8859 ISO15924 Relatedtopics Anomalies ConScriptUnicodeRegistry IdeographicResearchGroup InternationalComponentsforUnicode PeopleinvolvedwithUnicode Hanunification ScriptsandsymbolsinUnicodeCommonandinheritedscripts Combiningmarks Diacritics Punctuationmarks Spaces Numbers Modernscripts Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese CanadianAboriginal Chakma Cham Cherokee CJKUnifiedIdeographs(Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati GunjalaGondi Gurmukhi Hangul HanifiRohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana KayahLi Khmer Lao Latin Lepcha Limbu Lisu(Fraser) Lontara Malayalam MasaramGondi MendeKikakui Medefaidrin Miao(Pollard) Mongolian Mru N'Ko NagMundari NewTaiLue Nüshu NyiakengPuachueHmong Odia OlChiki Osage Osmanya PahawhHmong PauCinHau Pracalit(Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala SorangSompeng Sundanese Syriac Tagbanwa TaiLe TaiTham TaiViet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho WarangCiti Yi Ancientandhistoricscripts Ahom Anatolianhieroglyphs AncientNorthArabian Avestan BassaVah Bhaiksuki Brāhmī Carian CaucasianAlbanian Coptic Cuneiform Cypriot Cypro-Minoan DivesAkuru Dogra Egyptianhieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran ImperialAramaic InscriptionalPahlavi InscriptionalParthian Kaithi Kawi Kharosthi Khitansmallscript Khojki Khudawadi Khwarezmian(Chorasmian) LinearA LinearB Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen MeeteiMayek Meroitic Modi Multani Nabataean Nandinagari Ogham OldHungarian OldItalic OldPermic OldPersiancuneiform OldSogdian OldTurkic OldUyghur Palmyrene ʼPhags-pa Phoenician PsalterPahlavi Runic Sharada Siddham Sogdian SouthArabian Soyombo SylhetiNagri Tagalog(Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi ZanabazarSquare Notationalscripts Duployan SignWriting Symbols,emojis Cultural,political,andreligioussymbols Currency ControlPictures Mathematicaloperatorsandsymbols Listbysubject Phoneticsymbols(includingIPA) Emoji Category:Unicode Category:Unicodeblocks vteCharacterencodingsEarlytelecommunications Telegraphcode Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean BaudotandMurray Fieldata ASCII ISO/IEC646 BCDIC TeletexandVideotex/Teletext T.51/ISO/IEC6937 ITUT.61 ITUT.101 WorldSystemTeletext background sets Transcode ISO/IEC8859 Approvedparts -1(WesternEurope) -2(CentralEurope) -3(Maltese/Esperanto) -4(NorthEurope) -5(Cyrillic) -6(Arabic) -7(Greek) -8(Hebrew) -9(Turkish) -10(Nordic) -11(Thai) -13(Baltic) -14(Celtic) -15(NewWesternEurope) -16(Romanian) Abandonedparts -12(Devanagari) Proposedbutnot approved KOI-8Cyrillic Sámi Adaptations Welsh BarentsCyrillic Estonian UkrainianCyrillic Bibliographicuse MARC-8 ANSEL CCCII/EACC ISO5426 5426-2 5427 5428 6438 6862 Nationalstandards ArmSCII BraSCII CNS11643 DIN66003 ELOT927 GOST10859 GB2312 GB12345 GB12052 GB18030 HKSCS ISCII JISX0201 JISX0208 JISX0212 JISX0213 KOI-7 KPS9566 KSX1001 KSX1002 LST1564 LST1590-4 PASCII ShiftJIS SI960 TIS-620 TSCII VISCII VSCII YUSCII ISO/IEC2022 ISO/IEC8859 ISO/IEC10367 ExtendedUnixCode/EUC MacOSCodepages("scripts") Armenian Arabic BarentsCyrillic Celtic CentralEuropean Croatian Cyrillic Devanagari Farsi(Persian) FontX(Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin(Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish TurkicCyrillic Ukrainian VT100 DOScodepages 437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1040 1042 1043 1046 1098 1115 1116 1117 1118 1127 3846 ABICOMP CSIndic CSXIndic CSX+Indic CWI-2 IranSystem Kamenický Mazovia MIK IBMAIXcodepages 895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1124 1133 Windowscodepages CER-GS 932 936(GBK) 950 1169 ExtendedLatin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic+Finnish Cyrillic+French Cyrillic+German PolytonicGreek EBCDICcodepages 37 JapaneselanguageinEBCDIC DKOI DECterminals(VTx) Multinational(MCS) NationalReplacement(NRCS) FrenchCanadian Swiss Spanish UnitedKingdom Dutch Finnish French NorwegianandDanish Swedish NorwegianandDanish(alternative) 8-bitGreek 8-bitTurkish SI960 Hebrew SpecialGraphics Technical(TCS) Platformspecific 1057 Acorn AdobeStandard AdobeLatin1 AmstradCPC AppleII ATASCII AtariST BICS Casiocalculators CDC CompucolorII CP/M+ DECRADIX50 DECMCS/NRCS DGInternational Fieldata GEM GSM03.38 HPRoman HPFOCAL HPRPL SQUOZE LICS LMBCS MSX NECAPC NeXT PETSCII SegaSC-3000 Sharpcalculators SharpMZ SinclairQL Symbol Teletext TIcalculators TRS-80 VenturaInternational WISCII XCCS ZX80 ZX81 ZXSpectrum Unicode /ISO/IEC10646 UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB18030 BOCU-1 CESU-8 SCSU TACE16 ComparisonofUnicodeencodings TeXtypesettingsystem Cork LY1 OML OMS OT1 Miscellaneouscodepages ABICOMP ASMO449 Big5 DigitalencodingofAPLsymbols ISO-IR-68 ARIBSTD-B24 HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS TRON UnifiedHangulCode Controlcharacter Morseprosigns C0andC1controlcodes ISO/IEC6429 JISX0211 Unicodecontrol,formatandseparatorcharacters Whitespacecharacters Relatedtopics CCSID CharacterencodingsinHTML Charsetdetection Hanunification Hardwarecodepage MICRcode Mojibake Variable-widthencoding Charactersets Retrievedfrom"https://en.wikipedia.org/w/index.php?title=UTF-16&oldid=1115307272" Categories:EncodingsCharacterencodingUnicodeTransformationFormatsComputer-relatedintroductionsin1991Hiddencategories:ArticleswithshortdescriptionShortdescriptionisdifferentfromWikidataAllarticleswithunsourcedstatementsArticleswithunsourcedstatementsfromApril2022ArticleswithunsourcedstatementsfromNovember2015ArticlesneedingadditionalreferencesfromMarch2021AllarticlesneedingadditionalreferencesArticleswithunsourcedstatementsfromOctober2011 Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages AlemannischالعربيةБългарскиCatalàČeštinaDanskDeutschEspañolEsperantoFrançais한국어HrvatskiBahasaIndonesiaItalianoעבריתMagyarNederlands日本語PolskiPortuguêsРусскийSlovenčinaSvenskaไทยУкраїнськаاردو中文 Editlinks