What are Unicode, UTF-8, and UTF-16? - Stack Overflow

2024-11-30

文章推薦指數： 80 %

投票人數：10人

UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or ... Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams WhatareUnicode,UTF-8,andUTF-16? AskQuestion Asked 12years,8monthsago Modified 7monthsago Viewed 343ktimes 475 What'sthebasisforUnicodeandwhytheneedforUTF-8orUTF-16? IhaveresearchedthisonGoogleandsearchedhereaswell,butit'snotcleartome. InVSS,whendoingafilecomparison,sometimesthereisamessagesayingthetwofileshavedifferingUTF's.Whywouldthisbethecase? Pleaseexplaininsimpleterms. unicodeencodingutf-8utf-16 Share Improvethisquestion Follow editedFeb18at17:51 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges askedFeb11,2010at0:12 SoftwareGeekSoftwareGeek 14.8k1919goldbadges6060silverbadges7878bronzebadges 10 143 SoundslikeyouneedtoreadTheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets!It'saverygoodexplanationofwhat'sgoingon. – BrianAgnew Feb11,2010at0:14 5 ThisFAQfromtheofficialUnicodewebsitehassomeanswersforyou. – NemanjaTrifunovic Feb11,2010at16:10 4 @John:it'saveryniceintroduction,butit'snottheultimatesource:Itskipsquiteafewofthedetails(whichisfineforanoverview/introduction!) – JoachimSauer Jun24,2011at10:22 6 Thearticleisgreat,butithasseveralmistakesandrepresentsUTF-8insomewhatconservativelight.Isuggestreadingutf8everywhere.orgasasupplement. – PavelRadzivilovsky May27,2012at4:53 2 Takealookatthiswebsite:utf8everywhere.org – Vertexwahn Jan14,2016at12:37 | Show5morecomments 9Answers 9 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 652 +250 WhydoweneedUnicode? Inthe(nottoo)earlydays,allthatexistedwasASCII.Thiswasokay,asallthatwouldeverbeneededwereafewcontrolcharacters,punctuation,numbersandlettersliketheonesinthissentence.Unfortunately,today'sstrangeworldofglobalintercommunicationandsocialmediawasnotforeseen,anditisnottoounusualtoseeEnglish,العربية,汉语,עִבְרִית,ελληνικά,andភាសាខ្មែរinthesamedocument(IhopeIdidn'tbreakanyoldbrowsers). Butforargument'ssake,let’ssayJoeAverageisasoftwaredeveloper.HeinsiststhathewillonlyeverneedEnglish,andassuchonlywantstouseASCII.ThismightbefineforJoetheuser,butthisisnotfineforJoethesoftwaredeveloper.Approximatelyhalftheworldusesnon-LatincharactersandusingASCIIisarguablyinconsideratetothesepeople,andontopofthat,heisclosingoffhissoftwaretoalargeandgrowingeconomy. Therefore,anencompassingcharactersetincludingalllanguagesisneeded.ThuscameUnicode.Itassignseverycharacterauniquenumbercalledacodepoint.OneadvantageofUnicodeoverotherpossiblesetsisthatthefirst256codepointsareidenticaltoISO-8859-1,andhencealsoASCII.Inaddition,thevastmajorityofcommonlyusedcharactersarerepresentablebyonlytwobytes,inaregioncalledtheBasicMultilingualPlane(BMP).Nowacharacterencodingisneededtoaccessthischaracterset,andasthequestionasks,IwillconcentrateonUTF-8andUTF-16. Memoryconsiderations Sohowmanybytesgiveaccesstowhatcharactersintheseencodings? UTF-8: 1byte:StandardASCII 2bytes:Arabic,Hebrew,mostEuropeanscripts(mostnotablyexcludingGeorgian) 3bytes:BMP 4bytes:AllUnicodecharacters UTF-16: 2bytes:BMP 4bytes:AllUnicodecharacters It'sworthmentioningnowthatcharactersnotintheBMPincludeancientscripts,mathematicalsymbols,musicalsymbols,andrarerChinese,Japanese,andKorean(CJK)characters. Ifyou'llbeworkingmostlywithASCIIcharacters,thenUTF-8iscertainlymorememoryefficient.However,ifyou'reworkingmostlywithnon-Europeanscripts,usingUTF-8couldbeupto1.5timeslessmemoryefficientthanUTF-16.Whendealingwithlargeamountsoftext,suchaslargeweb-pagesorlengthyworddocuments,thiscouldimpactperformance. Encodingbasics Note:IfyouknowhowUTF-8andUTF-16areencoded,skiptothenextsectionforpracticalapplications. UTF-8:ForthestandardASCII(0-127)characters,theUTF-8codesareidentical.ThismakesUTF-8idealifbackwardscompatibilityisrequiredwithexistingASCIItext.Othercharactersrequireanywherefrom2-4bytes.Thisisdonebyreservingsomebitsineachofthesebytestoindicatethatitispartofamulti-bytecharacter.Inparticular,thefirstbitofeachbyteis1toavoidclashingwiththeASCIIcharacters. UTF-16:ForvalidBMPcharacters,theUTF-16representationissimplyitscodepoint.However,fornon-BMPcharactersUTF-16introducessurrogatepairs.Inthiscaseacombinationoftwotwo-byteportionsmaptoanon-BMPcharacter.Thesetwo-byteportionscomefromtheBMPnumericrange,butareguaranteedbytheUnicodestandardtobeinvalidasBMPcharacters.Inaddition,sinceUTF-16hastwobytesasitsbasicunit,itisaffectedbyendianness.Tocompensate,areservedbyteordermarkcanbeplacedatthebeginningofadatastreamwhichindicatesendianness.Thus,ifyouarereadingUTF-16input,andnoendiannessisspecified,youmustcheckforthis. Ascanbeseen,UTF-8andUTF-16arenowherenearcompatiblewitheachother.Soifyou'redoingI/O,makesureyouknowwhichencodingyouareusing!Forfurtherdetailsontheseencodings,pleaseseetheUTFFAQ. Practicalprogrammingconsiderations Characterandstringdatatypes:Howaretheyencodedintheprogramminglanguage?Iftheyarerawbytes,theminuteyoutrytooutputnon-ASCIIcharacters,youmayrunintoafewproblems.Also,evenifthecharactertypeisbasedonaUTF,thatdoesn'tmeanthestringsareproperUTF.Theymayallowbytesequencesthatareillegal.Generally,you'llhavetousealibrarythatsupportsUTF,suchasICUforC,C++andJava.Inanycase,ifyouwanttoinput/outputsomethingotherthanthedefaultencoding,youwillhavetoconvertitfirst. Recommended,default,anddominantencodings:WhengivenachoiceofwhichUTFtouse,itisusuallybesttofollowrecommendedstandardsfortheenvironmentyouareworkingin.Forexample,UTF-8isdominantontheweb,andsinceHTML5,ithasbeentherecommendedencoding.Conversely,both.NETandJavaenvironmentsarefoundedonaUTF-16charactertype.Confusingly(andincorrectly),referencesareoftenmadetothe"Unicodeencoding",whichusuallyreferstothedominantUTFencodinginagivenenvironment. Librarysupport:Thelibrariesyouareusingsupportsomekindofencoding.Whichone?Dotheysupportthecornercases?Sincenecessityisthemotherofinvention,UTF-8librarieswillgenerallysupport4-bytecharactersproperly,since1,2,andeven3bytecharacterscanoccurfrequently.However,notallpurportedUTF-16librariessupportsurrogatepairsproperlysincetheyoccurveryrarely. Countingcharacters:ThereexistcombiningcharactersinUnicode.Forexample,thecodepointU+006E(n),andU+0303(acombiningtilde)formsñ,butthecodepointU+00F1formsñ.Theyshouldlookidentical,butasimplecountingalgorithmwillreturn2forthefirstexample,and1forthelatter.Thisisn'tnecessarilywrong,butitmaynotbethedesiredoutcomeeither. Comparingforequality:A,А,andΑlookthesame,butthey'reLatin,Cyrillic,andGreekrespectively.YoualsohavecaseslikeCandⅭ.Oneisaletter,andtheotherisaRomannumeral.Inaddition,wehavethecombiningcharacterstoconsideraswell.Formoreinformation,seeDuplicatecharactersinUnicode. Surrogatepairs:ThesecomeupoftenenoughonStack Overflow,soI'lljustprovidesomeexamplelinks: Gettingstringlength Removingsurrogatepairs Palindromechecking Share Improvethisanswer Follow editedFeb20at20:28 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredFeb28,2013at5:24 DPenner1DPenner1 9,81855goldbadges3434silverbadges4545bronzebadges 13 11 Excellentanswer,greatchancesforthebounty;-)PersonallyI'daddthatsomeargueforUTF-8astheuniversalcharacterencoding,butIknowthatthat'saopinionthat'snotnecessarilysharedbyeveryone. – JoachimSauer Feb28,2013at9:09 3 Stilltootechnicalformeatthisstage.HowisthewordhellostoredinacomputerinUTF-8andUTF-16? – FirstNameLastName May12,2013at18:17 1 Couldyouexpandmoreonwhy,forexample,theBMPtakes3bytesinUTF-8?Iwouldhavethoughtthatsinceitsmaximumvalueis0xFFFF(16bits)thenitwouldonlytake2bytestoaccess. – mark Oct7,2014at22:14 2 @markSomebitsarereservedforencodingpurposes.Foracodepointthattakes2bytesinUTF-8,thereare5reservedbits,leavingonly11bitstoselectacodepoint.U+07FFendsupbeingthehighestcodepointrepresentablein2bytes. – DPenner1 Oct8,2014at3:40 1 BTW-ASCIIonlydefines128codepoints,usingonly7bitsforrepresentation.ItisISO-8859-1/ISO-8859-15whichdefine256codepointsanduse8bitsforrepresentation.Thefirst128codepointsinallthese3arethesame. – Tuxdude Feb15,2016at21:34 | Show8morecomments 92 Unicode isasetofcharactersusedaroundtheworld UTF-8 acharacterencodingcapableofencodingallpossiblecharacters(calledcodepoints)inUnicode. codeunitis8-bits useonetofourcodeunitstoencodeUnicode 00100100for"$"(one8-bits);1100001010100010for"¢"(two8-bits);111000101000001010101100for"€"(three8-bits) UTF-16 anothercharacterencoding codeunitis16-bits useonetotwocodeunitstoencodeUnicode 0000000000100100for"$"(one16-bits);11011000010100101101111101100010for"𤭢"(two16-bits) Share Improvethisanswer Follow answeredJan6,2015at7:45 wengeezhangwengeezhang 2,71911goldbadge1717silverbadges1010bronzebadges 1 Thecharacterbefore"two16-bits"doesnotrender(Firefoxversion97.0onUbuntuMATE20.04(FocalFossa)). – PeterMortensen Feb20at20:30 Addacomment | 34 Unicodeisafairlycomplexstandard.Don’tbetooafraid,butbe preparedforsomework![2] Becauseacredibleresourceisalwaysneeded,buttheofficialreportismassive,Isuggestreadingthefollowing: TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)AnintroductionbyJoelSpolsky,StackExchangeCEO. TotheBMPandbeyond!AtutorialbyEricMuller,TechnicalDirectorthen,VicePresidentlater,atTheUnicodeConsortium(thefirst20slidesandyouaredone) Abriefexplanation: Computersreadbytesandpeoplereadcharacters,soweuseencodingstandardstomapcharacterstobytes.ASCIIwasthefirstwidelyusedstandard,butcoversonlyLatin(sevenbits/charactercanrepresent128differentcharacters).Unicodeisastandardwiththegoaltocoverallpossiblecharactersintheworld(canholdupto1,114,112characters,meaning21bits/charactermaximum.CurrentUnicode8.0specifies120,737charactersintotal,andthat'sall). ThemaindifferenceisthatanASCIIcharactercanfittoabyte(eightbits),butmostUnicodecharacterscannot.Soencodingforms/schemes(likeUTF-8andUTF-16)areused,andthecharactermodelgoeslikethis: Everycharacterholdsanenumeratedpositionfrom0to1,114,111(hex:0-10FFFF)calledacodepoint. Anencodingformmapsacodepointtoacodeunitsequence.Acodeunitisthewayyouwantcharacterstobeorganizedinmemory,8-bitunits,16-bitunitsandsoon.UTF-8usesonetofourunitsofeightbits,andUTF-16usesoneortwounitsof16bits,tocovertheentireUnicodeof21bitsmaximum.Unitsuseprefixessothatcharacterboundariescanbespotted,andmoreunitsmeanmoreprefixesthatoccupybits.So,althoughUTF-8usesonebytefortheLatinscript,itneedsthreebytesforlaterscriptsinsideaBasicMultilingualPlane,whileUTF-16usestwobytesforallthese.Andthat'stheirmaindifference. Lastly,anencodingscheme(likeUTF-16BEorUTF-16LE)maps(serializes)acodeunitsequencetoabytesequence. character:π codepoint:U+03C0 encodingforms(codeunits): UTF-8:CF80 UTF-16:03C0 encodingschemes(bytes): UTF-8:CF80 UTF-16BE:03C0 UTF-16LE:C003 Tip:ahexadecimaldigitrepresentsfourbits,soatwo-digithexnumberrepresentsabyte. AlsotakealookatplanemapsonWikipediatogetafeelingofthecharactersetlayout. Share Improvethisanswer Follow editedFeb20at20:41 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredOct27,2015at1:03 NeuronNeuron 50544silverbadges88bronzebadges 1 JoelSpolskyisnolongertheCEO. – PeterMortensen Feb20at20:33 Addacomment | 32 ThearticleWhateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtextexplainsallthedetails. Writingtobuffer ifyouwritetoa4bytebuffer,symbolあwithUTF8encoding,yourbinarywilllooklikethis: 00000000111000111000000110000010 ifyouwritetoa4bytebuffer,symbolあwithUTF16encoding,yourbinarywilllooklikethis: 00000000000000000011000001000010 Asyoucansee,dependingonwhatlanguageyouwoulduseinyourcontentthiswilleffectyourmemoryaccordingly. Example:Forthisparticularsymbol:あUTF16encodingismoreefficientsincewehave2sparebytestouseforthenextsymbol.Butitdoesn'tmeanthatyoumustuseUTF16forJapanalphabet. Readingfrombuffer Nowifyouwanttoreadtheabovebytes,youhavetoknowinwhatencodingitwaswrittentoanddecodeitbackcorrectly. e.g.Ifyoudecodethis: 00000000111000111000000110000010 intoUTF16encoding,youwillendupwith臣notあ Note:EncodingandUnicodearetwodifferentthings.Unicodeisthebig(table)witheachsymbolmappedtoauniquecodepoint.e.g.あsymbol(letter)hasa(codepoint):3042(hex).Encodingontheotherhand,isanalgorithmthatconvertssymbolstomoreappropriateway,whenstoringtohardware. 3042(hex)->UTF8encoding->E38182(hex),whichisaboveresultinbinary. 3042(hex)->UTF16encoding->3042(hex),whichisaboveresultinbinary. Share Improvethisanswer Follow editedFeb20at20:46 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredJan17,2017at22:41 InGeekInGeek 2,37222goldbadges2323silverbadges3636bronzebadges 2 Greatanswer,whichIupvoted.Wouldyoubesokindtocheckifthispartofyouranswerishowyouthoughtitshouldbe(becauseitdoesnotmakesense):"convertssymbolstomoreappropriateway". – bomben Sep14,2021at9:47 1 Thetitleofthereference,"Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext",isclosetobeplagiarismoftheJoelSpolsky's"TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)". – PeterMortensen Feb20at20:48 Addacomment | 22 Originally,Unicodewasintendedtohaveafixed-width16-bitencoding(UCS-2).EarlyadoptersofUnicode,likeJavaandWindowsNT,builttheirlibrariesaround16-bitstrings. Later,thescopeofUnicodewasexpandedtoincludehistoricalcharacters,whichwouldrequiremorethanthe65,536codepointsa16-bitencodingwouldsupport.ToallowtheadditionalcharacterstoberepresentedonplatformsthathadusedUCS-2,theUTF-16encodingwasintroduced.Ituses"surrogatepairs"torepresentcharactersinthesupplementaryplanes. Meanwhile,alotofoldersoftwareandnetworkprotocolswereusing8-bitstrings.UTF-8wasmadesothesesystemscouldsupportUnicodewithouthavingtousewidecharacters.It'sbackwards-compatiblewith7-bitASCII. Share Improvethisanswer Follow editedFeb20at20:09 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredJul5,2010at5:04 dan04dan04 84.4k2323goldbadges160160silverbadges192192bronzebadges 1 4 It'sworthnotingthatMicrosoftstillreferstoUTF-16asUnicode,addingtotheconfusion.Thetwoarenotthesame. – MarkRansom Jan12,2017at17:57 Addacomment | 12 Unicodeisastandardwhichmapsthecharactersinalllanguagestoaparticularnumericvaluecalledacodepoint.Thereasonitdoesthisisthatitallowsdifferentencodingstobepossibleusingthesamesetofcodepoints. UTF-8andUTF-16aretwosuchencodings.Theytakecodepointsasinputandencodesthemusingsomewell-definedformulatoproducetheencodedstring. Choosingaparticularencodingdependsuponyourrequirements.Differentencodingshavedifferentmemoryrequirementsanddependinguponthecharactersthatyouwillbedealingwith,youshouldchoosetheencodingwhichusestheleastsequencesofbytestoencodethosecharacters. Formorein-depthdetailsaboutUnicode,UTF-8andUTF-16,youcancheckoutthisarticle, WhateveryprogrammershouldknowaboutUnicode Share Improvethisanswer Follow editedFeb20at20:52 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredMar25,2017at15:10 KishuAgarwalKishuAgarwal 34833silverbadges88bronzebadges Addacomment | 9 WhyUnicode?BecauseASCIIhasjust127characters.Thosefrom128to255differindifferentcountries,andthat'swhytherearecodepages.Sotheysaid:let’shaveupto1114111characters. Sohowdoyoustorethehighestcodepoint?You'llneedtostoreitusing21bits,soyou'lluseaDWORDhaving32bitswith11bitswasted.SoifyouuseaDWORDtostoreaUnicodecharacter,itistheeasiestway,becausethevalueinyourDWORDmatchesexactlythecodepoint. ButDWORDarraysareofcourselargerthanWORDarraysandofcourseevenlargerthanBYTEarrays.That'swhythereisnotonlyUTF-32,butalsoUTF-16.ButUTF-16meansaWORDstream,andaWORDhas16bits,sohowcanthehighestcodepoint1114111fitintoaWORD?Itcannot! Sotheyputeverythinghigherthan65535intoaDWORDwhichtheycallasurrogate-pair.Suchasurrogate-pairaretwoWORDSandcangetdetectedbylookingatthefirst6bits. SowhataboutUTF-8?Itisabytearrayorbytestream,buthowcanthehighestcodepoint1114111fitintoabyte?Itcannot!Okay,sotheyputinalsoaDWORDright?OrpossiblyaWORD,right?Almostright! Theyinventedutf-8sequenceswhichmeansthateverycodepointhigherthan127mustgetencodedintoa2-byte,3-byteor4-bytesequence.Wow!Buthowcanwedetectsuchsequences?Well,everythingupto127isASCIIandisasinglebyte.Whatstartswith110isatwo-bytesequence,whatstartswith1110isathree-bytesequenceandwhatstartswith11110isafour-bytesequence.Theremainingbitsofthesesocalled"startbytes"belongtothecodepoint. Nowdependingonthesequence,followingbytesmustfollow.Afollowingbytestartswith10,andtheremainingbitsare6bitsofpayloadbitsandbelongtothecodepoint.Concatenatethepayloadbitsofthestartbyteandthefollowingbyte/sandyou'llhavethecodepoint.That'sallthemagicofUTF-8. Share Improvethisanswer Follow editedFeb20at20:35 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredJan15,2014at14:02 brightybrighty 39622silverbadges99bronzebadges 1 3 utf-8exampleof€(Euro)signdecodedinutf-83-bytesequence:E2=1110001082=10000010AC=10101100Asyoucansee,E2startswith1110sothisisathree-bytesequenceAsyoucansee,82aswellasACstartswith10sothesearefollowingbytesNowweconcatenatethe"payloadbits":0010+000010+101100=10000010101100whichisdecimal8364So8364mustbethecodepointforthe€(Euro)sign. – brighty Jan15,2014at14:18 Addacomment | 6 ASCII-Softwareallocatesonly8bitbyteinmemoryforagivencharacter.ItworkswellforEnglishandadopted(loanwordslikefaçade)charactersastheircorrespondingdecimalvaluesfallsbelow128inthedecimalvalue.ExampleCprogram. UTF-8-Softwareallocatesonetofourvariable8-bitbytesforagivencharacter.Whatismeantbyavariablehere?Letussayyouaresendingthecharacter'A'throughyourHTMLpagesinthebrowser(HTMLisUTF-8),thecorrespondingdecimalvalueofAis65,whenyouconvertitintodecimalitbecomes01000010.Thisrequiresonlyonebyte,andonebytememoryisallocatedevenforspecialadoptedEnglishcharacterslike'ç'inthewordfaçade.However,whenyouwanttostoreEuropeancharacters,itrequirestwobytes,soyouneedUTF-8.However,whenyougoforAsiancharacters,yourequireminimumoftwobytesandmaximumoffourbytes.Similarly,emojisrequirethreetofourbytes.UTF-8willsolveallyourneeds. UTF-16willallocateminimum2bytesandmaximumof4bytespercharacter,itwillnotallocate1or3bytes.Eachcharacteriseitherrepresentedin16bitor32bit. ThenwhydoesUTF-16exist?Originally,Unicodewas16bitnot8bit.JavaadoptedtheoriginalversionofUTF-16. Inanutshell,youdon'tneedUTF-16anywhereunlessithasbeenalreadybeenadoptedbythelanguageorplatformyouareworkingon. JavaprograminvokedbywebbrowsersusesUTF-16,butthewebbrowsersendscharactersusingUTF-8. Share Improvethisanswer Follow editedFeb20at20:56 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredDec6,2018at8:41 SivaSiva 16911silverbadge33bronzebadges 3 "Youdon'tneedUTF-16anywhereunlessithasbeenalreadybeenadoptedbythelanguageorplatform":Thisisagoodpointbuthereisanon-inclusivelist:JavaScript,Java,.NET,SQLNCHAR,SQLNVARCHAR,VB4,VB5,VB6,VBA,VBScript,NTFS,WindowsAPI…. – TomBlodget Dec8,2018at13:49 Re"whenyouwanttostoreEuropeancharacters,itrequirestwobytes,soyouneedUTF-8":Unlesscodepagesareused,e.g.CP-1252. – PeterMortensen Feb20at20:59 Re"thewebbrowsersendscharactersusingUTF-8":UnlesssomethinglikeISO8859-1isspecifiedonawebpage(?).E.g. – PeterMortensen Feb20at21:15 Addacomment | 2 UTFstandsforstandsforUnicodeTransformationFormat.Basically,intoday'sworldtherearescriptswritteninhundredsofotherlanguages,formatsnotcoveredbythebasicASCIIusedearlier.Hence,UTFcameintoexistence. UTF-8hascharacterencodingcapabilitiesanditscodeunitiseightbitswhilethatforUTF-16itis16bits. Share Improvethisanswer Follow editedFeb20at20:43 PeterMortensen 30.6k2121goldbadges102102silverbadges124124bronzebadges answeredAug30,2016at9:39 KrishnaGaneriwalKrishnaGaneriwal 1,8131616silverbadges1717bronzebadges Addacomment | Highlyactivequestion.Earn10reputation(notcountingtheassociationbonus)inordertoanswerthisquestion.Thereputationrequirementhelpsprotectthisquestionfromspamandnon-answeractivity. Nottheansweryou'relookingfor?Browseotherquestionstaggedunicodeencodingutf-8utf-16oraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Visitchat Linked -2 WhatisthesignificanceofUTF-8? 529 HowcanIprocesseachletteroftextusingJavascript? 299 LaravelMigrationError:Syntaxerrororaccessviolation:1071Specifiedkeywastoolong;maxkeylengthis767bytes 193 WhydoesGittreatthistextfileasabinaryfile? 387 Unicode,UTF,ASCII,ANSIformatdifferences 246 What'sthedifferencebetweenacharacter,acodepoint,aglyphandagrapheme? 47 DifferencebetweencodePointAtandcharCodeAt 13 Whydoesn'tGitnativelysupportUTF-16? 21 RStudionotpickingtheencodingI'mtellingittousewhenreadingafile 14 HowtoremovesurrogatecharactersinJava? Seemorelinkedquestions Related 1488 WhenareyousupposedtouseescapeinsteadofencodeURI/encodeURIComponent? 1323 UTF-8allthewaythrough 385 Unicode(UTF-8)readingandwritingtofilesinPython 609 UTF-8,UTF-16,andUTF-32 674 WhatisthedifferencebetweenUTF-8andUnicode? 1251 What'sthedifferencebetweenutf8_general_ciandutf8_unicode_ci? 817 Whatexactlydo"u"and"r"stringprefixesdo,andwhatarerawstringliterals? 974 What'sthedifferencebetweenUTF-8andUTF-8withBOM? 585 WhydoesmodernPerlavoidUTF-8bydefault? 1402 WhyisexecutingJavacodeincommentswithcertainUnicodecharactersallowed? HotNetworkQuestions AmIreallyrequiredtosetupanInheritedIRA? Doyoupayforthebreakfastinadvance? Myfavoriteanimalisa-singularandpluralform PacifistethosblockingmyprogressinStellaris HowIcanremoveautoincrementfromaPrimarykeyinpostgresql? WhatisthecurrentstatusofwatchtowerimplementationsinOctober2022?Aretheymature,widelyinuse? Howdoparty-listsystemsaccommodateindependentcandidates? Adecimal-basedunitoftime WhatisthedifferencebetweenGlidepathversusGlideslope? HowtogetridofUbuntuProadvertisementwhenupdatingapt? Canyoufindit? Areyougettingtiredofregularcrosswords? HowtofindthebordercrossingtimeofatraininEurope?(Czechbureaucracyedition) MakeaCourtTranscriber Canaphotonturnaprotonintoaneutron? Whyare"eat"and"drink"differentwordsinlanguages? WhatistheAmericanequivalentof"Icalledmymomtoaskafterher"? Probabilisticmethodsforundecidableproblem Whatare"HollywoodTwin"beds? Awordfor"amessagetomyself" Wouldextractinghydrogenfromthesunlessenitslifespan? Howtoplug2.5mm²strandedwiresintoapushwirewago? Interpretinganegativeself-evaluationofahighperformer Writingthenumber'80668227'asasumsof4&5cubes morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings