What are Unicode, UTF-8, and UTF-16? - Stack Overflow
文章推薦指數: 80 %
UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or ...
Home
Public
Questions
Tags
Users
Companies
Collectives
ExploreCollectives
Teams
StackOverflowforTeams
–Startcollaboratingandsharingorganizationalknowledge.
CreateafreeTeam
WhyTeams?
Teams
CreatefreeTeam
Collectives™onStackOverflow
Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost.
LearnmoreaboutCollectives
Teams
Q&Aforwork
Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch.
LearnmoreaboutTeams
WhatareUnicode,UTF-8,andUTF-16?
AskQuestion
Asked
12years,8monthsago
Modified
7monthsago
Viewed
343ktimes
475
What'sthebasisforUnicodeandwhytheneedforUTF-8orUTF-16?
IhaveresearchedthisonGoogleandsearchedhereaswell,butit'snotcleartome.
InVSS,whendoingafilecomparison,sometimesthereisamessagesayingthetwofileshavedifferingUTF's.Whywouldthisbethecase?
Pleaseexplaininsimpleterms.
unicodeencodingutf-8utf-16
Share
Improvethisquestion
Follow
editedFeb18at17:51
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
askedFeb11,2010at0:12
SoftwareGeekSoftwareGeek
14.8k1919goldbadges6060silverbadges7878bronzebadges
10
143
SoundslikeyouneedtoreadTheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets!It'saverygoodexplanationofwhat'sgoingon.
– BrianAgnew
Feb11,2010at0:14
5
ThisFAQfromtheofficialUnicodewebsitehassomeanswersforyou.
– NemanjaTrifunovic
Feb11,2010at16:10
4
@John:it'saveryniceintroduction,butit'snottheultimatesource:Itskipsquiteafewofthedetails(whichisfineforanoverview/introduction!)
– JoachimSauer
Jun24,2011at10:22
6
Thearticleisgreat,butithasseveralmistakesandrepresentsUTF-8insomewhatconservativelight.Isuggestreadingutf8everywhere.orgasasupplement.
– PavelRadzivilovsky
May27,2012at4:53
2
Takealookatthiswebsite:utf8everywhere.org
– Vertexwahn
Jan14,2016at12:37
|
Show5morecomments
9Answers
9
Sortedby:
Resettodefault
Highestscore(default)
Trending(recentvotescountmore)
Datemodified(newestfirst)
Datecreated(oldestfirst)
652
+250
WhydoweneedUnicode?
Inthe(nottoo)earlydays,allthatexistedwasASCII.Thiswasokay,asallthatwouldeverbeneededwereafewcontrolcharacters,punctuation,numbersandlettersliketheonesinthissentence.Unfortunately,today'sstrangeworldofglobalintercommunicationandsocialmediawasnotforeseen,anditisnottoounusualtoseeEnglish,العربية,汉语,עִבְרִית,ελληνικά,andភាសាខ្មែរinthesamedocument(IhopeIdidn'tbreakanyoldbrowsers).
Butforargument'ssake,let’ssayJoeAverageisasoftwaredeveloper.HeinsiststhathewillonlyeverneedEnglish,andassuchonlywantstouseASCII.ThismightbefineforJoetheuser,butthisisnotfineforJoethesoftwaredeveloper.Approximatelyhalftheworldusesnon-LatincharactersandusingASCIIisarguablyinconsideratetothesepeople,andontopofthat,heisclosingoffhissoftwaretoalargeandgrowingeconomy.
Therefore,anencompassingcharactersetincludingalllanguagesisneeded.ThuscameUnicode.Itassignseverycharacterauniquenumbercalledacodepoint.OneadvantageofUnicodeoverotherpossiblesetsisthatthefirst256codepointsareidenticaltoISO-8859-1,andhencealsoASCII.Inaddition,thevastmajorityofcommonlyusedcharactersarerepresentablebyonlytwobytes,inaregioncalledtheBasicMultilingualPlane(BMP).Nowacharacterencodingisneededtoaccessthischaracterset,andasthequestionasks,IwillconcentrateonUTF-8andUTF-16.
Memoryconsiderations
Sohowmanybytesgiveaccesstowhatcharactersintheseencodings?
UTF-8:
1byte:StandardASCII
2bytes:Arabic,Hebrew,mostEuropeanscripts(mostnotablyexcludingGeorgian)
3bytes:BMP
4bytes:AllUnicodecharacters
UTF-16:
2bytes:BMP
4bytes:AllUnicodecharacters
It'sworthmentioningnowthatcharactersnotintheBMPincludeancientscripts,mathematicalsymbols,musicalsymbols,andrarerChinese,Japanese,andKorean(CJK)characters.
Ifyou'llbeworkingmostlywithASCIIcharacters,thenUTF-8iscertainlymorememoryefficient.However,ifyou'reworkingmostlywithnon-Europeanscripts,usingUTF-8couldbeupto1.5timeslessmemoryefficientthanUTF-16.Whendealingwithlargeamountsoftext,suchaslargeweb-pagesorlengthyworddocuments,thiscouldimpactperformance.
Encodingbasics
Note:IfyouknowhowUTF-8andUTF-16areencoded,skiptothenextsectionforpracticalapplications.
UTF-8:ForthestandardASCII(0-127)characters,theUTF-8codesareidentical.ThismakesUTF-8idealifbackwardscompatibilityisrequiredwithexistingASCIItext.Othercharactersrequireanywherefrom2-4bytes.Thisisdonebyreservingsomebitsineachofthesebytestoindicatethatitispartofamulti-bytecharacter.Inparticular,thefirstbitofeachbyteis1toavoidclashingwiththeASCIIcharacters.
UTF-16:ForvalidBMPcharacters,theUTF-16representationissimplyitscodepoint.However,fornon-BMPcharactersUTF-16introducessurrogatepairs.Inthiscaseacombinationoftwotwo-byteportionsmaptoanon-BMPcharacter.Thesetwo-byteportionscomefromtheBMPnumericrange,butareguaranteedbytheUnicodestandardtobeinvalidasBMPcharacters.Inaddition,sinceUTF-16hastwobytesasitsbasicunit,itisaffectedbyendianness.Tocompensate,areservedbyteordermarkcanbeplacedatthebeginningofadatastreamwhichindicatesendianness.Thus,ifyouarereadingUTF-16input,andnoendiannessisspecified,youmustcheckforthis.
Ascanbeseen,UTF-8andUTF-16arenowherenearcompatiblewitheachother.Soifyou'redoingI/O,makesureyouknowwhichencodingyouareusing!Forfurtherdetailsontheseencodings,pleaseseetheUTFFAQ.
Practicalprogrammingconsiderations
Characterandstringdatatypes:Howaretheyencodedintheprogramminglanguage?Iftheyarerawbytes,theminuteyoutrytooutputnon-ASCIIcharacters,youmayrunintoafewproblems.Also,evenifthecharactertypeisbasedonaUTF,thatdoesn'tmeanthestringsareproperUTF.Theymayallowbytesequencesthatareillegal.Generally,you'llhavetousealibrarythatsupportsUTF,suchasICUforC,C++andJava.Inanycase,ifyouwanttoinput/outputsomethingotherthanthedefaultencoding,youwillhavetoconvertitfirst.
Recommended,default,anddominantencodings:WhengivenachoiceofwhichUTFtouse,itisusuallybesttofollowrecommendedstandardsfortheenvironmentyouareworkingin.Forexample,UTF-8isdominantontheweb,andsinceHTML5,ithasbeentherecommendedencoding.Conversely,both.NETandJavaenvironmentsarefoundedonaUTF-16charactertype.Confusingly(andincorrectly),referencesareoftenmadetothe"Unicodeencoding",whichusuallyreferstothedominantUTFencodinginagivenenvironment.
Librarysupport:Thelibrariesyouareusingsupportsomekindofencoding.Whichone?Dotheysupportthecornercases?Sincenecessityisthemotherofinvention,UTF-8librarieswillgenerallysupport4-bytecharactersproperly,since1,2,andeven3bytecharacterscanoccurfrequently.However,notallpurportedUTF-16librariessupportsurrogatepairsproperlysincetheyoccurveryrarely.
Countingcharacters:ThereexistcombiningcharactersinUnicode.Forexample,thecodepointU+006E(n),andU+0303(acombiningtilde)formsñ,butthecodepointU+00F1formsñ.Theyshouldlookidentical,butasimplecountingalgorithmwillreturn2forthefirstexample,and1forthelatter.Thisisn'tnecessarilywrong,butitmaynotbethedesiredoutcomeeither.
Comparingforequality:A,А,andΑlookthesame,butthey'reLatin,Cyrillic,andGreekrespectively.YoualsohavecaseslikeCandⅭ.Oneisaletter,andtheotherisaRomannumeral.Inaddition,wehavethecombiningcharacterstoconsideraswell.Formoreinformation,seeDuplicatecharactersinUnicode.
Surrogatepairs:ThesecomeupoftenenoughonStack Overflow,soI'lljustprovidesomeexamplelinks:
Gettingstringlength
Removingsurrogatepairs
Palindromechecking
Share
Improvethisanswer
Follow
editedFeb20at20:28
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredFeb28,2013at5:24
DPenner1DPenner1
9,81855goldbadges3434silverbadges4545bronzebadges
13
11
Excellentanswer,greatchancesforthebounty;-)PersonallyI'daddthatsomeargueforUTF-8astheuniversalcharacterencoding,butIknowthatthat'saopinionthat'snotnecessarilysharedbyeveryone.
– JoachimSauer
Feb28,2013at9:09
3
Stilltootechnicalformeatthisstage.HowisthewordhellostoredinacomputerinUTF-8andUTF-16?
– FirstNameLastName
May12,2013at18:17
1
Couldyouexpandmoreonwhy,forexample,theBMPtakes3bytesinUTF-8?Iwouldhavethoughtthatsinceitsmaximumvalueis0xFFFF(16bits)thenitwouldonlytake2bytestoaccess.
– mark
Oct7,2014at22:14
2
@markSomebitsarereservedforencodingpurposes.Foracodepointthattakes2bytesinUTF-8,thereare5reservedbits,leavingonly11bitstoselectacodepoint.U+07FFendsupbeingthehighestcodepointrepresentablein2bytes.
– DPenner1
Oct8,2014at3:40
1
BTW-ASCIIonlydefines128codepoints,usingonly7bitsforrepresentation.ItisISO-8859-1/ISO-8859-15whichdefine256codepointsanduse8bitsforrepresentation.Thefirst128codepointsinallthese3arethesame.
– Tuxdude
Feb15,2016at21:34
|
Show8morecomments
92
Unicode
isasetofcharactersusedaroundtheworld
UTF-8
acharacterencodingcapableofencodingallpossiblecharacters(calledcodepoints)inUnicode.
codeunitis8-bits
useonetofourcodeunitstoencodeUnicode
00100100for"$"(one8-bits);1100001010100010for"¢"(two8-bits);111000101000001010101100for"€"(three8-bits)
UTF-16
anothercharacterencoding
codeunitis16-bits
useonetotwocodeunitstoencodeUnicode
0000000000100100for"$"(one16-bits);11011000010100101101111101100010for"𤭢"(two16-bits)
Share
Improvethisanswer
Follow
answeredJan6,2015at7:45
wengeezhangwengeezhang
2,71911goldbadge1717silverbadges1010bronzebadges
1
Thecharacterbefore"two16-bits"doesnotrender(Firefoxversion97.0onUbuntuMATE20.04(FocalFossa)).
– PeterMortensen
Feb20at20:30
Addacomment
|
34
Unicodeisafairlycomplexstandard.Don’tbetooafraid,butbe
preparedforsomework![2]
Becauseacredibleresourceisalwaysneeded,buttheofficialreportismassive,Isuggestreadingthefollowing:
TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)AnintroductionbyJoelSpolsky,StackExchangeCEO.
TotheBMPandbeyond!AtutorialbyEricMuller,TechnicalDirectorthen,VicePresidentlater,atTheUnicodeConsortium(thefirst20slidesandyouaredone)
Abriefexplanation:
Computersreadbytesandpeoplereadcharacters,soweuseencodingstandardstomapcharacterstobytes.ASCIIwasthefirstwidelyusedstandard,butcoversonlyLatin(sevenbits/charactercanrepresent128differentcharacters).Unicodeisastandardwiththegoaltocoverallpossiblecharactersintheworld(canholdupto1,114,112characters,meaning21bits/charactermaximum.CurrentUnicode8.0specifies120,737charactersintotal,andthat'sall).
ThemaindifferenceisthatanASCIIcharactercanfittoabyte(eightbits),butmostUnicodecharacterscannot.Soencodingforms/schemes(likeUTF-8andUTF-16)areused,andthecharactermodelgoeslikethis:
Everycharacterholdsanenumeratedpositionfrom0to1,114,111(hex:0-10FFFF)calledacodepoint.
Anencodingformmapsacodepointtoacodeunitsequence.Acodeunitisthewayyouwantcharacterstobeorganizedinmemory,8-bitunits,16-bitunitsandsoon.UTF-8usesonetofourunitsofeightbits,andUTF-16usesoneortwounitsof16bits,tocovertheentireUnicodeof21bitsmaximum.Unitsuseprefixessothatcharacterboundariescanbespotted,andmoreunitsmeanmoreprefixesthatoccupybits.So,althoughUTF-8usesonebytefortheLatinscript,itneedsthreebytesforlaterscriptsinsideaBasicMultilingualPlane,whileUTF-16usestwobytesforallthese.Andthat'stheirmaindifference.
Lastly,anencodingscheme(likeUTF-16BEorUTF-16LE)maps(serializes)acodeunitsequencetoabytesequence.
character:π
codepoint:U+03C0
encodingforms(codeunits):
UTF-8:CF80
UTF-16:03C0
encodingschemes(bytes):
UTF-8:CF80
UTF-16BE:03C0
UTF-16LE:C003
Tip:ahexadecimaldigitrepresentsfourbits,soatwo-digithexnumberrepresentsabyte.
AlsotakealookatplanemapsonWikipediatogetafeelingofthecharactersetlayout.
Share
Improvethisanswer
Follow
editedFeb20at20:41
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredOct27,2015at1:03
NeuronNeuron
50544silverbadges88bronzebadges
1
JoelSpolskyisnolongertheCEO.
– PeterMortensen
Feb20at20:33
Addacomment
|
32
ThearticleWhateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtextexplainsallthedetails.
Writingtobuffer
ifyouwritetoa4bytebuffer,symbolあwithUTF8encoding,yourbinarywilllooklikethis:
00000000111000111000000110000010
ifyouwritetoa4bytebuffer,symbolあwithUTF16encoding,yourbinarywilllooklikethis:
00000000000000000011000001000010
Asyoucansee,dependingonwhatlanguageyouwoulduseinyourcontentthiswilleffectyourmemoryaccordingly.
Example:Forthisparticularsymbol:あUTF16encodingismoreefficientsincewehave2sparebytestouseforthenextsymbol.Butitdoesn'tmeanthatyoumustuseUTF16forJapanalphabet.
Readingfrombuffer
Nowifyouwanttoreadtheabovebytes,youhavetoknowinwhatencodingitwaswrittentoanddecodeitbackcorrectly.
e.g.Ifyoudecodethis:
00000000111000111000000110000010
intoUTF16encoding,youwillendupwith臣notあ
Note:EncodingandUnicodearetwodifferentthings.Unicodeisthebig(table)witheachsymbolmappedtoauniquecodepoint.e.g.あsymbol(letter)hasa(codepoint):3042(hex).Encodingontheotherhand,isanalgorithmthatconvertssymbolstomoreappropriateway,whenstoringtohardware.
3042(hex)->UTF8encoding->E38182(hex),whichisaboveresultinbinary.
3042(hex)->UTF16encoding->3042(hex),whichisaboveresultinbinary.
Share
Improvethisanswer
Follow
editedFeb20at20:46
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJan17,2017at22:41
InGeekInGeek
2,37222goldbadges2323silverbadges3636bronzebadges
2
Greatanswer,whichIupvoted.Wouldyoubesokindtocheckifthispartofyouranswerishowyouthoughtitshouldbe(becauseitdoesnotmakesense):"convertssymbolstomoreappropriateway".
– bomben
Sep14,2021at9:47
1
Thetitleofthereference,"Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext",isclosetobeplagiarismoftheJoelSpolsky's"TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)".
– PeterMortensen
Feb20at20:48
Addacomment
|
22
Originally,Unicodewasintendedtohaveafixed-width16-bitencoding(UCS-2).EarlyadoptersofUnicode,likeJavaandWindowsNT,builttheirlibrariesaround16-bitstrings.
Later,thescopeofUnicodewasexpandedtoincludehistoricalcharacters,whichwouldrequiremorethanthe65,536codepointsa16-bitencodingwouldsupport.ToallowtheadditionalcharacterstoberepresentedonplatformsthathadusedUCS-2,theUTF-16encodingwasintroduced.Ituses"surrogatepairs"torepresentcharactersinthesupplementaryplanes.
Meanwhile,alotofoldersoftwareandnetworkprotocolswereusing8-bitstrings.UTF-8wasmadesothesesystemscouldsupportUnicodewithouthavingtousewidecharacters.It'sbackwards-compatiblewith7-bitASCII.
Share
Improvethisanswer
Follow
editedFeb20at20:09
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJul5,2010at5:04
dan04dan04
84.4k2323goldbadges160160silverbadges192192bronzebadges
1
4
It'sworthnotingthatMicrosoftstillreferstoUTF-16asUnicode,addingtotheconfusion.Thetwoarenotthesame.
– MarkRansom
Jan12,2017at17:57
Addacomment
|
12
Unicodeisastandardwhichmapsthecharactersinalllanguagestoaparticularnumericvaluecalledacodepoint.Thereasonitdoesthisisthatitallowsdifferentencodingstobepossibleusingthesamesetofcodepoints.
UTF-8andUTF-16aretwosuchencodings.Theytakecodepointsasinputandencodesthemusingsomewell-definedformulatoproducetheencodedstring.
Choosingaparticularencodingdependsuponyourrequirements.Differentencodingshavedifferentmemoryrequirementsanddependinguponthecharactersthatyouwillbedealingwith,youshouldchoosetheencodingwhichusestheleastsequencesofbytestoencodethosecharacters.
Formorein-depthdetailsaboutUnicode,UTF-8andUTF-16,youcancheckoutthisarticle,
WhateveryprogrammershouldknowaboutUnicode
Share
Improvethisanswer
Follow
editedFeb20at20:52
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredMar25,2017at15:10
KishuAgarwalKishuAgarwal
34833silverbadges88bronzebadges
Addacomment
|
9
WhyUnicode?BecauseASCIIhasjust127characters.Thosefrom128to255differindifferentcountries,andthat'swhytherearecodepages.Sotheysaid:let’shaveupto1114111characters.
Sohowdoyoustorethehighestcodepoint?You'llneedtostoreitusing21bits,soyou'lluseaDWORDhaving32bitswith11bitswasted.SoifyouuseaDWORDtostoreaUnicodecharacter,itistheeasiestway,becausethevalueinyourDWORDmatchesexactlythecodepoint.
ButDWORDarraysareofcourselargerthanWORDarraysandofcourseevenlargerthanBYTEarrays.That'swhythereisnotonlyUTF-32,butalsoUTF-16.ButUTF-16meansaWORDstream,andaWORDhas16bits,sohowcanthehighestcodepoint1114111fitintoaWORD?Itcannot!
Sotheyputeverythinghigherthan65535intoaDWORDwhichtheycallasurrogate-pair.Suchasurrogate-pairaretwoWORDSandcangetdetectedbylookingatthefirst6bits.
SowhataboutUTF-8?Itisabytearrayorbytestream,buthowcanthehighestcodepoint1114111fitintoabyte?Itcannot!Okay,sotheyputinalsoaDWORDright?OrpossiblyaWORD,right?Almostright!
Theyinventedutf-8sequenceswhichmeansthateverycodepointhigherthan127mustgetencodedintoa2-byte,3-byteor4-bytesequence.Wow!Buthowcanwedetectsuchsequences?Well,everythingupto127isASCIIandisasinglebyte.Whatstartswith110isatwo-bytesequence,whatstartswith1110isathree-bytesequenceandwhatstartswith11110isafour-bytesequence.Theremainingbitsofthesesocalled"startbytes"belongtothecodepoint.
Nowdependingonthesequence,followingbytesmustfollow.Afollowingbytestartswith10,andtheremainingbitsare6bitsofpayloadbitsandbelongtothecodepoint.Concatenatethepayloadbitsofthestartbyteandthefollowingbyte/sandyou'llhavethecodepoint.That'sallthemagicofUTF-8.
Share
Improvethisanswer
Follow
editedFeb20at20:35
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredJan15,2014at14:02
brightybrighty
39622silverbadges99bronzebadges
1
3
utf-8exampleof€(Euro)signdecodedinutf-83-bytesequence:E2=1110001082=10000010AC=10101100Asyoucansee,E2startswith1110sothisisathree-bytesequenceAsyoucansee,82aswellasACstartswith10sothesearefollowingbytesNowweconcatenatethe"payloadbits":0010+000010+101100=10000010101100whichisdecimal8364So8364mustbethecodepointforthe€(Euro)sign.
– brighty
Jan15,2014at14:18
Addacomment
|
6
ASCII-Softwareallocatesonly8bitbyteinmemoryforagivencharacter.ItworkswellforEnglishandadopted(loanwordslikefaçade)charactersastheircorrespondingdecimalvaluesfallsbelow128inthedecimalvalue.ExampleCprogram.
UTF-8-Softwareallocatesonetofourvariable8-bitbytesforagivencharacter.Whatismeantbyavariablehere?Letussayyouaresendingthecharacter'A'throughyourHTMLpagesinthebrowser(HTMLisUTF-8),thecorrespondingdecimalvalueofAis65,whenyouconvertitintodecimalitbecomes01000010.Thisrequiresonlyonebyte,andonebytememoryisallocatedevenforspecialadoptedEnglishcharacterslike'ç'inthewordfaçade.However,whenyouwanttostoreEuropeancharacters,itrequirestwobytes,soyouneedUTF-8.However,whenyougoforAsiancharacters,yourequireminimumoftwobytesandmaximumoffourbytes.Similarly,emojisrequirethreetofourbytes.UTF-8willsolveallyourneeds.
UTF-16willallocateminimum2bytesandmaximumof4bytespercharacter,itwillnotallocate1or3bytes.Eachcharacteriseitherrepresentedin16bitor32bit.
ThenwhydoesUTF-16exist?Originally,Unicodewas16bitnot8bit.JavaadoptedtheoriginalversionofUTF-16.
Inanutshell,youdon'tneedUTF-16anywhereunlessithasbeenalreadybeenadoptedbythelanguageorplatformyouareworkingon.
JavaprograminvokedbywebbrowsersusesUTF-16,butthewebbrowsersendscharactersusingUTF-8.
Share
Improvethisanswer
Follow
editedFeb20at20:56
PeterMortensen
30.6k2121goldbadges102102silverbadges124124bronzebadges
answeredDec6,2018at8:41
SivaSiva
16911silverbadge33bronzebadges
3
"Youdon'tneedUTF-16anywhereunlessithasbeenalreadybeenadoptedbythelanguageorplatform":Thisisagoodpointbuthereisanon-inclusivelist:JavaScript,Java,.NET,SQLNCHAR,SQLNVARCHAR,VB4,VB5,VB6,VBA,VBScript,NTFS,WindowsAPI….
– TomBlodget
Dec8,2018at13:49
Re"whenyouwanttostoreEuropeancharacters,itrequirestwobytes,soyouneedUTF-8":Unlesscodepagesareused,e.g.CP-1252.
– PeterMortensen
Feb20at20:59
Re"thewebbrowsersendscharactersusingUTF-8":UnlesssomethinglikeISO8859-1isspecifiedonawebpage(?).E.g.
延伸文章資訊
- 1What are Unicode, UTF-8, and UTF-16? - Stack Overflow
UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1...
- 2UTF-8, UTF-16, and UTF-32 - unicode - Stack Overflow
- 3Db2 12 - Internationalization - UTFs
- 4FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode
UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, t...
- 5UTF-16 - 字嗨!
UTF-16是Unicode的一種可變長度的字元編碼形式。 它原來是最早期Unicode 1.0所想像,能用16位元的固定長去處理全世界所有文字的UCS-2。但自從Unicode 2.0新增補充...