UTF-8 - Gentoo Wiki
文章推薦指數: 80 %
UTF-8 means that ASCII and Latin characters are interchangeable with little increase in the size of the data, because only the first byte is ...
Jumpto: content
UTF-8
FromGentooWiki
Jumpto:navigation
Jumpto:search
Otherlanguages:English
español
français
italiano
русский
中文(中国大陆)
日本語
한국어
Resources
Wikipedia
UTF-8isavariable-lengthcharacterencoding,whichinthisinstancemeansthatituses1to4bytespersymbol.So,thefirstUTF-8byteisusedforencodingASCII,givingthecharactersetfullbackwardscompatibilitywithASCII.UTF-8meansthatASCIIandLatincharactersareinterchangeablewithlittleincreaseinthesizeofthedata,becauseonlythefirstbyteisused.UsersofEasternalphabetssuchasJapanese,whohavebeenassignedahigherbyterangeareunhappy,asthisresultsinasmuchasa50%redundancyintheirdata.
Contents
1Characterencodings
1.1Whatisacharacterencoding?
1.2Thehistoryofcharacterencodings
1.3WhatisUnicode?
1.4WhatUTF-8cando
2SettingupUTF-8inGentoo
2.1FindingorcreatingUTF-8locales
2.2Settingthelocale
2.3Alternatively,usingeselecttosetlocales
3Applicationsupport
3.1(V)FAT
3.2Filenames
3.3Thesystemconsole
3.4NcursesandSlang
3.5KDE,GNOME,andXfce
3.6X11andfonts
3.7Windowmanagersandterminalemulators
3.8Vim,emacs,xemacs,andnano
3.9Shells
3.10Irssi
3.11Mutt
3.12linksandelinks
3.13Samba
3.14Testingitallout
4Reportedissuesandproblems
4.1Systemconfigurationfiles(in/etc)
5Externalresources
Characterencodings
Whatisacharacterencoding?
Computersthemselvesdonotunderstandprintedtextasahumanwould.Forcomputers,everycharacteroftextisrepresentedbyanumber.Traditionally,eachsetofnumbersusedtorepresentalphabetsandcharacters(knownasacodingsystem,encoding,orcharacterset)waslimitedinsizeduetolimitationsincomputerhardware.
Thehistoryofcharacterencodings
Themostcommon(oratleastthemostwidelyaccepted)charactersetisASCII(AmericanStandardCodeforInformationInterchange).ItiswidelyheldthatASCIIisthemostsuccessfulsoftwarestandardevercreated.ModernASCIIwasstandardizedin1986(ANSIX3.4,RFC20,ISO/IEC646:1991,ECMA-6)bytheAmericanNationalStandardsInstitute.
ASCIIisstrictlyseven-bit,meaningthatitusesbitpatternsrepresentablewithsevenbinarydigits,whichprovidesarangeof0to127indecimal.Theseinclude32non-visiblecontrolcharacters,mostbetween0and31,withthefinalcontrolcharacter,DELordeleteat127.Characters32to126arevisiblecharacters:aspace,punctuationmarks,Latinlettersandnumbers.
TheeighthbitinASCIIwasoriginallyusedasaparitybitforerrorchecking.Iferrorcheckingisnotdesired,itisleftas0.Thismeansthat,withASCII,eachcharacterisrepresentedbyasinglebyte.
AlthoughASCIIwasenoughforcommunicationinmodernEnglish,inotherEuropeanlanguagesthatincludeaccentedcharacters,thingswerenotsoeasy.TheISO8859standardsweredevelopedtomeettheseneeds.TheywerebackwardscompatiblewithASCII,butinsteadofleavingtheeighthbitblank,theyusedittoallowanother127charactersineachencoding.ISO8859'slimitationssooncametolight,andtherearecurrently15variantsoftheISO8859standard(8859-1throughto8859-15).OutsideoftheASCII-compatiblebyterangeofthesecharactersets,thereisoftenconflictbetweenthelettersrepresentedbyeachbyte.Tocomplicateinteroperabilitybetweencharacterencodingsfurther,Windows-1252isusedinsomeversionsofMicrosoftWindowsinsteadforWesternEuropeanlanguages.Thisisasuper-setofISO8859-1,howeveritisdifferentinseveralways;thesesetsdoallretainASCIIcompatibility.
Thenecessarydevelopmentofcompletelydifferentsingle-byteencodingsfornon-Latinalphabets,suchasEUC(ExtendedUnixCoding)whichisusedforJapaneseandKorean(andtoalesserextentChinese)createdmoreconfusion.Otheroperatingsystemsstilluseddifferentcharactersetsforthesamelanguages,forexample,Shift-JISandISO-2022-JP.UserswishingtoviewcyrillicglyphshadtochoosebetweenKOI8-RforRussianandBulgarianorKOI8-UforUkrainian,aswellasalltheothercyrillicencodingssuchastheunsuccessfulISO8859-5,andthecommonWindows-1251set.AllofthesecharactersetsbrokemostcompatibilitywithASCII.AlthoughitshouldbementionedKOI8encodingsplacecyrilliccharactersinLatinorder,soincasetheeighthbitisstripped,textisstilldecipherableonanASCIIterminalthroughcase-reversedtransliteration.
Allofthishasledtomassconfusion,andtoanalmosttotalinabilityformultilingualcommunication;especiallyacrossdifferentalphabets.EnterUnicode.
WhatisUnicode?
Unicodethrowsawaythetraditionalsingle-bytelimitofcharactersets.Ituses17"planes"of65,536codepointstodescribeamaximumof1,114,112characters.Asthefirstplane,aka."BasicMultilingualPlane"orBMP,containsalmosteverycharacterauserwilleverneed.ManyhavemadethewrongassumptionthatUnicodewasa16-bitcharacterset.
Unicodehasbeenmappedinmanydifferentways,butthetwomostcommonareUTF(UnicodeTransformationFormat)andUCS(UniversalCharacterSet).AnumberafterUTFindicatesthenumberofbitsinoneunit,whilethenumberafterUCSindicatesthenumberofbytes.UTF-8hasbecomethemostwidespreadmeansfortheinterchangeofUnicodetextasaresultofitseight-bitcleannature;itisthereforethesubjectofthisdocument.
WhatUTF-8cando
UTF-8allowsuserstoworkinastandards-compliantandinternationallyacceptedmultilingualenvironment,withacomparativelylowdataredundancy.Itisthepreferredwayfortransmittingnon-ASCIIcharactersovertheInternet,throughEmail,IRC,oralmostanyothermedium.Despitethis,manypeopleregardUTF-8inonlinecommunicationasabusive.ItisalwaysbesttobeawareoftheattitudetowardsUTF-8inaspecificchannel,mailinglist,orUsenetgroupbeforeusingnon-ASCIIUTF-8.
SettingupUTF-8inGentoo
FindingorcreatingUTF-8locales
NowthattheprinciplesbehindUnicodehavebeenlaidout,getreadytostartusingUTF-8locally!
ForusersinterestedinmoreknowledgefurtherexplanationcanbefoundintheGentooLocalizationGuide.
Next,theuserneedstodecidewhetheraUTF-8localeisavailableforthelanguageofchoice,orwhetheroneneedstobegenerated.
user$locale-a|grep'en_GB'en_GB
en_GB.utf8
Fromtheoutputoftheabovecommand,lookforaresultwithasuffixsimilarto.UTF-8.IfthereisnoresultwithasimilarsuffixaUTF-8compatiblelocalemustbecreated.
Thecommandliststhesuffixinlowercasewithoutanyhyphens,glibcunderstandsbothformsofthesuffix,manyotherprogramsdon't.ThemostcommonexampleofwhichisX.SoitisbesttoalwaysuseUTF-8inpreferencetoutf8.
NoteOnlyexecutethefollowingcodeifthesystemdoesnothaveaUTF-8localeavailableforthelanguageofchoice.
Replace"en_GB"withthedesiredlocalesetting:
root#localedef-ien_GB-fUTF-8en_GB.UTF-8
AnotherwaytoincludeaUTF-8localeistoaddittothe/etc/locale.genfileandgeneratenecessarylocalesusingthelocale-gencommand.Localeswillbewrittentothelocale-archive/usr/lib/locale/locale-archive.
CODELinein/etc/locale.genen_GB.UTF-8UTF-8
root#locale-gen*Generating1locales(thismighttakeawhile)with1jobs
*(1/1)Generatingen_GB.UTF-8...[ok]
*Generationcomplete
Settingthelocale
ThereisoneenvironmentvariablethatneedstobesetinordertousethenewUTF-8locales:LC_CTYPE(optionallymodifytheLANGvariabletochangethesystemlanguageaswell).Therearealsomanydifferentwaystosetit;somesystemadministratorsprefertoonlyhaveaUTF-8environmentforaspecificuser,inwhichcasetheysetthemintheir~/.profile(/bin/shforBourneshellusers),~/.bash_profileor~/.bashrc(/bin/bashforBourneagainshellusers).MoredetailsandbestpracticescanbefoundintheLocalizationGuide.
Stillothersprefertosetthelocaleglobally.Onespecificcircumstancewheretheauthorparticularlyrecommendsdoingthisiswhen/etc/init.d/xdmisinuse,becausethisinitscriptstartsthedisplaymanageranddesktopbeforeanyoftheaforementionedshellstartupfilesaresourced.Inotherwords,thisisperformedbeforeanyofthevariablesareloadedintheenvironment.
Settingthelocalegloballyshouldbedoneusing/etc/env.d/02localefile.Thisfileshouldlooksomethinglikethefollowing:
FILE/etc/env.d/02localeDemonstrationofen_GB.UTF-8##(Asalways,change"en_GB.UTF-8"totheappropriatelocalevalue;eachlanguagehasadifferentvalue!)
LANG="en_GB.UTF-8"
NoteItispossibletosubstitutetheLC_CTYPEvariablefortheLANGvariable.FormoreinformationonthecategoriesaffectedbyusingLC_CTYPEreadtheGNUlocalepage.
Next,theenvironmentmustbeupdatedbyrunningthefollowingcommand:
root#env-update>>>Regenerating/etc/ld.so.cache...
root#source/etc/profile
Now,runlocalewithnoargumentstoseeifthecorrectvariableshavebeenloadedintheenvironment:
root#localeLANG=en_GB.utf8
LC_CTYPE="en_GB.utf8"
LC_NUMERIC="en_GB.utf8"
LC_TIME="en_GB.utf8"
LC_COLLATE="en_GB.utf8"
LC_MONETARY="en_GB.utf8"
LC_MESSAGES="en_GB.utf8"
LC_PAPER="en_GB.utf8"
LC_NAME="en_GB.utf8"
LC_ADDRESS="en_GB.utf8"
LC_TELEPHONE="en_GB.utf8"
LC_MEASUREMENT="en_GB.utf8"
LC_IDENTIFICATION="en_GB.utf8"
LC_ALL=
Thevaluesoflocaleenvironmentvariablesthathavebeenexplicitlysete.g.inanexportstatement(ifusingbash)arelistedwithoutdoublequotes.Thosewhosevaluehasbeeninheritedfromotherlocaleenvironmentvariableshavetheirvaluesindoublequotes.
Alternatively,usingeselecttosetlocales
Althoughitisgoodtomaintainthesystemasdescribedabove,itispossibletoverifythecorrectlocaleconfiguredusingtheeselectutility.
Useeselecttolisttheavailablelocalesonthesystem:
root#eselectlocalelist[1]C
[2]POSIX*
[3]en_GB.utf8
[](freeform)
Usingeselectsettingthelocaleisassimpleaslistingthem.Oncethecorrectlocalehasbeendeterminedinvoke:
root#eselectlocaleset3SettingLANGtoen_GB.utf8...
Checktheresult:
root#eselectlocalelist[1]C
[2]POSIX
[3]en_GB.utf8*
[](freeform)
Incaseitispreferredtohave/etc/env.d/02localewith.UTF-8insteadof.utf8,runtheappropriateeselectcommand:
root#eselectlocaleseten_GB.UTF-8SettingLANGtoen_GB.UTF-8...
root#eselectlocalelist[1]C
[2]POSIX
[3]en_GB.utf8
[4]en_GB.UTF-8*
[](freeform)
Runningthefollowingcommandwillupdatethevariablesintheshell:
root#env-update&&source/etc/profile>>>Regenerating/etc/ld.so.cache...
Thatiseverything.ThesystemisnowusingUTF-8locales.Thenexthurdleistheconfigurationoftheapplicationsusedfromdaytoday.
Applicationsupport
WhenUnicodefirststartedgainingmomentuminthesoftwareworld,multibytecharactersetswerenotwellsuitedtolanguageslikeC,whichisthebaselanguageofmostcommonlyusedprograms.Eventoday,someprogramsarenotabletohandleUTF-8properly.Fortunatelythemajorityofprograms,especiallythecommonones,aresupported.
(V)FAT
ForUTF-8supportinFATfilesystemsseetheFATarticle.
Filenames
Forchangingtheencodingoffilenames,app-text/convmvcanbeused.
root#emerge--askapp-text/convmv
Theformatoftheconvmvcommandisasfollows:
root#convmv-f
延伸文章資訊
- 1UTF-8 - Wiktionary
See also: UTF8 ... UTF-8. (computing) Unicode Transformation Format-8, a variable-width encoding ...
- 2UTF-8 - 维基百科,自由的百科全书 - KFD.ME
UTF-8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有...
- 3UTF-8 - HTML & CSS Wiki - Fandom
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. ...
- 4UTF-8 - Wikiwand
維基百科,自由的百科全書. 此條目需要補充更多來源。 (2018年12月27日)請協助 ...
- 5wiki關鍵字中,將查詢關鍵字轉換成utf-8編碼的部分 - Cupoy
1. wiki關鍵字中,將查詢關鍵字轉換成utf-8編碼的部分,轉成字串時為何使用repr而不是str呢? 2. 搜尋a標籤的href的正則表達式,"^(/wiki/)((?!