UTF-8 - Gentoo Wiki

文章推薦指數: 80 %
投票人數:10人

UTF-8 means that ASCII and Latin characters are interchangeable with little increase in the size of the data, because only the first byte is ... Jumpto: content UTF-8 FromGentooWiki Jumpto:navigation Jumpto:search Otherlanguages:English español français italiano русский 中文(中国大陆)‎ 日本語 한국어 Resources Wikipedia UTF-8isavariable-lengthcharacterencoding,whichinthisinstancemeansthatituses1to4bytespersymbol.So,thefirstUTF-8byteisusedforencodingASCII,givingthecharactersetfullbackwardscompatibilitywithASCII.UTF-8meansthatASCIIandLatincharactersareinterchangeablewithlittleincreaseinthesizeofthedata,becauseonlythefirstbyteisused.UsersofEasternalphabetssuchasJapanese,whohavebeenassignedahigherbyterangeareunhappy,asthisresultsinasmuchasa50%redundancyintheirdata. Contents 1Characterencodings 1.1Whatisacharacterencoding? 1.2Thehistoryofcharacterencodings 1.3WhatisUnicode? 1.4WhatUTF-8cando 2SettingupUTF-8inGentoo 2.1FindingorcreatingUTF-8locales 2.2Settingthelocale 2.3Alternatively,usingeselecttosetlocales 3Applicationsupport 3.1(V)FAT 3.2Filenames 3.3Thesystemconsole 3.4NcursesandSlang 3.5KDE,GNOME,andXfce 3.6X11andfonts 3.7Windowmanagersandterminalemulators 3.8Vim,emacs,xemacs,andnano 3.9Shells 3.10Irssi 3.11Mutt 3.12linksandelinks 3.13Samba 3.14Testingitallout 4Reportedissuesandproblems 4.1Systemconfigurationfiles(in/etc) 5Externalresources Characterencodings Whatisacharacterencoding? Computersthemselvesdonotunderstandprintedtextasahumanwould.Forcomputers,everycharacteroftextisrepresentedbyanumber.Traditionally,eachsetofnumbersusedtorepresentalphabetsandcharacters(knownasacodingsystem,encoding,orcharacterset)waslimitedinsizeduetolimitationsincomputerhardware. Thehistoryofcharacterencodings Themostcommon(oratleastthemostwidelyaccepted)charactersetisASCII(AmericanStandardCodeforInformationInterchange).ItiswidelyheldthatASCIIisthemostsuccessfulsoftwarestandardevercreated.ModernASCIIwasstandardizedin1986(ANSIX3.4,RFC20,ISO/IEC646:1991,ECMA-6)bytheAmericanNationalStandardsInstitute. ASCIIisstrictlyseven-bit,meaningthatitusesbitpatternsrepresentablewithsevenbinarydigits,whichprovidesarangeof0to127indecimal.Theseinclude32non-visiblecontrolcharacters,mostbetween0and31,withthefinalcontrolcharacter,DELordeleteat127.Characters32to126arevisiblecharacters:aspace,punctuationmarks,Latinlettersandnumbers. TheeighthbitinASCIIwasoriginallyusedasaparitybitforerrorchecking.Iferrorcheckingisnotdesired,itisleftas0.Thismeansthat,withASCII,eachcharacterisrepresentedbyasinglebyte. AlthoughASCIIwasenoughforcommunicationinmodernEnglish,inotherEuropeanlanguagesthatincludeaccentedcharacters,thingswerenotsoeasy.TheISO8859standardsweredevelopedtomeettheseneeds.TheywerebackwardscompatiblewithASCII,butinsteadofleavingtheeighthbitblank,theyusedittoallowanother127charactersineachencoding.ISO8859'slimitationssooncametolight,andtherearecurrently15variantsoftheISO8859standard(8859-1throughto8859-15).OutsideoftheASCII-compatiblebyterangeofthesecharactersets,thereisoftenconflictbetweenthelettersrepresentedbyeachbyte.Tocomplicateinteroperabilitybetweencharacterencodingsfurther,Windows-1252isusedinsomeversionsofMicrosoftWindowsinsteadforWesternEuropeanlanguages.Thisisasuper-setofISO8859-1,howeveritisdifferentinseveralways;thesesetsdoallretainASCIIcompatibility. Thenecessarydevelopmentofcompletelydifferentsingle-byteencodingsfornon-Latinalphabets,suchasEUC(ExtendedUnixCoding)whichisusedforJapaneseandKorean(andtoalesserextentChinese)createdmoreconfusion.Otheroperatingsystemsstilluseddifferentcharactersetsforthesamelanguages,forexample,Shift-JISandISO-2022-JP.UserswishingtoviewcyrillicglyphshadtochoosebetweenKOI8-RforRussianandBulgarianorKOI8-UforUkrainian,aswellasalltheothercyrillicencodingssuchastheunsuccessfulISO8859-5,andthecommonWindows-1251set.AllofthesecharactersetsbrokemostcompatibilitywithASCII.AlthoughitshouldbementionedKOI8encodingsplacecyrilliccharactersinLatinorder,soincasetheeighthbitisstripped,textisstilldecipherableonanASCIIterminalthroughcase-reversedtransliteration. Allofthishasledtomassconfusion,andtoanalmosttotalinabilityformultilingualcommunication;especiallyacrossdifferentalphabets.EnterUnicode. WhatisUnicode? Unicodethrowsawaythetraditionalsingle-bytelimitofcharactersets.Ituses17"planes"of65,536codepointstodescribeamaximumof1,114,112characters.Asthefirstplane,aka."BasicMultilingualPlane"orBMP,containsalmosteverycharacterauserwilleverneed.ManyhavemadethewrongassumptionthatUnicodewasa16-bitcharacterset. Unicodehasbeenmappedinmanydifferentways,butthetwomostcommonareUTF(UnicodeTransformationFormat)andUCS(UniversalCharacterSet).AnumberafterUTFindicatesthenumberofbitsinoneunit,whilethenumberafterUCSindicatesthenumberofbytes.UTF-8hasbecomethemostwidespreadmeansfortheinterchangeofUnicodetextasaresultofitseight-bitcleannature;itisthereforethesubjectofthisdocument. WhatUTF-8cando UTF-8allowsuserstoworkinastandards-compliantandinternationallyacceptedmultilingualenvironment,withacomparativelylowdataredundancy.Itisthepreferredwayfortransmittingnon-ASCIIcharactersovertheInternet,throughEmail,IRC,oralmostanyothermedium.Despitethis,manypeopleregardUTF-8inonlinecommunicationasabusive.ItisalwaysbesttobeawareoftheattitudetowardsUTF-8inaspecificchannel,mailinglist,orUsenetgroupbeforeusingnon-ASCIIUTF-8. SettingupUTF-8inGentoo FindingorcreatingUTF-8locales NowthattheprinciplesbehindUnicodehavebeenlaidout,getreadytostartusingUTF-8locally! ForusersinterestedinmoreknowledgefurtherexplanationcanbefoundintheGentooLocalizationGuide. Next,theuserneedstodecidewhetheraUTF-8localeisavailableforthelanguageofchoice,orwhetheroneneedstobegenerated. user$locale-a|grep'en_GB'en_GB en_GB.utf8 Fromtheoutputoftheabovecommand,lookforaresultwithasuffixsimilarto.UTF-8.IfthereisnoresultwithasimilarsuffixaUTF-8compatiblelocalemustbecreated. Thecommandliststhesuffixinlowercasewithoutanyhyphens,glibcunderstandsbothformsofthesuffix,manyotherprogramsdon't.ThemostcommonexampleofwhichisX.SoitisbesttoalwaysuseUTF-8inpreferencetoutf8. NoteOnlyexecutethefollowingcodeifthesystemdoesnothaveaUTF-8localeavailableforthelanguageofchoice. Replace"en_GB"withthedesiredlocalesetting: root#localedef-ien_GB-fUTF-8en_GB.UTF-8 AnotherwaytoincludeaUTF-8localeistoaddittothe/etc/locale.genfileandgeneratenecessarylocalesusingthelocale-gencommand.Localeswillbewrittentothelocale-archive/usr/lib/locale/locale-archive. CODELinein/etc/locale.genen_GB.UTF-8UTF-8 root#locale-gen*Generating1locales(thismighttakeawhile)with1jobs *(1/1)Generatingen_GB.UTF-8...[ok] *Generationcomplete Settingthelocale ThereisoneenvironmentvariablethatneedstobesetinordertousethenewUTF-8locales:LC_CTYPE(optionallymodifytheLANGvariabletochangethesystemlanguageaswell).Therearealsomanydifferentwaystosetit;somesystemadministratorsprefertoonlyhaveaUTF-8environmentforaspecificuser,inwhichcasetheysetthemintheir~/.profile(/bin/shforBourneshellusers),~/.bash_profileor~/.bashrc(/bin/bashforBourneagainshellusers).MoredetailsandbestpracticescanbefoundintheLocalizationGuide. Stillothersprefertosetthelocaleglobally.Onespecificcircumstancewheretheauthorparticularlyrecommendsdoingthisiswhen/etc/init.d/xdmisinuse,becausethisinitscriptstartsthedisplaymanageranddesktopbeforeanyoftheaforementionedshellstartupfilesaresourced.Inotherwords,thisisperformedbeforeanyofthevariablesareloadedintheenvironment. Settingthelocalegloballyshouldbedoneusing/etc/env.d/02localefile.Thisfileshouldlooksomethinglikethefollowing: FILE/etc/env.d/02localeDemonstrationofen_GB.UTF-8##(Asalways,change"en_GB.UTF-8"totheappropriatelocalevalue;eachlanguagehasadifferentvalue!) LANG="en_GB.UTF-8" NoteItispossibletosubstitutetheLC_CTYPEvariablefortheLANGvariable.FormoreinformationonthecategoriesaffectedbyusingLC_CTYPEreadtheGNUlocalepage. Next,theenvironmentmustbeupdatedbyrunningthefollowingcommand: root#env-update>>>Regenerating/etc/ld.so.cache... root#source/etc/profile Now,runlocalewithnoargumentstoseeifthecorrectvariableshavebeenloadedintheenvironment: root#localeLANG=en_GB.utf8 LC_CTYPE="en_GB.utf8" LC_NUMERIC="en_GB.utf8" LC_TIME="en_GB.utf8" LC_COLLATE="en_GB.utf8" LC_MONETARY="en_GB.utf8" LC_MESSAGES="en_GB.utf8" LC_PAPER="en_GB.utf8" LC_NAME="en_GB.utf8" LC_ADDRESS="en_GB.utf8" LC_TELEPHONE="en_GB.utf8" LC_MEASUREMENT="en_GB.utf8" LC_IDENTIFICATION="en_GB.utf8" LC_ALL= Thevaluesoflocaleenvironmentvariablesthathavebeenexplicitlysete.g.inanexportstatement(ifusingbash)arelistedwithoutdoublequotes.Thosewhosevaluehasbeeninheritedfromotherlocaleenvironmentvariableshavetheirvaluesindoublequotes. Alternatively,usingeselecttosetlocales Althoughitisgoodtomaintainthesystemasdescribedabove,itispossibletoverifythecorrectlocaleconfiguredusingtheeselectutility. Useeselecttolisttheavailablelocalesonthesystem: root#eselectlocalelist[1]C [2]POSIX* [3]en_GB.utf8 [](freeform) Usingeselectsettingthelocaleisassimpleaslistingthem.Oncethecorrectlocalehasbeendeterminedinvoke: root#eselectlocaleset3SettingLANGtoen_GB.utf8... Checktheresult: root#eselectlocalelist[1]C [2]POSIX [3]en_GB.utf8* [](freeform) Incaseitispreferredtohave/etc/env.d/02localewith.UTF-8insteadof.utf8,runtheappropriateeselectcommand: root#eselectlocaleseten_GB.UTF-8SettingLANGtoen_GB.UTF-8... root#eselectlocalelist[1]C [2]POSIX [3]en_GB.utf8 [4]en_GB.UTF-8* [](freeform) Runningthefollowingcommandwillupdatethevariablesintheshell: root#env-update&&source/etc/profile>>>Regenerating/etc/ld.so.cache... Thatiseverything.ThesystemisnowusingUTF-8locales.Thenexthurdleistheconfigurationoftheapplicationsusedfromdaytoday. Applicationsupport WhenUnicodefirststartedgainingmomentuminthesoftwareworld,multibytecharactersetswerenotwellsuitedtolanguageslikeC,whichisthebaselanguageofmostcommonlyusedprograms.Eventoday,someprogramsarenotabletohandleUTF-8properly.Fortunatelythemajorityofprograms,especiallythecommonones,aresupported. (V)FAT ForUTF-8supportinFATfilesystemsseetheFATarticle. Filenames Forchangingtheencodingoffilenames,app-text/convmvcanbeused. root#emerge--askapp-text/convmv Theformatoftheconvmvcommandisasfollows: root#convmv-f-tutf-8 Substituteiso-8859-1withthecharsetbeingconvertedfrom: root#convmv-fiso-8859-1-tutf-8filename Forchangingthecontentsoffiles,usetheiconvutility,itcomesbundledwithsys-libs/glibcandshouldbeinstalledonallGentoosystems.Substituteiso-8859-1withthecharsetbeingconvertedfrom.Afterrunningthecommandbesuretocheckforsaneoutput: root#iconv-fiso-8859-1-tutf-8filename Toconvertafile,anotherfilemustbecreated: root#iconv-fiso-8859-1-tutf-8filename>newfile Therecode(app-text/recode)packagecanalsobeusedforthispurpose. Thesystemconsole ToenableUTF-8ontheconsoleedit/etc/rc.conf.Setunicode="yes"andreadthecomments--itisimportanttohaveafontthathasagoodrangeofcharacterstomakethemostofUnicode.ForthistoworkmakesuretheUnicodelocalehasbeenproperlycreated. Thekeymapvariable,setin/etc/conf.d/keymaps,shouldhaveaUnicodekeymapspecified. CODEExample/etc/conf.d/keymapssnippet##(Change"uk"totherightlocallayout) keymap="uk" NcursesandSlang NoteIgnoreanymentionofSlanginthissectionifitisnotinstalledorunneeded. ItiswisetoaddunicodetotheglobalUSEflagsin/etc/portage/make.conf,andthentore-emergesys-libs/ncursesandsys-libs/slang.Portagewilldothisautomaticallyifthe--changed-useor--newuseoptionsareused.Runthefollowingcommandtopullinthepackages: root#emerge--update--deep--newuse@world Wealsoneedtorebuildpackagesthatlinktothese,nowtheUSEchangeshavebeenapplied.Thetoolweuse(revdep-rebuild)ispartoftheapp-portage/gentoolkitpackage. root#revdep-rebuild--librarylibncurses.so.5 root#revdep-rebuild--librarylibslang.so.1 KDE,GNOME,andXfce AllofthemajordesktopenvironmentshavefullUnicodesupport,andwillrequirenofurthersetupthanwhathasalreadybeencoveredinthisguide.Thisisbecausetheunderlyinggraphicaltoolkits(QtorGTK2)areUTF-8aware.Subsequently,allapplicationsrunningontopofthesetoolkitsshouldbeUTF-8-awareoutofthebox. OnGTKbasedapplications,thekeysequenceforhexadecimalUnicodeinputisCtrl+Shift+u+.Asanexample,theunicodecharacter✔whichhasunicodenumberU+2714canbewrittenasCtrl+Shift+u+2714+ENTER,beingrenderedas✔. TheexceptionstothisrulecomeinXlibandGTK1.GTK1requiresaiso-10646-1FontSpecinthe~/.gtkrc,forexample-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1.Also,applicationsusingXliborXawwillneedtobegivenasimilarFontSpec,otherwisetheywillnotwork. NoteIfanoldGNOME1controlcenterversionisavailable,usethatinstead.Pickanyiso10646-1fontfromthere. CODEExample~/.gtkrc(forGTK1)thatdefinesaUnicodecompatiblefontstyle"user-font" { fontset="-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" } widget_class"*"style"user-font" IfanapplicationhassupportforbothaQtandGTK2GUI,theGTK2GUIwillgenerallygivebetterresultswithUnicode. X11andfonts TrueTypefontshavesupportforUnicode,andmostofthefontsthatshipwithXorghaveextensivecharactersupport,although,obviously,noteverysingleglyphavailableinUnicodehasbeencreatedforthatfont. Also,manyfontpackagesinPortageareUnicodeaware.SeetheFontconfigpageformoreinformationonrecommendedfontsandconfiguration. Windowmanagersandterminalemulators WindowmanagersnotbuiltonGTKorQtgenerallyhaveverygoodUnicodesupport,astheyoftenusetheXftlibraryforhandlingfonts.IfthewindowmanagerdoesnotuseXftforfonts,thenitisstillpossibletousetheFontSpecmentionedintheprevioussectionasaUnicodefont. TerminalemulatorsthatuseXftandsupportUnicodearehardertocomeby.AsidefromKonsoleandGNOMETerminal,thebestoptionsinPortagearex11-terms/rxvt-unicode,x11-terms/xfce4-terminal,gnustep-apps/terminal,x11-terms/mlterm,orplainx11-terms/xtermwhenbuiltwiththeunicodeUSEflagandinvokedasuxterm.app-misc/screensupportsUTF-8too,wheninvokedasscreen-Uorthefollowingisputintothe~/.screenrc: CODE~/.screenrcforUTF-8defutf8on Vim,emacs,xemacs,andnano VimprovidesfullUTF-8support,andalsohasbuiltindetectionofUTF-8files.ForfurtherinformationinVim,use:helpmbyte.txt. GNUEmacssinceversion23andXEmacsversion21.5havefullUTF-8support.GNUEmacs24alsosupportseditingbidirectionaltext. NanohasprovidedfullUTF-8supportsinceversion1.3.6. Shells Currently,bashprovidesfullUnicodesupportthroughtheGNUreadlinelibrary.ZShell(zsh)offersUnicodesupportwiththeunicodeUSEflag. TheCshell,tcshandkshdonotprovideUTF-8supportatall. Irssi IrssihascompleteUTF-8support,althoughitdoesrequireausertosetanoption. [irssi]setterm_charsetUTF-8 Forchannelswherenon-ASCIIcharactersareoftenexchangedinnon-UTF-8charsets,the/recodecommandmaybeusedtoconvertthecharacters.Type/helprecodeformoreinformation. Mutt TheMuttmailuseragenthasverygoodUnicodesupport.TouseUTF-8withMutt,nothingneedstobeputintheconfigurationfiles.MuttwillworkunderUnicodeenvironmentwithoutmodificationifalltheconfigurationfiles(signatureincluded)areUTF-8encoded. NoteItisstillpossibletosee'?'inmailsreadwithMutt.Thisisaresultofpeopleusingamailclientwhichdoesnotindicatetheusedcharset.Thereislittleonecandoaboutthisthantoaskthemtoconfiguretheirclientcorrectly. FurtherinformationisavailablefromtheMuttWiki. linksandelinks Thesearecommonlyusedtext-basedbrowsers,andweshallseehowwecanenableUTF-8supportonthem.Onelinksandlinks,therearetwowaystogoaboutthis,oneusingtheSetupoptionfromwithinthebrowseroreditingtheconfigfile.Tosettheoptionthroughthebrowser,openasitewithelinksorlinksandthenAlt+StoentertheSetupMenuthenselectTerminaloptions,orpressT.ScrolldownandselectthelastoptionUTF-8I/ObypressingEnter.ThenSaveandexitthemenu.OnlinksonemayhavetodoarepeatAlt+SandthenpressStosave.Theconfigfileoption,isshownbelow. CODEEnablingUTF-8forelinks/links##(Forelinks,edit/etc/elinks/elinks.confor~/.elinks/elinks.confandaddthefollowingline) setterminal.linux.utf_8_io=1 ##(Forlinks,edit~/.links/links.cfgandaddthefollowingline) terminal"xterm"010us-asciiutf-8 Samba SambaisasoftwaresuitewhichimplementstheSMB(ServerMessageBlock)protocolforUNIXsystemssuchasMacs,LinuxandFreeBSD.TheprotocolisalsosometimesreferredtoastheCommonInternetFileSystem(CIFS).SambaalsoincludestheNetBIOSsystem-usedforfilesharingoverwindowsnetworks. Addthefollowinglinesunderthe[global]section: root#nano-w/etc/samba/smb.confdoscharset=1255 unixcharset=UTF-8 displaycharset=UTF-8 Testingitallout TherearenumerousUTF-8testwebsitesaroundandmostofthepopularbrowsersinGentoohavefullUTF-8support. Whenusingoneofthetext-onlywebbrowsers,makeabsolutelysureaUnicode-awareterminalisused. Ifcertaincharactersaredisplayedasboxeswithlettersornumbersinside,thenthecurrentfontdoesnothaveglyphsforthosecharacters.Instead,itdisplaysaboxwiththehexcodeoftheUTF-8symbol. unicode-table.com AW3CUTF-8TestPage AUTF-8testpageprovidedbytheUniversityofFrankfurt Reportedissuesandproblems Systemconfigurationfiles(in/etc) Mostsystemconfigurationfiles(suchas/etc/fstab)donotsupportUTF-8.ItisrecommendedtostickwiththeASCIIcharactersetforthesefiles. Externalresources TheWikipediaentryforUnicode TheWikipediaentryforUTF-8 Unicode.org UTF-8.com RFC3629 RFC2277 Charactersvs.Bytes TheGNUCLibrary:LocalesandInternationalization Unifoundry.com-UnicodeTutorial unicodeUSEflagdescriptionThispageisbasedonadocumentformerlyfoundonourmainwebsitegentoo.org.Thefollowingpeoplecontributedtotheoriginaldocument:ThomasMartin,AlexanderSimonov,ShyamMani,Theyarelistedherebecausewikihistorydoesnotallowforanyexternalattribution.Ifyoueditthewikiarticle,pleasedonotaddyourselfhere;yourcontributionsarerecordedoneacharticle'sassociatedhistorypage. Retrievedfrom"https://wiki.gentoo.org/index.php?title=UTF-8&oldid=1045427" Category:Localization



請為這篇文章評分?