A quick tale about FEFF, an invisible UTF-8 character that ...
文章推薦指數: 80 %
Our friend FEFF means different things, but it's basically a signal for a program on how to read the text. It can be UTF-8 (more common), UTF-16 ... Search Submityoursearchquery Forum Donate EumirGaspar EumirGaspar Today,weencounteredanerrorwhiletryingtocreatesomedatabaseseedsfromaCSV.ThisCSVwasoriginallygeneratedbymeusingaRubyscriptwhichpipedtheoutputtoafileandsavedasaCSV.TheCSVwascheckedintoGitandhadbeenusedforawhileuntilwehadtoupdatesomepartsofitbyaddinganewcolumnandfixingsomevalues.Whilewedon’tknowtheexactreasonyet,mytheoryisthatsomehow,ExcelforMac(weareallusingMacs)addedsomeadditionalmetadatatoitevenaftersavingthefileasaCSV.Thisinturnmadeanyoneusingtheseedreceivethefollowingerror:CSV::MalformedCSVError:Illegalquotinginline1.IopenedtheCSVfileandnothinglookedsuspicious.Myfirstthoughtwassomeleft/rightquotationmarksweresomehowmixedintothefileinsteadofjustthe‘normal’doublequotes:".Butuponfurtherinvestigation,therewasnothingoutoftheordinary.Thisledmetojustwipeoutthewholefile,andactuallytypeoutthefirstrowagain.Isavedthatfileagainandranthemigration:CSV::MalformedCSVError:Illegalquotinginline1.What?!Okay,thiswasdrivingmenuts.Iopenedupanewfile,typedtheexactsinglelineagain,andranthemigration.Itworked.Sowhatwasinthatfile?!Onlyonewaytofindout:catcompanies.csv|pbcopy|pbpaste>temp.csv rmcompanies.csv mvtemp.csvcompanies.csv gitdiffSoOSXhasthesetwofunctionsthatareveryuseful:pbcopyandpbpaste.Basicallyanythingpipedtopbcopygetsintoyourclipboardandpbpasteputswhatyouhaveonyourclipboardtostandardoutput(stdout).Butitremovesallformatting.VeryusefulwhenyouwanttojustcopysometextfromsomewhereandyouwanttopasteitintoaWYSIWYGeditorwithoutalltheformatting.LikewhenwritinganemailfromGmail,forexample.Ithenremovedtheoriginalfileandsavedthenew‘unformatted’filewiththesamefilenamesoIcouldseethedifference.Andwefinallysawtheinvisibleman:TheinvisiblecharactershowinginAtlassian’sBitbucket.Theinvisiblecharacter’sactualname!AquickGooglesearchtoldusthatourfriendU+FEFFwascalledaZEROWIDTHNO-BREAKSPACE.Also,aquicktriptoWikipediatoldusabouttheactualusesforU+FEFF,morecommonlyknownasByteordermarkorBOM.OurfriendFEFFmeansdifferentthings,butit’sbasicallyasignalforaprogramonhowtoreadthetext.ItcanbeUTF-8(morecommon),UTF-16,orevenUTF-32.FEFFitselfisforUTF-16—inUTF-8itismorecommonlyknownas0xEF,0xBB,or0xBF.Frommyunderstanding,whentheCSVfilewasopenedinExcelandsaved,Excelcreatedaspaceforourinvisiblestowaway,U+FEFF.Andinfrontofthefiletoboot!Exceldidsomemagic,anditwasprobablysavedinUTF-16insteadofUTF-8.UTF-8doesnotunderstandBOMandjusttreatsitasanon-charactersovisually,thefilewasokay.ButRuby’sCSVthoughtthattherewassomethingwrongbecauseitassumedthefileitwasreadingwasUTF-8anditcouldn’tignoreMr.U+FEFF.Solessonlearned:don’topen(andsave!)aCSVfileinExcelifyouwanttofeedittoRuby’sCSVparser.Ifyoudoeverencounteranerrorlikethat,besuretolookforhiddencharactersnotshownbyyoureditor.Ifyoustillcan’tseeitandareusingOSX,thenpbcopyandpbpastewillhelpyouout—theystripoutanyformattingorhiddencharactersfromtextinadditiontocopyingandpastingit. ADVERTISEMENT ADVERTISEMENT ADVERTISEMENT EumirGaspar EumirGaspar Cryptoenthusiast.Rubydeveloperbyday,CTO/Elixirdeveloperatnight.SASSloverallday,everyday. Ifyoureadthisfar,tweettotheauthortoshowthemyoucare.Tweetathanks Learntocodeforfree.freeCodeCamp'sopensourcecurriculumhashelpedmorethan40,000peoplegetjobsasdevelopers.Getstarted ADVERTISEMENT
延伸文章資訊
- 1The FEFF Project
The FEFF Project at the University of Washington specializes in theoretical methods for spectrosc...
- 2Zero Width No-Break Space: U+FEFF
Symbol: , Name of the character: zero width no-break space, Unicode number for the sign: U+FEFF,...
- 3Unicode Character (U+FEFF) - Compart
U+FEFF is the unicode hex value of the character Zero Width No-Break Space (BOM, ZWNBSP). Char U+...
- 4The FEFF9 code - The FEFF Project - University of Washington
FEFF is an automated program for ab initio multiple scattering calculations of X-ray Absorption F...
- 5位元組順序記號 - 维基百科
位元組順序記號(英語:byte-order mark,BOM)是位於碼點 U+FEFF 的統一碼字符的名称。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字串編碼時,這個字符被用來...