Handle UTF8 file with BOM - Real's Java How-to
文章推薦指數: 80 %
UTF8 file are a special case because it is not recommended to add a BOM to them. The presence of UTF8 BOM can break other tools like Java. Language HTML&CSS Form Javainteraction Mobile Varia Language String/Number AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database WSH&VBScript Windows,Batch,PDF,Internet BigIndex Download TS2068,SinclairQLArchives Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated(oldstuff) Java Language StringandNumber AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Javascript Language HTML&CSS Form Javainteraction Mobile Varia Powerbuilder Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database MoreHowTo WSH&VBScript Windows,Batch,PDF,... Varia BigIndex Download TS2068/SinclairQL Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated Java Javascript Powerbuilder MoreHowTo Varia Sharethispage HandleUTF8filewithBOMTag(s):IO Aboutcookiesonthissite Weusecookiestocollectandanalyzeinformationonsiteperformanceandusage, toprovidesocialmediafeaturesandtoenhanceandcustomizecontentandadvertisements. Gotit FromWikipedia,thebyteordermark(BOM)isaUnicodecharacterusedtosignaltheendianness(byteorder)ofatextfileorstream. ItscodepointisU+FEFF.BOMuseisoptional,and,ifused,shouldappearatthestartofthetextstream.Beyonditsspecificuse asabyte-orderindicator,theBOMcharactermayalsoindicatewhichoftheseveralUnicoderepresentationsthetextisencodedin. ThecommonBOMsare: EncodingRepresentation(hexadecimal)Representation(decimal) UTF-8EFBBBF239187191 UTF-16(BE)FEFF254255 UTF-16(LE)FFFE255254 UTF-32(BE)0000FEFF00254255 UTF-32(LE)FFFE000025525400 UTF8fileareaspecialcasebecauseitisnotrecommendedtoaddaBOMtothem.ThepresenceofUTF8BOMcanbreakothertoolslikeJava. Infact,JavaassumestheUTF8don'thaveaBOMsoiftheBOMispresentitwon'tbediscardedanditwillbeseenasdata. TocreateanUTF8filewithaBOM,opentheWindowsNotepad,createasimpletextfileandsaveitasutf8.txtwiththeencodingUTF-8. Nowifyouexaminethefilecontentasbinary,youseetheBOMatthebeginning. IfwereaditwithJava. importjava.io.*; publicclassx{ publicstaticvoidmain(Stringargs[]){ try{ FileInputStreamfis=newFileInputStream("c:/temp/utf8.txt"); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); for(Strings="";(s=r.readLine())!=null;){ System.out.println(s); } r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } } TheoutputcontainsastrangecharacteratthebeginningbecausetheBOMisnotdiscarded: ?helloworld ThisbehaviourisdocumentedintheJavabugdatabase,here andhere.Therewillbenofixfornowbecauseitwillbreak existingtoolslikejavadocouxmlparsers. TheApacheIOCommons providessometoolstohandlethissituation.TheBOMInputStreamclassdetectstheBOMand,ifrequired,canautomaticallyskipitandreturn thesubsequentbyteasthefirstbyteinthestream. Oryoucandoitmanually.ThenextexampleconvertsanUTF8filetoANSI. WecheckthefirstlineforthepresenceoftheBOMandifpresent,wesimplydiscardit. importjava.io.*; publicclassUTF8ToAnsiUtils{ //FEFFbecausethisistheUnicodecharrepresentedbytheUTF-8byteordermark(EFBBBF). publicstaticfinalStringUTF8_BOM="\uFEFF"; publicstaticvoidmain(Stringargs[]){ try{ if(args.length!=2){ System.out .println("Usage:javaUTF8ToAnsiUtilsutf8fileansifile"); System.exit(1); } booleanfirstLine=true; FileInputStreamfis=newFileInputStream(args[0]); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); FileOutputStreamfos=newFileOutputStream(args[1]); Writerw=newBufferedWriter(newOutputStreamWriter(fos,"Cp1252")); for(Strings="";(s=r.readLine())!=null;){ if(firstLine){ s=UTF8ToAnsiUtils.removeUTF8BOM(s); firstLine=false; } w.write(s+System.getProperty("line.separator")); w.flush(); } w.close(); r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } privatestaticStringremoveUTF8BOM(Strings){ if(s.startsWith(UTF8_BOM)){ s=s.substring(1); } returns; } } comment Comments()
延伸文章資訊
- 1[ Java 常見問題] Handle UTF8 file with BOM - 程式扎記
參考至 這裡. Preface : 編碼問題一直是編程人員處理上很頭痛的問題, 特別是在處理 BOM 的時候. 舉例來說, 考慮有一個UTF-8 編碼檔案內容如下:
- 2How to add a UTF-8 BOM in Java? - Stack Overflow
As noted in section 23.8 of the Unicode 9 specification, the BOM for UTF-8 is EF BB BF . That seq...
- 3Java - 读取UTF-8-BOM文件,第一个字段值为Null - CSDN博客
2. 从xls中copy数据到phone.txt中,此时phone.txt默认格式为UTF-8-BOM。 3. 查看文本编码格式,用nodepad打开文件,Encoding 即可查看文本编码格式...
- 4Handle UTF8 file with BOM - Real's Java How-to
UTF8 file are a special case because it is not recommended to add a BOM to them. The presence of ...
- 5Java处理UTF-8文件的BOM头部 - 51CTO博客
Java处理UTF-8文件的BOM头部. BOM——Byte Order Mark,就是字节序标记。 基本概念. 在 UCS 编码 中有一个叫做” ZERO WIDTH NO-BREAK SPA...