Handle UTF8 file with BOM - Real's Java How-to
文章推薦指數: 80 %
UTF8 file are a special case because it is not recommended to add a BOM to them. The presence of UTF8 BOM can break other tools like Java. Language HTML&CSS Form Javainteraction Mobile Varia Language String/Number AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database WSH&VBScript Windows,Batch,PDF,Internet BigIndex Download TS2068,SinclairQLArchives Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated(oldstuff) Java Language StringandNumber AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Javascript Language HTML&CSS Form Javainteraction Mobile Varia Powerbuilder Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database MoreHowTo WSH&VBScript Windows,Batch,PDF,... Varia BigIndex Download TS2068/SinclairQL Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated Java Javascript Powerbuilder MoreHowTo Varia Sharethispage HandleUTF8filewithBOMTag(s):IO Aboutcookiesonthissite Weusecookiestocollectandanalyzeinformationonsiteperformanceandusage, toprovidesocialmediafeaturesandtoenhanceandcustomizecontentandadvertisements. Gotit FromWikipedia,thebyteordermark(BOM)isaUnicodecharacterusedtosignaltheendianness(byteorder)ofatextfileorstream. ItscodepointisU+FEFF.BOMuseisoptional,and,ifused,shouldappearatthestartofthetextstream.Beyonditsspecificuse asabyte-orderindicator,theBOMcharactermayalsoindicatewhichoftheseveralUnicoderepresentationsthetextisencodedin. ThecommonBOMsare: EncodingRepresentation(hexadecimal)Representation(decimal) UTF-8EFBBBF239187191 UTF-16(BE)FEFF254255 UTF-16(LE)FFFE255254 UTF-32(BE)0000FEFF00254255 UTF-32(LE)FFFE000025525400 UTF8fileareaspecialcasebecauseitisnotrecommendedtoaddaBOMtothem.ThepresenceofUTF8BOMcanbreakothertoolslikeJava. Infact,JavaassumestheUTF8don'thaveaBOMsoiftheBOMispresentitwon'tbediscardedanditwillbeseenasdata. TocreateanUTF8filewithaBOM,opentheWindowsNotepad,createasimpletextfileandsaveitasutf8.txtwiththeencodingUTF-8. Nowifyouexaminethefilecontentasbinary,youseetheBOMatthebeginning. IfwereaditwithJava. importjava.io.*; publicclassx{ publicstaticvoidmain(Stringargs[]){ try{ FileInputStreamfis=newFileInputStream("c:/temp/utf8.txt"); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); for(Strings="";(s=r.readLine())!=null;){ System.out.println(s); } r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } } TheoutputcontainsastrangecharacteratthebeginningbecausetheBOMisnotdiscarded: ?helloworld ThisbehaviourisdocumentedintheJavabugdatabase,here andhere.Therewillbenofixfornowbecauseitwillbreak existingtoolslikejavadocouxmlparsers. TheApacheIOCommons providessometoolstohandlethissituation.TheBOMInputStreamclassdetectstheBOMand,ifrequired,canautomaticallyskipitandreturn thesubsequentbyteasthefirstbyteinthestream. Oryoucandoitmanually.ThenextexampleconvertsanUTF8filetoANSI. WecheckthefirstlineforthepresenceoftheBOMandifpresent,wesimplydiscardit. importjava.io.*; publicclassUTF8ToAnsiUtils{ //FEFFbecausethisistheUnicodecharrepresentedbytheUTF-8byteordermark(EFBBBF). publicstaticfinalStringUTF8_BOM="\uFEFF"; publicstaticvoidmain(Stringargs[]){ try{ if(args.length!=2){ System.out .println("Usage:javaUTF8ToAnsiUtilsutf8fileansifile"); System.exit(1); } booleanfirstLine=true; FileInputStreamfis=newFileInputStream(args[0]); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); FileOutputStreamfos=newFileOutputStream(args[1]); Writerw=newBufferedWriter(newOutputStreamWriter(fos,"Cp1252")); for(Strings="";(s=r.readLine())!=null;){ if(firstLine){ s=UTF8ToAnsiUtils.removeUTF8BOM(s); firstLine=false; } w.write(s+System.getProperty("line.separator")); w.flush(); } w.close(); r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } privatestaticStringremoveUTF8BOM(Strings){ if(s.startsWith(UTF8_BOM)){ s=s.substring(1); } returns; } } comment Comments()
延伸文章資訊
- 1Java - 读取UTF-8-BOM文件,第一个字段值为Null - CSDN博客
2. 从xls中copy数据到phone.txt中,此时phone.txt默认格式为UTF-8-BOM。 3. 查看文本编码格式,用nodepad打开文件,Encoding 即可查看文本编码格式...
- 2Java - How to add and remove BOM from UTF-8 file
1. Add BOM to a UTF-8 file ... To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff o...
- 3Java處理UTF-8帶BOM的文本的讀寫 - 網頁設計教學
BOM(byte-order mark),即字節順序標記,它是插入到以UTF-8、UTF16或UTF-32編碼Unicode文件開頭的特殊標記,用來識別Unicode文件的編碼類型。
- 4Java处理UTF-8文件的BOM头部 - 51CTO博客
Java处理UTF-8文件的BOM头部. BOM——Byte Order Mark,就是字节序标记。 基本概念. 在 UCS 编码 中有一个叫做” ZERO WIDTH NO-BREAK SPA...
- 5Java处理UTF-8文件的BOM头部 - CSDN博客
Java处理UTF-8文件的BOM头部BOM——Byte Order Mark,就是字节序标记。基本概念在UCS 编码中有一个叫做”ZERO WIDTH NO-BREAK SPACE“的字符,它...