Handle UTF8 file with BOM - Real's Java How-to

文章推薦指數: 80 %
投票人數:10人

UTF8 file are a special case because it is not recommended to add a BOM to them. The presence of UTF8 BOM can break other tools like Java. Language HTML&CSS Form Javainteraction Mobile Varia Language String/Number AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database WSH&VBScript Windows,Batch,PDF,Internet BigIndex Download TS2068,SinclairQLArchives Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated(oldstuff) Java Language StringandNumber AWT Swing Environment IO JSinteraction JDBC Thread Networking JSP/Servlet XML/RSS/JSON Localization Security JNI/JNA Date/Time OpenSource Varia Javascript Language HTML&CSS Form Javainteraction Mobile Varia Powerbuilder Powerscript WinAPI&Registry Datawindow PFC Commonproblems Database MoreHowTo WSH&VBScript Windows,Batch,PDF,... Varia BigIndex Download TS2068/SinclairQL Real'sHowToFAQ Donate! Funny1 Funny2 Funny3 Funny4 Oneline AsciiArt Deprecated     Java Javascript Powerbuilder MoreHowTo Varia Sharethispage  HandleUTF8filewithBOMTag(s):IO Aboutcookiesonthissite Weusecookiestocollectandanalyzeinformationonsiteperformanceandusage, toprovidesocialmediafeaturesandtoenhanceandcustomizecontentandadvertisements. Gotit FromWikipedia,thebyteordermark(BOM)isaUnicodecharacterusedtosignaltheendianness(byteorder)ofatextfileorstream. ItscodepointisU+FEFF.BOMuseisoptional,and,ifused,shouldappearatthestartofthetextstream.Beyonditsspecificuse asabyte-orderindicator,theBOMcharactermayalsoindicatewhichoftheseveralUnicoderepresentationsthetextisencodedin. ThecommonBOMsare: EncodingRepresentation(hexadecimal)Representation(decimal) UTF-8EFBBBF239187191 UTF-16(BE)FEFF254255 UTF-16(LE)FFFE255254 UTF-32(BE)0000FEFF00254255 UTF-32(LE)FFFE000025525400 UTF8fileareaspecialcasebecauseitisnotrecommendedtoaddaBOMtothem.ThepresenceofUTF8BOMcanbreakothertoolslikeJava. Infact,JavaassumestheUTF8don'thaveaBOMsoiftheBOMispresentitwon'tbediscardedanditwillbeseenasdata. TocreateanUTF8filewithaBOM,opentheWindowsNotepad,createasimpletextfileandsaveitasutf8.txtwiththeencodingUTF-8. Nowifyouexaminethefilecontentasbinary,youseetheBOMatthebeginning. IfwereaditwithJava. importjava.io.*; publicclassx{ publicstaticvoidmain(Stringargs[]){ try{ FileInputStreamfis=newFileInputStream("c:/temp/utf8.txt"); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); for(Strings="";(s=r.readLine())!=null;){ System.out.println(s); } r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } } TheoutputcontainsastrangecharacteratthebeginningbecausetheBOMisnotdiscarded: ?helloworld ThisbehaviourisdocumentedintheJavabugdatabase,here andhere.Therewillbenofixfornowbecauseitwillbreak existingtoolslikejavadocouxmlparsers. TheApacheIOCommons providessometoolstohandlethissituation.TheBOMInputStreamclassdetectstheBOMand,ifrequired,canautomaticallyskipitandreturn thesubsequentbyteasthefirstbyteinthestream. Oryoucandoitmanually.ThenextexampleconvertsanUTF8filetoANSI. WecheckthefirstlineforthepresenceoftheBOMandifpresent,wesimplydiscardit. importjava.io.*; publicclassUTF8ToAnsiUtils{ //FEFFbecausethisistheUnicodecharrepresentedbytheUTF-8byteordermark(EFBBBF). publicstaticfinalStringUTF8_BOM="\uFEFF"; publicstaticvoidmain(Stringargs[]){ try{ if(args.length!=2){ System.out .println("Usage:javaUTF8ToAnsiUtilsutf8fileansifile"); System.exit(1); } booleanfirstLine=true; FileInputStreamfis=newFileInputStream(args[0]); BufferedReaderr=newBufferedReader(newInputStreamReader(fis, "UTF8")); FileOutputStreamfos=newFileOutputStream(args[1]); Writerw=newBufferedWriter(newOutputStreamWriter(fos,"Cp1252")); for(Strings="";(s=r.readLine())!=null;){ if(firstLine){ s=UTF8ToAnsiUtils.removeUTF8BOM(s); firstLine=false; } w.write(s+System.getProperty("line.separator")); w.flush(); } w.close(); r.close(); System.exit(0); } catch(Exceptione){ e.printStackTrace(); System.exit(1); } } privatestaticStringremoveUTF8BOM(Strings){ if(s.startsWith(UTF8_BOM)){ s=s.substring(1); } returns; } } comment Comments()



請為這篇文章評分?