Why Python 3 doesn't write the Unicode BOM - Peter Bloomfield

文章推薦指數: 80 %
投票人數:10人

According to the Python documentation on reading and writing Unicode data: Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; ... I’vebeenusingPythonscriptstoautomaticallyeditandoutputWindowsResourcefiles(.rc)forC++projectsinVisualStudio2013.WhenhandlingUnicode,WindowsandVisualStudioalwayswantlittleendianUTF-16encoding,andtheresourcefileshouldalwaysstartwiththeUnicodeBOM(ByteOrderMark).However,despitethepromisesinthedocumentation,IfoundthatPythonwasn’toutputtingtheBOMautomatically. Inearlyresortedtooutputtingitmanually,butasisoftenthecasewithPython,thecorrectapproachissimplerthanitseems. WhatistheUnicodeBOM? TheUnicodeBOM(ByteOrderMark)isacharacterwhichcanoccuratthestartofaUnicodetextfiletoindicatewhatendiannessthedataisstoredin.It’sveryhelpfulforportabilityasitmeansprogramsondifferentsystemscanautomaticallydetecttheencoding.Thisallowsthemtodisplay,edit,andstorethetextappropriately,leavingnoroomforambiguity. Endiannessisonlyrelevantforencodingswhichusemorethanonebytepercodeunit,suchasUTF-16andUTF-32.ThereisalsoaBOMforUTF-8.However,it’sonlyusedtoidentifythefileasbeingUTF-8,asopposedtoASCIIorsomeotherencoding.Byteorderisirrelevantinthatcase,andtheUTF-8BOMisactivelydiscouraged. PythonandBOM AccordingtothePythondocumentationonreadingandwritingUnicodedata: Someencodings,suchasUTF-16,expectaBOMtobepresentatthestartofafile;whensuchanencodingisused,theBOMwillbeautomaticallywrittenasthefirstcharacterandwillbesilentlydroppedwhenthefileisread. Fromthis,itsoundslikeanyUTF-16orUTF-32encodingwillautomaticallytakeoftheBOM.However,tryrunningthefollowingcodeinaPython3script: withopen("output.txt",mode="w",encoding="utf-16-le")asf: f.write("HelloWorld.") Opentheoutputfileinaneditorwhichreportstheencoding,suchastheexcellentNotepad++onWindows.You’llseeUTF-16(orUCS-2)LittleEndian,butitwillsaythereisnoBOM. WheredidtheBOMgo? Toanswerthat,lookattheencodingargumentinthecodesnippetabove.It’ssettoutf-16-le,whichexplicitlyindicatesLittleEndianencoding.Itturnsoutthatifyouexplicitlyspecifyendianness,Pythonassumesyoudon’tneedaBOM.Thisisactuallymentionedintheofficialdocumentation,butnotparticularlyclearly. Instead,changetheencodingtoutf-16.ThisletsPythonusetheOperatingSystem’sendianness,anditassumesthataBOMisthereforenecessary.Here’sthemodifiedcodesnippet: withopen("output.txt",mode="w",encoding="utf-16")asf: f.write("HelloWorld.") Onceagain,runthatasaPython3scriptandthenopentheoutputfile.Assumingyou’rerunningonalittleendiansystem(whichshouldapplytoanythingrunningaWindowsOS),theencodingshouldshowupasUTF-16(orUCS-2)LittleEndianwithBOM. AtutorialexplaininghowtoprogramtheATOMMatrixwithArduinotoactlikea6-sideddice. Step-by-stepinstructionsforusingtheArduinoIDEonWindowstouploadprogramstotheATOMMatrixandATOMLite. Whenwritingunittests,remembertomakethemBRIEF:Brief,Reliable,Independent,Explicit,Focused.



請為這篇文章評分?