How to remove \xa0 from string in Python? - Stack Overflow

文章推薦指數: 80 %
投票人數:10人

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space. Home Public Questions Tags Users Companies Collectives ExploreCollectives Teams StackOverflowforTeams –Startcollaboratingandsharingorganizationalknowledge. CreateafreeTeam WhyTeams? Teams CreatefreeTeam Collectives™onStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. LearnmoreaboutCollectives Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. LearnmoreaboutTeams Howtoremove\xa0fromstringinPython? AskQuestion Asked 10years,4monthsago Modified 3monthsago Viewed 402ktimes 331 IamcurrentlyusingBeautifulSouptoparseanHTMLfileandcallingget_text(),butitseemslikeI'mbeingleftwithalotof\xa0Unicoderepresentingspaces.IsthereanefficientwaytoremovealloftheminPython2.7,andchangethemintospaces?Iguessthemoregeneralizedquestionwouldbe,isthereawaytoremoveUnicodeformatting? Itriedusing:line=line.replace(u'\xa0',''),assuggestedbyanotherthread,butthatchangedthe\xa0'stou's,sonowIhave"u"severywhereinstead.): EDIT:Theproblemseemstoberesolvedbystr.replace(u'\xa0','').encode('utf-8'),butjustdoing.encode('utf-8')withoutreplace()seemstocauseittospitoutevenweirdercharacters,\xc2forinstance.Cananyoneexplainthis? pythonpython-2.7unicodebeautifulsouputf-8 Share Improvethisquestion Follow editedJul14,2020at15:32 ivanleoncz 8,02744goldbadges5353silverbadges4848bronzebadges askedJun12,2012at9:12 zhuyxnzhuyxn 6,20188goldbadges3636silverbadges4343bronzebadges 4 triedthatalready,'ascii'codeccan'tdecodebyte0xa0inposition0:ordinalnotinrange(128) – zhuyxn Jun12,2012at9:19 18 embraceUnicode.Useu''sinsteadof''s.:-) – jpaugh Jun12,2012at9:26 2 triedusingstr.replace(u'\xa0','')butgot"u"severywhereinsteadof\xa0s:/ – zhuyxn Jun12,2012at9:30 Ifthestringistheunicodeone,youhavetousetheu''replacement,notthe''.Istheoriginalstringtheunicodeone? – pepr Jun12,2012at10:51 Addacomment  |  15Answers 15 Sortedby: Resettodefault Highestscore(default) Trending(recentvotescountmore) Datemodified(newestfirst) Datecreated(oldestfirst) 385 \xa0isactuallynon-breakingspaceinLatin1(ISO8859-1),alsochr(160).Youshouldreplaceitwithaspace. string=string.replace(u'\xa0',u'') When.encode('utf-8'),itwillencodetheunicodetoutf-8,thatmeanseveryunicodecouldberepresentedby1to4bytes.Forthiscase,\xa0isrepresentedby2bytes\xc2\xa0. Readuponhttp://docs.python.org/howto/unicode.html. Pleasenote:thisanswerinfrom2012,Pythonhasmovedon,youshouldbeabletouseunicodedata.normalizenow Share Improvethisanswer Follow editedJun11,2019at1:45 TFD 23.2k22goldbadges3333silverbadges5050bronzebadges answeredJul19,2012at17:42 samwizesamwize 24.2k1515goldbadges137137silverbadges183183bronzebadges 6 15 Idon'tknowahugeamountaboutUnicodeandcharacterencodings..butitseemslikeunicodedata.normalizewouldbemoreappropriatethanstr.replace – dbr Sep9,2013at7:45 Yoursisworkableadviceforstrings,butnotethatallreferencestothisstringwillalsoneedtobereplaced.Forexample,ifyouhaveaprogramthatopensfiles,andoneofthefileshasanon-breakingspaceinitsname,youwillneedtorenamethatfileinadditiontodoingthisreplacement. – user67416 Sep23,2014at10:52 3 U+00a0isanon-breakablespaceUnicodecharacterthatcanbeencodedasb'\xa0'byteinlatin1encoding,astwobytesb'\xc2\xa0'inutf-8encoding.Itcanberepresentedas inhtml. – jfs Jan20,2015at12:39 4 WhenItrythis,IgetUnicodeDecodeError:'ascii'codeccan'tdecodebyte0xa0inposition397:ordinalnotinrange(128). – jds May28,2015at22:15 Itriedthiscodeonalistofstrings,itdidn'tdoanything,andthe\xa0characterremained.IfIreencodedmytextfiletoUTF-8,thecharacterwouldappearasanuppercaseAwithacarrotonit'shead,andIencodeditinUnicodethePythoninterpretercrashed. – MushroomMan Jul20,2016at22:02  |  Show1morecomment 302 There'smanyusefulthingsinPython'sunicodedatalibrary.Oneofthemisthe.normalize()function. Try: new_str=unicodedata.normalize("NFKD",unicode_str) ReplacingNFKDwithanyoftheothermethodslistedinthelinkaboveifyoudon'tgettheresultsyou'reafter. Share Improvethisanswer Follow answeredJan8,2016at4:24 JamieJamie 3,12811goldbadge88silverbadges66bronzebadges 8 3 Notsosure,youmaywantnormalize('NFKD','1º\xa0dia')toreturn'1ºdia'butitreturns'1odia' – Faccion Nov8,2017at14:58 5 hereisthedocsaboutunicodedata.normalize – TT-- Dec4,2017at15:04 3 ah,iftextis'KOREAN',donottrythis.글자가전부깨져버리네요. – Cho Oct17,2019at9:05 2 ThissolutionchangesRussianletterйtoanidenticallylookingsequenceoftwounicodecharacters.Theproblemhereisthatstringsthatusedtobeequaldonotmatchanymore.Fix:use"NFKC"insteadof"NFKD". – Markus Apr21,2020at19:23 2 Thisisawesome.Itchangestheone-letterstring﷼tothefour-letterstringریالthatitactuallyis.Soit'smucheasiertoreplacewhenneeded.You'dnormalizeandthenreplace,withouthavingtocarewhichoneitwas.normalize("NFKD","﷼").replace("ریال",''). – AmirShabani Apr29,2021at7:55  |  Show3morecomments 33 Aftertryingseveralmethods,tosummarizeit,thisishowIdidit.Followingaretwowaysofavoiding/removing\xa0charactersfromparsedHTMLstring. Assumewehaveourrawhtmlasfollowing: raw_html='

DearParent, 

Thisisatestmessage, kindlyignoreit. 

Thanks

' SoletstrytocleanthisHTMLstring: frombs4importBeautifulSoup raw_html='

DearParent,

Thisisatestmessage,kindlyignoreit.

Thanks

' text_string=BeautifulSoup(raw_html,"lxml").text printtext_string #u'DearParent,\xa0Thisisatestmessage,\xa0kindlyignoreit.\xa0Thanks' Theabovecodeproducesthesecharacters\xa0inthestring.Toremovethemproperly,wecanusetwoways. Method#1(Recommended): ThefirstoneisBeautifulSoup'sget_textmethodwithstripargumentasTrue Soourcodebecomes: clean_text=BeautifulSoup(raw_html,"lxml").get_text(strip=True) printclean_text #DearParent,Thisisatestmessage,kindlyignoreit.Thanks Method#2: Theotheroptionistousepython'slibraryunicodedata importunicodedata text_string=BeautifulSoup(raw_html,"lxml").text clean_text=unicodedata.normalize("NFKD",text_string) printclean_text #u'DearParent,Thisisatestmessage,kindlyignoreit.Thanks' Ihavealsodetailedthesemethodsonthisblogwhichyoumaywanttorefer. Share Improvethisanswer Follow answeredJan16,2018at16:57 AliRazaBhayaniAliRazaBhayani 2,8052424silverbadges2020bronzebadges 2 4 get_text(strip=True)reallydidatrick.Thanksm8 – ChewChew Nov24,2021at18:57 thisisveryspecificforrawhtmlreturningunicodeaftercleaningwithbs4orregex.Worksperfectly,butitwillnotremovelinebreaksortabs – Y4RD13 May9at12:18 Addacomment  |  29 Tryusing.strip()attheendofyourline line.strip()workedwellforme Share Improvethisanswer Follow answeredJul21,2015at21:50 user3590113user3590113 50777silverbadges1313bronzebadges 0 Addacomment  |  21 trythis: string.replace('\\xa0','') Share Improvethisanswer Follow answeredJun12,2012at9:20 user278064user278064 9,84811goldbadge3232silverbadges4646bronzebadges 2 6 @RyanMartin:thisreplacesfourbytes:len(b'\\xa0')==4butlen(b'\xa0')==1.Ifpossible;youshouldfixupstreamthatgeneratestheseescapes. – jfs Jan20,2015at12:43 3 Thissolutionworkedforme:string.replace('\xa0','') – JenyaPu Jul4,2020at14:31 Addacomment  |  14 Iranintothissameproblempullingsomedatafromasqlite3databasewithpython.Theaboveanswersdidn'tworkforme(notsurewhy),butthisdid:line=line.decode('ascii','ignore')However,mygoalwasdeletingthe\xa0s,ratherthanreplacingthemwithspaces. Igotthisfromthissuper-helpfulunicodetutorialbyNedBatchelder. Share Improvethisanswer Follow editedJun20,2020at9:12 CommunityBot 111silverbadge answeredDec11,2012at20:39 user1774699user1774699 4 15 Youarenowremovinganythingthatisn'taASCIIcharacter,youareprobablymaskingyouractualproblem.Using'ignore'islikeshovingthroughtheshiftstickeventhoughyoudon'tunderstandhowtheclutchworks.. – MartijnPieters ♦ Dec11,2012at20:58 @MartijnPietersThelinkedunicodetutorialisgood,butyouarecompletelycorrect-str.encode(...,'ignore')istheUnicode-handlingequivalentoftry:...except:....Whileitmighthidetheerrormessage,itrarelysolvestheproblem. – dbr Sep9,2013at7:43 2 forsomepurposeslikedealingwithEMAILorURLSitseemsperfecttouse.decode('ascii','ignore') – andilabs Dec12,2014at10:15 2 samwize'sanswerdidn'tworkforyoubecauseitworksonUnicodestrings.line.decode()inyouranswersuggeststhatyourinputisabytestring(youshouldnotcall.decode()onaUnicodestring(toenforceit,themethodisremovedinPython3).Idon'tunderstandhowitispossibletoseethetutorialthatyou'velinkedinyouranswerandmissthedifferencebetweenbytesandUnicode(donotmixthem). – jfs Jan20,2015at12:49 Addacomment  |  12 Trythiscode importre re.sub(r'[^\x00-\x7F]+','','pasteyourstringhere').decode('utf-8','ignore').strip() Share Improvethisanswer Follow answeredMar20,2017at13:04 shivashiva 40911goldbadge55silverbadges1717bronzebadges Addacomment  |  11 Pythonrecognizeitlikeaspacecharacter,soyoucansplititwithoutargsandjoinbyanormalwhitespace: line=''.join(line.split()) Share Improvethisanswer Follow answeredApr23,2019at7:16 JonhyBeebopJonhyBeebop 1,42411goldbadge1717silverbadges2929bronzebadges 0 Addacomment  |  9 Iendupherewhilegooglingfortheproblemwithnotprintablecharacter.IuseMySQLUTF-8general_cianddealwithpolishlanguage.ForproblematicstringsIhavetoproccedasfollows: text=text.replace('\xc2\xa0','') Itisjustfastworkaroundandyouprobabllyshouldtrysomethingwithrightencodingsetup. Share Improvethisanswer Follow editedJun10,2015at15:30 answeredFeb22,2014at12:09 andilabsandilabs 21.2k1414goldbadges111111silverbadges144144bronzebadges 1 2 thisworksiftextisabytestringthatrepresentsatextencodedusingutf-8.Ifyouareworkingwithtext;decodeittoUnicodefirst(.decode('utf-8'))andencodeittoabytestringonlyattheveryend(ifAPIdoesnotsupportUnicodedirectlye.g.,socket).AllintermediateoperationsonthetextshouldbeperformedonUnicode. – jfs Jan20,2015at12:57 Addacomment  |  6 InBeautifulSoup,youcanpassget_text()thestripparameter,whichstripswhitespacefromthebeginningandendofthetext.Thiswillremove\xa0oranyotherwhitespaceifitoccursatthestartorendofthestring.BeautifulSoupreplacedanemptystringwith\xa0andthissolvedtheproblemforme. mytext=soup.get_text(strip=True) Share Improvethisanswer Follow editedJan19,2015at15:25 shauryachats 9,56544goldbadges3535silverbadges4848bronzebadges answeredJan19,2015at14:51 MarkMark 6111silverbadge22bronzebadges 1 9 strip=Trueworksonlyif isatthebeginningorendofeachbitoftext.Itwon'tremovethespaceifitisinbetweenothercharactersinthetext. – jfs Jan20,2015at13:01 Addacomment  |  5 It'stheequivalentofaspacecharacter,sostripit print(string.strip())#nomorexa0 Share Improvethisanswer Follow answeredMar6,2019at17:23 8bitjunkie8bitjunkie 12.4k99goldbadges5353silverbadges6969bronzebadges 1 5 Thiswillonlyremoveitifit'satthebeginningorendofthestring. – Bill Jan18,2021at23:55 Addacomment  |  4 0xA0(Unicode)is0xC2A0inUTF-8..encode('utf8')willjusttakeyourUnicode0xA0andreplacewithUTF-8's0xC2A0.Hencetheapparitionof0xC2s...Encodingisnotreplacing,asyou'veprobablyrealizednow. Share Improvethisanswer Follow editedSep26,2012at5:55 answeredJun12,2012at12:02 ddadda 5,81422goldbadges2424silverbadges3434bronzebadges 1 1 0xc2a0isambiguous(byteorder).Useb'\xc2\xa0'bytesliteralinstead. – jfs Jan20,2015at13:03 Addacomment  |  2 Youcantrystring.strip() Itworkedforme!:) Share Improvethisanswer Follow editedJan30,2021at14:54 sta 26.5k88goldbadges4040silverbadges5353bronzebadges answeredJan30,2021at14:13 SaemaMiftahSaemaMiftah 2922bronzebadges Addacomment  |  1 Genericversionwiththeregularexpression(Itwillremoveallthecontrolcharacters): importre defremove_control_chart(s): returnre.sub(r'\\x..','',s) Share Improvethisanswer Follow editedAug30,2018at6:23 answeredJul2,2018at12:28 ranaFireranaFire 955bronzebadges Addacomment  |  1 ThisishowIsolvedthisissueasIencountered\xaoinhtmlencodedstring. IdiscoveredaNonebreakingspaceisinsertedtoensurethatawordandsubsequentHTMLmarkupisnotseparatedduetoresizingofapage. This presentsaproblemfortheparsingcodeasitintroducedcodecencodingissues.Whatmadeithardwasthatwe arenotprivytotheencodingused.FromWindowsmachinesitcanbelatin-1orCP1252(WesternISO), butmorerecentOSeshavestandardizedtoUTF-8.Bynormalizingunicodedata,westrip\xa0 my_string=unicodedata.normalize('NFKD',my_string).encode('ASCII','ignore') Share Improvethisanswer Follow answeredJul6at3:13 AmroYounesAmroYounes 1,24211goldbadge1515silverbadges3232bronzebadges Addacomment  |  YourAnswer ThanksforcontributingananswertoStackOverflow!Pleasebesuretoanswerthequestion.Providedetailsandshareyourresearch!Butavoid…Askingforhelp,clarification,orrespondingtootheranswers.Makingstatementsbasedonopinion;backthemupwithreferencesorpersonalexperience.Tolearnmore,seeourtipsonwritinggreatanswers. Draftsaved Draftdiscarded Signuporlogin SignupusingGoogle SignupusingFacebook SignupusingEmailandPassword Submit Postasaguest Name Email Required,butnevershown PostYourAnswer Discard Byclicking“PostYourAnswer”,youagreetoourtermsofservice,privacypolicyandcookiepolicy Nottheansweryou'relookingfor?Browseotherquestionstaggedpythonpython-2.7unicodebeautifulsouputf-8oraskyourownquestion. TheOverflowBlog HowtoearnamillionreputationonStackOverflow:beofservicetoothers Therightwaytojobhop(Ep.495) FeaturedonMeta BookmarkshaveevolvedintoSaves Inboximprovements:markingnotificationsasread/unread,andafiltered... Revieweroverboard!Orarequesttoimprovetheonboardingguidancefornew... CollectivesUpdate:RecognizedMembers,Articles,andGitLab Shouldweburninatethe[script]tag? Linked 42 Removingunicode\u2026likecharactersinastringinpython2.7 0 Weirdprobleminpythonwithremoving\xa0andotherencodingwhenaddingitemstolist 0 Howtoremove'\xa0'inhtmlsource? 0 isthereanydirectwaytoremove\xa0fromtheoutputwhilewebscrapingusingpython 313 Replacenon-ASCIIcharacterswithasinglespace 104 Strippingnonprintablecharactersfromastringinpython 10 BeautifulSoupandUnicodeProblems 2 WhatisthisregexusedforinjQueryv1.11.0? 2 PHPIFORANDnotworking 2 Python-ReplaceSpecialCharactersfromkey,valueindictonary Seemorelinkedquestions Related 6474 HowdoImergetwodictionariesinasingleexpression? 6784 HowdoIcheckwhetherafileexistswithoutexceptions? 6975 WhataremetaclassesinPython? 7492 DoesPythonhaveaternaryconditionaloperator? 2557 HowdoIgetasubstringofastringinPython? 3246 HowdoIconcatenatetwolistsinPython? 3588 DoesPythonhaveastring'contains'substringmethod? 2455 HowdoIlowercaseastringinPython? 1454 UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xa0'inposition20:ordinalnotinrange(128) 2646 HowcanIremoveakeyfromaPythondictionary? HotNetworkQuestions tutorialto"motionblur"peopleonbackground HowtoruntheGUIofWindowsFeaturesOn/OffusingPowershell Whatisthebestwaytocalculatetruepasswordentropyforhumancreatedpasswords? Howtosimplifyapurefunction? Interpretinganegativeself-evaluationofahighperformer HowcanIkeepmyampfromtemperingthetoneofmyprocessor?(rockandhardmetalmusic) Iwanttodothedoubleslitexperimentwithelectrons,but Changelinkcolorbasedinbackgroundcolor? Findanddeletepartiallyduplicatelines Movingframesmethod ArethereanyspellsotherthanWishthatcanlocateanobjectthroughleadshielding? Whydoesn'ttheMBRS1100SchottkydiodehaveanexponentialI/Vcharacteristic? Howtotellifmybikehasanaluminumframe Traditionally,andcurrently,whatstopshumanvotecountersfromalteringballotstomakethem'Spoilt/Invalidvotes? SomeoneofferedtaxdeductibledonationasapaymentmethodforsomethingIamselling.AmIgettingscammed? Levinson'salgorithmandQRdecompositionforcomplexleast-squaresFIRdesign Howtoprovethisalgebraicidentity? Whyismyropeweird-looking? HowdothosewhoholdtoaliteralinterpretationofthefloodaccountrespondtothecriticismthatNoahbuildingthearkwouldbeunfeasible? Whyistherealotofcurrentvariationattheoutputofabuckwhenabatteryisconnectedattheoutput? AreChernclasseswelldefineduptocontractiblechoice? Areyougettingtiredofregularcrosswords? What'sthedifferencebetween'Dynamic','Random',and'Procedural'generations? AmIreallyrequiredtosetupanInheritedIRA? morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. lang-py Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings  


請為這篇文章評分?