document and implement invalid UTF-8 treated as U+FFFD ...
文章推薦指數: 80 %
Invalid UTF-8 gets turned, one byte at a time, into U+FFFD. Here is text from the spec about ranges over strings, which is as good a description ... Skiptocontent {{message}} golang / go Public Notifications Fork 15.5k Star 105k Code Issues 5k+ Pullrequests 286 Discussions Actions Projects 3 Wiki Security Insights More Code Issues Pullrequests Discussions Actions Projects Wiki Security Insights Newissue Haveaquestionaboutthisproject?SignupforafreeGitHubaccounttoopenanissueandcontactitsmaintainersandthecommunity. Pickausername EmailAddress Password SignupforGitHub Byclicking“SignupforGitHub”,youagreetoourtermsofserviceand privacystatement.We’lloccasionallysendyouaccountrelatedemails. AlreadyonGitHub? Signin toyouraccount Jumptobottom regexp:documentandimplementinvalidUTF-8treatedasU+FFFD #48749 Closed ComaVNopenedthisissue Oct3,2021 ·7comments Closed regexp:documentandimplementinvalidUTF-8treatedasU+FFFD #48749 ComaVNopenedthisissue Oct3,2021 ·7comments Labels Documentation NeedsInvestigation Someonemustexamineandconfirmthisisavalidissueandnotaduplicateofanexistingone. Milestone Backlog Comments Copylink ComaVN commented Oct3,2021 • edited WhatversionofGoareyouusing(goversion)? $goversion goversiongo1.16.8linux/amd64 Doesthisissuereproducewiththelatestrelease? ItreproduceswiththeGoPlayground,whichIassumeisthelatestversion. Whatoperatingsystemandprocessorarchitectureareyouusing(goenv)? goenvOutput$goenv GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/roel/.cache/go-build" GOENV="/home/roel/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/roel/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/roel/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/snap/go/8408" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.16.8" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/roel/dev/json-api-golang/go.mod" CGO_CFLAGS="-g-O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g-O2" CGO_FFLAGS="-g-O2" CGO_LDFLAGS="-g-O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC-m64-pthread-fmessage-length=0-fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build-gno-record-gcc-switches" Whatdidyoudo? Itriedtovalidateauser-providedstring,whichmightcontainnon-utf8data,usingaregex. Seehttps://play.golang.org/p/j-jsteknY0Mforaconciseexample. Whatdidyouexpecttosee? Theregexpackagedocssays: AllcharactersareUTF-8-encodedcodepoints. So,Iexpectedmatchingonnon-utf8stringstoeither: alwaysgiveanerror, alwaysreturnfalse Whatdidyouseeinstead? Forsomeregexes,itreturnsfalse,forsomeitreturnstrue.Itneverreturnsanerror. Particularly,regexesreferencingorcontainingtheUnicodeREPLACEMENTCHARACTER(\ufffd,�)insideabracketexpressionreturntrue(butonlyifthereareothercharactersinthesamebracket).Seetheplaygoundexample. Iunderstandthattheimmediatesolutionformeistojustcheckforinvalidutf8first,beforeregexing.However,theactualbehaviourwassounexpectedtome,evenifit'stechnicallyundefinedwhenreadingthedocs,thatitmightbeagoodideatoatleastdocumentthis. Thetextwasupdatedsuccessfully,buttheseerrorswereencountered: 👍 1 DasSkelettreactedwiththumbsupemoji 👀 1 hidexirreactedwitheyesemoji Allreactions 👍 1reaction 👀 1reaction Copylink Contributor robpike commented Oct3,2021 Itmaynotbeclearanditmaynotberight,butthisbehaviorishowstringsworkinGo.InvalidUTF-8getsturned,onebyteatatime,intoU+FFFD.Hereistextfromthespecaboutrangesoverstrings,whichisasgoodadescriptionasanyofhowGohandlesinvalidUTF-8.It'spartofthelanguageitselftodoitthisway: IftheiterationencountersaninvalidUTF-8sequence,thesecondvaluewillbe0xFFFD,theUnicodereplacementcharacter,andthenextiterationwilladvanceasinglebyteinthestring. Fromthepointofviewofthematchingalgorithm,theregexpcompilerhasalreadyoverwrittenalltheinvalidUTF-8whenitbuilttheengineusingGo'srulestointerpretthestring. Thereisnowaytofixthiscompatiblyotherthantoprovideaflagorothermechanismtoavoidthisinterpretation.Giventhatthecodeisallrunesinside,though,eventhatmaybeinfeasible. Workingasintended,andunfortunate. Youarerightthatyourbestbetislikelytovalidatethestringaheadoftime,orelseelidetheinvalidUTF-8altogether. Allreactions Sorry,somethingwentwrong. Copylink Author ComaVN commented Oct3,2021 InvalidUTF-8getsturned,onebyteatatime,intoU+FFFD ThisdoesnotexplainwhyasimpleregexexplicitlylookingforU+FFFDdoesNOTmatchoninvalidutf8.Maybethere'ssomeoptimizationgoingonforregexessuchas\ufffdor\x{fffd}thatturnsthemintosimplestring.Containsorsimilar? Anyway,thanksforthequickresponse.IjustwantedtosavesomeonesometimehairpullinglikeIdidtoday:) Allreactions Sorry,somethingwentwrong. mknyszek changedthetitle Unexpectedbehaviourofregexcontainingunicodereplacementcharacteronnon-utf8strings regexp:unexpectedbehaviourofregexcontainingunicodereplacementcharacteronnon-utf8strings Oct4,2021 mknyszek added the NeedsInvestigation Someonemustexamineandconfirmthisisavalidissueandnotaduplicateofanexistingone. label Oct4,2021 mknyszek addedthistotheBacklogmilestone Oct4,2021 Copylink Contributor mknyszek commented Oct4,2021 Basedon@ComaVNand@robpike'sconversation,I'mgoingtoclosethisissue. Allreactions Sorry,somethingwentwrong. mknyszek closedthisascompleted Oct4,2021 Copylink Contributor robpike commented Oct5,2021 Ithinkthereisarealbughereforsomecases.Reopening. 👍 1 mknyszekreactedwiththumbsupemoji Allreactions 👍 1reaction Sorry,somethingwentwrong. robpike reopenedthis Oct5,2021 robpike assignedrobpikeandrscandunassignedrobpike Oct5,2021 Copylink Contributor rsc commented Oct5,2021 Theliterals(lines13-18)arebuggyinhttps://play.golang.org/p/j-jsteknY0Mandshouldbefixed. TheliteralsearchneedstonotkickinforU+FFFD. Allreactions Sorry,somethingwentwrong. rsc mentionedthisissue Oct6,2021 regexp:behavioroninvalidUTF-8inputisundocumentedandinconsistent #38006 Closed Copylink Contributor rsc commented Oct6,2021 Thisisaduplicateof#38006,whichI'vemergedintothisissuebecausethisissuehadmorecommentary.Thatissuewasmarkedasjustneedingadocumentationupdate. Istartedtolookintofixingthis,butit'sfairlycomplextogetallthecasesinallthematchingengines. Fortherecord,thecoherentbehavioroptionsare: InvalidUTF-8doesnotmatchanycharacterclasses,noraU+FFFDliteral(nor\x{fffd}). EachbyteofinvalidUTF-8istreatedidenticallytoaU+FFFDintheinput,asautf8.DecodeRuneloopmight. RE2usesRule1.Becauseitworksbyteatatimeitcanalsoprovide\Ctomatchanysinglebyteofinput,whichmatchesinvalidUTF-8aswell.Thisprovidesthenicepropertythatamatchforaregexpwithout\CisguaranteedtobevalidUTF-8. Unfortunately,todayGohasanincoherentmixofthesetwo,althoughmostlyRule2.ThisisadeviationfromRE2,anditgivesuptheniceproperty,butweprobablycan'tcorrectthatatthispoint.Inparticular.*alreadymatchesentireinputstoday,validUTF-8ornot,andIdoubtwecanbreakthat.TherightsolutionforGoisprobablytoadoptRule2officially,fixingthefewplacesthatdeviatefromRule2. Allreactions Sorry,somethingwentwrong. rsc changedthetitle regexp:unexpectedbehaviourofregexcontainingunicodereplacementcharacteronnon-utf8strings regexp:documentthatinvalidUTF-8istreatedasU+FFFD(andfix) Oct6,2021 rsc changedthetitle regexp:documentthatinvalidUTF-8istreatedasU+FFFD(andfix) regexp:documentandimplementinvalidUTF-8treatedasU+FFFD Oct6,2021 gopherbot added the Documentation label Oct6,2021 Copylink gopherbot commented Oct7,2021 Changehttps://golang.org/cl/354569mentionsthisissue:regexp:documentandimplementthatinvalidUTF-8bytesarethesameasU+FFFD Allreactions Sorry,somethingwentwrong. gopherbot closedthisascompleted in 702e337 Oct11,2021 rsc removedtheirassignment Jun23,2022 Signupforfree tojointhisconversationonGitHub. Alreadyhaveanaccount? Signintocomment Assignees Nooneassigned Labels Documentation NeedsInvestigation Someonemustexamineandconfirmthisisavalidissueandnotaduplicateofanexistingone. Projects Noneyet Milestone Backlog Development Nobranchesorpullrequests 5participants Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.
延伸文章資訊
- 1Unicode字符列表- 维基百科,自由的百科全书
本條目以列表形式展示並介紹Unicode字符。如果字母顯示模糊,請將瀏覽器字型調為例如「Arial ... U+FFFD, , 佔位字元(英語:Replacement Character).
- 2Unicode Replacement Character (U+FFFD) - Sublime Forum
I am currently working on a c++ project where it sometimes happens that typing a space is interpr...
- 3ConPTY mangles U+1F600 to U+FFFD · Issue #2770 - GitHub
U+FFFD, REPLACEMENT CHARACTER, gets sent through instead. This doesn't affect most characters. Th...
- 4Oracle OCI changeing invalid UTF8 characters to U+FFFD
It is correct, by the Unicode standard, to replace such data by U+FFFD REPLACEMENT CHARACTER when...
- 5Unicode字元列表- 維基百科,自由的百科全書