Emojis in Your Data
文章推薦指數: 80 %
Everything you need to know about emoji in your database or dataset. ... Adapting to every emoji may not be easy, however it is worth noting ... GetstartedOpeninappSigninGetstartedFollow600KFollowers·Editors'PicksFeaturesDeepDivesGrowContributeAboutGetstartedOpeninappEmojisinYourData🤓Everythingyouneedtoknowaboutemojiinyourdatabaseordataset.JonathanLawOct5·5minreadHeadercreatedbytheauthorLetusfaceit,itis2021andemojisareinevitable.Youseeiteverywherefromchatstoproductreviews,andinsomecasesusernames.Todaywewillbeansweringsomequestions:WhatareemojisHowareemojisstoredinadatabaseHowimportantisittoretainemojisinmydatasetordatabaseWhat’snext🤷WhatareemojisEveryoneknowswhatemojisare,butwhataretheyreally?Aretheyanimageorfont,whydodifferentsystemsshowthesamelaughingemojidifferently?Forbeginners,emojiisaglyph,thinkofitasafont.Behindeachlaughingface,theemojiisahexadecimalcodepoint.Taking🤓“nerdface”emojiforanexample,itshexadecimalcodepoint(denotedbyU+)isU+1F913aslistedinEmojipedia.EachcodepointreferstosomethingonauniversallyunderstooddictionarycalledUnicode.Ifthedictionaryhastheword,youwouldgetthedefinitionofit.IfyoutrylookingupaChinesewordinanEnglishdictionary,youwillnotgetthatworddefinition.ThisconceptappliestohowUnicodewouldworkonyoursystem.Ifyoursystemdoesnotcontaintheglyphforyourcodepoint,itwouldnotbeabletoshow🤓.AndjustlikehowyouhavedifferentdefinitionsforthesameEnglishwordfromCambridgeorOxford,differentsystemshavetheirglyphdesignstoowithoutdeviatingfromtheoriginalmeaning.HowareemojisstoredinadatabaseNowthatyouknowhowemojisarerepresentedbyacomputersystem,howcanwestoreorhowisitstoredintodatabases?Weknowthatemojisarejusthexadecimalcodepoints,doweneedtohaveaspecialUnicodecolumntypeordowestoreitasastringandparseitasanemojilater?Inmostdatabase,youcaninformtheenginethatthisparticularblockofcharactersissupposedtobestoredasUnicoderatherthanstrings,andthisprocessiscalled\escape.Belowareafewexamplesofescapingandstoringemojisindatabases.Differentdatabaseshaveaslightlydifferentrequirementinstoringescapedcharacters,sodoreadcarefullyeachdatabasedocumentationonescaping.--POSTGRESSELECT('\+01F913'),(U&'\+01F913'),(U&'\d83e\dd13')--POSTGRESRESULT:"\+01F913","🤓",🤓"--BIGQUERYSELECT('\U0001F913')--BIGQUERYRESULT:"🤓"AsshowninthePostgresexample,thefirstcolumnisstoredliterallyasastring‘\+01F913’.ByescapingusingU&asthesecondcolumn,thedatabasenowunderstandsthatthisblockofcharactersisnotaregularstring,butshouldbeunderstoodasahexadecimalsequence.ImagebyauthorNoticeweaddeda0infrontof1F913,thisisbecausethePostgresdocumentationherespecifiedthatUnicodeescapesrequire“abackslashfollowedbyaplussignfollowedbyasix-digithexadecimalcodepointnumber”.Ouroriginalcodepointisonly5digits,thereforewewillhavetozeropadthecodepointtomakeit6digitswithoutchangingthemeaning.Thethirdcolumnisanexampleofasurrogatepairofcodeunits.Asurrogatepairofcodeunitsmakeupacodepoint.AsmentionedinthesamePostgresdocumentation,surrogatepairs(16bit+16bit)existtocomposecodepointslargerthanU+FFFF(16bit).F(hex)=1111(binary)=4bitFFFF=1111111111111111=16bitThisincreasedtheamountavailableofcodepointstooveramillion.Howeverthatistheleastofourconcern,sincePostgreswouldcombinethesurrogatepairsintoonecodepointbeforestoringit.Thefinalcolumnisbasicallyjuststoringthehexadecimalcodeinbytes.ThesamegoesforBigQuery,whereintheirdocumentationhere,theyindicatedtheirUnicodeescaperequires8digit.Thereforeasshownintheexampleabove,wewillzeropadtheUnicode.HowimportantisittoretainemojisinmydatasetordatabaseImagebyauthorTwoverysimilarEnglishtexts,yetverydifferentintermsofexpression.Assuchtrendcontinueswhereemojisincreasinglybecomethegotowayofexpressingfeelings,NLPmodelsanddatasetsshouldadapttoaccommodatestoringandprocessingsuchinformation.Newemojisareconstantlyadded,andthecomplexityhasincreasedwiththeintroductionofvariantsorskintone.Adaptingtoeveryemojimaynotbeeasy,howeveritisworthnotingthatunicode.orghasalistofemojifrequencythathelpsusunderstandwhatarethefewemojisthatweshouldtakenoteof.Nodoubtstoringemojisindatabaseshasbecomeacommonpractice,anditshould.Googlehasawhitepaperwhichcanbefoundherethatstatesenablingthelargestcharactersetsavailablewhichincludesemojiisoneofthegoodpracticesinstoringuserpasswords.Thereareproductnamesanddescriptionsinmostonlineshoppingplatformthatcontainsemojistoo,andwedefinitelywouldnotwanttostripthoseoutwhenusersaresavingitorwhenyouareperforminginferenceforyourbusinesscase.What'snext🤷Emojisareheretostay,anditisuptodeveloperstomanagethatextrainformation,andfordatapractitionerstomakesenseoutofit.Imagetakenfromtheauthor’sTelegrammessageEmojishasbecomeanintegralpartofourdailyconversation,andinsomecasesistheanswertoquestions.Managingthisinformation,whetherbyreplacingemojiwithitsrelevantEnglishwordmeaning,orencodingitasanothertokenwouldbesomethingtothinkabout.HerearesomecooltoolsthatmighthelpyouunderstandUnicodealittlebetter(innoparticularorder):UnicodetohexconverterUnicodeCodeConverterUnicodeanditsbytestableEmojipediaunicode.orgUnicodesurrogatepairunderstandingandcalculatorJonathanLawI'mJonathanLawHuiHao,afullstackdeveloperbasedinMalaysiawhoworkswithtechstuffandenjoyslearningandworkingonRPAandMachineLearning!Follow49ThankstoElliotGunn. 49 49EmojiDatabaseDataDataScienceLanguageMorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceMoreFromMedium3KeyNon-TechnicalSkillsofanAnalyticsEngineerMadisonSchottinGeekCultureExploratoryDataAnalysisusingDoraHimanshuSharmainTowardsDataScienceinfo.bestgems.orgBeganxmisaoininfo.bestgems.orgChart—themostpowerfuldatavisualizationpluginforSketch,FigmaandAdobeXDPavelKuligininDesign+SketchAHolisticFrameworkforManagingDataAnalyticsProjectsElderResearch,Inc.Incidentmanagementisanart,butitshouldbeascience!ValeranninValerannHiringaDataAnalyticsConsultantElderResearch,Inc.TheunderestimatedroleofstatisticsinpredictiveanalyticsChristianSchitton
延伸文章資訊
- 1Emoji unicode characters for use on the web - Experimental ...
- 2Emojis in Your Data
Everything you need to know about emoji in your database or dataset. ... Adapting to every emoji ...
- 3Regex to match all emoji - Regex Tester/Debugger
Regular Expression to As seen here, this should match all official emoji as of Dec 2018. ... Regu...
- 4Full Emoji List, v14.0 - Unicode
- 5😃 Every Emoji by Codepoint - Emojipedia
😃 Every Emoji by Codepoint. 😀 Grinning Face, U+1F600. 😃 Grinning Face with Big Eyes, U+1F603.