Emojis in Your Data

文章推薦指數: 80 %
投票人數:10人

Everything you need to know about emoji in your database or dataset. ... Adapting to every emoji may not be easy, however it is worth noting ... GetstartedOpeninappSigninGetstartedFollow600KFollowers·Editors'PicksFeaturesDeepDivesGrowContributeAboutGetstartedOpeninappEmojisinYourData🤓Everythingyouneedtoknowaboutemojiinyourdatabaseordataset.JonathanLawOct5·5minreadHeadercreatedbytheauthorLetusfaceit,itis2021andemojisareinevitable.Youseeiteverywherefromchatstoproductreviews,andinsomecasesusernames.Todaywewillbeansweringsomequestions:WhatareemojisHowareemojisstoredinadatabaseHowimportantisittoretainemojisinmydatasetordatabaseWhat’snext🤷WhatareemojisEveryoneknowswhatemojisare,butwhataretheyreally?Aretheyanimageorfont,whydodifferentsystemsshowthesamelaughingemojidifferently?Forbeginners,emojiisaglyph,thinkofitasafont.Behindeachlaughingface,theemojiisahexadecimalcodepoint.Taking🤓“nerdface”emojiforanexample,itshexadecimalcodepoint(denotedbyU+)isU+1F913aslistedinEmojipedia.EachcodepointreferstosomethingonauniversallyunderstooddictionarycalledUnicode.Ifthedictionaryhastheword,youwouldgetthedefinitionofit.IfyoutrylookingupaChinesewordinanEnglishdictionary,youwillnotgetthatworddefinition.ThisconceptappliestohowUnicodewouldworkonyoursystem.Ifyoursystemdoesnotcontaintheglyphforyourcodepoint,itwouldnotbeabletoshow🤓.AndjustlikehowyouhavedifferentdefinitionsforthesameEnglishwordfromCambridgeorOxford,differentsystemshavetheirglyphdesignstoowithoutdeviatingfromtheoriginalmeaning.HowareemojisstoredinadatabaseNowthatyouknowhowemojisarerepresentedbyacomputersystem,howcanwestoreorhowisitstoredintodatabases?Weknowthatemojisarejusthexadecimalcodepoints,doweneedtohaveaspecialUnicodecolumntypeordowestoreitasastringandparseitasanemojilater?Inmostdatabase,youcaninformtheenginethatthisparticularblockofcharactersissupposedtobestoredasUnicoderatherthanstrings,andthisprocessiscalled\escape.Belowareafewexamplesofescapingandstoringemojisindatabases.Differentdatabaseshaveaslightlydifferentrequirementinstoringescapedcharacters,sodoreadcarefullyeachdatabasedocumentationonescaping.--POSTGRESSELECT('\+01F913'),(U&'\+01F913'),(U&'\d83e\dd13')--POSTGRESRESULT:"\+01F913","🤓",🤓"--BIGQUERYSELECT('\U0001F913')--BIGQUERYRESULT:"🤓"AsshowninthePostgresexample,thefirstcolumnisstoredliterallyasastring‘\+01F913’.ByescapingusingU&asthesecondcolumn,thedatabasenowunderstandsthatthisblockofcharactersisnotaregularstring,butshouldbeunderstoodasahexadecimalsequence.ImagebyauthorNoticeweaddeda0infrontof1F913,thisisbecausethePostgresdocumentationherespecifiedthatUnicodeescapesrequire“abackslashfollowedbyaplussignfollowedbyasix-digithexadecimalcodepointnumber”.Ouroriginalcodepointisonly5digits,thereforewewillhavetozeropadthecodepointtomakeit6digitswithoutchangingthemeaning.Thethirdcolumnisanexampleofasurrogatepairofcodeunits.Asurrogatepairofcodeunitsmakeupacodepoint.AsmentionedinthesamePostgresdocumentation,surrogatepairs(16bit+16bit)existtocomposecodepointslargerthanU+FFFF(16bit).F(hex)=1111(binary)=4bitFFFF=1111111111111111=16bitThisincreasedtheamountavailableofcodepointstooveramillion.Howeverthatistheleastofourconcern,sincePostgreswouldcombinethesurrogatepairsintoonecodepointbeforestoringit.Thefinalcolumnisbasicallyjuststoringthehexadecimalcodeinbytes.ThesamegoesforBigQuery,whereintheirdocumentationhere,theyindicatedtheirUnicodeescaperequires8digit.Thereforeasshownintheexampleabove,wewillzeropadtheUnicode.HowimportantisittoretainemojisinmydatasetordatabaseImagebyauthorTwoverysimilarEnglishtexts,yetverydifferentintermsofexpression.Assuchtrendcontinueswhereemojisincreasinglybecomethegotowayofexpressingfeelings,NLPmodelsanddatasetsshouldadapttoaccommodatestoringandprocessingsuchinformation.Newemojisareconstantlyadded,andthecomplexityhasincreasedwiththeintroductionofvariantsorskintone.Adaptingtoeveryemojimaynotbeeasy,howeveritisworthnotingthatunicode.orghasalistofemojifrequencythathelpsusunderstandwhatarethefewemojisthatweshouldtakenoteof.Nodoubtstoringemojisindatabaseshasbecomeacommonpractice,anditshould.Googlehasawhitepaperwhichcanbefoundherethatstatesenablingthelargestcharactersetsavailablewhichincludesemojiisoneofthegoodpracticesinstoringuserpasswords.Thereareproductnamesanddescriptionsinmostonlineshoppingplatformthatcontainsemojistoo,andwedefinitelywouldnotwanttostripthoseoutwhenusersaresavingitorwhenyouareperforminginferenceforyourbusinesscase.What'snext🤷Emojisareheretostay,anditisuptodeveloperstomanagethatextrainformation,andfordatapractitionerstomakesenseoutofit.Imagetakenfromtheauthor’sTelegrammessageEmojishasbecomeanintegralpartofourdailyconversation,andinsomecasesistheanswertoquestions.Managingthisinformation,whetherbyreplacingemojiwithitsrelevantEnglishwordmeaning,orencodingitasanothertokenwouldbesomethingtothinkabout.HerearesomecooltoolsthatmighthelpyouunderstandUnicodealittlebetter(innoparticularorder):UnicodetohexconverterUnicodeCodeConverterUnicodeanditsbytestableEmojipediaunicode.orgUnicodesurrogatepairunderstandingandcalculatorJonathanLawI'mJonathanLawHuiHao,afullstackdeveloperbasedinMalaysiawhoworkswithtechstuffandenjoyslearningandworkingonRPAandMachineLearning!Follow49ThankstoElliotGunn. 49 49EmojiDatabaseDataDataScienceLanguageMorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceMoreFromMedium3KeyNon-TechnicalSkillsofanAnalyticsEngineerMadisonSchottinGeekCultureExploratoryDataAnalysisusingDoraHimanshuSharmainTowardsDataScienceinfo.bestgems.orgBeganxmisaoininfo.bestgems.orgChart—themostpowerfuldatavisualizationpluginforSketch,FigmaandAdobeXDPavelKuligininDesign+SketchAHolisticFrameworkforManagingDataAnalyticsProjectsElderResearch,Inc.Incidentmanagementisanart,butitshouldbeascience!ValeranninValerannHiringaDataAnalyticsConsultantElderResearch,Inc.TheunderestimatedroleofstatisticsinpredictiveanalyticsChristianSchitton



請為這篇文章評分?