3 (Or More) Ways to Open a CSV in Python

文章推薦指數: 80 %
投票人數:10人

The idea behind just opening a file and calling readlines() or readline() is that it's simple. With readlines() you will get a list back that ... Blog-LatestNewsYouarehere:Home/3(OrMore)WaystoOpenaCSVinPython Ah.Whataclassic.TheonepieceofcodethatIendupwritingoverandoveragain,youwouldthinkIwouldhavestasheditawaybynow.NotgoingtolieIusuallyhavetoGoogleit,whilethinking,isthistherightway?ShouldIjustopenthecsvfileanditerateit?ShouldIimportthecsvmodule?ShouldIjustusePandas?Doesitmatter?Probablynot. So,letstrythemall.Notthatitmatterswhat’sslower,butsometimesyoudorunacrossthe2.5GBcsvfile,soit’sprobablynotabadideatocheckouttheoptions. Wewillbeusinganopensourcedataset,outstandingstudentloaddebtbystate.AllmycodeandthefilecanbefoundonGitHub.Let’sjustopenthefile,readtherowsandsplitthecolumnsupandcallthatwork. OptionsforworkingwithCSVfilesinPython. Justopenit.Pythonstandardcsvmodule.Pandas. JustOpentheCSVFileAlready. Thefirstoptionisjusttotheopenafilelikeyouwouldanythingelse,andthenreadthelinesoneatattime.Therearesomesuboptionshere. Readallthelinesintolist.readlines()Readonelineatatime.readline()Readintoasinglestring.read()–doesn’tapplyinthissituationforhowwewanttodealwithdata. Theideabehindjustopeningafileandcallingreadlines()orreadline()isthatit’ssimple.Withreadlines()youwillgetalistbackthatcontainsarowforeachlineorrecordinyourcsvfile.Usingreadline()youcanjustgetonelineatatime.Seemslikeyouwouldmaybewanttousereadline()ifyouwantedtokeepmemorydown,butmostofthetimewhocares? Let’scheckoutreadlines()first. fromtimeimporttime defopen_csv_file(file_location:str)->object: withopen(file_location,'r')asf: data=f.readlines() forlineindata: split_line(line) defsplit_line(line:str)->None: column_data=line.split(',') print(column_data) if__name__=='__main__': t1=time() open_csv_file(file_location='PortfoliobyBorrowerLocation-Table1.csv') t2=time() print('Thetotaltimetakenwas{t}seconds'.format(t=str(t2-t1))) Let’stryreadline()next.Wewouldexpectittobeslightlylowerbecauseitjustrequiresatouchmorecode. fromtimeimporttime defopen_csv_file(file_location:str)->object: withopen(file_location,'r')asf: line=True whileline: line=f.readline() split_line(line) defsplit_line(line:str)->None: column_data=line.split(',') print(column_data) if__name__=='__main__': t1=time() open_csv_file(file_location='PortfoliobyBorrowerLocation-Table1.csv') t2=time() print('Thetotaltimetakenwas{t}seconds'.format(t=str(t2-t1))) Iraneachmethod3times,belowyoucantellthatreadlines()isalittlefaster. sousingreadlines()isalittlefaster,nosurprisethere. Importcsv….whatcouldbeeasier? Boththosemethodsseemfairlystraightforward.Let’scheckoutthebuiltincsvmoduleinPython.Thisshouldbeeasiertouseintheorybecausewewon’thavetosplitoutourowncolumnsetc. fromtimeimporttime importcsv defopen_csv_file(file_location:str)->object: withopen(file_location)asf: csv_reader=csv.reader(f) forrowincsv_reader: print(row) if__name__=='__main__': t1=time() open_csv_file(file_location='PortfoliobyBorrowerLocation-Table1.csv') t2=time() print('Thetotaltimetakenwas{t}seconds'.format(t=str(t2-t1))) Interesting,fasterthenreadline()butslightlyslowerthenreadlines()andsplittingcolumnsourselves.Thisisalittlestrangetome,Ijustassumedthatthecsvmoduleofferedmorethenjustconvenience. openingcsvfilesinPython.Performancecomparison. WhocansaycsvandPythoninthesamesentenceandnotthinkofPandas?Ihavemycomplaintsaboutit,butwittheriseofdatascience,it’sheretostay.Ihavetosay,ofalltheoptions,readingacsvfilewithPandasistheeasiesttouseandremember. WhatmakesPandasniceisthattoopenafileintoadataframeallyouhavetodoiscallpandas.read_csv().Alsoasyoucanseecallingiterrows()willallowyoutoeasilyiterateovertherows. fromtimeimporttime importpandas defopen_csv_file(file_location:str)->object: dataframe=pandas.read_csv(file_location) forindex,rowindataframe.iterrows(): print(row['Location'],row['Balance(inbillions)'],row['Borrowers(inthousands)']) if__name__=='__main__': t1=time() open_csv_file(file_location='PortfoliobyBorrowerLocation-Table1.csv') t2=time() print('Thetotaltimetakenwas{t}seconds'.format(t=str(t2-t1))) Ohboy,easytousebyperformancewise,yikes.Saygoodbyetomynicelookingchart!!HaHa! Oh,andyoucan’tforgetthatpieceofjunkDask.Iknowitwasn’treallymadetoreadonecsvfile,butIhavetopokeatitanyways.Ifnothingelsetomakemyselffeelbetter. fromtimeimporttime importdask.dataframeasdd defopen_csv_file(file_location:str)->object: df=dd.read_csv(file_location) forindex,rowindf.iterrows(): print(row['Location'],row['Balance(inbillions)'],row['Borrowers(inthousands)']) if__name__=='__main__': t1=time() open_csv_file(file_location='PortfoliobyBorrowerLocation-Table1.csv') t2=time() print('Thetotaltimetakenwas{t}seconds'.format(t=str(t2-t1))) Bahaha! Nice!It’salwaysfuntogobacktothesimplestuff,loadingcsvfilesmightbeforthebirds,butanydataengineerisprobablygoingtohavetodoitafewtimesayear.Myvoteisforreadlines(),it’sfastandnotthatcomplicated. Iknowsomepeoplemightargueaboutthenuancesofthedifferenttools,andtherearegoodreasonstouseeachoneI’msure.But,Ithinkit’simportanttojustlookatthebasicsofloadinganditeratingaCSVfilewithallthedifferenttools.Mostlybecauseintherealworldwemightjustpicksomethingintheheatofthemoment,especiallyasadataengineer,andthousandsoffileslaterwhenthingsgrow,cometotherealizationtoolchoiceandspeeddidmatterafterall. Ohbytheway,incaseyouwerecuriousandhaveheardalotaboutthestudentloandebacle.Youwillnoticewewereusingadatasetoffederalstudentloansperstate.Hereitis.Classic,waytogoCali. https://www.confessionsofadataguy.com/wp-content/uploads/2019/03/DG_logo450-300x104.png 0 0 Daniel https://www.confessionsofadataguy.com/wp-content/uploads/2019/03/DG_logo450-300x104.png Daniel2019-11-2721:26:022019-11-2721:27:493(OrMore)WaystoOpenaCSVinPython IntroductiontoDataEngineeringEbook...$9.99!! MostPopular IntroductiontoUnitTestingwithPySpark. 14.7kviews HttpxvsRequestsinPython.PerformanceandotherMusings. 13.5kviews Top10DataEngineeringBlogs 11.8kviews AirflowvsDagster 11.4kviews PleaseSubscribeforUpdates! Emailaddress: Leavethisfieldemptyifyou'rehuman:Categories BigData Data DataEngineering DataQuality DataWarehousing Geospatial Golang MachineLearning Python Ramblings Rust Scala SQL Uncategorized Archives October2022 September2022 August2022 July2022 June2022 May2022 April2022 March2022 February2022 January2022 December2021 November2021 October2021 September2021 August2021 July2021 June2021 May2021 April2021 March2021 February2021 January2021 December2020 November2020 October2020 September2020 August2020 July2020 June2020 May2020 April2020 March2020 January2020 December2019 November2019 October2019 September2019 August2019 July2019 May2019 March2019 February2019 January2019 December2018 November2018 October2018 September2018 July2018 June2018 May2018 April2018 March2018 February2018 Interestinglinks Herearesomeinterestinglinksforyou!Enjoyyourstay:) PagesAbout Contact IntroductiontoDataEngineeringEbook Resources Categories BigData Data DataEngineering DataQuality DataWarehousing Geospatial Golang MachineLearning Python Ramblings Rust Scala SQL Uncategorized Archive October2022 September2022 August2022 July2022 June2022 May2022 April2022 March2022 February2022 January2022 December2021 November2021 October2021 September2021 August2021 July2021 June2021 May2021 April2021 March2021 February2021 January2021 December2020 November2020 October2020 September2020 August2020 July2020 June2020 May2020 April2020 March2020 January2020 December2019 November2019 October2019 September2019 August2019 July2019 May2019 March2019 February2019 January2019 December2018 November2018 October2018 September2018 July2018 June2018 May2018 April2018 March2018 February2018 HowSmartEngineersCreateBadSoftwareApproachingSoftwareasaCraft,thenasaEngineer. Scrolltotop



請為這篇文章評分?