Python Pandas read_csv: Load Data from CSV Files

文章推薦指數: 80 %
投票人數:10人

The Python Pandas read_csv function is used to read or load data from CSV files. We examine the comma-separated value format, tab-separated files, ... Skiptocontent CSV(comma-separatedvalue)filesareacommonfileformatfortransferringandstoringdata.Theabilitytoread,manipulate,andwritedatatoandfromCSVfilesusingPythonisakeyskilltomasterforanydatascientistorbusinessanalysis.Inthispost,we’llgooverwhatCSVfilesare,howtoreadCSVfilesintoPandasDataFrames,andhowtowriteDataFramesbacktoCSVfilespostanalysis. PandasisthemostpopulardatamanipulationpackageinPython,andDataFramesarethePandasdatatypeforstoringtabular2Ddata. TableOfContents LoadCSVfilestoPythonPandas1.FileExtensionsandFileTypes2.DataRepresentationinCSVfilesOtherDelimiters/Separators–TSVfilesDelimitersinTextFields–Quotechar3.Python–Paths,Folders,FilesFindingyourPythonPathFileLoading:AbsoluteandRelativePaths4.PandasCSVFileLoadingErrorsAdvancedReadCSVFilesSpecifyingDataTypesSkippingandPickingRowsandColumnsFromFileCustomMissingValueSymbols CSVFormatAdvantagesandDisadvantagesAdditionalReading LoadCSVfilestoPythonPandas ThebasicprocessofloadingdatafromaCSVfileintoaPandasDataFrame(withallgoingwell)isachievedusingthe“read_csv”functioninPandas: #LoadthePandaslibrarieswithalias'pd' importpandasaspd #Readdatafromfile'filename.csv' #(inthesamedirectorythatyourpythonprocessisbased) #Controldelimiters,rows,columnnameswithread_csv(seelater) data=pd.read_csv("filename.csv") #Previewthefirst5linesoftheloadeddata data.head() Whilethiscodeseemssimple,anunderstandingofthreefundamentalconceptsisrequiredtofullygraspanddebugtheoperationofthedataloadingprocedureifyourunintoissues: Understandingfileextensionsandfiletypes–whatdothelettersCSVactuallymean?What’sthedifferencebetweena.csvfileanda.txtfile?UnderstandinghowdataisrepresentedinsideCSVfiles–ifyouopenaCSVfile,whatdoesthedataactuallylooklike?UnderstandingthePythonpathandhowtoreferenceafile –whatistheabsoluteandrelativepathtothefileyouareloading?Whatdirectoryareyouworkingin?CSVdataformatsanderrors–commonerrorswiththefunction. Eachofthesetopicsisdiscussedbelow,andwefinishthistutorialbylookingatsomemoreadvancedCSVloadingmechanismsandgivingsomebroadadvantagesanddisadvantagesoftheCSVformat. 1.FileExtensionsandFileTypes Thefirststeptoworkingwithcomma-separated-value(CSV)filesisunderstandingtheconceptoffiletypesandfileextensions. Dataisstoredonyourcomputerinindividual“files”,orcontainers,eachwithadifferentname.Eachfilecontainsdataofdifferenttypes–theinternalsofaWorddocumentisquitedifferentfromtheinternalsofanimage.Computersdeterminehowtoreadfilesusingthe“fileextension”,thatisthecodethatfollowsthedot(“.”)inthefilename.So,afilenameistypicallyintheform“.”.Examples: project1.DOCX–aMicrosoftWordfilecalledProject1. shanes_file.TXT–asimpletextfilecalledshanes_file IMG_5673.JPG–AnimagefilecalledIMG_5673. Otherwellknownfiletypesandextensionsinclude:XLSX:Excel,PDF:PortableDocumentFormat,PNG–images,ZIP–compressedfileformat,GIF–animation,MPEG–video,MP3–musicetc.Seeacompletelistofextensionshere. ACSVfileisafilewitha“.csv”fileextension,e.g.“data.csv”,“super_information.csv”.The“CSV”inthiscaseletsthecomputerknowthatthedatacontainedinthefileisin“commaseparatedvalue”format,whichwe’lldiscussbelow. Fileextensionsarehiddenbydefaultonalotofoperatingsystems.Thefirststepthatanyself-respectingengineer,softwareengineer,ordatascientistwilldoonanewcomputeristoensurethatfileextensionsareshownintheirExplorer(Windows)orFinder(Mac)windows. Folderwithfileextensionsshowing.BeforeworkingwithCSVfiles,ensurethatyoucanseeyourfileextensionsinyouroperatingsystem.Differentfilecontentsaredenotedbythefileextension,orlettersafterthedot,ofthefilename.e.g.TXTistext,DOCXisMicrosoftWord,PNGareimages,CSViscomma-separatedvaluedata. Tocheckiffileextensionsareshowinginyoursystem,createanewtextdocumentwithNotepad(Windows)orTextEdit(Mac)andsaveittoafolderofyourchoice.Ifyoucan’tseethe“.txt”extensioninyourfolderwhenyouviewit,youwillhavetochangeyoursettings. InMicrosoftWindows:Open ControlPanel> AppearanceandPersonalization.  Now,clickonFolderOptionsorFileExplorerOption,asitisnowcalled>Viewtab. Inthistab,underAdvanceSettings,youwillseetheoptionHideextensionsforknownfiletypes.Uncheckthisoption andclickonApplyandOK.InMacOS:OpenFinder>Inmenu,clickFinder>Preferences,ClickAdvanced,Selectthecheckboxfor“Showallfilenameextensions”. 2.DataRepresentationinCSVfiles A“CSV”file,thatis,afilewitha“csv”filetype,isabasictextfile.AnytexteditorsuchasNotePadonwindowsorTextEditonMac,canopenaCSVfileandshowthecontents.SublimeTextisawonderfulandmulti-functionaltexteditoroptionforanyplatform. CSVisastandardforstoringtabulardataintextformat,wherecommasareusedtoseparatethedifferentcolumns,andnewlines(carriagereturn/pressenter)usedtoseparaterows.Typically,thefirstrowinaCSVfilecontainsthenamesofthecolumnsforthedata. AndexampletabledatasetandthecorrespondingCSV-formatdataisshowninthediagrambelow. Comma-separatedvaluefiles,orCSVfiles,aresimpletextfileswherecommasandnewlinesareusedtodefinetabulardatainastructuredway. NotethatalmostanytabulardatacanbestoredinCSVformat–theformatispopularbecauseofitssimplicityandflexibility.Youcancreateatextfileinatexteditor,saveitwitha.csvextension,andopenthatfileinExcelorGoogleSheetstoseethetableform. OtherDelimiters/Separators–TSVfiles Thecommaseparationschemeisbyfarthemostpopularmethodofstoringtabulardataintextfiles. However,thechoiceofthe‘,’commacharactertodelimiterscolumns,however,isarbitrary,andcanbesubstitutedwhereneeded.Popularalternativesincludetab(“\t”)andsemi-colon(“;”).Tab-separatefilesareknownasTSV(Tab-SeparatedValue)files. WhenloadingdatawithPandas,theread_csvfunctionisusedforreadinganydelimitedtextfile,andbychangingthedelimiterusingthesep parameter. DelimitersinTextFields–Quotechar OnecomplicationincreatingCSVfilesisifyouhavecommas,semicolons,ortabsactuallyinoneofthetextfieldsthatyouwanttostore.Inthiscase,it’simportanttousea“quotecharacter”intheCSVfiletocreatethesefields. ThequotecharactercanbespecifiedinPandas.read_csvusingthequotechar argument.Bydefault(aswithmanysystems),it’ssetasthestandardquotationmarks(“).Anycommas(orotherdelimitersasdemonstratedbelow)thatoccurbetweentwoquotecharacterswillbeignoredascolumnseparators. Intheexampleshown,asemicolon-delimitedfile,withquotationmarksasaquotecharisloadedintoPandas,andshowninExcel.Theuseofthequotecharallowsthe“NickName”columntocontainsemicolonswithoutbeingsplitintomorecolumns. OtherthancommasinCSVfiles,Tab-separatedandSemicolon-separateddataispopularalso.Quotecharactersareusedifthedatainacolumnmaycontaintheseparatingcharacter.Inthiscase,the‘NickName’columncontainssemicoloncharacters,andsothiscolumnis“quoted”.Specifytheseparatorandquotecharacterinpandas.read_csv 3.Python–Paths,Folders,Files WhenyouspecifyafilenametoPandas.read_csv,Pythonwilllookinyour“currentworkingdirectory“.YourworkingdirectoryistypicallythedirectorythatyoustartedyourPythonprocessorJupyternotebookfrom. Pandassearchesyour‘currentworkingdirectory’forthefilenamethatyouspecifywhenopeningorloadingfiles.TheFileNotFoundErrorcanbeduetoamisspelledfilename,oranincorrectworkingdirectory. FindingyourPythonPath YourPythonpathcanbedisplayedusingthebuilt-inos module.TheOSmoduleisforoperatingsystemdependentfunctionalityintoPythonprogramsandscripts. Tofindyourcurrentworkingdirectory,thefunctionrequiredisos.getcwd().The  os.listdir() functioncanbeusedtodisplayallfilesinadirectory,whichisagoodchecktoseeiftheCSVfileyouareloadingisinthedirectoryasexpected. #Findoutyourcurrentworkingdirectory importos print(os.getcwd()) #Out:/Users/shane/Documents/blog #Displayallofthefilesfoundinyourcurrentworkingdirectory print(os.listdir(os.getcwd()) #Out:['test_delimted.ssv','CSVBlog.ipynb','test_data.csv'] Intheexampleabove,mycurrentworkingdirectoryisinthe‘/Users/Shane/Document/blog’directory.AnyfilesthatareplacesinthisdirectorywillbeimmediatelyavailabletothePythonfileopen()functionorthePandasreadcsvfunction. Insteadofmovingtherequireddatafilestoyourworkingdirectory,youcanalsochangeyourcurrentworkingdirectorytothedirectorywherethefilesresideusing os.chdir(). FileLoading:AbsoluteandRelativePaths Whenspecifyingfilenamestotheread_csvfunction,youcansupplybothabsoluteorrelativefilepaths. Arelativepath isthepathtothefileifyoustartfromyourcurrentworkingdirectory.Inrelativepaths,typicallythefilewillbeinasubdirectoryoftheworkingdirectoryandthepathwillnotstartwithadrivespecifier,e.g.(data/test_file.csv).Thecharacters‘..’areusedtomovetoaparentdirectoryinarelativepath.Anabsolutepath isthecompletepathfromthebaseofyourfilesystemtothefilethatyouwanttoload,e.g.c:/Documents/Shane/data/test_file.csv.Absolutepathswillstartwithadrivespecifier(c:/ord:/inWindows,or‘/’inMacorLinux) It’srecommendedandpreferredtouserelativepathswherepossibleinapplications,becauseabsolutepathsareunlikelytoworkondifferentcomputersduetodifferentdirectorystructures. LoadingthesamefilewithPandasread_csvusingrelativeandabsolutepaths.Relativepathsaredirectionstothefilestartingatyourcurrentworkingdirectory,whereabsolutepathsalwaysstartatthebaseofyourfilesystem. 4.PandasCSVFileLoadingErrors Themostcommonerror’syou’llgetwhileloadingdatafromCSVfilesintoPandaswillbe: FileNotFoundError: Fileb'filename.csv'doesnotexistAFileNotFounderroristypicallyanissuewithpathsetup,currentdirectory,orfilenameconfusion(fileextensioncanplayaparthere!)UnicodeDecodeError: 'utf-8'codeccan'tdecodebyteinposition:invalidcontinuationbyteAUnicodeDecodeErroristypicallycausedbynotspecifyingtheencodingofthefile,andhappenswhenyouhaveafilewithnon-standardcharacters.Foraquickfix,tryopeningthefileinSublimeText,andre-savingwithencoding‘UTF-8’.pandas.parser.CParserError:Errortokenizingdata.ParseErrorscanbecausedinunusualcircumstancestodowithyourdataformat–trytoaddtheparameter“engine=’python'”totheread_csvfunctioncall;thischangesthedatareadingfunctioninternallytoaslowerbutmorestablemethod. AdvancedReadCSVFiles TherearesomeadditionalflexibleparametersinthePandasread_csv()functionthatareusefultohaveinyourarsenalofdatasciencetechniques: SpecifyingDataTypes Asmentionedbefore,CSVfilesdonotcontainanytypeinformationfordata.Datatypesareinferredthroughexaminationofthetoprowsofthefile,whichcanleadtoerrors.Tomanuallyspecifythedatatypesfordifferentcolumns,the dtypeparametercanbeusedwithadictionaryofcolumnnamesanddatatypestobeapplied,forexample: dtype={"name":str,"age":np.int32}. Notethatfordatesanddatetimes,theformat,columns,andotherbehaviourcanbeadjustedusingparse_dates,date_parser,dayfirst,keep_date parameters. SkippingandPickingRowsandColumnsFromFile The nrowsparameterspecifieshowmanyrowsfromthetopofCSVfiletoread,whichisusefultotakeasampleofalargefilewithoutloadingcompletely.Similarlythe skiprows parameterallowsyoutospecifyrowstoleaveout,eitheratthestartofthefile(provideanint),orthroughoutthefile(providealistofrowindices).Similarly,the usecols parametercanbeusedtospecifywhichcolumnsinthedatatoload. CustomMissingValueSymbols WhendataisexportedtoCSVfromdifferentsystems,missingvaluescanbespecifiedwithdifferenttokens.The na_valuesparameterallowsyoutocustomisethecharactersthatarerecognisedasmissingvalues.ThedefaultvaluesinterpretedasNA/NaNare: ‘’,‘#N/A’,‘#N/AN/A’,‘#NA’,‘-1.#IND’,‘-1.#QNAN’,‘-NaN’,‘-nan’,‘1.#IND’,‘1.#QNAN’,‘N/A’,‘NA’,‘NULL’,‘NaN’,‘n/a’,‘nan’,‘null’. #AdvancedCSVloadingexample data=pd.read_csv( "data/files/complex_data_example.tsv",#relativepythonpathtosubdirectory sep='\t' #Tab-separatedvaluefile. quotechar="'", #singlequoteallowedasquotecharacter dtype={"salary":int}, #Parsethesalarycolumnasaninteger usecols=['name','birth_date','salary'].#Onlyloadthethreecolumnsspecified. parse_dates=['birth_date'], #Intepretthebirth_datecolumnasadate skiprows=10, #Skipthefirst10rowsofthefile na_values=['.','??'] #Takeany'.'or'??'valuesasNA )  CSVFormatAdvantagesandDisadvantages Aswithalltechnicaldecisions,storingyourdatainCSVformathasbothadvantagesanddisadvantages.Beawareofthepotentialpitfallsandissuesthatyouwillencounterasyouload,store,andexchangedatainCSVformat: Ontheplusside: CSVformatisuniversalandthedatacanbeloadedbyalmostanysoftware.CSVfilesaresimpletounderstandanddebugwithabasictexteditorCSVfilesarequicktocreateandloadintomemorybeforeanalysis. However,theCSVformathassomenegativesides: Thereisnodatatypeinformationstoredinthetextfile,alltyping(dates,intvsfloat,strings)areinferredfromthedataonly.There’snoformattingorlayoutinformationstorable–thingslikefonts,borders,columnwidthsettingsfromMicrosoftExcelwillbelost.Fileencodingscanbecomeaproblemiftherearenon-ASCIIcompatiblecharactersintextfields.CSVformatisinefficient;numbersarestoredascharactersratherthanbinaryvalues,whichiswasteful.YouwillfindhoweverthatyourCSVdatacompresseswellusingzipcompression. Asandaside,inanefforttocountersomeofthesedisadvantages,twoprominentdatasciencedevelopersinboththeRandPythonecosystems,WesMcKinneyandHadleyWickham,recentlyintroducedtheFeatherFormat,whichaimstobeafast,simple,open,flexibleandmulti-platformdataformatthatsupportsmultipledatatypesnatively. AdditionalReading OfficialPandasdocumentationfortheread_csvfunction.Python3Notesonfilepaths,workingdirectories,andusingtheOSmodule.DatacampTutorialonloadingCSVfiles,includingsomeadditionalOScommands.PythonHowLoadingCSVtutorial. Postnavigation ←PreviousPostNextPost→ Subscribe Notifyof newfollow-upcomments newrepliestomycomments Label {} [+] Name* Email* Website Label {} [+] Name* Email* Website 18Comments InlineFeedbacks Viewallcomments manjunaths 4yearsago hello,thearticleisreallygood i’mfacingaproblemwhileimportingthecsvfile.wheniimportthecsvfilethedatatypeofsomecolumnswillchangeandwontbethesameasitwasinthecsv. likenumericwillbechangedtoobjectorfloat.asihave100columnsicantchangeeachcolumnafterimporting plssuggesthowtoimportandpreventthechangeofdtypesofcoulmns 0 Reply manjunaths Replyto  manjunaths 4yearsago andihavesomeblankcellsinthosecolumnsinwhichthedatatypeischangingwhileimporting 0 Reply Manish 3yearsago Verynicewrite-up! 1 Reply ParasJain 3yearsago Helloallthearticleisreallygood, buthowtoexportthecontentofvariabledataintoanothercsv 0 Reply lol 3yearsago Stillgettingerror: parserError:Errortokenizingdata.Cerror:Expected1feildsinline3,saw37 0 Reply MasonPham Replyto  lol 3yearsago HelloAll,mycsvhavesomethinglikethis: “Alumina12″long” Usuallywithquotechar=‘”‘,Pandaswillignoresomethingwithinthedoublequotationbutinmycase,itwillonlytake“Alumina12”andskiptherestwhichcausetroubles.HowcanIwritethecodetoimportwithpandas? 0 Reply SorinCirco 3yearsago excellentmaterial! 0 Reply RyanHodder 3yearsago Heymate, Appreciatethearticle,wasamassivehelp!spentafewhoursscouringthewebforbasicread_csvproblemtroubleshooting.Thanksagain 0 Reply vicky 3yearsago alifesaver..readlotsoftutorialsbuttheydidnotshowhowtoactuallyloadthedata.thanks. 0 Reply suma 2yearsago Nicesolutioni’vegotfromthissite 0 Reply GolamMuktadir 2yearsago Thanksforyournicearticle 1 Reply NikolaGalvan 2yearsago Hithere!Thankyouforyourblogpost!IjuststartedusingpandasandwenloadingthecsvfileIgetthefollowingerror: TypeError:descriptor‘axes’for‘BlockManager’objectsdoesn’tapplyto‘SingleBlockManager’object Idon’tunderstandwhatIamdoingwrong… Haveyoueverencounteredthiserror? Thanksandhaveaniceday!! 0 Reply NikolaGalvan 2yearsago Hithereagain! IjustnoticedthattheerrorcamefromanoutdatedversionofPandas. Afterupdatingeverythingworksfine! Thanks,justwantedtoletyouknow!! 0 Reply TyroneFrielinghaus 1yearago Ireallylikedhowyouwentintodetail:Itrulyhatereadingexplanationsthatleaveoutcrucialinformationforunderstanding.It’smuchbettertobemoreverbosethannot!!Thanks! 1 Reply JamieHeiney 1yearago Great! 0 Reply Khan 1yearago Hi Soplainlyexplained.Lovethepost.Itsimplyworksforme. 0 Reply Fernando 1yearago IwastryingtoimportmycsvfileandIhadalotoferrors.However,thistutorialhelpedmeatosolvealltheerrorsigot 0 Reply Shane Author Replyto  Fernando 1yearago HiJuan–CSVfilesplayingwithPandascanbeanightmare.I’mgladthattheposthelpedyouout! 0 Reply   wpDiscuz GetDataUpdatesviaEmail Enteryouremailaddresstosubscribetothisblogandreceivenotificationsofnewpostsbyemail. EmailAddress Subscribe wpDiscuzInsert   LoadingComments...   WriteaComment... Email(Required) Name(Required) Website



請為這篇文章評分?