The Python Pandas read_csv function is used to read or load data from CSV files. We examine the comma-separated value format, tab-separated files, ...
Skiptocontent
CSV(comma-separatedvalue)filesareacommonfileformatfortransferringandstoringdata.Theabilitytoread,manipulate,andwritedatatoandfromCSVfilesusingPythonisakeyskilltomasterforanydatascientistorbusinessanalysis.Inthispost,we’llgooverwhatCSVfilesare,howtoreadCSVfilesintoPandasDataFrames,andhowtowriteDataFramesbacktoCSVfilespostanalysis.
PandasisthemostpopulardatamanipulationpackageinPython,andDataFramesarethePandasdatatypeforstoringtabular2Ddata.
TableOfContents
LoadCSVfilestoPythonPandas1.FileExtensionsandFileTypes2.DataRepresentationinCSVfilesOtherDelimiters/Separators–TSVfilesDelimitersinTextFields–Quotechar3.Python–Paths,Folders,FilesFindingyourPythonPathFileLoading:AbsoluteandRelativePaths4.PandasCSVFileLoadingErrorsAdvancedReadCSVFilesSpecifyingDataTypesSkippingandPickingRowsandColumnsFromFileCustomMissingValueSymbols CSVFormatAdvantagesandDisadvantagesAdditionalReading
LoadCSVfilestoPythonPandas
ThebasicprocessofloadingdatafromaCSVfileintoaPandasDataFrame(withallgoingwell)isachievedusingthe“read_csv”functioninPandas:
#LoadthePandaslibrarieswithalias'pd'
importpandasaspd
#Readdatafromfile'filename.csv'
#(inthesamedirectorythatyourpythonprocessisbased)
#Controldelimiters,rows,columnnameswithread_csv(seelater)
data=pd.read_csv("filename.csv")
#Previewthefirst5linesoftheloadeddata
data.head()
Whilethiscodeseemssimple,anunderstandingofthreefundamentalconceptsisrequiredtofullygraspanddebugtheoperationofthedataloadingprocedureifyourunintoissues:
Understandingfileextensionsandfiletypes–whatdothelettersCSVactuallymean?What’sthedifferencebetweena.csvfileanda.txtfile?UnderstandinghowdataisrepresentedinsideCSVfiles–ifyouopenaCSVfile,whatdoesthedataactuallylooklike?UnderstandingthePythonpathandhowtoreferenceafile –whatistheabsoluteandrelativepathtothefileyouareloading?Whatdirectoryareyouworkingin?CSVdataformatsanderrors–commonerrorswiththefunction.
Eachofthesetopicsisdiscussedbelow,andwefinishthistutorialbylookingatsomemoreadvancedCSVloadingmechanismsandgivingsomebroadadvantagesanddisadvantagesoftheCSVformat.
1.FileExtensionsandFileTypes
Thefirststeptoworkingwithcomma-separated-value(CSV)filesisunderstandingtheconceptoffiletypesandfileextensions.
Dataisstoredonyourcomputerinindividual“files”,orcontainers,eachwithadifferentname.Eachfilecontainsdataofdifferenttypes–theinternalsofaWorddocumentisquitedifferentfromtheinternalsofanimage.Computersdeterminehowtoreadfilesusingthe“fileextension”,thatisthecodethatfollowsthedot(“.”)inthefilename.So,afilenameistypicallyintheform“.”.Examples:
project1.DOCX–aMicrosoftWordfilecalledProject1.
shanes_file.TXT–asimpletextfilecalledshanes_file
IMG_5673.JPG–AnimagefilecalledIMG_5673.
Otherwellknownfiletypesandextensionsinclude:XLSX:Excel,PDF:PortableDocumentFormat,PNG–images,ZIP–compressedfileformat,GIF–animation,MPEG–video,MP3–musicetc.Seeacompletelistofextensionshere.
ACSVfileisafilewitha“.csv”fileextension,e.g.“data.csv”,“super_information.csv”.The“CSV”inthiscaseletsthecomputerknowthatthedatacontainedinthefileisin“commaseparatedvalue”format,whichwe’lldiscussbelow.
Fileextensionsarehiddenbydefaultonalotofoperatingsystems.Thefirststepthatanyself-respectingengineer,softwareengineer,ordatascientistwilldoonanewcomputeristoensurethatfileextensionsareshownintheirExplorer(Windows)orFinder(Mac)windows.
Folderwithfileextensionsshowing.BeforeworkingwithCSVfiles,ensurethatyoucanseeyourfileextensionsinyouroperatingsystem.Differentfilecontentsaredenotedbythefileextension,orlettersafterthedot,ofthefilename.e.g.TXTistext,DOCXisMicrosoftWord,PNGareimages,CSViscomma-separatedvaluedata.
Tocheckiffileextensionsareshowinginyoursystem,createanewtextdocumentwithNotepad(Windows)orTextEdit(Mac)andsaveittoafolderofyourchoice.Ifyoucan’tseethe“.txt”extensioninyourfolderwhenyouviewit,youwillhavetochangeyoursettings.
InMicrosoftWindows:Open ControlPanel> AppearanceandPersonalization. Now,clickonFolderOptionsorFileExplorerOption,asitisnowcalled>Viewtab. Inthistab,underAdvanceSettings,youwillseetheoptionHideextensionsforknownfiletypes.Uncheckthisoption andclickonApplyandOK.InMacOS:OpenFinder>Inmenu,clickFinder>Preferences,ClickAdvanced,Selectthecheckboxfor“Showallfilenameextensions”.
2.DataRepresentationinCSVfiles
A“CSV”file,thatis,afilewitha“csv”filetype,isabasictextfile.AnytexteditorsuchasNotePadonwindowsorTextEditonMac,canopenaCSVfileandshowthecontents.SublimeTextisawonderfulandmulti-functionaltexteditoroptionforanyplatform.
CSVisastandardforstoringtabulardataintextformat,wherecommasareusedtoseparatethedifferentcolumns,andnewlines(carriagereturn/pressenter)usedtoseparaterows.Typically,thefirstrowinaCSVfilecontainsthenamesofthecolumnsforthedata.
AndexampletabledatasetandthecorrespondingCSV-formatdataisshowninthediagrambelow.
Comma-separatedvaluefiles,orCSVfiles,aresimpletextfileswherecommasandnewlinesareusedtodefinetabulardatainastructuredway.
NotethatalmostanytabulardatacanbestoredinCSVformat–theformatispopularbecauseofitssimplicityandflexibility.Youcancreateatextfileinatexteditor,saveitwitha.csvextension,andopenthatfileinExcelorGoogleSheetstoseethetableform.
OtherDelimiters/Separators–TSVfiles
Thecommaseparationschemeisbyfarthemostpopularmethodofstoringtabulardataintextfiles.
However,thechoiceofthe‘,’commacharactertodelimiterscolumns,however,isarbitrary,andcanbesubstitutedwhereneeded.Popularalternativesincludetab(“\t”)andsemi-colon(“;”).Tab-separatefilesareknownasTSV(Tab-SeparatedValue)files.
WhenloadingdatawithPandas,theread_csvfunctionisusedforreadinganydelimitedtextfile,andbychangingthedelimiterusingthesep parameter.
DelimitersinTextFields–Quotechar
OnecomplicationincreatingCSVfilesisifyouhavecommas,semicolons,ortabsactuallyinoneofthetextfieldsthatyouwanttostore.Inthiscase,it’simportanttousea“quotecharacter”intheCSVfiletocreatethesefields.
ThequotecharactercanbespecifiedinPandas.read_csvusingthequotechar argument.Bydefault(aswithmanysystems),it’ssetasthestandardquotationmarks(“).Anycommas(orotherdelimitersasdemonstratedbelow)thatoccurbetweentwoquotecharacterswillbeignoredascolumnseparators.
Intheexampleshown,asemicolon-delimitedfile,withquotationmarksasaquotecharisloadedintoPandas,andshowninExcel.Theuseofthequotecharallowsthe“NickName”columntocontainsemicolonswithoutbeingsplitintomorecolumns.
OtherthancommasinCSVfiles,Tab-separatedandSemicolon-separateddataispopularalso.Quotecharactersareusedifthedatainacolumnmaycontaintheseparatingcharacter.Inthiscase,the‘NickName’columncontainssemicoloncharacters,andsothiscolumnis“quoted”.Specifytheseparatorandquotecharacterinpandas.read_csv
3.Python–Paths,Folders,Files
WhenyouspecifyafilenametoPandas.read_csv,Pythonwilllookinyour“currentworkingdirectory“.YourworkingdirectoryistypicallythedirectorythatyoustartedyourPythonprocessorJupyternotebookfrom.
Pandassearchesyour‘currentworkingdirectory’forthefilenamethatyouspecifywhenopeningorloadingfiles.TheFileNotFoundErrorcanbeduetoamisspelledfilename,oranincorrectworkingdirectory.
FindingyourPythonPath
YourPythonpathcanbedisplayedusingthebuilt-inos module.TheOSmoduleisforoperatingsystemdependentfunctionalityintoPythonprogramsandscripts.
Tofindyourcurrentworkingdirectory,thefunctionrequiredisos.getcwd().The os.listdir() functioncanbeusedtodisplayallfilesinadirectory,whichisagoodchecktoseeiftheCSVfileyouareloadingisinthedirectoryasexpected.
#Findoutyourcurrentworkingdirectory
importos
print(os.getcwd())
#Out:/Users/shane/Documents/blog
#Displayallofthefilesfoundinyourcurrentworkingdirectory
print(os.listdir(os.getcwd())
#Out:['test_delimted.ssv','CSVBlog.ipynb','test_data.csv']
Intheexampleabove,mycurrentworkingdirectoryisinthe‘/Users/Shane/Document/blog’directory.AnyfilesthatareplacesinthisdirectorywillbeimmediatelyavailabletothePythonfileopen()functionorthePandasreadcsvfunction.
Insteadofmovingtherequireddatafilestoyourworkingdirectory,youcanalsochangeyourcurrentworkingdirectorytothedirectorywherethefilesresideusing os.chdir().
FileLoading:AbsoluteandRelativePaths
Whenspecifyingfilenamestotheread_csvfunction,youcansupplybothabsoluteorrelativefilepaths.
Arelativepath isthepathtothefileifyoustartfromyourcurrentworkingdirectory.Inrelativepaths,typicallythefilewillbeinasubdirectoryoftheworkingdirectoryandthepathwillnotstartwithadrivespecifier,e.g.(data/test_file.csv).Thecharacters‘..’areusedtomovetoaparentdirectoryinarelativepath.Anabsolutepath isthecompletepathfromthebaseofyourfilesystemtothefilethatyouwanttoload,e.g.c:/Documents/Shane/data/test_file.csv.Absolutepathswillstartwithadrivespecifier(c:/ord:/inWindows,or‘/’inMacorLinux)
It’srecommendedandpreferredtouserelativepathswherepossibleinapplications,becauseabsolutepathsareunlikelytoworkondifferentcomputersduetodifferentdirectorystructures.
LoadingthesamefilewithPandasread_csvusingrelativeandabsolutepaths.Relativepathsaredirectionstothefilestartingatyourcurrentworkingdirectory,whereabsolutepathsalwaysstartatthebaseofyourfilesystem.
4.PandasCSVFileLoadingErrors
Themostcommonerror’syou’llgetwhileloadingdatafromCSVfilesintoPandaswillbe:
FileNotFoundError: Fileb'filename.csv'doesnotexistAFileNotFounderroristypicallyanissuewithpathsetup,currentdirectory,orfilenameconfusion(fileextensioncanplayaparthere!)UnicodeDecodeError: 'utf-8'codeccan'tdecodebyteinposition:invalidcontinuationbyteAUnicodeDecodeErroristypicallycausedbynotspecifyingtheencodingofthefile,andhappenswhenyouhaveafilewithnon-standardcharacters.Foraquickfix,tryopeningthefileinSublimeText,andre-savingwithencoding‘UTF-8’.pandas.parser.CParserError:Errortokenizingdata.ParseErrorscanbecausedinunusualcircumstancestodowithyourdataformat–trytoaddtheparameter“engine=’python'”totheread_csvfunctioncall;thischangesthedatareadingfunctioninternallytoaslowerbutmorestablemethod.
AdvancedReadCSVFiles
TherearesomeadditionalflexibleparametersinthePandasread_csv()functionthatareusefultohaveinyourarsenalofdatasciencetechniques:
SpecifyingDataTypes
Asmentionedbefore,CSVfilesdonotcontainanytypeinformationfordata.Datatypesareinferredthroughexaminationofthetoprowsofthefile,whichcanleadtoerrors.Tomanuallyspecifythedatatypesfordifferentcolumns,the dtypeparametercanbeusedwithadictionaryofcolumnnamesanddatatypestobeapplied,forexample: dtype={"name":str,"age":np.int32}.
Notethatfordatesanddatetimes,theformat,columns,andotherbehaviourcanbeadjustedusingparse_dates,date_parser,dayfirst,keep_date parameters.
SkippingandPickingRowsandColumnsFromFile
The nrowsparameterspecifieshowmanyrowsfromthetopofCSVfiletoread,whichisusefultotakeasampleofalargefilewithoutloadingcompletely.Similarlythe skiprows parameterallowsyoutospecifyrowstoleaveout,eitheratthestartofthefile(provideanint),orthroughoutthefile(providealistofrowindices).Similarly,the usecols parametercanbeusedtospecifywhichcolumnsinthedatatoload.
CustomMissingValueSymbols
WhendataisexportedtoCSVfromdifferentsystems,missingvaluescanbespecifiedwithdifferenttokens.The na_valuesparameterallowsyoutocustomisethecharactersthatarerecognisedasmissingvalues.ThedefaultvaluesinterpretedasNA/NaNare: ‘’,‘#N/A’,‘#N/AN/A’,‘#NA’,‘-1.#IND’,‘-1.#QNAN’,‘-NaN’,‘-nan’,‘1.#IND’,‘1.#QNAN’,‘N/A’,‘NA’,‘NULL’,‘NaN’,‘n/a’,‘nan’,‘null’.
#AdvancedCSVloadingexample
data=pd.read_csv(
"data/files/complex_data_example.tsv",#relativepythonpathtosubdirectory
sep='\t' #Tab-separatedvaluefile.
quotechar="'", #singlequoteallowedasquotecharacter
dtype={"salary":int}, #Parsethesalarycolumnasaninteger
usecols=['name','birth_date','salary'].#Onlyloadthethreecolumnsspecified.
parse_dates=['birth_date'], #Intepretthebirth_datecolumnasadate
skiprows=10, #Skipthefirst10rowsofthefile
na_values=['.','??'] #Takeany'.'or'??'valuesasNA
)
CSVFormatAdvantagesandDisadvantages
Aswithalltechnicaldecisions,storingyourdatainCSVformathasbothadvantagesanddisadvantages.Beawareofthepotentialpitfallsandissuesthatyouwillencounterasyouload,store,andexchangedatainCSVformat:
Ontheplusside:
CSVformatisuniversalandthedatacanbeloadedbyalmostanysoftware.CSVfilesaresimpletounderstandanddebugwithabasictexteditorCSVfilesarequicktocreateandloadintomemorybeforeanalysis.
However,theCSVformathassomenegativesides:
Thereisnodatatypeinformationstoredinthetextfile,alltyping(dates,intvsfloat,strings)areinferredfromthedataonly.There’snoformattingorlayoutinformationstorable–thingslikefonts,borders,columnwidthsettingsfromMicrosoftExcelwillbelost.Fileencodingscanbecomeaproblemiftherearenon-ASCIIcompatiblecharactersintextfields.CSVformatisinefficient;numbersarestoredascharactersratherthanbinaryvalues,whichiswasteful.YouwillfindhoweverthatyourCSVdatacompresseswellusingzipcompression.
Asandaside,inanefforttocountersomeofthesedisadvantages,twoprominentdatasciencedevelopersinboththeRandPythonecosystems,WesMcKinneyandHadleyWickham,recentlyintroducedtheFeatherFormat,whichaimstobeafast,simple,open,flexibleandmulti-platformdataformatthatsupportsmultipledatatypesnatively.
AdditionalReading
OfficialPandasdocumentationfortheread_csvfunction.Python3Notesonfilepaths,workingdirectories,andusingtheOSmodule.DatacampTutorialonloadingCSVfiles,includingsomeadditionalOScommands.PythonHowLoadingCSVtutorial.
Postnavigation
←PreviousPostNextPost→
Subscribe
Notifyof
newfollow-upcomments
newrepliestomycomments
Label
{}
[+]
Name*
Email*
Website
Label
{}
[+]
Name*
Email*
Website
18Comments
InlineFeedbacks
Viewallcomments
manjunaths
4yearsago
hello,thearticleisreallygood
i’mfacingaproblemwhileimportingthecsvfile.wheniimportthecsvfilethedatatypeofsomecolumnswillchangeandwontbethesameasitwasinthecsv.
likenumericwillbechangedtoobjectorfloat.asihave100columnsicantchangeeachcolumnafterimporting
plssuggesthowtoimportandpreventthechangeofdtypesofcoulmns
0
Reply
manjunaths
Replyto
manjunaths
4yearsago
andihavesomeblankcellsinthosecolumnsinwhichthedatatypeischangingwhileimporting
0
Reply
Manish
3yearsago
Verynicewrite-up!
1
Reply
ParasJain
3yearsago
Helloallthearticleisreallygood,
buthowtoexportthecontentofvariabledataintoanothercsv
0
Reply
lol
3yearsago
Stillgettingerror:
parserError:Errortokenizingdata.Cerror:Expected1feildsinline3,saw37
0
Reply
MasonPham
Replyto
lol
3yearsago
HelloAll,mycsvhavesomethinglikethis:
“Alumina12″long”
Usuallywithquotechar=‘”‘,Pandaswillignoresomethingwithinthedoublequotationbutinmycase,itwillonlytake“Alumina12”andskiptherestwhichcausetroubles.HowcanIwritethecodetoimportwithpandas?
0
Reply
SorinCirco
3yearsago
excellentmaterial!
0
Reply
RyanHodder
3yearsago
Heymate,
Appreciatethearticle,wasamassivehelp!spentafewhoursscouringthewebforbasicread_csvproblemtroubleshooting.Thanksagain
0
Reply
vicky
3yearsago
alifesaver..readlotsoftutorialsbuttheydidnotshowhowtoactuallyloadthedata.thanks.
0
Reply
suma
2yearsago
Nicesolutioni’vegotfromthissite
0
Reply
GolamMuktadir
2yearsago
Thanksforyournicearticle
1
Reply
NikolaGalvan
2yearsago
Hithere!Thankyouforyourblogpost!IjuststartedusingpandasandwenloadingthecsvfileIgetthefollowingerror:
TypeError:descriptor‘axes’for‘BlockManager’objectsdoesn’tapplyto‘SingleBlockManager’object
Idon’tunderstandwhatIamdoingwrong…
Haveyoueverencounteredthiserror?
Thanksandhaveaniceday!!
0
Reply
NikolaGalvan
2yearsago
Hithereagain!
IjustnoticedthattheerrorcamefromanoutdatedversionofPandas.
Afterupdatingeverythingworksfine!
Thanks,justwantedtoletyouknow!!
0
Reply
TyroneFrielinghaus
1yearago
Ireallylikedhowyouwentintodetail:Itrulyhatereadingexplanationsthatleaveoutcrucialinformationforunderstanding.It’smuchbettertobemoreverbosethannot!!Thanks!
1
Reply
JamieHeiney
1yearago
Great!
0
Reply
Khan
1yearago
Hi
Soplainlyexplained.Lovethepost.Itsimplyworksforme.
0
Reply
Fernando
1yearago
IwastryingtoimportmycsvfileandIhadalotoferrors.However,thistutorialhelpedmeatosolvealltheerrorsigot
0
Reply
Shane
Author
Replyto
Fernando
1yearago
HiJuan–CSVfilesplayingwithPandascanbeanightmare.I’mgladthattheposthelpedyouout!
0
Reply
wpDiscuz
GetDataUpdatesviaEmail
Enteryouremailaddresstosubscribetothisblogandreceivenotificationsofnewpostsbyemail.
EmailAddress
Subscribe
wpDiscuzInsert
LoadingComments...
WriteaComment...
Email(Required)
Name(Required)
Website