read csv in python

3 Best Ways to Read CSV in Python

CSV (comma-separated values) format is one of the commonly used file formats to store data. It is easily readable and supported by many regular-use applications like Google sheets, Spreadsheets, databases, and text readers.

So, it is inevitable for the applications to read the data from and write data to CSV files. Here, in this detailed guide, we will be focusing on different ways to read CSV in Python. Writing the data to CSV in Python is already covered with detailed examples.

Working with CSV Data in Python

Python provides a rich library set to work with files. These are the popular libraries we use to read data from CSV in Python.

  1. csv library
  2. numpy library
  3. pandas library

Given a lot of options, one would generally be inclined towards using the library that has simple code. However, as the data grows, that might not work as expected.

So it is always a good idea to assess the different options and choose the best that suits your requirement.

We have compiled different libraries and their pros and cons in the coming sections. You can go through the quick summary to understand which method to opt for.

Also, your specific use case might require you to skip the headers, read only one row or one column. We have listed down all the different possible cases with code examples.

Best Way to Read CSV Data in Python

If you are working with small to medium-sized data sets, using csv module for reading and writing the data to a CSV file is sufficient. However, if you want to work with large data sets, then pandas and NumPy are good choices.

NumPy works faster than pandas. But if you want more advanced data manipulation functions and elegant handling of missing data, you can opt for the pandas module to write data to CSV in Python.

Throughout this article, we will be using the below movie CSV file as an example.

id,name,year,rating
1,"The Shawshank Redemption",1994,9.2
2,"The Godfather",1972,9.2
3,"The Dark Knight",2008,9.0
4,"The Godfather Part II",1974,9.0
5,"12 Angry Men",1957,9.0
6,"Schindlers List",1993,8.9
7,"The Lord of the Rings: The Return of the King",2003,8.9
8,"Pulp Fiction",1994,8.8
9,"The Lord of the Rings: The Fellowship of the Ring",2001,8.8
10,"The Good the Bad and the Ugly",1966,8.8

Different ways to read CSV in Python

We have listed down 3 different ways to read CSV data in Python. These are the best and efficient ways. However, if you don’t want to use any libraries, we can also read the file as a plain text and format using the Python code.

Method 1. Using the CSV module

The built-in csv Python library will be used in this technique to handle read and write operations.

Read CSV file into a list of lists

import csv

#open file
with open('movies.csv') as file:

    #reader function returns reader object which is an iterable type
    file_reader = csv.reader(file, delimiter=',')
    res = []

    #each will come as list of strings based on delimiter
    for line in file_reader:
        res.append(line)

print(res)

In the above code snippet, we are importing the csv module which has reader() function to read the contents of a CSV file. The reader() function accepts the below parameters –

  1. csvfile
  2. delimiter – To specify the delimiter of the data (whether it is a comma or tab or any other specific delimiter). By default, it takes a comma value.
  3. Other formatting parameters like escapechar, quotechar, skipinitialspace, strict are allowed. Here is more about these parameters – CSV dialects and formatting parameters.

This function returns a Reader object which can be iterated and formatted as we like. Make sure the file is path is correct, otherwise, you will end up with # FileNotFoundError: [Errno 2] No such file or directory exception.

Load CSV file into a list of dictionaries

If you want to read the CSV file to the Python dictionaries list, use the below code snippet.

import csv

# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
with open('movies.csv') as file:

    # reader function returns reader object which is an iterable type
    file_reader = csv.DictReader(file)
    res = []
    # each line will come as dict based on delimiter
    for line in file_reader:
        res.append(line)

print(res)

Load CSV file into a custom types

Custom types are the preferable way to represent data in code. Here is the code thread to read CSV to Python custom object.

import csv


class Movie:
    def __init__(self, id, name, year, rating):
        self.id = id
        self.name = name
        self.year = year
        self.rating = rating

    def __repr__(self):
        return 'Movie(' + id + ',' + name + ',' + year + ',' + rating + ')'


# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
with open('movies.csv') as file:
    # reader function returns reader object which is an iterable type
    file_reader = csv.DictReader(file)
    movies = []
    # each line will come as dict based on delimiter ,and we are converting dict to movie custom type
    for line in file_reader:
        id = line['id']
        name = line['name']
        year = line['year']
        rating = line['rating']
        movies.append(Movie(id, name, year, rating))

print(movies)

We have defined a custom type Movie in the above code. After reading the CSV data using csv module, we are extracting each field from the data and created custom Movie objects.

Load CSV: only one column

Another common use case while getting the data from files is to read only one or a couple of columns of data instead of loading all the data from the file. Here is the code for this.

import csv

# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
def get_movie_names():
    with open('movies.csv') as file:
        # reader function returns reader object which is an iterable type
        file_reader = csv.reader(file)
        movie_names = []

        # each line will come as list of strings based on delimiter ,and we are getting second column, movie_name.
        for line in file_reader:
            movie_names.append(line[1])
        return movie_names


print(get_movie_names())

In this code thread, instead of reading the entire line, we are only taking the first index (which is movie_name from that CSV example).

Load CSV: only one cell

If we have a requirement to read only one cell from the CSV file, use the below code thread.

import csv

# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
def get_ith_movie_name(index):
    with open('movies.csv') as file:
        # reader function returns reader object which is an iterable type
        file_reader = csv.reader(file)
        movie_names = []

        # each line will come as list of strings based on delimiter ,and we are getting second column, movie_name.
        for line in file_reader:
            movie_names.append(line[1])
        return movie_names[index]


print(get_ith_movie_name(1))

Load CSV: only one row

If you wanted to import only one row from the CSV file in Python, use the below code thread –

import csv

# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
def get_ith_movie(index):
    with open('movies.csv') as file:
        # reader function returns reader object which is an iterable type
        file_reader = csv.reader(file)
        movies = []
        # each line will come as list of strings based on delimiter
        for line in file_reader:
            movies.append(line)

        #we are returning only ith list or row
        return movies[index]


print(get_ith_movie(1))

Here, in the above code, we are loading all the data of a CSV file and then using the index to read the required row from the data.


Method 2. Use the NumPy module

The NumPy module is designed to work efficiently with large data sets. It doesn’t come in-built in Python. We need to install NumPy module separately.

Pre-requisite: Set up numpy module with the below command.

pip3 install numpy

Below is the code to read the CSV file in Python using NumPy module.

import numpy as np

movies_list = np.loadtxt('movies.csv', delimiter=',', skiprows=1, dtype=[('id', 'i4'), ('name', 'U50'), ('year', 'i4'), ('rating', 'f8')])

print(movies_list)

Here NumPy’s loadtxt() method takes the below params –

  1. fname (Mandatory): Takes the file name
  2. delimiter (Optional): You can specify the delimiter of data in csv file. By default, it takes a comma (,).
  3. dtype (Optional): dtype can be used to mention the data type of each field in the CSV format. By default, loadtxt() tries to convert all the fields to Float data type. We need to specify the data type for each field to avoid any exceptions.
  4. usecols (Optional): This parameter is to specify whether to read all the fields from the CSV or to load specified columns. By default, it takes all the columns.
  5. skiprows (Optional): If the CSV file contains field names in the first row, you can skip that row by defining this value as 1. The default value for this parameter is ‘1’ if not specified.
  6. encoding (Optional): Default value for this parameter is ‘bytes’ and used to specify the encoding format.

loadtxt() function returns the NumPy array. We can use the regular Python functions to convert this NumPy array to lists or dictionaries.

Here is the output of the above code.

[( 1, '"The Shawshank Redemption"', 1994, 9.2)
 ( 2, '"The Godfather"', 1972, 9.2) ( 3, '"The Dark Knight"', 2008, 9. )
 ( 4, '"The Godfather Part II"', 1974, 9. )
 ( 5, '"12 Angry Men"', 1957, 9. ) ( 6, '"Schindlers List"', 1993, 8.9)
 ( 7, '"The Lord of the Rings: The Return of the King"', 2003, 8.9)
 ( 8, '"Pulp Fiction"', 1994, 8.8)
 ( 9, '"The Lord of the Rings: The Fellowship of the Ring', 2001, 8.8)
 (10, '"The Good the Bad and the Ugly"', 1966, 8.8)]

To read only one column:

We could use useccols parameter to specify which columns we want to load from the CSV file. Here is the code thread for this –

import numpy as np

data = np.loadtxt('movies.csv', delimiter = ',', skiprows= 1, dtype=[('id', 'i4'), ('name', 'U50')], usecols=(0, 1))

print(data)

Output:

[( 1, '"The Shawshank Redemption"') ( 2, '"The Godfather"')
 ( 3, '"The Dark Knight"') ( 4, '"The Godfather Part II"')
 ( 5, '"12 Angry Men"') ( 6, '"Schindlers List"')
 ( 7, '"The Lord of the Rings: The Return of the King"')
 ( 8, '"Pulp Fiction"')
 ( 9, '"The Lord of the Rings: The Fellowship of the Ring')
 (10, '"The Good the Bad and the Ugly"')]

Read CSV: only one row

To read only one row from a CSV file, you can use indexing on the NumPy array to get the required row.

import numpy as np

data = np.loadtxt('movies.csv', delimiter = ',', skiprows= 1, dtype=[('id', 'i4'), ('name', 'U50'), ('year', 'i4'), ('rating', 'f8')])

#Get the fifth record using index 4
fifth_record = data[4]

print(fifth_record)

Output:

(5, '"12 Angry Men"', 1957, 9.0)


Method 3. Using the Pandas module

Similar to the NumPy module, we can also use Pandas to read and write data to CSV files.

pandas library provides more sophisticated functions to manipulate the data and handle the missing or malformed data in the data set. It can also handle automatic type conversions and has advanced methods to group or aggregate the data.

Pre-requisite: You should install the Pandas library if you haven’t already.

pip3 install pandas

Now, let’s see how we read data from a CSV file using read_csv() function in the pandas module.

Read CSV into Dataframe

import pandas as pd

# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
def read_csv(file_name, delimiter=','):
    # reader_csv function returns pandas dataframe object
    data_frame = pd.read_csv(file_name)

    print('type:', type(data_frame))
    return data_frame

print(read_csv('movies.csv'))

As written in the above code, read_csv() function takes the file name as an argument and reads the CSV data to <class ‘pandas.core.frame.DataFrame’> format.

Output:

   id                                               name  year  rating
0   1                           The Shawshank Redemption  1994     9.2
1   2                                      The Godfather  1972     9.2
2   3                                    The Dark Knight  2008     9.0
3   4                              The Godfather Part II  1974     9.0
4   5                                       12 Angry Men  1957     9.0
5   6                                    Schindlers List  1993     8.9
6   7      The Lord of the Rings: The Return of the King  2003     8.9
7   8                                       Pulp Fiction  1994     8.8
8   9  The Lord of the Rings: The Fellowship of the Ring  2001     8.8
9  10                      The Good the Bad and the Ugly  1966     8.8

Here are a few important arguments that read_csv() function supports. The complete list of arguments pandas read_csv() function accepts are listed in this guide – pandas module.

  1. filepath_or_buffer (Mandatory) – Provide the CSV file path
  2. sep (Optional) – This argument is used for specifying the separator for fields in CSV file. By default, it is set to a comma.
  3. header (Optional) – Row number where the field names are present. This is set to ‘0’ by default.
  4. usecols (Optional) – To read only the specified columns from the data. It takes all the columns if nothing is specified.
  5. skipinitialspace (Optional) – If there are any spaces after the delimiter in fields, we can use this to strip them. It is set to False by default.
  6. low_memory (Optional) – Processes the files in chunks resulting in using lower memory while reading the data from the CSV file. It is set to True if not specified.
  7. error_bad_lines (Optional) – If there are any errors in any of the records, this parameter can be used to either drop them or throw an error. None is the default value for this parameter.
  8. skiprows (Optional) – To skip initial N rows while processing the data.

read_csv() function – Complete Parameters List

pandas.read_csv(filepath_or_buffer*sep=_NoDefault.no_defaultdelimiter=Noneheader=’infer’names=_NoDefault.no_defaultindex_col=Noneusecols=Nonesqueeze=Noneprefix=_NoDefault.no_defaultmangle_dupe_cols=Truedtype=Noneengine=Noneconverters=Nonetrue_values=Nonefalse_values=Noneskipinitialspace=Falseskiprows=Noneskipfooter=0nrows=Nonena_values=Nonekeep_default_na=Truena_filter=Trueverbose=Falseskip_blank_lines=Trueparse_dates=Noneinfer_datetime_format=Falsekeep_date_col=Falsedate_parser=Nonedayfirst=Falsecache_dates=Trueiterator=Falsechunksize=Nonecompression=’infer’thousands=Nonedecimal=’.’lineterminator=Nonequotechar='”‘quoting=0doublequote=Trueescapechar=Nonecomment=Noneencoding=Noneencoding_errors=’strict’dialect=Noneerror_bad_lines=Nonewarn_bad_lines=Noneon_bad_lines=Nonedelim_whitespace=Falselow_memory=Truememory_map=Falsefloat_precision=Nonestorage_options=None)

Read CSV without Headers

We can use skiprows argument for read_csv() function to skip the first row if it contains the headers.

import pandas as pd


# Make sure the file is path is correct, otherwise you will end up with FileNotFoundError
def read_csv(delimiter=','):
    # reader_csv function returns pandas dataframe object
    data_frame = pd.read_csv('movies.csv', skiprows=1)
    print('type:', type(data_frame))
    return data_frame


print(read_csv())

Output:

    1                           The Shawshank Redemption  1994  9.2
0   2                                      The Godfather  1972  9.2
1   3                                    The Dark Knight  2008  9.0
2   4                              The Godfather Part II  1974  9.0
3   5                                       12 Angry Men  1957  9.0
4   6                                    Schindlers List  1993  8.9
5   7      The Lord of the Rings: The Return of the King  2003  8.9
6   8                                       Pulp Fiction  1994  8.8
7   9  The Lord of the Rings: The Fellowship of the Ring  2001  8.8
8  10                      The Good the Bad and the Ugly  1966  8.8

Read CSV: Very large CSV File

If you are planning to read a very large CSV file, then using chunksize parameter in read_csv() function would be very useful.

This parameter is used for specifying how many records to fetch in each chunk. By default, all the data will be loaded to memory and this might cause problems in the application. So better use chunksize argument.

import pandas as pd

for chunk in pd.read_csv("movies.csv", chunksize=1000):

And another useful parameter for reading large data sets is low_memory. This optimizes the memory usage for reading the data from file.

Frequently Asked Questions

  • Can we read an Excel file as a CSV file in Python?

    Excel file has slight differences compared to CSV file format. Python has a rich library set to work with excel as well. For suppose, you can use read_excel() instead of read_csv() from the Pandas library. Complete code is included with examples.

  • How do I read a CSV file in Python using PyCharm or any IDE?

    You read a CSV file from PyCharm using any of the mentioned methods above. You need to copy the file path and use either NumPy or Pandas to read CSV.

  • How do I read multiple CSV files in Python?

    There is no in-built option to read multiple CSV files at once in NumPy and Pandas. Alternatively, we can list all files in a directory and read each one of them separately.

  • How to load a CSV file from an S3 bucket?

    You can use any of the libraries like boto3 to connect to S3 and then use the above-mentioned methods to import the CSV file in Python.

Follow codethreads.dev for more insightful stories.

Leave a Comment

Your email address will not be published. Required fields are marked *