I was given a fairly large .csv file, containing emails my corporation had sent and received since Jan. 1, 2015. I was asked to parse through the emails and insert all of the email addresses, with date stamps, into a database for analytics. For the purpose of this article, I’m only going to cover dealing with large amounts of data in a Pythonic way.

The .csv file was around 2GBs and contained a little over 2 million records. Initially, I tried to open the file in Excel to see what the data looked like. Sadly, my computer started throwing up everywhere and I had to manually kill Excel (sigh).

Generators to the Rescue!

A generator function is mainly a more convenient way of writing an iterator. You don’t have to worry about the iterator protocol (`.next`, `.__iter__`, etc.). It just works.

David Beazley Generator Tricks for Systems Programmers

Generators are a handy feature of Python that allows you to iterate over some data lazily without loading the entire data source into memory at once. This is how it looked:

1
2
3
4
5
6
7
8
9
10
11
12
def get_email_data(csv_fname):
  with open(csv_fname, "r", encoding="latin-1") as email_records:
    for email_record in csv.reader(email_records):
      yield email_record

if __name__ == '__main__':
  filename = "./emailData.csv"
  iter_email = iter(get_email_data(filename))
  next(iter_email)  # Skipping the column names

  for row in iter_email:
    print(row)

It looked like I was riding the gravy train on biscuits after that, until…

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128).

The Unicode Sandwich

The best practice for handling text is the “Unicode sandwich” . This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “meat” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible. — Luciano Ramalho, Fluent Python

The reason I received this error message is because when debugging it I was printing to stdout in Windows. Meaning the command line. I was also told that Microsoft SQL Server threw a few errors when importing this .csv due to the encoding (I didn’t get to see the exact error). I might be wrong about this, but apparently the Windows Command Prompt doesn’t support any encoding other than ASCII by default. Windows gives you the ability to change the ‘code page’ which enables utf-8 encodings, but I opted for a different solution since I’ve read that doing so can cause issues elsewhere and I wanted to make sure SQL didn’t have anything to complain about as well.

Here’s the new “get_email_data” generator function:

1
2
3
4
5
def get_email_data():
  with open(csv_fname, "r", encoding="latin-1") as email_records:
    for email_record in csv.reader(email_records):
      ascii_record = (x.encode('ascii', errors='replace').decode() for x in email_record)
      yield ascii_record

Now, we’re encoding the data as soon as it’s available to us! Note the errors=replace parameter. This solution is not ideal for many cases but it solved my needs. When encoding a string, Python gives you multiple methods to pick from. Here are the main three I use:

  • strict: throws a fatal error when it can’t encode a char.
  • ignore: this removes the char entirely
  • replace: replaces the char with a ‘?’

The “replace” option isn’t ideal, but it suited my needs. Using “replace”, I was able to process all of my .csv with no Unicode errors and found that the root of my problem was due to Subject lines containing Emoji’s. Whatever.

Putting the Icing on the Cake

So far to access any element in our email data we have to do it via indexes

email_row[0], email_row[1], email_row[2], etc...

While this isn’t the end of the world as we could surely get by, wouldn’t it just be swell to access row attributes via column name like so:

email_row.date, email_row.from, email_row.to, etc...

NamedTuples to the rescue!

Here’s the final “get_email_data” function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def get_email_data(csv_fname):
    """
    A generator for the data in the csv. This is because the csv files can often contain millions of records and shouldn't be stored in memory all at once.

    :param csv_fname:
        filename/location of the csv.

    :return:
        yields each row as a namedtuple.
    """
    EmailRecord = namedtuple('EmailRecord', 'date size from_addr  to_addr subject excerpt')
    with open(csv_fname, "r", encoding="latin-1") as email_records:
        for email_record in csv.reader(email_records):
            if len(email_record) == 6: # a valid row
                ascii_email_record = (x.encode('ascii', errors='replace').decode() for x in email_record)
                yield EmailRecord(*ascii_email_record)

Wrapping Up

We just learned how to handle very large files with Python in a way that doesn’t make your computer scream at you. We also used a number of awesome features including:

  • Generators
  • String encoding for Unicode and
  • NamedTuples

Thanks for reading!