Wednesday, August 3, 2016

More on Pandas Data Loading with ArcGIS (Another Example)

Large datasets can be a major problem with systems that are running 32-bit Python because there is an upper limit on memory use: 2 GB.  Most times programs fail before they even hit the 2 GB mark, but there it is.

When working with large data that cannot fit into the 2 GB of RAM, how can we push the data into DataFrames?

One way is to chunk it into groups:

#--------------------------------------------------------------------------
def grouper_it(n, iterable):
    """
    creates chunks of cursor row objects to make the memory
    footprint more manageable
    """
    it = iter(iterable)
    while True:
        chunk_it = itertools.islice(it, n)
        try:
            first_el = next(chunk_it)
        except StopIteration:
            return
        yield itertools.chain((first_el,), chunk_it) 


This code takes an iterable object (has next() defined at Python 2.7 or __next__() for Python 3.4) and makes other iterators of size n where n is a whole number (integer).

Example Usage:

import itertools
import os
import json
import arcpy
import pandas as pd


with arcpy.da.SearchCursor(fc, ["Field1", "Field2"]) as rows:
     groups = grouper_it(n=50000, iterable=rows)
     for group in groups:
         df = pd.DataFrame.from_records(group, columns=rows.fields)
         df['Field1'] = "Another Value"
         df.to_csv(r"\\sever\test.csv", mode='a')
         del group
         del df
     del groups

This is one way to manage your memory footprint by loading records in smaller bits.

Some considerations on 'n'.  I found the following effects the size of 'n': number of columns, field length, and data types.