Another GIS Blog: More on Pandas Data Loading with ArcGIS (Another Example)

Wednesday, August 3, 2016

More on Pandas Data Loading with ArcGIS (Another Example)

Large datasets can be a major problem with systems that are running 32-bit Python because there is an upper limit on memory use: 2 GB. Most times programs fail before they even hit the 2 GB mark, but there it is.

When working with large data that cannot fit into the 2 GB of RAM, how can we push the data into DataFrames?

One way is to chunk it into groups:


#--------------------------------------------------------------------------

def grouper_it(n, iterable):

    """

    creates chunks of cursor row objects to make the memory

    footprint more manageable

    """

    it = iter(iterable)

    while True:

        chunk_it = itertools.islice(it, n)

        try:

            first_el = next(chunk_it)

        except StopIteration:

            return

        yield itertools.chain((first_el,), chunk_it)

This code takes an iterable object (has next() defined at Python 2.7 or __next__() for Python 3.4) and makes other iterators of size n where n is a whole number (integer).

Example Usage:


import itertools

import os

import json

import arcpy

import pandas as pd

with arcpy.da.SearchCursor(fc, ["Field1", "Field2"]) as rows:
groups = grouper_it(n=50000, iterable=rows)
for group in groups:
df = pd.DataFrame.from_records(group, columns=rows.fields)
df['Field1'] = "Another Value"
df.to_csv(r"\\sever\test.csv", mode='a')
del group
del df
del groups

This is one way to manage your memory footprint by loading records in smaller bits.

Some considerations on 'n'. I found the following effects the size of 'n': number of columns, field length, and data types.