Friday, August 19, 2016

Panda Dataframe as a Process Tracker (postgres example)

Sometimes you need to keep track of the number of rows processed for a given table.

Let's assume you are working in postgres and you want want to do row by row operations to do some sort of data manipulation.  Your user requires you to keep track of each row's changes and wants to know the number of failures with the updates and the number of successful updates. The output must be in a text file with pretty formatting.

There are many ways to accomplish this task, but let's use Pandas, arcpy.da Update Cursor, and some sql.

def create_tracking_table(sde, tables):
    creates a panadas dataframe from a sql statement
       sde - sde connection file
       tables - name of the table to get the counts for
       Panda Dataframe with column names: Table_Name, Total_Rows and
    desc = arcpy.Describe(sde)
    connectionProperties = desc.connectionProperties
    username = connectionProperties.user
    sql = """SELECT
       nspname AS schemaname,relname,reltuples
    FROM pg_class C
     LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
       nspname NOT IN ('pg_catalog', 'information_schema') AND
       relkind='r' AND
       nspname='{schema}' AND
       relname in ({tables})
    ORDER BY reltuples DESC;""".format(
                                   tables=",".join(["'%s'" % t for t in tables])
    columns = ['schemaname','Table_Name','Total_Rows']

    con = arcpy.ArcSDESQLExecute(sde)
    rows = con.execute(sql)
    count_df = pd.DataFrame.from_records(rows, columns=columns)
    del count_df['schemaname']
    count_df['Processed'] = 0

    count_df['Errors'] = 0
    return count_df

Now we have a function that will return a dataframe object from a SQL statement.  It contains 3 fields; Table_Name, Total_Rows, and Processed.  Table_name is the name of the table in the database.  Total_Rows is the length of the table.  Processed is where you are going to modify every a row gets updated successfully.  Errors is the numeric column where if an update fails, the value will be added to.

So let's use what we just made:

count_df = create_tracking_table(sde, tables)
for table in tables:
   with arcpy.da.UpdateCursor(table, "*") as urows:
      for urow in urows:
            urow[3] += 1
            df.loc[df['Table_Name'] == '%s' % table, 'Processed'] += 1
            df.loc[df['Table_Name'] == '%s' % table, 'Errors'] += 1

The pseudo code above shows that whenever an exception is raised, 'Errors' get 1 added to it, and when it successfully updates a row 'Processed' gets updated.

The third part of the task was to output the count table to a text file which can be done easily using the to_string() method.

with open(, 'w') as writer:
   writer.write(count_df.to_string(index=False, col_space=12, justify='left'))

So there you have it.  We have a nice human readable output table in a text file.


Wednesday, August 3, 2016

More on Pandas Data Loading with ArcGIS (Another Example)

Large datasets can be a major problem with systems that are running 32-bit Python because there is an upper limit on memory use: 2 GB.  Most times programs fail before they even hit the 2 GB mark, but there it is.

When working with large data that cannot fit into the 2 GB of RAM, how can we push the data into DataFrames?

One way is to chunk it into groups:

def grouper_it(n, iterable):
    creates chunks of cursor row objects to make the memory
    footprint more manageable
    it = iter(iterable)
    while True:
        chunk_it = itertools.islice(it, n)
            first_el = next(chunk_it)
        except StopIteration:
        yield itertools.chain((first_el,), chunk_it) 

This code takes an iterable object (has next() defined at Python 2.7 or __next__() for Python 3.4) and makes other iterators of size n where n is a whole number (integer).

Example Usage:

import itertools
import os
import json
import arcpy
import pandas as pd

with arcpy.da.SearchCursor(fc, ["Field1", "Field2"]) as rows:
     groups = grouper_it(n=50000, iterable=rows)
     for group in groups:
         df = pd.DataFrame.from_records(group, columns=rows.fields)
         df['Field1'] = "Another Value"
         df.to_csv(r"\\sever\test.csv", mode='a')
         del group
         del df
     del groups

This is one way to manage your memory footprint by loading records in smaller bits.

Some considerations on 'n'.  I found the following effects the size of 'n': number of columns, field length, and data types.