Tuesday, May 8, 2012

Finding Duplicates (3 Ways)

In this post I will show some methods of finding duplicate values in an enterprise SDE database.
The general query is quite simple.  We will perform a SELECT on the table and return the values and the number of times it repeats within that table
First I will use cx_Oracle module to query a table and return both the count and field value that has duplicates:
import cx_Oracle
#.... connection 
#.... create cursor
columnName = "ClientName" 
table = "clients" 
sql = "SELECT %s, count(*) from %s GROUP BY %s HAVING count(*) > 1" % (columnName, table, columnName) 
print cursor.execute(sql).fetchall()
That will return a 2 column list of tuples where as a developer you can then parse that information as needed.

Let's assume you want to do it a more ArcPy way.  ArcSDESQLExecute class can be used to perform the same SQL statement.  The ArcSDESQLExecute class provides a means of executing SQL statements via an ArcSDE connection (source: webhelp.esri.com).

# Two ways to create the object, which also creates the connection to ArcSDE. 
# Using the first method, pass a set of strings containing the connection properties: 
# sdeConn = arcpy.ArcSDESQLExecute("gpserver3","5151","#","toolbox","toolbox") 
# Using the second method pass the path to a valid ArcSDE connection file 
# sdeConn = arcpy.ArcSDESQLExecute("data\Connection to GPSERVER3.sde")
sdeConn = arcpy.ArcSDESQLExecute(r"c:\temp\connectionFile.sde"

columnName = "ClientName" 
table = "clients" 
sql = "SELECT %s, count(*) from %s GROUP BY %s HAVING count(*) > 1" % (columnName, table, columnName) 

results = sdeConn.execute(sql)
for result in results:
   print result # List with [column value, # of duplicates]

You can always turns to ArcToolbox for help if you do not like using SQL. Using Summary Statistics you can get the count of the values in a specific field. It most likely will be slower for large data sets compared to the other two method mentioned above, but it can be used for any supported data set in ArcGIS.

Happy Duplicate Finding

No comments: