Friday, November 4, 2011

Parallel Python with GIS

I've begun to dive into Parallel Python to see if I can reduce processing times on long running tasks by dividing the workload among a cluster of computers. 

I've already run into some problems:
  • Spatial data is not serializable
  • Lack of good documentation from Parallel Python
  • No upload data/download results function if you use files.
  • Servers time out
  • Tasks randomly restart
You'll have to have arcpy installed on all the machines you are performing the cluster computing with.
Now that you know that, you can get started easy as this:

ppservers = ('',)
job_server = pp.Server(ppservers=ppservers,ncpus=0)

You set ncpus = 0 inorder prevent processes from being used locally. To submit a job:

libs = ("arcpy",)

job_server.submit(function,# function to perform
                 (variables,), # function variable
                 (),# call back function
                 libs # modules used by function
job_server.wait() # waits for the job to complete
job_server.print_stats() # print some stats about the server and task
del job_server

It's that simple to run the task.