I analyzed the problem, and tried several methods for solving the constituent parts of the problem, until I found a good solution. Then I translated it fully onto the GPU using OpenCV's GPU-accelerated bindings for Python. I also learned that pre-allocating certain crucial segments of memory on the GPU made for a 2.5x speedup in those parts of the program.