FormulaFolios Data Processing

Ben Pham

Backend Engineer

Web Developer

Database Specialist

PostgreSQL

Ruby

Ruby on Rails

Database Optimization

Overview

This was one of my final project and biggest highlight of my career at FormulaFolios; improving the background calculation for high-water mark in one of our internal trading platform. In order to calculate high-water mark, you’d first need to backfill that data through https://apexclearing.com/. Basically, this platform was like our own internal version of Orion Wealth Management, an application that that monitors risk assessment and wealth management. Once we backfill those data, we pass it through some service classes to run through and do the calculation to get the HWM. The traders have the formula sheets for those so we just implement the calculations.

Problem

The problem was after backfilling all of those data and then iterating through the dataset to run high-water mark calculations, the overnight jobs that does that would still be running through until morning and does not finish until late morning and eventually noon. Ideally, we would want those overnight jobs to be completed before the morning that traders come in to work so they can make important trade decisions base on portfolio’s risk assessments and the market’s predicted or current trends base on the graphs we have.

Experiment

Observation

The biggest bottleneck in your average web application are database calls. The more costly the database calls and the more frequent you need to call them, the slower your application or daily jobs will be and the more memory it will consume.

Hypothesis

I assume we will need to index our most called queries and cache repeated queries to avoid multiple hits to the database. Along the line, the calculation methods in our application will need to pull data that we backfilled to our database. If we don’t have an index on there, it will take longer to search for that data.

Think of it like a bookshelf. If you went to the library and all of the books are organized by alphabetical order, it’d be much easier to find your book of choice, correct? On the other hand, imagine if you place the book in any bookshelf you like and try to find what you need. That’ll take forever, right?

Experimentation and Results

First, we’ll need to go through our calculation service classes to see what’s the most common calls are and then use ruby benchmark module to benchmark each individual calculation. They each took me about 40 seconds to 1 minute. Data wise, we would have to multiple this by a few thousands for each trade files to calculate. That’s going to really add up.

We are currently splitting each of these into a sidekiq process to divide out the workload so maybe divide the old calculation by 4 on the total time it’d take to finish all of this. A raw run of this WITH sidekiq ON took about 3 hours 30 minutes. Without sidekiq, this job won’t finish by the time I get home from work. Either way, it’s still inefficient.

Another thing I notice was that we have a method to check whether the account has any gaps (it checks to make sure that the daily log for that account is updated from the latest trade files yet or no). We constantly call this method to double check if we need to recalculate the account or not and it takes 50 seconds to 1 minute 34 seconds per call so multiply that by the thousands of record you have and it will be very slow to run.

I decided to try caching this check and each calculation now take an average of 10 seconds. Fortunately, the state that we cache is still valid because we only need the monitoring start date to update after 24 hours when we recalculate the high-water marks as we only run these jobs every 24 hours over night. Because of this, caching was a viable solution as we don’t need to worry about an outdated start date mid calculation.

With just the cache on it’s own, I tested the job process and it took less than 20 minutes. From 3.5 hours to 20 minutes is a HUGE improvement. This work since for each account, we have about a few hundred daily log to parse through. We just need a cache per account to run on each hundreds of logs instantly rather than hit the database over and over for the exact same check.

I have yet to implement the indices since I was finding the best keys to use for each log itself. When fetching the log, it took about 2-3 seconds per log from each account. There was a sort key on the date it was traded, but not on the account ID itself. After switching this to a composite index on both the trade date and account ID, the time went to 0.37 seconds to fetch a log per account. This is probably the quickest I’ve got, although we’re still using a Sequential Scan although Parallel Sequential Scan, one of the more inefficient types of scans. I can’t switch the index type to a b-tree index OR top-n heapsort unless I change the query itself and then it will no longer be correct when trying to pull the account log from a business logic standpoint.

Overall, after combining both solutions, the composite index solution + the cache on checking whether the account has been updated or not, the entire job process took 12 minutes.

Conclusion

Going from 3 hours 30 minutes to 20 minutes is significant and almost halfing the processes to get 12 minutes is another siginificant improvement. Fortunately we could use caching in this index as the time record for checking whether the account is parsed yet doesn’t need to expire until the next day for a complex check and the indices to check each log is now definitely being hit efficiently saving memory, time, and money.

Note that on production server depending on how many processes are running at once can be a bit slower than the staging or local test and it should still complete within half an hour, but this conclude that the best optimal solution for speeding up the high-water mark calculations was a combination of caching + implementation of composite indices without having to risk changing any queries and potentially causing an error on the existing business logic.

Summary of Project

Always try indexing first. A simple index is the cheapest implementation to start off for a new database table where you’d expect certain columns to be queried multiple times. Of course when a database is very large, it’d cause locking on tables so be very careful during your production releases during database maintenances.

When trying caching, check your cache hit rate. If your cache is being hit a lot and the dataset doesn’t need to expire for a set amount of time, then it is viable to set a cache. If you need that data to update to the latest time immediately, a cache won’t be a viable solution. Do be careful of hard to catch bugs since cache related bugs are hard to diagnose due to certain states.

I’d always advise to try the above solutions first before looking for more expensive options such as throw more money for memory by having more Sidekiq processes when you haven’t exhausted all of the simple solutions yet. Having caches and indices can cross over between environments whereas choosing a schedule processor of choice would require some changes in implementation depending on whether you’re choosing Sidekiq, Resque, DelayedJobs, etc.

Other Researched Optimization Techniques

I read through an article in regards to subquery caching which I thought was interesting in case complex queries cannot be simplified due to the nature of the underlying business logics that it is tied to. Being able to extract parts of the queries, cache them, and reuse them for parts that can stay the same can be a great strategy to improve performances.

Subquery caching can be complex and should only be considered once missing indices to avoid inefficient Sequential scans for large records, decent cache hit rates are met, PostgresSQL config is fully optimized, and queries aren’t locking too long.

Like this project

Posted Jan 13, 2025

Research and optimization project during my time at FormulaFolios to optimize high water marks data backfill job schedulers involving cache and database indices

Likes

Views

Clients