Origins

This tool originated when a team of us contemplated a data conversion project. While the final database was under 10Gb, the legacy system consisted of over 30 individual systems running one of two bespoke mainframe applications or a SQL*Server-based practice management system. The new incarnation integrated an Oracle-based legal practice management system with Oracle Financials and more.

Data conversions can be a big deal. Even from a simple engineering point of view, there is a lot to worry about. There is all the data to be marshalled. Database extracts, proprietary binary dumps, spreadsheets, comma-separated-values in flat files,...  Then there can be thousands of scripts. These will come in a variety of flavours: data manipulation obviously, data definition, stored procedures, reports, export and import utilities to run, utilities to transform spreadsheets into CSV files,...  Just executing them isn't enough, however. We wanted to automate as much as possible. Having someone sitting executing scripts by hand is slow, wasteful and error-prone. The tool had to automatically execute these scripts with no extra input from the user. This gives us the first requirements for the tool we wanted:

And to do these automatically, we need to be able to:

Scripts fail. They fail spectacularly; they fail quietly; they fail unpredictably in data conversions because the data quality is uneven. During a conversion, any failure should cause the run to stop before errors are propagated. Once erroneously converted data starts being re-used, the only sensible way to recover is give up and start again from a full database backup. That is too time-consuming to contemplate doing frequently. To fail immediately and be able to

is a better option. Note that the point of failure may be the last chunk of committed rows.

The data often needs to be loaded in a particular order, so the order of script execution may be important. In fact, some will have very specific preconditions before they can be run. Not all the scripts will be under the control of the data conversion team; they may have been contributed by a vendor, by the DBA or by other project teams. These cannot even be optimized or re-written the way you'd like.

Nothing should be able to run before its dependencies have successfully finished.

Data conversions are naturally long running jobs. Lots of data to process. Individual legacy rows may even be processed multiple times for different reasons. One immediate implication is that data converson should not be run from some developer's desktop. The risk of interruption by a corporate re-boot policy or a idle time-out is unacceplable. Our tool, ScriptRunner, should be able to run without an interactive session.

Furthermore, unfettered access to the production machines was not possible. In particular, there was no chance of executing a GUI on the production machines. These factors supported our inclination for a command-line tool.

Notwithstanding the earlier comment about desktop PCs, in practice, that is where development occurs. Since our developer PCs ran Windows and the database machines were Solaris, any tool had to be

To help optimize the conversion process, it would be nice to be

As we progressed, we found it useful to have threads that waited for all parallel tasks to complete, and to have threads spawned without the parent thread waiting.

Financial systems are periodically audited. However, different applications will store the same number in different ways, e.g. a debt could be positive or negative. An auditor may well insist on knowing how the old number was transformed into the new number. The auditor, coming in months after the conversion is signed-off, may ask to see not only relevant scripts but also some proof that they were the scripts that ran. Thus, we need a clear and persistant audit trail of scripts executed.

This is our final requirement but was seen to be both very important in theory and extremely useful in practice. During a run, the audit trail can be used to monitor progress. After failed runs, it is very useful for debugging. After successful runs, the performance of scripts can be assessed and targets for optimization identified. And it helps settle arguments with DBAs.