The ETL from Hell – Diagnosing Batch System Performance Issues

Too often, the batch systems that underlie a lot of database processing just grow without conscious design. When runs start to extend beyond their allotted time, and tuning no longer solves the problem, it is often discovered that batches are run in series, with draconian error handling. It is time to impose some rational design, and Nigel is a seasoned healer of batch processes.

Overview

Batch systems, which perform housekeeping jobs without human intervention, are often used with databases, usually for the population of data warehouses but more generally for any regular backend processing such as accounting processes.

In this article, I’ll be discussing the typical problems in batch processing, showing how to determine their cause, and describing how to resolve them. We will concentrate on an overnight batch run because this is such a common way to populate a data warehouse, but the same principles will apply to any batch system, whenever it is run.

Systems that are designed for high availability have additional challenges, and processing will already be designed so that maintenance can be carried out while the system is available. These systems can still benefit from the principles outlined in this article because control of the process can still be an issue.

The ETL From Hell Described

We’ll use a typical example of a system that receives data from remote sources and populates a local data warehouse for interrogation by the business. The objective is to produce scheduled reports, predefined reports or to allow ad hoc queries.

This system will probably have developed piecemeal, in that developers each take a definition of functionality and use it to create additional tasks to load the data (often termed ETL or ‘Extract, Transform and Load’). These tasks are then scheduled to run either manually or by using a task scheduler such as SQL Server Agent. Initially the system works well and the business is happy but, over time, data is accumulated, the source systems grow and more data feeds and tasks are added. This causes the process to take more time to run, there are more errors, and the total processing time grows. Eventually the system is not ready for use in time; when it is required by the business. This inevitably strains the relationship between the departments. Something must be done, but what?

Issue Identification and Denial

Often the development department will take a long time to admit there is an issue. The business will complain that the data is incorrect or not available when required, but this will be assumed by the developers to be an unusual event due to unexpected issues causing the batch to overrun. Production staff such as DBAs who are tasked with supporting the system will try to keep the system running and attempt to optimize the component processes. Eventually, the IT staff will conclude that the system very occasionally processes the batch in a normal fashion, but most of the time it overruns or fails. Rogue processes will be identified and an attempt made to optimize them but often these ignore the underlying problem that there is another process running at the same time that is interfering with it.

The staff who are working on this will take more and more effort to ensure that the overnight batch completes in time, by means of focusing in on the detail. Eventually, however, the truth begins to dawn that the problem is due to the fundamental lack of design in the system.

A Strategy for Initial Issue-Resolution

Once the department has acknowledged there is a design problem it is faced with making a decision of how to resolve it. It is often difficult to be objective about this. The development department now has strained relations with the business, and is very defensive about their system and what the underlying causes are. The staff who are supporting the system are weary of the task and do not wish to admit that all the work they have been putting in has not been successful.

Sometimes, a redesign is considered to be an intolerable solution. Management decide that the system is working in the best way it possibly can, and the only strategy is to increase the hardware, change the product, or use a turnkey solution. The outcome is always very expensive.

  • The hardware route is always a temporary fix and is often prevalent in companies where the system is supported by a division of the infrastructure department.
  • Changing the product (SQL Server replaced by Oracle/DB2 perhaps) means redevelopment of the system and sometimes works, usually because staff who are expert in the product and building batch systems are recruited and the can work with current staff who have a deep knowledge of what is required.
  • Even more expensive can be the acquisition of a product that is marketed as an expert system for the business area. These can take a long time to implement and often do not address the underlying problem and similar optimization processes need to be performed.

A less draconian approach is to split the system up into parts, each running on their own isolated servers. This creates its own difficulties. Firstly, there will be some inevitable duplication of processes: Changes to any process will then need to be implemented on multiple systems. If each system is assigned different data to process in order to avoid this, a new complication appears : synchronization. Some processes will need the completed data from other processes in order to run, and so dealing with timing issues, and data traffic, will become complicated. A typical problem is that of dealing with tasks being scheduled to run at a specific time. This causes two main problems.

  • The data may not be ready for the task to run causing an error and downstream processes needing to be rerun
  • Even if everything runs well there needs to be a gap to allow the previous process to complete. This creates slack in the process which eats into the processing window time. It often means that tasks grow and control features are added to the task, thereby adding complexity to the system.

The Tactical Approach to Resolution

First of all, collect the existing knowledge about the system. The support staff will almost certainly have some statistics about what has been happening. Often they will have reports of long-running SQL statements from the overnight run which they have been looking to optimize. This is a good start but make sure that this is documented. Usually a few processes will be found to be causing major issues due to their long-running. Often these processes are not the underlying cause but are being blocked by others running at the same time. After a few iterations of optimization, the system can be improved. A note should be taken of problem tasks and how long they take to run which will be important information for a strategic solution.

The most important task is to get as much baseline information as possible. How long is a process likely to take? What is unusual? It is a good idea to alter the schedule for a run so that a problem task runs on its own. This will give an idea of how long it needs and also whether or not it is affected by other tasks that are running on the system.

Approach to Resolution – Strategic

To resolve a problem, the first step is to define the problem precisely. Although it is obvious now to everyone that the issue is that the data is not ready, this does not give any information to work from. The definition of the problem needs to be broken down in a reasonably structured way such as this.

Processing Time Window

  1. When does the system need to be available?
  2. When does the business expect the system to be available.

When can the system start processing the data?

  1. Initially often the system needs to be taken down to start processing the data
  2. Later the design will allow some data to be processed when it is available but it is a good idea to start with a fixed processing start time

What is the processing time-window?

  1. The above will define a start and end time for the overnight batch run which is the objective for the system

Task list

Create a list of all tasks within the batch system – excel is a good tool for this. From the existing diagnosis there is probably an idea of the amount of time needed for each task. There needs to be a minimum time, maximum time, and an expected time. The expected time will be the time that is needed for the process to run

Dependencies

Dependencies External

  1. What are the sources for data?
  2. When is the data source data available?
  3. For each of the sources give a time at which the data is available

Which tasks are dependent on source data?

  1. For each task (usually an import task) give a list of the source data needed.

Dependencies Internal

  1. What are the dependencies between tasks?
  2. For each task give a list of the tasks on which it is dependent

This should now give a complete list of dependencies for the system

Monitoring Information

Include some sort of monitoring. This should gather data about each major tasks and especially when the batch system starts and completes, and whether there are any errors. It is a good idea to make this visible to the business, preferably with graphs, so that they can see when the system is available and progress that is being made. It will also take some pressure off support staff and start to rebuild the relationship with the business.

Define Threads

  1. Review the task list and dependencies and combine tasks into threads
  2. Define dependencies at task level for each of the threads.
  3. This will come from the internal and external dependencies but will be identified by thread or task.

Processing time

For each of the tasks assign an expected processing time. This does not need to be accurate as it will be monitored as the system is redesigned.

Initial Review

Now we have enough information for an initial review of the system

We can calculate and expected processing time for each thread and, using the dependencies, calculate when each thread can start.

The dependencies are between tasks within a thread rather than a thread but initially assume that each thread runs when a dependent thread is complete. It will probably be found that this either cannot be completed or runs well outside the processing window. Therefore alter the details by changing the start times to take into account the task dependencies for long-running tasks and also for threads that are dependent on each other.

From this, form a list of the order in which tasks can run. This should give a total time for a single threaded system. This should give an idea of the scale of the problem.

It might be a good idea now to change the system to be more single-threaded and follow the task dependencies given. It will probably still be necessary to schedule processes by time-slot so it may not be possible to run everything in isolation.

From this it should be possible to gather more information about how long each task should take in isolation

System Monitoring

Now we need to gather data about the system to identify in detail what the issues are, and to maintain a history of the processing.

Each task needs to have a log of the start and end time. This should come from the scheduler – the SQL Server Agent maintains this in the history tables but this needs to be extracted into a log table for review.

For long-running or problematic tasks add logging for statements that may be an issue. This should also log the number of rows affected. I create a table called Trace to which I log the start and end time of a task, statements start and end and rows affected. This table has a flexible format and is used for troubleshooting and depends on the format of the data that the developer decides to log.

From this the task list can be updated and the processing time amended

Strategic Review

Now we should have the information we need to redesign the system and to decide how to proceed.

The basis for the redesign will be the task list with processing times and dependencies.

Look at the history and try to get an accurate baseline estimate for the expected processing time per task. There will be anomalous values but it should be clear whether this is due to increased data sizes on particular days or whether the task needs to allow for the extra time.

It is now time to decide on a way forward and these are my suggestions.

Batch System Control

We need to implement a controller to schedule each process. This will include logging the start and end time of each task, dependencies between tasks, the time window for processing for each task and its priority and the logging and alerting of errors. It should also alert when a task is taking longer than expected.

All this logging will be written to a central repository and can be used to control the system. The error and monitoring information from the tasks themselves (Trace table) should be used for troubleshooting rather than for scheduling or control.

The completion of threads can be used as an event to alert downstream systems to the availability of data. This can also be made available to the business to let them ascertain whether batch processing is complete for their data.

The scheduler should take input from a table which contains a list of tasks and dependencies.

For an overnight run, a new batch will be created and this will enable all tasks scheduled within that batch to be run until the batch is complete. Note that multiple batches can be created and dependencies created between them but care should be taken not to make this too convoluted.

Each task will have a time window during which it can run. To solve the problem of External data dependencies, a task detects the availability of that data by one of these methods

  • polling for a file and making a copy for import
  • a time-based event indicating that an external system is available
  • detection of a semaphore from an external system

These tasks will be the start point for a scheduled thread and all other tasks will be ultimately dependent on them. The scheduler could be used for running these tasks and there is an advantage in this as it allows logging to the central repository, polling tasks will only be logged when and event occurs and the task is considered complete. This information is needed as it indicates when the source data is available – important when reviewing schedules and altering time windows.

The scheduler will now chose the next task to run from the dependencies (an import task will usually have a single dependency of the data detection task).

The simplest implementation of this will be to have a single scheduler running in a loop, checking for tasks that have completed and running the next task. This can be synchronous; running a task, logging its start and end then selecting the next task to run. This is usually all that is needed as we now have very little slack because every task now runs as soon as possible, if any task terminates with an error, other tasks will run that are not dependent on the task that was in error. The system will automatically processes the dependent threads when the error is fixed.

This is not always possible though so the next simplest implementation is to split the system into multiple batches and have multiple copies of the scheduler running which service each batch. Beware that certain tasks will be found to be incompatible due to resources needed. Often the simplest way around this is to place these tasks in a single batch so they do not run together.

Eventually it will be discovered that the scheduler needs more intelligence and flexibility.

Each task should be given a resource estimate – I would suggest a 1-5 rating and a total resource available to the system.

Then the scheduler should select a number of tasks that are available to run within the resource limit and execute them all asynchronously. The scheduler will then poll the schedule table to see if any tasks have completed and whether any more can be run. This will require monitoring and regular tweaking to provide optimal performance.

From this implementation we should have a flexible system which provides information about what has run on the system and what is possible for further processing.

Further Enhancements

The above system works best if tasks are small and it is best to redesign these tasks so that they perform single functions. This is especially important with SSIS packages in which there is a tendency to include processing control so that a package becomes a thread. Sometimes this is necessary but usually it is better to take this control out of the package, break it down into smaller tasks and schedule them with the rest of the system. In this way the logging, scheduling and error handling of the scheduler can be leveraged. Otherwise the package will need to handle logging of the individual processes and also possibly implement reprocessing from a failure point.

The processing-time information that you will gain from the schedule-monitoring and the list of task dependencies will probably indicate where certain threads should be reconfigured. It will probably be found that the data that is important to the business can be made available earlier by changing the processing of tasks. For instance, if there is an import report which is needed early in the day that only needs some of the data provided by a thread, then change the tread to optimize for that data, and then perform the rest of the processing later. This may mean that the overall time is longer and it may even require the duplication of processing or output data, but this compromise will have the result of a system that is more suited to the business requirements.

When errors are found in the output of a task, further tests should be included as a final step in a thread before the data is released to the business. This will be scheduled as a task and the scheduler will alert on failure so that the data can be investigated. This means that common errors can be trapped and the department does not have the embarrassment of continually presenting incorrect data, nor the need to manually check output on every run.

Summary

Any batch process must be carefully monitored and documented. It should be designed to allow concurrent processing. Individual tasks should be as simple and ‘atomic’ as possible, with a clear definition of their data dependencies. They should test the results before completion and report errors to the scheduler. It should be possible to change priorities for the order of processing to allow as much completion as possible for the business. The batch process should fail gracefully on error so that only those processes downstream of a task with an error and dependent on data from it are affected.

We started this article with a system which was not performing and not delivering timely data. Have gathered information about that system and made some tactical changes to alleviate the issues then we redesigned the processing control to automatically monitor the system, gather historic processing data, alert on errors and automatically reprocess from where the system failed.

This approach should allow the system to be flexible and predictive. For any new processes an indication of the effect on the total processing time estimated and their dependencies easily enforced without recoding existing processes.

We have identified several stages in the evolution of a batch system

  1. Stage 1 : Ad hoc scheduled tasks
  2. Stage 2 : Optimization of existing tasks
  3. Stage 3 : Re-scheduling of tasks
  4. Stage 4 : Monitoring and alerting of system with respect to deadlines
  5. Stage 5 : Gathering of task performance statistics
  6. Stage 6 : Central schedule and monitoring implementation with dependencies
  7. Stage 7 : Asynchronous and disparate system scheduling with load balancing