Repeatability & Reliability in Copying Data

Posted by jack on July 26, 2015

This post centres on T-SQL development but I apply similar principles in other environments – in Force.com Apex classes for example.

The principles can be summed up as :-

  • If the procedure is not run at the expected time/point in a process, then it can be run later without causing a problem
  • If a procedure is run repeatedly, the second and subsequent runs do not cause a problem.
  • Make as few assumptions as possible about the external environment

At this point I suspect most experienced developers will stop reading – if you do not start with these principles, you quickly learn them the hard way. For others, the following is an example of what I mean.

BUSINESS REQUIREMENT: In table SOURCE we have new rows arriving every day, with a single dateTime stamp dateLoaded recording their arrival. That column and the other attributes in the table are non-unique – ie we have a “heap” with no identifiable primary key. We need to copy these rows from there to an identically structured TARGET table. There are a lot of rows in both table, making a complete copy each update an inefficient process. Both SOURCE and TARGET table structures are a given and may not be amended.

The simplest T-SQL that might achieve this aim daily could be

INSERT INTO TARGET SELECT * FROM SOURCE WHERE dateLoaded > getDate()-1

[In T-SQL the getDate() function returns the current dateTime; T-SQL is not case-sensitive] 

Using this would be a bad idea for these reasons:-

  1. If you ran it twice in succession, it would load the same set of records into TARGET again.
  2. If you missed running it one day, it would not recover – the missed records would remain missed for all time, or at least until the TARGET is re-loaded from scratch
  3. If the server clock is adjusted or the job server delays or brings forward a job, then missed/duplicated rows might arise without notice.
  4. Using * in the query to represent all columns will work but is bad practice – if the columns are later amended in either or both tables, or are in a different order in the two tables, then either the query will fail or (worse) could map incorrectly between tables. for the sake of brevity the remainder of this post uses COLUMNLIST but always substitute the actual column names.

Here is an improved code that could overcome these objections:-

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SOURCE)

This revision will not load the same rows twice, and if it is not run for a while then the next time it runs it will catch up. The drawback is that the sub-query  (SELECT MAX(dateLoaded) from SOURCE) will slow up operation – especially if there is no index on dateLoaded in the two tables.  Some options present themselves here

  • If permitted, add indexes to the dateLoaded columns on the two tables
  • Retain a relative age limit on the oldest date you expect to include
  • Create a table to retain the last time the job was run and reference that in place of the sub-query

To expand on the second of these – amending the query by adding a final clause AND dateLoaded > getDate()-5 will not invalidate the process provided you are 100% sure that within 4 days you will pick up any issues and get them fixed.

Better, provided you have the ability is to retain a dateTime entry (simplest for our example is in a control table in the  SQL database  but it can be another type of persistent storage).

EG a single row table SYNCHLOG with a single dateTime column lastUpdate

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SYNCHLOG)

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

Notice here I have included what might be considered some redundancy – if at the END of this run we are setting the value of SynchLog.lastUpdate why do I need to repeat the process on the next run? My reason is – I cannot be 100% sure that in between runs the value in my control table is not amended by some other process – so just to be sure I am setting it again before the data load.

In practice I would create and drop a temporary table for this purpose – unless I wanted to elevate the control table to be a permanent log table instead – for example holding a record of the number of rows added each run.

CHECKING YOUR WORK

I would also consider putting in a check to verify all is working as expected – for example, in this case we might do this check

DECLARE @countTargetRecords bigint;

DECLARE @countSourceRecords bigint;

DECLARE @timeCutoff dateTime;

[HERE RUN THE LOAD PROCESS….]

SELECT @timeCutoff = Max(dateLoaded)  FROM target

SELECT @countTargetRecords = COUNT(*)  FROM target

SELECT @countSourceRecords = COUNT(*)  FROM source WHERE dateLoaded <= @timeCutoff

IF  @countSourceRecords <> @countTargetRecords

BEGIN

[…. raise the alarm….. ]

END

Notice that I included a time cutoff – this is to cover the situation where the SOURCE table is continuously getting new records – without a cutoff the alarm might be raised incorrectly due to these new records.

Comments are closed.