Repeatability & Reliability in Copying Data

Posted by jack on July 26, 2015

This post centres on T-SQL development but I apply similar principles in other environments – in Force.com Apex classes for example.

The principles can be summed up as :-

  • If the procedure is not run at the expected time/point in a process, then it can be run later without causing a problem
  • If a procedure is run repeatedly, the second and subsequent runs do not cause a problem.
  • Make as few assumptions as possible about the external environment

At this point I suspect most experienced developers will stop reading – if you do not start with these principles, you quickly learn them the hard way. For others, the following is an example of what I mean.

BUSINESS REQUIREMENT: In table SOURCE we have new rows arriving every day, with a single dateTime stamp dateLoaded recording their arrival. That column and the other attributes in the table are non-unique – ie we have a “heap” with no identifiable primary key. We need to copy these rows from there to an identically structured TARGET table. There are a lot of rows in both table, making a complete copy each update an inefficient process. Both SOURCE and TARGET table structures are a given and may not be amended.

The simplest T-SQL that might achieve this aim daily could be

INSERT INTO TARGET SELECT * FROM SOURCE WHERE dateLoaded > getDate()-1

[In T-SQL the getDate() function returns the current dateTime; T-SQL is not case-sensitive] 

Using this would be a bad idea for these reasons:-

  1. If you ran it twice in succession, it would load the same set of records into TARGET again.
  2. If you missed running it one day, it would not recover – the missed records would remain missed for all time, or at least until the TARGET is re-loaded from scratch
  3. If the server clock is adjusted or the job server delays or brings forward a job, then missed/duplicated rows might arise without notice.
  4. Using * in the query to represent all columns will work but is bad practice – if the columns are later amended in either or both tables, or are in a different order in the two tables, then either the query will fail or (worse) could map incorrectly between tables. for the sake of brevity the remainder of this post uses COLUMNLIST but always substitute the actual column names.

Here is an improved code that could overcome these objections:-

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SOURCE)

This revision will not load the same rows twice, and if it is not run for a while then the next time it runs it will catch up. The drawback is that the sub-query  (SELECT MAX(dateLoaded) from SOURCE) will slow up operation – especially if there is no index on dateLoaded in the two tables.  Some options present themselves here

  • If permitted, add indexes to the dateLoaded columns on the two tables
  • Retain a relative age limit on the oldest date you expect to include
  • Create a table to retain the last time the job was run and reference that in place of the sub-query

To expand on the second of these – amending the query by adding a final clause AND dateLoaded > getDate()-5 will not invalidate the process provided you are 100% sure that within 4 days you will pick up any issues and get them fixed.

Better, provided you have the ability is to retain a dateTime entry (simplest for our example is in a control table in the  SQL database  but it can be another type of persistent storage).

EG a single row table SYNCHLOG with a single dateTime column lastUpdate

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SYNCHLOG)

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

Notice here I have included what might be considered some redundancy – if at the END of this run we are setting the value of SynchLog.lastUpdate why do I need to repeat the process on the next run? My reason is – I cannot be 100% sure that in between runs the value in my control table is not amended by some other process – so just to be sure I am setting it again before the data load.

In practice I would create and drop a temporary table for this purpose – unless I wanted to elevate the control table to be a permanent log table instead – for example holding a record of the number of rows added each run.

CHECKING YOUR WORK

I would also consider putting in a check to verify all is working as expected – for example, in this case we might do this check

DECLARE @countTargetRecords bigint;

DECLARE @countSourceRecords bigint;

DECLARE @timeCutoff dateTime;

[HERE RUN THE LOAD PROCESS….]

SELECT @timeCutoff = Max(dateLoaded)  FROM target

SELECT @countTargetRecords = COUNT(*)  FROM target

SELECT @countSourceRecords = COUNT(*)  FROM source WHERE dateLoaded <= @timeCutoff

IF  @countSourceRecords <> @countTargetRecords

BEGIN

[…. raise the alarm….. ]

END

Notice that I included a time cutoff – this is to cover the situation where the SOURCE table is continuously getting new records – without a cutoff the alarm might be raised incorrectly due to these new records.

Comments Off on Repeatability & Reliability in Copying Data

Cast Iron – returning SQL query results that match multiple key values

Posted by jack on July 7, 2015

This post concerns an annoying limitation of the Database Query activities when compared to the Salesforce equivalent Query Objects – namely that the latter allows the input of a multi-row selection parameter using a SOQL IN clause, whilst the Database Query SQL activity does not. This is particularly obvious when (as in my case) your task is to take a bunch of Salesforce records and check them against the database for updates.

Other Cast Iron users refer to the data Quality  Lookup activity for a work-round, but I have a better approach using Stored procedures. That solution is to have a stored procedure that accepts the input as a comma separated list of values in single string, then internally converts this input to the key values to match in the query, and use a database Call Procedure activity to collect the matching rows.

To created the input parameter string for the Call Procedure input I used a For Each loop upstream, accumulating the key values from the Salesforce data (in my case I actually used a SPLIT of that data to handle larger data volumes). Remember to remove the initial comma (or ending one, depending how you build) from the string after the for each loop.

Derivation of a suitable stored procedure is documented at http://www.sommarskog.se/arrays-in-sql-2005.html#XML  – this is specific to Microsoft SQL Server 2005 (which is what I am using) – for 2000 there is a (slower, worse) alternative and for later versions more elegant and quicker options. No doubt there are similar approaches for Oracle & DB2.

Since I already have a view used in an earlier Cast Iron implementation where each individual salesforce record was matched using a For Each loop to one records each time (ie very slow), all I had to do was include that view in my stored procedure and return the matched rows (see this post on use of Copy Of function to get the data from a stored procedure)

The net result of this is now 2 datasets for comparison:

  • The Salesforce Objects that require checking for update
  • The matching database values to use for that check and provide updates.

My next post will concern approaches to the comparison of two datasets in Cast Iron; as usual, I will avoid the use of more than simple javascript functions in the solution, making the most of built in Cast Iron capabilities.

 

 

 

 

 

Comments Off on Cast Iron – returning SQL query results that match multiple key values
Categories: Cast Iron,SQL Server

SQL Function – sensible display formats depending on the size of the number

Posted by jack on June 20, 2012

I am working on dashboards – this requires the interface to display a wide range of numbers in sensible formats specific to each row – ie a mix of numbers representing currency amounts and counts of units between numbers less than 10 and numbers over a million. So I want to display these in different formats depending on the amount – we cannot do this in the end user tool I am using to generate the user interface (well not without a lot of proprietary coding I would have to learn and repeat on every dashboard)  so I wrote this SQL function to convert an input number into a text display value.
CREATE FUNCTION [dbo].[numberToScaledText]
(
@testValue float
)
RETURNS VARCHAR(255)
AS
BEGIN
DECLARE
@strOut Varchar(255)

Select @strOut = Case when @testValue > 1E7 then
cast(Round(cast(@testValue as bigint)/1E6,1) as varchar(50)) + ‘mill’ else
Case when @testValue > 1E6 then
cast(cast(@testValue as bigint)/1E6 as varchar(50)) + ‘mill’ else
case  when @testValue > 1E3 then
cast(cast(@testValue as int)/1E3 as varchar(50)) +
‘k’ else Case when @testValue > 100 then
cast(cast(@testValue as int) as varchar(6))
else cast(round(@testValue,1)  as varchar(20))

end end end end

RETURN @strOut
END

GO

 

Comments Off on SQL Function – sensible display formats depending on the size of the number
Categories: SQL Server

New SQL Server function – sentenceCase

Posted by jack on May 16, 2012

An extension of previous versions to cover conversion of all upper case to normal sentence case.  If you have embedded proper nouns or intials it won’t give exact results, but a lot prettier than a lot of shouting….

USE [DataMart]
GO
/****** Object:  UserDefinedFunction [dbo].[sentenceCase]    Script Date: 05/16/2012 12:39:55 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
–lifted from
–http://classicasp.aspfaq.com/general/how-do-i-convert-a-name-to-proper-case.html
CREATE FUNCTION [dbo].[sentenceCase]
(
@strIn VARCHAR(255)
)
RETURNS VARCHAR(255)
AS
BEGIN
IF @strIn IS NULL
RETURN NULL

DECLARE
@strOut VARCHAR(255),
@i INT,
@Up BIT,
@c VARCHAR(1),
@cminus1 varchar(1)

SELECT
@strOut = ”,
@i = 0,
@Up = 1

WHILE @i <= DATALENGTH(@strIn)
BEGIN
SET @c = SUBSTRING(@strIn,@i,1)
If @i > 1
BEGIN
SET @cminus1 = SUBSTRING(@strIn,@i-1,1)
END
IF @c IN (‘.’,””) or @i = 0  or (@c = ‘ ‘ and @cminus1 = ‘.’)
BEGIN
SET @strOut = @strOut + @c
SET @Up = 1
END
ELSE
BEGIN
IF @up = 1
SET @c = UPPER(@c)
ELSE
SET @c = LOWER(@c)

SET @strOut = @strOut + @c
SET @Up = 0
END
SET @i = @i + 1
END
RETURN @strOut
END

Comments Off on New SQL Server function – sentenceCase

Dynamic SQL layout

Posted by jack on March 30, 2012

A simple thing this realisation – that in forming a SQL statement within a Sproc, any tabs, line breaks and spaces you include in the built-up SQL query will be accurately reproduced in the query variable – this makes it MUCH easier to debug the variable content whilst building up the query.

 

Here is my example, which involves building up a pivotted dataset by successively joining the entries in one table N times for the N different values of one attribute field (category). The point of the post is not the actual query, but the layout of the SQL code in variable @ViewSQL  [to see this replace exec(@viewSQL) with print  @viewSQL at the end].

 

USE [Interactions]
GO
/****** Object:  StoredProcedure [dbo].[sp_inspectionImportOverallPivot]    Script Date: 03/30/2012 15:51:49 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
/*
create proc sp_inspectionImportOverallPivot  31Mar2012
successively joins rows from the table
renaming the repeated key fields in each sub-view
and ends up with a view we can use for loading to salesforce that pivots the
inspectionImportOverall tabName entries
*/
CREATE proc [dbo].[sp_inspectionImportOverallPivot] as
declare @sectionCount int;
declare @counter int;
declare @viewSQL varchar(8000);
declare @tabName varchar(255);
declare @viewName varchar(255);
declare @firstViewName varchar(255);
select @counter = 1
If Exists (Select ‘True’ from INFORMATION_SCHEMA.TABLES
Where Table_Name = ‘vw_inspectionImportOverallPivot’ )
Drop View vw_inspectionImportOverallPivot

select @viewSQL  =
‘Create View vw_inspectionImportOverallPivot as
Select * from (

select @sectionCount = count (distinct TabName)
from dbo.InspectionsImportOverall
while @counter <= @sectionCount
begin
Select @tabName = TabName from (
Select ROW_NUMBER() OVER(ORDER BY TabName DESC) AS ‘TabNumber’,TabName
from dbo.InspectionsImportOverall) X where TabNumber = @counter
Select @viewName = ‘v’+replace(replace(@tabName,’ ‘,”),’&’,”)
Select @firstViewName = Case When @counter = 1 then @viewName else @firstViewName  END

select @viewSQL  = @viewSQL + ‘    SELECT  Account_ID as Ac’+Cast(@Counter as Varchar(3))+’,
Inspector_ID as Ins’+Cast(@Counter as Varchar(3)) + ‘ ,InspectionDate as IDate’+Cast(@Counter as Varchar(3))+’,
Score as ‘ + replace(replace(@tabName,’ ‘,”),’&’,”) + ‘Score from dbo.InspectionsImportOverall where TabName = ”’
+ @tabName + ‘ ” )’ + @viewName
select @viewSQL  = @viewSQL + Case When @Counter > 1 then

ON ‘+ @firstViewName + ‘.Ac1 = ‘ + @viewName  + ‘.Ac’+Cast(@Counter as Varchar(3)) + ‘ AND
‘  + @firstViewName + ‘.Ins1 = ‘ + @viewName  + ‘.Ins’+Cast(@Counter as Varchar(3)) + ‘ AND
‘  + @firstViewName + ‘.IDate1 = ‘ + @viewName  + ‘.IDate’+Cast(@Counter as Varchar(3))

else ” END

select @viewSQL  = @viewSQL + Case when @counter = @sectionCount  then ” else

LEFT OUTER JOIN (
‘  END
select @counter =@counter  + 1
end
exec (@viewSQL)

Comments Off on Dynamic SQL layout
Categories: SQL Server

A useful SQL Server custom function to proper-case a name

Posted by jack on March 30, 2012

Custom SQL functions are a pretty handy thing to cover common data cleaning activies. I wrote this a while back but find this a really useful function – I expect you can find better versions out there and this is anglo-centric:-

USE [DataMart]
GO
/****** Object:  UserDefinedFunction [dbo].[ProperCaseName]    Script Date: 03/30/2012 15:44:43 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE Function [dbo].[ProperCaseName](@InputString Nvarchar(255))
Returns NVarchar(255)
As
Begin
Declare @ProperCaseString NVarchar(255)
Declare @counter int
Declare @startPosition int
Set @counter = 0
Set @ProperCaseString = LTRIM(RTRIM(@InputString))
Set @ProperCaseString = Upper(Substring(@ProperCaseString,1,1))+Lower(Substring(@ProperCaseString,2,40))
–check for and eliminate duplicated intervening spaces
doubleSpacesCheck:
Set @counter = Charindex(‘  ‘,@ProperCaseString)
If @counter > 0
Set @ProperCaseString = Replace(@ProperCaseString,’  ‘,’ ‘)
If @counter > 0  goto doubleSpacesCheck
–find position of intervening spaces and proper case the words
Set @startPosition = 1
interveningSpacesCheck:
Set @counter = Charindex(‘ ‘,@ProperCaseString,@startPosition)
If @counter > 0
Set @ProperCaseString = Substring(@ProperCaseString,1,@counter)
+Upper(Substring(@ProperCaseString,@counter+1,1))+Substring(@ProperCaseString,@counter+2,255)
Set @startPosition = @counter + 1
If @counter > 0 goto interveningSpacesCheck
–find position of intervening dashes and proper case the words
Set @startPosition = 1
interveningDashesCheck:
Set @counter = Charindex(‘-‘,@ProperCaseString,@startPosition)
If @counter > 0
Set @ProperCaseString = Substring(@ProperCaseString,1,@counter)
+Upper(Substring(@ProperCaseString,@counter+1,1))+Substring(@ProperCaseString,@counter+2,255)
Set @startPosition = @counter + 1
If @counter > 0 goto interveningDashesCheck
Return (@ProperCaseString)
End;

 

 

 

 

 

Comments Off on A useful SQL Server custom function to proper-case a name

Using Cast Iron to integrate Salesforce to the data warehouse

Posted by jack on February 12, 2012

I integrate Salesforce Enterprise to the SQL Server based data warehouse using IBM Websphere Cast Iron Cloud Edition.

Cast Iron includes a built-in connector for Salesforce and what they term a “secure connector”, being an applet you install inside the firewall on the secured LAN that connects to the Cast Iron service on the Internet using a secure key to enable encrypted data communications without making holes in firewalls.

Synchronising the data warehouse with Salesforce

My normal approach is a standard data warehouse one; each Salesforce object is mirrored by a staging table in the data warehouse that has the same data elements as the Salesforce object.

Cast Iron is configured to synchronise data between Salesforce and staging tables at regular intervals.

Errors in copying from Salesforce are posted to a custom error object in  Salesforce (why? Because one possible error is loss of availability of the data warehouse – not much point trying to posting to the data warehouse in those circumstances).

Copying data from the data warehouse to Salesforce

Customer transactions are copied into the data warehouse from other applications, and from there loaded into Salesforce.

Cast Iron  triggers stored procedures to start a load process (or we can use SQL Server Agent or another local mechanism – but it is sometimes neater to keep all the data integration processes within Cast Iron).

Errors in this case are posted to the data warehouse (since one possible error is loss of availability of Salesforce – not much point trying to posting there in those circumstances).

Closed loop feedback

All applications are occasionally subject to non-availability – in these circumstances some records may get missed from synchronisation. For critical records I deploy a feedback loop along these lines

  • Source record changes are synchronised to the target
  • The target is itself a source for a second synchronisation of a skeleton record (Primary key and datetime stamps only)
  • The target for this skeleton is alongside the original source records
  • A regular process re-sends any record updates included in the source but not the target skeleton records.

Since data synchronisation over the Internet take place over an interval this is not an instant correction, but I have found it works well where a one hour delay in synchronisation accuracy is acceptable.

 

 

 

 

Comments Off on Using Cast Iron to integrate Salesforce to the data warehouse