Repeatability & Reliability in Copying Data

Posted by jack on July 26, 2015

This post centres on T-SQL development but I apply similar principles in other environments – in Force.com Apex classes for example.

The principles can be summed up as :-

  • If the procedure is not run at the expected time/point in a process, then it can be run later without causing a problem
  • If a procedure is run repeatedly, the second and subsequent runs do not cause a problem.
  • Make as few assumptions as possible about the external environment

At this point I suspect most experienced developers will stop reading – if you do not start with these principles, you quickly learn them the hard way. For others, the following is an example of what I mean.

BUSINESS REQUIREMENT: In table SOURCE we have new rows arriving every day, with a single dateTime stamp dateLoaded recording their arrival. That column and the other attributes in the table are non-unique – ie we have a “heap” with no identifiable primary key. We need to copy these rows from there to an identically structured TARGET table. There are a lot of rows in both table, making a complete copy each update an inefficient process. Both SOURCE and TARGET table structures are a given and may not be amended.

The simplest T-SQL that might achieve this aim daily could be

INSERT INTO TARGET SELECT * FROM SOURCE WHERE dateLoaded > getDate()-1

[In T-SQL the getDate() function returns the current dateTime; T-SQL is not case-sensitive] 

Using this would be a bad idea for these reasons:-

  1. If you ran it twice in succession, it would load the same set of records into TARGET again.
  2. If you missed running it one day, it would not recover – the missed records would remain missed for all time, or at least until the TARGET is re-loaded from scratch
  3. If the server clock is adjusted or the job server delays or brings forward a job, then missed/duplicated rows might arise without notice.
  4. Using * in the query to represent all columns will work but is bad practice – if the columns are later amended in either or both tables, or are in a different order in the two tables, then either the query will fail or (worse) could map incorrectly between tables. for the sake of brevity the remainder of this post uses COLUMNLIST but always substitute the actual column names.

Here is an improved code that could overcome these objections:-

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SOURCE)

This revision will not load the same rows twice, and if it is not run for a while then the next time it runs it will catch up. The drawback is that the sub-query  (SELECT MAX(dateLoaded) from SOURCE) will slow up operation – especially if there is no index on dateLoaded in the two tables.  Some options present themselves here

  • If permitted, add indexes to the dateLoaded columns on the two tables
  • Retain a relative age limit on the oldest date you expect to include
  • Create a table to retain the last time the job was run and reference that in place of the sub-query

To expand on the second of these – amending the query by adding a final clause AND dateLoaded > getDate()-5 will not invalidate the process provided you are 100% sure that within 4 days you will pick up any issues and get them fixed.

Better, provided you have the ability is to retain a dateTime entry (simplest for our example is in a control table in the  SQL database  but it can be another type of persistent storage).

EG a single row table SYNCHLOG with a single dateTime column lastUpdate

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

INSERT INTO TARGET(COLUMNLIST )

SELECT COLUMNLIST  FROM SOURCE WHERE dateLoaded > (SELECT MAX(dateLoaded) from SYNCHLOG)

UPDATE SYNCHLOG SET lastUpdate = (SELECT MAX(dateLoaded) FROM TARGET)

Notice here I have included what might be considered some redundancy – if at the END of this run we are setting the value of SynchLog.lastUpdate why do I need to repeat the process on the next run? My reason is – I cannot be 100% sure that in between runs the value in my control table is not amended by some other process – so just to be sure I am setting it again before the data load.

In practice I would create and drop a temporary table for this purpose – unless I wanted to elevate the control table to be a permanent log table instead – for example holding a record of the number of rows added each run.

CHECKING YOUR WORK

I would also consider putting in a check to verify all is working as expected – for example, in this case we might do this check

DECLARE @countTargetRecords bigint;

DECLARE @countSourceRecords bigint;

DECLARE @timeCutoff dateTime;

[HERE RUN THE LOAD PROCESS….]

SELECT @timeCutoff = Max(dateLoaded)  FROM target

SELECT @countTargetRecords = COUNT(*)  FROM target

SELECT @countSourceRecords = COUNT(*)  FROM source WHERE dateLoaded <= @timeCutoff

IF  @countSourceRecords <> @countTargetRecords

BEGIN

[…. raise the alarm….. ]

END

Notice that I included a time cutoff – this is to cover the situation where the SOURCE table is continuously getting new records – without a cutoff the alarm might be raised incorrectly due to these new records.

Comments Off on Repeatability & Reliability in Copying Data

Salesforce.com – so much more than just a CRM

Posted by jack on April 26, 2013

One issue that deters organisations from adopting Salesforce is the higher cost of licences – if you do some simple arithmetic you quickly end up with a formidable number – especially when comparing Salesforce Enterprise with much of the competition. And the saas model means after spending all that money you don’t even own a copy of the software!

There are a number of reasons why this kind of comparison can be a bit misleading – for example the phrase “perpetual licence ” may legally mean effective ownership, but in practical terms unless you have support and maintenance in place (and the cost of this is normally 20 – 30% of the original box price per year)  then the value of what you purchased quickly decays – even with s&m it is easy to get trapped in  old versions of software due to the costs of upgrade – one client I worked for had a highly configured Pivotal installation that ran their business and could not upgrade it. They were running CRM without email integration!

Like much of IT, software packages of any type do not have as long a life as other assets – which is why accountants tend to write them off over many fewer years than they take for fixed plant for example.  With saas like Salesforce (other brands are available) you are always on the latest version for as long as you subscribe.

But the accounting cost of software is not the theme I wanted to address now, but how Salesforce can quickly become a key component well outside the core CRM use when integrated to other systems, and how that in turn can enrich the CRM. My example is a recent development for a client selling hotel rooms, but can apply to many other businesses.

Hotel rooms can be booked using a wide variety of rate codes; some rates can include or exclude items like breakfast or late check-out, can  be exclusive to specific consumers or travel agency groups, or special offers (effectively sales prices). The number and variety of rates is constantly changing as all those concerned in the business seek new ways to grow their business.

As the use of these rates changes, so do the information reporting requirements for monitoring performance. With several thousand rates in use, rather than individually select every rate required in every different report, the client’s data warehouse has evolved a set of reporting flags and descriptions to allow rapid production and one-point maintenance of such reports.

But the data warehouse has no user front end, so each time a new rate is required, or an existing one is moved into or out of a selected group of rates for a specific requirement, the poor data warehouse manager must be asked to adjust the values in this reference data –  and I have an aversion to being included in business processes and repeatedly doing simple dull SQL updates. Other users were asking the report writer to manually include/exclude rates to get the reports they want quickly – neglecting update of the reference data and causing different reports intended to be comparable to end up including different sets of rates.

In essence, a single table of reference data can be represented as a single object in Salesforce enterprise – so I could rapidly design a new object based on the data fields in the SQL table, using entirely drag & drop customisation. The natural candidate for the Salesforce object Name is the rate code itself (as that is the one fixed value in the table) but Salesforce does not allow imposition of the Unique constraint on object  Names, so I did have to apply a trigger to prevent duplicates.Here is the code for my trigger (Rate_Code_Description__c is the name of the custom object:-

trigger rateCodeDuplicatePreventer on Rate_Code_Description__c (before insert, before update) {

Map<String,Rate_Code_Description__c> rateCodeMap = new Map<String, Rate_Code_Description__c>();
    for (Rate_Code_Description__c rateCode : System.Trigger.new) {
    
    // Make sure we do not treat a rate code that is not changed during an update as a duplicate
    if (( rateCode.Name != null) && (System.Trigger.isInsert || (rateCode.Name !=
    System.Trigger.oldMap.get(rateCode.Id).Name))){
    // Make sure another new rateCode isn’t also a duplicate
        if (rateCodeMap.containsKey(rateCode.Name)) {
            rateCode.Name.addError(‘Duplicate rate code in the same batch’);
        } else {
                rateCodeMap.put(rateCode.Name, rateCode);
           
        }
    }
   }
    //using a single query, finad all the ratecodes with the same name
    for (Rate_Code_Description__c rateCode : [SELECT Name from Rate_Code_Description__c WHERE Name IN :rateCodeMap.KeySet()]){
    Rate_Code_Description__c newRateCode = rateCodeMap.get(rateCode.Name);
    newRateCode.Name.addError(‘This rate code already in Salesforce’);
}
}

Standard salesforce drag & drop form design allowed me to group the various checkboxes into related sections to help users manage their own rates – in this client there is good data discipline and cross-functonal cooperation, so I have not chosen to apply varying field level security – but I could have done so with a little more time.

Integration between salesforce and the data warehouse was undertaken using Cast Iron – the integration is 2 way but not symmetrical, since with the exception of the rate code (object Name) itself, we wanted all the other entries to be mastered in Salesforce, even if originating from the data warehouse table.  And we need to ensure that users get confirmation that entries put in Salesforce have correctly updated the data warehouse table.

New rates can be input in the data warehouse – this happens automatically where a new rate code is in use in incoming reservation data but does not exist in the reference data table – in this event a stub record is created using globally applicable rules to set a few flags to sensible defaults and taking a description from the first incoming reservation record. This new rate then is synched by IBM Cast Iron using a Poll Inserted Rows database activity into Salesforce. Once in salesforce a workflow rule sends an email to a responsible member of staff (identity varies according to the value of rate code input) so that person can then correct the description and set flags to the appropriate values.

The Salesforce object is connected to a second Cast Iron orchestration using a Poll Updated Objects activity – this then send the updated values back to the data warehouse table. Finally, the response from the update activity is used to update the Salesforce object in a single date field to confirm to the user that all the integration has worked correctly and their input has updated the data warehouse table.

Total time taken to develop, test and implement this is less than two days, and the job of maintaining the rate related data is put firmly where it should be – with the business users.

Now, here is the gravy – now that we have all this rate code information in Salesforce I can leverage this to improve the CRM – providing users with information on customer, partner and supplier related activities based on both this rate code data and the other objects already in Salesforce. We’ve not done that part yet, so I will write about that another day.

Jack

 

 

 

 

 

 

Comments Off on Salesforce.com – so much more than just a CRM

New SQL Server function – sentenceCase

Posted by jack on May 16, 2012

An extension of previous versions to cover conversion of all upper case to normal sentence case.  If you have embedded proper nouns or intials it won’t give exact results, but a lot prettier than a lot of shouting….

USE [DataMart]
GO
/****** Object:  UserDefinedFunction [dbo].[sentenceCase]    Script Date: 05/16/2012 12:39:55 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
–lifted from
–http://classicasp.aspfaq.com/general/how-do-i-convert-a-name-to-proper-case.html
CREATE FUNCTION [dbo].[sentenceCase]
(
@strIn VARCHAR(255)
)
RETURNS VARCHAR(255)
AS
BEGIN
IF @strIn IS NULL
RETURN NULL

DECLARE
@strOut VARCHAR(255),
@i INT,
@Up BIT,
@c VARCHAR(1),
@cminus1 varchar(1)

SELECT
@strOut = ”,
@i = 0,
@Up = 1

WHILE @i <= DATALENGTH(@strIn)
BEGIN
SET @c = SUBSTRING(@strIn,@i,1)
If @i > 1
BEGIN
SET @cminus1 = SUBSTRING(@strIn,@i-1,1)
END
IF @c IN (‘.’,””) or @i = 0  or (@c = ‘ ‘ and @cminus1 = ‘.’)
BEGIN
SET @strOut = @strOut + @c
SET @Up = 1
END
ELSE
BEGIN
IF @up = 1
SET @c = UPPER(@c)
ELSE
SET @c = LOWER(@c)

SET @strOut = @strOut + @c
SET @Up = 0
END
SET @i = @i + 1
END
RETURN @strOut
END

Comments Off on New SQL Server function – sentenceCase

A useful SQL Server custom function to proper-case a name

Posted by jack on March 30, 2012

Custom SQL functions are a pretty handy thing to cover common data cleaning activies. I wrote this a while back but find this a really useful function – I expect you can find better versions out there and this is anglo-centric:-

USE [DataMart]
GO
/****** Object:  UserDefinedFunction [dbo].[ProperCaseName]    Script Date: 03/30/2012 15:44:43 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE Function [dbo].[ProperCaseName](@InputString Nvarchar(255))
Returns NVarchar(255)
As
Begin
Declare @ProperCaseString NVarchar(255)
Declare @counter int
Declare @startPosition int
Set @counter = 0
Set @ProperCaseString = LTRIM(RTRIM(@InputString))
Set @ProperCaseString = Upper(Substring(@ProperCaseString,1,1))+Lower(Substring(@ProperCaseString,2,40))
–check for and eliminate duplicated intervening spaces
doubleSpacesCheck:
Set @counter = Charindex(‘  ‘,@ProperCaseString)
If @counter > 0
Set @ProperCaseString = Replace(@ProperCaseString,’  ‘,’ ‘)
If @counter > 0  goto doubleSpacesCheck
–find position of intervening spaces and proper case the words
Set @startPosition = 1
interveningSpacesCheck:
Set @counter = Charindex(‘ ‘,@ProperCaseString,@startPosition)
If @counter > 0
Set @ProperCaseString = Substring(@ProperCaseString,1,@counter)
+Upper(Substring(@ProperCaseString,@counter+1,1))+Substring(@ProperCaseString,@counter+2,255)
Set @startPosition = @counter + 1
If @counter > 0 goto interveningSpacesCheck
–find position of intervening dashes and proper case the words
Set @startPosition = 1
interveningDashesCheck:
Set @counter = Charindex(‘-‘,@ProperCaseString,@startPosition)
If @counter > 0
Set @ProperCaseString = Substring(@ProperCaseString,1,@counter)
+Upper(Substring(@ProperCaseString,@counter+1,1))+Substring(@ProperCaseString,@counter+2,255)
Set @startPosition = @counter + 1
If @counter > 0 goto interveningDashesCheck
Return (@ProperCaseString)
End;

 

 

 

 

 

Comments Off on A useful SQL Server custom function to proper-case a name

Using Cast Iron to integrate Salesforce to the data warehouse

Posted by jack on February 12, 2012

I integrate Salesforce Enterprise to the SQL Server based data warehouse using IBM Websphere Cast Iron Cloud Edition.

Cast Iron includes a built-in connector for Salesforce and what they term a “secure connector”, being an applet you install inside the firewall on the secured LAN that connects to the Cast Iron service on the Internet using a secure key to enable encrypted data communications without making holes in firewalls.

Synchronising the data warehouse with Salesforce

My normal approach is a standard data warehouse one; each Salesforce object is mirrored by a staging table in the data warehouse that has the same data elements as the Salesforce object.

Cast Iron is configured to synchronise data between Salesforce and staging tables at regular intervals.

Errors in copying from Salesforce are posted to a custom error object in  Salesforce (why? Because one possible error is loss of availability of the data warehouse – not much point trying to posting to the data warehouse in those circumstances).

Copying data from the data warehouse to Salesforce

Customer transactions are copied into the data warehouse from other applications, and from there loaded into Salesforce.

Cast Iron  triggers stored procedures to start a load process (or we can use SQL Server Agent or another local mechanism – but it is sometimes neater to keep all the data integration processes within Cast Iron).

Errors in this case are posted to the data warehouse (since one possible error is loss of availability of Salesforce – not much point trying to posting there in those circumstances).

Closed loop feedback

All applications are occasionally subject to non-availability – in these circumstances some records may get missed from synchronisation. For critical records I deploy a feedback loop along these lines

  • Source record changes are synchronised to the target
  • The target is itself a source for a second synchronisation of a skeleton record (Primary key and datetime stamps only)
  • The target for this skeleton is alongside the original source records
  • A regular process re-sends any record updates included in the source but not the target skeleton records.

Since data synchronisation over the Internet take place over an interval this is not an instant correction, but I have found it works well where a one hour delay in synchronisation accuracy is acceptable.

 

 

 

 

Comments Off on Using Cast Iron to integrate Salesforce to the data warehouse

Welcome to the blog

Posted by jack on April 24, 2010

This is the new blog of Jack, Principal Consultant at Albacore Ltd.

Here I will giving some of my impressions and thoughts whilst working with clients on their CRM and data warehousing implementations.

I will try to remember to put in interesting and useful technical stuff, as well as musings on the wider context.

I work in the London, UK area, and my client base over recent years has ranged from large public sector and international businesses to small outfits with three or four staff.

Our new business strategy is to target “small growing companies” in London & South East England – we think there is a growing market for services to businesses with less than 100 employees – ten years ago these firms had maybe one server, in-house email and the rest of their computing was MS Office.

Today not only has the server rack got filled with more servers, but with the onset of saas (software as a service) offerings like SalesForce and Webex they are increasingly moving their business out to the Cloud.

The current generation of entrepreneurs and CEOs were brought up in the networked, information-driven age, and they expect to deploy IT support throughout their business.

But whilst they pay attention to the input of information, the networking of information is often neglected – new processes spawning new applications. Sometimes these are linked into the existing ones to avoid duplicating data entry – but combining the stored data from these systems to provide an integrated view across all aspects of the business is rarely attempted outside the large corporate data warehouse.

I have been designing just such an integration for a client – integrating a range of systems including Cloud, email servers, voice call logs, website, direct marketing and database systems to provide a single, consistent means of monitoring all aspects of the operation . There is nothing magical or new about any of the technology, but the price, reliability and capability of software now available makes this achievable with a fraction of the resources this would have required in the past.

My next posts will discuss some aspects of the design as it evolves.

Comments Off on Welcome to the blog