Persistence of Data in Customer Data Platforms

December 5, 2016

The CDP Institute’s definition of a Customer Data Platform describes it as a “persistent, unified customer database”.  Most of the CDP discussion focuses on the “unified” bit, since collecting data from different sources and linking it to cross-channel identities is a huge challenge.  But “persistent” is worth some thought as well.

“Persistent” is in the definition to distinguish CDPs from solutions that read data from external source systems without storing it internally.  The two main classes of these real time interaction managers, which assemble data to guide Web and call center interactions*, and integration platforms like Jitterbit, Mulesoft, Zapier, and Boomi, which act as switchboards to shuttle data between different systems without storing the data themselves.

The value of persistence is obvious: storing data lets CDP users look back over time to find patterns, calculate trends, build aggregates, and access details that might be lost or inaccessible in source systems.  Persistence is especially important for identity resolution, which needs historical data such as the same device accessing different accounts or different devices being used simultaneously.  On a practical level, it’s often easier to work with data stored inside the CDP than to read that data from an external system.  Indeed, the owners of external systems are often unwilling to allow external access to their data because they fear it will interfere with operational performance.  And they’re often right.

But it’s not enough to say that persistence is important.  Persistence also has its costs, most obviously in extracting, moving and storing the persisted data.  There are also performance penalties from having more data to sort through.  It’s true that storage is cheap and big data technology scales almost indefinitely.  But if you copy enough data these costs still become significant.  At the extremes, there are certain kinds of data it doesn’t make sense to persist (in most cases), such as minute-by-minute changes in customer location, local weather, or stock portfolio values.   Most CDP applications only need to know those values while an interaction is happening, so all that’s needed is to look up the current value at the start of an interaction.  Storing a continuous history would be overkill, although it often does make sense to save a snapshot of those values at the time of the transaction.  As the mention of location may suggest, persistence can also raise privacy issues.

In other words, the question isn’t whether persistence is needed but which data to persist.  Choices must be made.

The first step is to distinguish three categories: data which must be persisted, data which might be persisted, and data which should never be persisted.  You can then start asking which data falls into each category.  The real answers will depend on your situation but here are some thoughts.

  • Required data.  At a bare minimum, this includes data for identity resolution.  That information is the key to linking all other data, whether stored inside or outside the CDP.  Historical information is critical to the process so relying purely on external data isn’t an option.  Required data also includes information that is often lost in source systems, such as past addresses or contact phone numbers.  It extends to derived values that don’t exist in the source systems, such as trends and aggregates.
  • Optional data.  This is most of the customer profile details and behavior histories loaded into a typical CDP.  These could theoretically be read from source systems but are often not available in practice.  Reasons could include slow access (especially to support real-time interactions), need for preparation or processing (too slow or inefficient to do on demand), or refused permission from the system owner.   Reformatting and indexing data for easy access are other good reasons to load it into the CDP.  So is looking for patterns in data streams – sometimes called complex event processing – which needs a readily-accessible history of previous information.

The call for whether or not to persist can be close where there are truly massive amounts of detail – think Web logs – which are used only occasionally.  Having them available is extremely convenient, especially when summaries are not sufficient substitutes.  For example, customer segmentation projects may need the underlying details to reclassify customers using different segment definitions.  Simply storing the customer’s current segment with each interaction won’t work.  This type of after-the-fact reclassification is a common requirement and one of the big advantages of having the details in the CDP.  But it might not be worth loading the data if you’ll do the analysis just once every three years – although having the data easily available might result in doing the analysis more frequently.  Chicken, meet egg.

  • Excluded data. The clear cases are information that shouldn’t be stored for privacy, security or regulatory reasons.  Beyond that, you’re mostly in the realm of cost-benefit analysis.  One reasonable rule of thumb is you don’t want to load data that changes frequently, must be current when used, and is only used rarely.  Mobile device location used for customer service is a good example.  The difficulty and timeliness of accessing the data in source systems is also a factor: the easier it is to read the data externally, the less value you get from loading it into the CDP.  But terms like “frequently”, “rarely”, “difficulty”, and “timeliness” are all relative, so there are no fixed rules here.  Sorry.

If you’re thinking that the boundaries between these categories are pretty vague, here’s some bad news: it gets worse.  You might want to store the recent portion of a data stream that’s too large to keep in its entirety, in the way that surveillance tapes are kept for a period and then erased if nothing important happened.  Or, you might read time-sensitive information directly from operational systems to support real-time interactions, but then upload the same information in overnight batches for historical analysis.  And let’s not even get started on the fact that your CDP itself can have different types of storage with different levels of detail and access speed.  Or that answers will change over time as you find and discard uses for particular pieces of data.  Or that CDP technology itself will evolve.

Given these ambiguities, how should you think about persistence in planning your CDP?  As ever, the foundation is specific business uses: what data do you need, in which formats and how quickly, to support your intended applications?  Can you meet those needs by reading directly from the source systems or do you need to load it into the CDP?  If it must be in the CDP, is the cost to load and store it acceptable?  Beyond this relatively static analysis, remember that there may be future uses for the data and that you have choices in how you manage it in your CDP.

Bottom line: persistence in a CDP isn’t a simple topic.  But if you only remember that you’ll likely need a combination of internal storage and external access, you’re already started in the right direction.

* This is a large and complicated category, which could easily occupy several blog posts by itself.  See this Forrester Wave for a good overview of the classic real time interaction managers, nearly all of which are now baked into enterprise marketing suites.  See this Gartner report for an overview of digital personalization engines, a newer group with overlapping functions.