One of the primary goals for every business using data is to build a comprehensive understanding of their customers. That effort makes complete sense: the more you know about your customers, the more value you are able to provide them and the more they will pay you.
In the digital age, the most common term used for building a complete customer profile is a “360-degree view of the customer” (often abbreviated to “customer 360”). This coveted customer 360 serves as the foundation for driving significant value, from uncovering previously hidden insights to enabling powerful ML use cases like personalization.
Every business wants those outcomes, but as anyone working in data knows, actually building a 360-degree view of your customer is really hard. But why such difficulty in the world of modern data tooling?
The initial challenge of building a unified view of the customer is collecting all of the data your organization has about its customers into a single place. For most companies, this means aggregating data from a huge number of sources, which itself is a serious technical challenge.
While IT departments have been aggregating customer data since the beginning of databases, the first set of SaaS tools to promise a kind of ‘automated’ customer 360 in response to this need were cloud marketing CDPs. Unfortunately, these CDPs were built with limited integration flexibility and proprietary customer profile models, so they ultimately failed on their promise of a golden customer record and actually exacerbated the underlying problem by creating additional data silos.
In the wake of that failure, companies were reminded that building a 360 degree view of their customer is a fundamentally technical problem and responsibility for these projects began to shift back to data and engineering teams (a trend we are currently seeing play out across the market).
Thankfully for those technical teams, the recent commoditization of customer data pipelines and scalability of data warehouses have made it much easier to pull every bit of customer data into a centralized store.
It turns out, though, that solving the initial challenge of centralization is only half the battle.
Before we dig into the details, let’s define what a 360 degree view of the customer actually looks like in your data warehouse. On a basic level, your customer 360 exists as a table with one row per user and a bunch of columns that represent everything you know about that user. Those columns are often called user traits or attributes. Traits generally fall into a few categories.
The first is unique identifiers, which are all of the unique IDs you have for your user from every tool. These could range from anonymousId values from various web sessions to customerId values from payments systems.
The second kind of traits are known user attributes, which are all of the data points about a customer that are pulled in from various tools and data sources. These are often demographic (job title, age, etc.), behavioral (lead source, product usage, etc.) and stage or state related (sales opportunity status, active/inactive status, subscription status, etc.).
The third kind of traits are computed traits, which are calculated by combining data sets that contain information related to the user. These are also called user features. User features are often related to key business metrics and are used to drive all kinds of insights and optimizations. A few examples are total revenue per user, last 10 products viewed, average time between logins, etc. Some of these features are fairly standardized. For example, every eCommerce company needs features related to products viewed, abandoned carts, etc. Other use cases and business models require custom features.
If you imagine having a complete repository of every user trait and feature, shipping projects like powerful product analytics or granular marketing audiences becomes significantly easier (and faster).
On the first glance, producing this unified table might not seem too difficult. Unfortunately, though, computing customer 360 is much harder than it looks. Here’s why.
When it comes to data, the typical user journey begins with anonymous activity on a website or app before eventually signing up, logging in or making a purchase. When that key identifying event happens, the user generally provides a known identifier (or set of identifiers) like email or phone number. As a result of this anonymous → known transition, the anonymous events are associated with anonymous identifiers like a cookie ID or device ID before being associated with the known identifiers (email, phone). In order to accurately construct this user journey, all of those activities must be logically combined into a single timeline of events for that user.
Semantic user features like “number of products viewed before first purchase” must be computed on the logically combined user journey, not the raw events. For most companies, though, managing anonymous identifiers and known identifiers, then combining the raw data, is a massive undertaking.
Things get even more complex in multi-device scenarios where the same end user can be associated with multiple anonymous identifiers such as a cookie ID on a browser and a device ID on a mobile device. All these identities must be tied to one end user. To make things worse, these associations are most often discovered over time. For example, if the user creates an account with their email address on the browser, then creates an account with their phone number on the mobile device, you may not know that they are the same user until you have additional information tying together email and phone (i.e., the user providing their phone number during a checkout event on the browser after logging in via email).
Even for this basic case, the required logic is significant:
- Anonymous activities in the browser must be associated with the email address after the initial account creation in the browser
- Anonymous activities in the mobile device must be associated with the phone number after account creation in the mobile app
- All activities (anonymous and known) and unique identifiers (email, phone, etc.) must be merged into a single timeline and user profile representing a single user’s journey
Stitching identities like this requires maintaining an identity graph and computing transitive closure—doing it in SQL is non trivial.
Further, this stitching needs to be done not only for event-based user journeys, but also for data extracted from other SaaS sources like marketing tools, CRMs, customer success tools and payment systems via ETL. Some of this data is event-based (like payments) while some is basic relational data (like lead records from a CRM), so decisions have to be made about timestamps on relational data and how to structure joins into a single table.
And if that wasn’t enough, everything we just described refers to deterministic identity stitching, but there are many use cases for non-deterministic or probabilistic identity stitching.
It’s no wonder that this is the first major roadblock most companies face when they set out to build their customer 360.
Data points with timestamps (events) are fundamental for building a 360-degree view of the customer, but user traits and features in the customer 360 table rarely map 1-1 to events. Features are highly semantic, involving multiple dimensions and events, which create additional challenges for the team building them.
For example, a feature like “user lifetime revenue” would require summing transactions from the website and mobile app, as well as any subscription revenue, and reconciling with financial transactions from the payment system. Even this simple use case requires working with multiple events from four different data sources and performing multiple mathematical operations.
Even individual events have semantic complexity. For example, an “added to cart” event seems simple on the surface, but could occur on different first-party platforms (web and mobile), third-party platforms (like affiliate sites) or even at different positions on the same page (header or features section).
Semantic features and events also need to take into account the output of the identity stitching step mentioned above: the raw events need to be associated with the anonymous and known identifiers, while features need to be computed over all events across all IDs belonging to a user.
Accomplishing this requires a significant amount of complex, repetitive SQL joins and unions across multiple events and identities.
Another ‘hidden’ challenge teams face with feature semantics is spending huge amounts of time computing attributes from the ground up, even though the schemas of the data sources and even the metrics themselves are largely standardized. This dynamic is pervasive for metrics that are important for ML use cases. For example, the semantics for a metric like “total user revenue in the last 30 days” don’t vary significantly from company to company.
For business models like eCommerce, 90% of key metrics could work off of a standard set of features, which would drastically accelerate time to market for companies building customer 360.
Once you compute semantic features, you also need to keep track of important metadata related to those features. At a high level, you need a description of the metric (what it means), time of last update, provenance (e.g., who defined and built the metric) and any access/ownership requirements. Having all of your metrics definitions and metadata in one centralized location makes it easy for data producers (data engineers, analytics engineers) and data consumers (analysts, product managers, data scientists, marketers) to collaborate—it gives consumers access, clarity and confidence without sacrificing visibility and control for producers.
Tracking historic versions of the metrics is also important, especially for ML algorithms For example, a churn algorithm may model off of features like revenue and website activity in a 7-day period prior to the churn-date. If the definition of revenue changes, the historic version of the metric must be marked as deprecated and the new metric recomputed. Because recomputing the entire history of a metric can be very costly, it’s best to recompute on demand.
Needless to say, building out the models and pipelines required to manage event semantics and metadata based on an identity graph (that you are also managing) is an extremely complex undertaking.
Current tooling for semantics and metadata
As you would expect, multiple kinds of tools have been built to address these pain points. There are several “metrics layer” or “semantic layer” tools that help data teams more easily manage metric semantics like the ones discussed above, while data cataloging and observability tools address the metadata challenge from a variety of angles at various points in the pipeline.
Many of these tools are great, but most are also young and have yet to reach full maturity in their solutions to these problems. More importantly, most of these tools are, understandably, built for defining global metrics in a traditional batch ETL workflow. So, while the current tools do make managing aspects of semantics and metadata easier than doing everything manually or building your own tooling, they still require a significant amount of configuration and management for projects like customer 360, which focus on deriving user-specific metrics from customer data.
Lastly, the end end user experience for most of these tools requires deep expertise in a particular language, specifically either SQL or python, meaning that the number of people within an organization who can build and manage these components of the customer 360 is limited. When it comes to user metrics, this limitation forces a significant amount of translation between multiple teams, which is a big reason customer 360 projects tend to take so long.
We’ll also be open about the punchline here: our team is actively building tooling to solve these problems for customer data. Stay tuned or reach out if you want to join our beta!
Even if your team builds the identity stitching and user feature layers, maintaining the underlying pipelines, scheduling and infrastructure requires full time data engineering and data ops teams. Pipelines can require orchestration across tools like dbt, Airflow, ETL jobs and more, necessitating a health monitoring layer to ensure continued operation through the lifecycle of the data flow and compute process.
The resources required for this infrastructure and orchestration is a significant roadblock for companies building customer 360.
Our team is actively building a solution for these problems and we will reveal much more detail in future posts. The product is currently in beta and our first step is making identity stitching as easy as editing a config file. You can see the beta product page here, and reach out to our team if customer 360 has been painful for you to build—we would love your feedback on our approach and the early version of the product.
This article was originally published on RudderStack’s website. Click here to see the original blog post.