What Is a Data Lake vs. a Data Warehouse?

Today, organizations are handling multiple types of data from different sources and to make sense of this incoming ‘gold’ (so to speak) they need a data storage and analytics solution. Where an organization houses their data in order to take action on it often leads to the discussion of Data Lakes and Data Warehouses. And while some organizations have heard of each solution, many are unclear on what the differences and benefits are and, ultimately, which solution is best for them.

I recently sat down with Ted Sfikas, Tealium’s Director of Solutions Consulting for North America and LATAM, to ask him about the differences between a Data Lake and a Data Warehouse. Is one better than the other? How can you best maximize the value of each solution? And what are his key recommendations for choosing the best type of solution for a brand?

So Ted, what is a Data Warehouse?

A Data Warehouse is a repository – it’s where structured data goes to rest. It involves engineers building ‘entry criteria’ for the data’s entry – these criteria are based on the data models that the organization wants to analyze. The end result is a well-structured set of data in and forms the schema. Data entering a Data Warehouse has some very specific requirements on it and is very controlled. The tables in the Warehouse used by the organization to produce meaningful analytics are preordained to assist with the most efficient production of this business intelligence.

A Data Warehouse is an important thing – it’s not a place where data is thrown in without a consideration of what will happen next. There are very distinct purposes for loading information into Data Warehouses, there are many technologies connected to it and they depend on conforming to those data principles.

So to summarize, a Data Warehouse is a repository of structured data that has been collected and organized according to a pre-built, rigid model (schema-on-write) based on very specific uses of that data. Because only very well-organized and specific data is collected with this approach, using this data once it’s in the Data Warehouse is fast and easy – but the setup is harder and you may potentially miss data that will be valuable in the future because you’re only collecting what’s in the schema.

And comparatively – what is a Data Lake?

A Data Lake is the response of the IT industry to providing data in an unstructured format. Data Lakes are made to handle the new, “Big Data” aspects of our industry. Think of the analogy with a box of bottled water, ready to drink, as compared to a Lake. The Data Lake is not packaged and ready to use, but it’s got all the same ingredients, it just needs to be put together. Data from external sources (ie: social, video, voice, text) gets poured into this lake through a number of different streams (ie: channels) all related to the customer, but again, it’s not in a structured format. What you end up with is one, massive table of data in the Data Lake where it’s simply not efficient to place analytical tools upon it like you can in a Data Warehouse.

In contrast – the way we work with data in a Data Lake is we use search and tagging capabilities to indicate which pieces of data we want to capture and place into a single object — after that structuring is done, we can analyze. Why have we done it that way? Because for some types of information, it’s a lot easier to get data into a Data Lake and deal with forming a schema later, as opposed to dealing with a Data Warehouse’s schema restrictions up front.

So to summarize, a Data Lake is a repository of unstructured data that’s not rigidly filtered during collection. Rather, the raw data is simply loaded into the lake and modeled and structured later (schema-on-read). Because way more data is collected with this approach, accessing the data takes a little more work and requires certain technology…but it takes very little setup and there’s no risk of missing valuable information.

That’s why a Data Warehouse is referred to as schema on-write because the minute you write the data to disc you have to have a schema in place. As opposed to a Data Lake where raw data enters the repository before it’s structured, and so it’s schema-on-read.

Is one better than the other? As in – who would want to use a Data Lake and who would want to use a Data Warehouse?

In larger companies where there may be a business intelligence department, multiple pieces and types of analytics may be being created in several business units per day. And even the savviest of professionals don’t want to be burdened with dealing with numerous sources of data and having to understand where it all came from – they want it ready to go and to be able to act on it. So in an organization with a larger business intelligence department, a Data Warehouse is much more suitable.

A Data Warehouse is suitable when data is being used in precise ways with the teams to backup managing the data in those ways. But recent uses of Machine Learning and resultant Artificial Intelligence have placed more of an emphasis on getting raw data; you’ll find that the traditional Data Scientist will likely be using a Data Lake because of this, as the information they need has to be dynamic and raw, and far more agile. These roles may not even know how they’re going to use the data when it is first collected, and that’s the point of the Lake.

How does Tealium fit into the Data Lake and Data Warehouse comparison?

What’s great is that Tealium’s solution works with both Data Lakes and Data Warehouses. Just like a Data Lake and a Data Warehouse are very complementary methods to storing, collecting and interpreting data – Tealium is like a third and complementary place for data to go as well.

Tealium is where technology can be pre-programmed to react to data in real-time. The Tealium solution brings in clean data, and then makes that data actionable in real-time. It’s complementary in the sense that it allows for governed data collection and enrichment, orchestrates actions based on business rules, and ultimately allows a business to programmatically define the formation of a customer profile in an automated fashion. Instead of spending days confirming how to work with customers that are changing all the time, those decisions are now automated by Tealium and the business is finally able to react in real-time, something a Data Warehouse and a Data Lake are not designed for. As this is all happening, Tealium can store data in Redshift and S3 repositories that the Data Warehouse and Data Lake can work with, and so Tealium extends the data supply chain to be available in real-time.

What are some of the key things you would recommend a brand look for and consider in choosing a Data Lake or Data Warehouse tool?

Most organizations today have a Data Warehouse, and many of them have heard of the Data Lake or are in the process of building one. It’s important to understand that the cost differences can be enormous.

According to Pricewaterhouse Coopers, a Data Warehouse can cost anywhere in the range of $10M to set up and build because there’s a lot of processes to create, models to build, and a lot of training to do. A Data Lake can be set up for a much lower cost, at around 20% of that of the Warehouse. Why? Because a Data Lake uses open source software like Hadoop, and by definition, will not require any lengthy modeling. There are open-source alternatives to choose from but organizations can also buy commercial-off-the-shelf software for their Data Lake initiative. So cost is a big factor. Organizationally, the company and its personnel must be ready for this change. Having the right people on staff and the right processes in place to leverage the results is mandatory.

Customers have a lot more choice with Data Lakes. And they can be a lot easier to set up.

I’d recommend looking at the organization first to see how it runs best. It may be better to get data in an unstructured, informal manner where you’re helping business units gain value in an informal way. If so, then a Data Lake would be best suited for them.

Thank you Ted!

Want more information on how Tealium’s solution is complementary to both Data Lakes and Data Warehouses? Contact us for a live demo today!

CDP Institute

Blog

What Is a Data Lake vs. a Data Warehouse?

More from the blog

Top 5 Customer Data and Engagement Trends for 2018

What’s the ROI for a Customer Data Platform? It’s In The Way That You Use It.

Why data provenance standards are essential for modern marketing success