One of the biggest drawbacks of the data lake is that without proper data pipeline management, as well as without proper cataloging, you can easily end up with a data swamp that is difficult to use for any real value. While It’s easy to put data in the lake, it’s tougher to sift through all of that information to find what exactly you need once it’s in there.
A data process framework for your business needs is essential to utilizing the data lake and preventing the creation of a “data swamp.” Having well-defined data catalogs for your data lake will help data scientists and stakeholders utilize the data for business benefit.
Lastly, data lakes generally require more storage space than data warehouses because they will be used to store all of your organization’s data, including unstructured data like images and videos.
So How Do I Know Which One I Should Use?
The answer to the question of whether you should use a data lake or a data warehouse comes down to asking yourself how you want to use your data.
What type of analysis do you want to do? As we discussed, each of these options has pros and cons and focuses on different types of analysis because each has different purposes.
In this era of a growing focus on machine learning, the raw data you find in a data lake is perfect for those uses. This is perfect for data scientists.
On the other hand, the clean, structured approach to data stored in a data warehouse is the perfect solution for business professionals who want analytics but don’t have the know-how to sift through raw data to find the insights they need. The data and query system is more normalized through the use of such tools batch reporting, business intelligence and visualizations
One note: If you’re already using a data warehouse and want to jump into the data lake, either by moving to one or adding one to your existing environment (note: it’s possible to use both), you’ll need to undergo a transition period because the data stored in them isn’t the same.