Implementing Enterprise Data Lake using Amazon (AWS) S3

Introduction & Background

In the modern digital world, many of the smaller to medium sized organizations (even some good sized organizations) still stumble upon the problem of having ever growing data and strive hard to become a truly data driven organization while their underlying architecture for this purpose is not strong enough to help them achieve their mission. In this post, I would try to focus on that topic and thus the beneficiaries of this post will be all those Engineers and organizations who are just beginning to think of moving towards a solution that takes them away from their traditional data warehousing models while staying in the big data platform. In other words, who still use a lot of traditional DWH principles while they either already started adopting big data frameworks and technologies or just thinking of going in that direction and more specifically who are on the AWS platform. But it is not too difficult to safely rethink the whole approach in terms of other cloud providers like GCP or Azure in place of AWS. Now let's get on to the technicality of the topic!

The troublesome architecture in a nutshell


Every organization has some kind of relational database as one of their source systems and almost all of them also have some additional but ever growing and increasingly difficult to handle files of data whether structured or unstructured. So most of these architectures evolved over time by adding piece by piece to their existing structure after each incremental demand and over a period of time it becomes a legacy system that is very much tedious to maintain and handle. More often any smaller change to this architecture or a specific flow has a very adverse domino effect on the whole system and thus majority of the organizations are very cautious and even reluctant to change. This is a situation I also have observed in a few of the organizations myself. 

The architecture in the above diagram is only a typical example and there are many more combinations of this becoming clumsy possible these days and the technologies/tools listed above are also just a common ones, there could be many more less common ones or even the databases could include NoSQL and etc.

It all starts from the data that resides in the transactional system, while most of them could reside in a (or multiple) relational database(s), there are potentially significant data that could come from other formats like CSVs, JSONs, XMLs, etc. When the different teams use different types of ETL tools to process and store these data within a particular organization, it is also very much common that they use a variety of ETL and Load orchestration tools like Spark, Jenkins, Airflow, NiFi, Streamsets, etc. Ultimately, the data being processed by them would also end up in a similar fashion. Initially every thing might have been in database for consumption but as the requirements grow and for the flexibility of developers who consume those data, eventually the target systems also vary greatly like Relational Databases, different types of files like CSV, JSON, Avro, Parquet, ORC, etc. These are more prevalent especially for those organizations that heavily use storage systems like AWS S3 and now that every organization started using cloud services, it would also be the similar situation for the users of GCP and MS Azure. For these reasons, it becomes increasingly complex to maintain these diverse data that probably keeps going back and forth in multiple directions and eventually there are cases where the data are unknowingly (sometimes even knowingly) redundant in multiple formats and thus leading to inconsistencies as they aren't really atomic processes that make them redundant. This is both a maintenance overhead and a chaos in the organization and thus, it is very much hard for the enterprise to become truly data driven.

A short summary of the pain points of this type of an architecture

  • Data keeps growing in volume, variety, velocity & veracity and thus beginning to become a real Big-Data problem
  • Data Scientists and Software Developers find it easier working with files (like JSON, Avro, CSV, etc) than reading from and writing to tables for their development modelling and training purposes and thus they end up creating files and for the sake of having these data available for down stream systems, they would eventually get loaded into tables that exist in the relational databases.
  • Data Movements and Migrations have been increases day by day with relational databases becoming source-of-truth for the down stream systems and the developers use other ways of creating and interacting with data except for analysts.
  • The traction of tools like AWS Sagemaker with the Data Scientists is altogether a next level to this problem as Sagemaker takes only S3 as its data source and target. With increasing usage of Sagemaker by Data Scientists, this database overhead increases many fold as the data has to be exported to S3 as files and vice versa.
  • To cater to this wide variety, the processes and logics would also be in various platforms/tech stacks and scheduled in different orchestration environments, which is also a maintenance overhead.
  • Scalability of the systems for these reason is also a pain and the guardians of the environments, the tech support specialists have a tough time as the systems aren't resilient to changes.
There are in fact much more than these issues but these are some of the common ones. Thus, here is how the problem can really be solved with the Enterprise wide Data Lake Implementation.

The Data Lake with AWS S3

The architecture here involves the following components

  • Storage Tier
  • Ingestion & ETL Tier
  • Abstraction Tier
  • Consumption Tier

Storage Tier

This is the heart of the architecture where all the data will be stored. The difference from the previous architecture's storage is that, this tier is backed by Amazon’s S3 instead of a relational database. A separate bucket (if needed multiple based on the needs) would be created to ingest all the data into respective folders, in fact could also be partitioned case by case basis for a better storage and access.

Ingestion Tier

All the data to be ingested from various sources are being developed, modelled, transformed as needed put together form the Ingestion Tier. This tier deals with ETL technologies and frameworks that we use to ingest data to S3 buckets and folders as per the functionality. For example, Spark, Kafka, Standalone Python applications, Airflow, NiFi, Streamsets etc. fall under this tier. And because we unify and standardize the whole architecture by having a common storage system, it also gives a flexibility to culminate the problem of having a variety of ETL frameworks and have a uniform method.

Abstraction Tier

Since the Data in S3 are all files, in order provide data access capabilities for both non-developers (like analysts) and developers alike, we could create a data abstraction layer with Hive. In other words, Hive will stay on top of the S3 data in order to facilitate the users with SQL capabilities and thus the data can be accessed using tables, views and so on. Therefore, there is no losing of SQL capabilities which the analysts are comfortable with. It could also benefit developers who wants it.

Consumption Tier

With data lake containing all the data in the file formats, to boost the performance and usability better, a consumption tier would comprise of the tools and frameworks like Presto (for better performant SQL capabilities) and Alluxio (for fast possible data retrieval). This layer can also include any other consuming applications that are either in-house, custom built or already available open sourced options for the sake of data access and retrieval.

Advantages of the new approach

  • Solving the big-data problem, in the big-data standard way with a simplified architecture
  • Need not worry about growing data as the storage system is S3 and therefore reduces the effort required in scaling and maintaining databases
  • Storage of data in optimized, performant formats like parquets, ORC and avro are also possible.
  • Data Migration from environment to environment and files to tables, tables to files for the use of data scientists is minimized and in most cases close to Zero as the data would already be in file formats in the S3
  • When the resources used are all in AWS (like EC2, EMR, Data Pipeline etc)it is much more efficient to read from and write to S3 which thereby improve performances many fold.
  • Cases like the Data Scientists using Sagemaker which can read and write to S3, this architecture solves the problem of building a process for them to export and import the data from and to the Database for the use in Sagemaker
  • This architecture leads one to devise and construct homogeneous ETL jobs instead of having them in diverse tech stack and different orchestration platforms which are followed currently.
  • Non-developers like analysts could also continue to work with data as they do not lose SQL capabilities
  • Cost of owning and maintaining a Database would be drastically reduced
  • DevOps could be much more easier than it used to be.
  • Scalability is a now a thing of consideration only for the compute layer and not the storage layer with the use of S3
A sample illustration of the scalability with this model
Based on the need of the organization, there could be further optimization and tuning on this ecosystem but it is already one of the nicest models to implement a data lake for the organization and as one can see, it outweighs the other data warehousing and data lake approaches by a great difference. 

Well now, you could start fishing inside your data lake too!

Comments

Popular posts from this blog

Installation of Apache Cassandra