AWS implementations

10 Key Considerations for Cloud Data Lakes

and why AWS has the best tech stack to implement a data lake

Gaurav Thalpati
AWS in Plain English
5 min readFeb 6, 2022

--

Photo by Jack Church on Unsplash

Introduction

Data lakes are one of the most critical building blocks of any data ecosystem.

Many enterprises have been using data lakes, for several years, mainly for dumping all their data — hot, cold, warm, incremental, transactional, historical, archived etc.

Data lakes have now become more popular than ever with the advent of lakehouse architectures. Please read my blog post on the trends in 2022 to know more about lakehouse architectures.

So why are data lakes so popular even in modern data ecosystems?

Data lakes are the backbone of the lake house architectures. They are used as a single data repository to store all the data, which can be processed, transformed & analyzed by using any compute engine of choice. Such data repositories need to be flexible, secure, durable, scalable & highly available to meet the ever-increasing data demands.

Data lake remains the most critical layer in such data architectures, which should be accessible 24*7 to data engineers and many other personas, including data scientists, analysts, business users, and other executives.

Here is a list of 10 key considerations for building cloud data lakes & how various AWS services can be used for implementing these

1. Single Source of Truth

A data lake should have all the data in its “as-is” raw form. It should have data from all the source systems without creating any silos. “Acquire once, use multiple times” — should be the approach when populating the data lake. Whenever any business unit needs data from any new source system, they should get it from the central data lake so that anyone else can use this data.

AWS S3 is the most popular & widely used service for implementing a data lake.

2. Security & Data Protection

All the data within the data lake should be encrypted using appropriate encryption approaches. Leverage the various AWS encryption mechanisms like SSE-S3 or KMS encryption.

Ensure that the PII attributes are either masked, encrypted or tokenized based on your data abstraction approach. The PII/PHI data should NOT be accessible to non-eligible users.

3. Governance & Sharing

Like Security, Data Governance is also a day 0 activity.

Implement the right access control mechanism so that data can be accessed and modified only by the relevant users. Use AWS lake formation service for managing fine-grained access control. Lake formation can help in managing access at the tables as well as column level. There is also a new feature to control access to specific cells within a table based on data filters.

Enable data sharing within clusters or accounts to avoid data redundancy and to maintain a single source of truth. Lake formation can help in sharing data & controlling access across AWS accounts.

4. ACID Compliance, Handling of Updates & Time Travel

Most data lakes have limited functionality for handling ACID properties like a database. Data lakes are immutable, making it challenging to update any data.

Use open data formats like Apache Iceberg, Apache Hudi or Delta-lakes to implement such features.

Also, the latest features in lake formation can help in implementing ACID & Time Travel features on S3

5. Easy Integration with other Compute Engines

Modern architectures demand separate scaling of compute & storage. They also need the flexibility to have various compute engines to process & analyze data based on the workload.

Data Lakes should be flexible enough to allow easy connectivity & integration with native services (AWS Athena, Redshift Spectrum) as well as external compute engines (Databricks, Snowflake)

6. Scalable, Durable & Available

Scalable — with increasing data demands, you cannot predict how much capacity would be required. Ensure that the data lake is easy & quick to scale

Durable — once data is stored, it might get accessed after a decade. AWS S3 offers 11 ‘9s’ of durability

Always available — Data lakes should always be available - 24*7

AWS S3 has all these features inbuilt, you don't need to spend any manual efforts to implement these

7. Backup & Replication

Data lake should be able to backup & restore data — with minimum RTO and RPO. Since S3 is highly available (99.99% for standard storage), you don't need to worry about such manual backups.

S3 cross-region replication can be used to replicate data across regions. This will ensure that data is available even during region failures.

8. Cost Optimizations

Data lakes can grow in size quickly. Increased storage results in increased cost.

Data lakes should have automated mechanisms for optimizing cost

Leverage AWS S3 features like Life Cycle policies, Intelligent tiering to keep the cost in check

9. Data for Everyone!

Data lake should cater to all users.

It should have

  • Raw data for data scientists,
  • Pre-cleansed data for data engineers
  • Curated data for analysts.

Always store all available data in multiple layers like the Bronze(Raw), Silver(Cleansed) or Gold(Transformed) zones. Different enterprises follow different naming standards for these layers, but they have the same purpose — to segregate the data and control the access as per the users and their roles.

10. Support for Data Lineage & Metadata Discovery

And finally, the most important of all.

Make your data discoverable.

Create the metadata, have the correct tags for these, and give access to the right users to use it effectively. AWS Lake formation provides tags based access control which can be leveraged for easier access control

Closing Thoughts

A data lake is the backbone of the modern data ecosystem.

The true value of data is only when it is accessed, processed, analysed & used for insight generation or future predictions.

Having robust data lakes to persist this data will not only help the users to get the right value out of this data but can also help them to innovate & discover endless possibilities for leveraging this data.

I hope you have found this useful. Thanks for reading.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Check out our Community Discord and join our Talent Collective.

--

--