Lake formation is a service provided by Amazon that automates the process of creating data lakes in AWS. In other words, AWS lake formation allows you to build, manage, and secure data lake automatically. It means you don’t have to use and set up different required services manually. This guide builds the theoretical foundation for your understanding of AWS Lake formation before demonstrating its configuration practically.
Setting up a new data lake in AWS and getting it ready to use can be an involved process. You have to create your data store, register it, place data access and security policies, ingest data, and use analytics and machine learning services separately. To accomplish these tasks, you can use different services such as Amazon S3 for storage, AWS Glue crawlers and classifiers for crawling and data categorization, and AWS Glue jobs for data transformation.
Then, for analysis and data visualization, you can use tools such as Athena, QuickSight, and EMR. Using these services and tools individually and then configuring them manually to form your data lake is a difficult process and involves many time-consuming tasks.
Lake formation simplifies this process by automating many of the required steps, which otherwise need to be done manually. In essence, lake formation can help you to
- Register a data lake and the paths where it will reside
- Orchestrate data flows
- Create and manage a centralized data catalog containing meta-data
- Define fine-grained permissions to access data
As lake formation provides better security controls, it helps you to build a secure data lake. This means that all your data will be securely available to gain insights.
In this article, you will learn about
- What is a data lake, its characteristics, and why do we need data lakes?
- Amazon Services required to build data lakes
- The functionality of AWS lake formation
- Practical demonstration of the lake formation process
Before diving into the process of AWS lake formation and the use of different AWS services, you first need to understand what a data lake is. Let’s start with its brief introduction.
What is a Data lake?
A data lake is a centralized repository, where you can store all of your data without thinking about its structure and scaling restrictions. It means that you can store your data as it is, without structuring and running different types of analytics.
There is a common misunderstanding that data lake tools are like a powerful database. In fact, data lakes provide multiple functions such as data ingestion, processing, data visualization, and the use of machine learning algorithms. More importantly, unlike databases the storage and processing mechanisms of a data lake are decoupled. This allows the use of different AWS data processing and data visualization services.
Data lakes are also different from data warehouses, as you can store both structured and unstructured data in a data lake. Contrary to a data warehouse, there is no need to define the schema or structure of data in data lakes. This means you do not need to worry about the design of the data or the data-related questions that might arise in the future. You can also apply different types of analytics to your data such as real-time analytics, machine learning, SQL queries, and text searches to gain insights.
A data lake usually consists of the following elements:
- Storage
- Data ingestion
- Analytics and machine learning
- Real-time analysis
Why Data Lakes?
Traditional databases and object storage are prone to load problems, whereas distributed databases have security and data cataloging issues. In contrast, a data lake enables organizations to perform new types of analytics like applying machine learning algorithms over new sources such as social media, log files, and data from click-streams and IoT devices. This helps businesses to identify opportunities for faster growth, make new business decisions and outperform their competitors.
The main reasons for adopting data lakes are:
- Increased operational efficiency
- More data availability
- Lower transactional costs
- Offload capacity from databases and data warehouses
Data lakes characteristics
The most important characteristic of data lakes is that they should be data agnostic. This means that they should allow storing of both structured and unstructured data and should not be limited to just one type of file or data structure.
In addition, all the data should be stored in a centralized place to break down data silos (which means that data of one type should not be stored separately from data of another type).
Amazon services to build a data lake
You can build a data lake by using different services offered by Amazon. These services can be categorized into three types:
- Data storage and cataloging servicesThe preferred AWS services for the data layer are Amazon simple storage service (S3) and S3 Glacier. These services are used to build the data lake itself and not to populate or analyze the data lake. S3 uses the concept of “Bucket” to store all types of content in a central repository.In S3, you can use different storage tiers to categorize your data. These tiers are based on the frequency of the data access and you will be charged accordingly. So, for frequently accessed data, you can select a tier that costs less for access requests and more for data storage.AWS Glue is a data processing and cataloging service you can use to catalog data in your data lake and prevent it to become a data swamp. Likewise, AWS Glue Data Catalog is the metadata repository that represents your data. You can use Glue “Crawler” to populate AWS Glue Data Catalog with tables.
- Data movement servicesThese services are used to ingest data into your data lake. For real-time data ingestion, you can use Kinesis Data Streams or Kinesis Firehose. Whereas you can use Amazon API Gateway to ingest restful data using standard HTTP calls.On the other side, you can use AWS data exchange to extract public data from third-party and store it with your data. Similarly, App flow can be used for collecting your data from third-party software and securely transferring it to your data lake.
- Analytics and processing servicesAmazon provides a great serverless service called Athena via which you can run SQL queries against data in S3. You can also use Amazon EMR to process and analyze data in batches, whereas to transform raw data collected in real-time, the AWS Lambda service is available.
You can comprehend complex data lake architectures by keeping the above-mentioned categories in mind. Dividing complex architectures into these three types of services make it more understandable.
How AWS lake formation work?
The working of AWS lake formation can be divided into the following three parts:
- Pointing lake formation to your data sourcesThe first step in lake formation is to specify your data sources. A data source can be Amazon S3 or relational and NoSQL databases. Lake formation allows you to import data both from existing data sources stored already in AWS and from other external sources. The data can be imported in bulk or incremental fashion.
- Crawling and moving data to these data lakeAfter defining data sources and specifying data access credentials, lake formation will read your data (and meta-data) with the help of a crawler. The crawler will move the data into your new S3 data lake. Then, Lake Formation manages ETL jobs, data cataloging, security settings, and access control.Crawlers use classifiers that read the data and generate its schema on recognizing the format of the data. You can create your custom crawlers for this purpose, but lake formation provides several templates to choose from. You can select a template based on predefined data sources such as a rational database or AWS CloudTrail logs.
- Use analytical servicesOnce the data is securely stored, you can use different analytics and machine learning services such as Athena, Redshift, and EMR.Athena is a great service that allows you to query data sets using standard SQL queries without loading them into the database. With Amazon Redshift, you can run complex SQL queries for data analytics. EMR is a cluster platform that simplifies running big data frameworks like Apache Hadoop and Spark on AWS.
AWS lake formation
Now we have enough theoretical understanding of lake formation, let’s see how to build it on AWS. The steps given below will help you to create a data lake (in the form of S3 buckets) and register its location.
For AWS lake formation, first, you have to create a new user (administrator) and assign full administration permissions to it. Because your root user cannot be the administrator of a data lake.
After logging in to your AWS account, you can use the following steps to create a new admin user.
- Creating the Admin user for lake formation
To create the admin user account, you will use the IAM dashboard.
- Type IAM in the search bar and select it
- Click on the “Users” option from the dashboard that appears on the left of the screen
- Click on the “Add User” button, and type a name for the new user
- Select both programmatic and AWS management console access control settings
- This will open the password textbox, so set a console password
- Uncheck the Reset Password option and click “Next”
Next, you have to provide administrator and AWS lake formation permissions to the newly created user. For that
- Select “Attach existing policies directly”
- Now assign administrator and data admin permission by searching and checking AdministratorAccess and AWSlakeformationDataAdmin
- “Review” and then click on the “Create User” button
- Lake Formation
For lake formation itself, you can follow the given steps:
- Type AWS lake formation in the search bar and select it (you will get some messages to add administrator if you are accessing it for the first time)
- Otherwise, click “Admins and database creators” from the dashboard on the left side
- Click on the “Grant” button on the top right side
- select the admin account you created previously
After selecting administrator for your data lake, you have to log out of your root account and log in back using the newly created admin account.
- Granting access permission
- Click on AWS lake formation, and then “Admins and database creators”
- Click on the “Grant” button present in the Database creators section
- Select the user you created from the drop-down list and assign Create database permission
Now, this user has full access inside lake formation, but other users can have different access permissions.
- Create S3 bucket
The next step is to create an object storage bucket for data collection. We will create an S3 bucket because S3 acts as the storage layer of your data lake. Then you need to register that bucket as lake creation.
- Search S3 and select it
- Click on the “Create Bucket” button present on the top right side
- Name your bucket
- enable Versioning and Encryption
- Click on the “Create Bucket” at the bottom of the page
- Now create three folders representing the different data quality, or just a single folder using the “Create folder” button
- Add Data lake location
- Type AWS lake formation in the search bar and select it
- Select “Data lake locations” from the dashboard on the left side and click on the “Register location” button
- Select the bucket you have created, using the “Browse” button
- Click on the “Register location” button while leaving the other options as it is
After lake formation, the next steps are to ingest data, create tables and read data from it.
Summary
AWS lake formation is the latest service that helps you build your data lakes on AWS. It is a great tool to automatically perform many of the manual steps required to create a data lake. These steps are specifying your data sources, moving existing data to your data lake, cleaning and classifying it, and assigning security permissions for its access. Lake formation helps you to perform these steps automatically. But, it is important to understand that lake formation is not a data lake itself. Rather, it is a service via which you can put the building blocks together to build your data lake.