Introduction to Azure Data Lake Storage Gen2
What is Azure Data Lake Storage
A Data Lake Storage is a repository for storing large quantities and varieties of structured, semi-structured and unstructured data in respective native format. It is nothing but a big container of data.
Any number of Files, of any type and size, can be stored in the Azure Data Lake Storage. The files can be of any format, like - text file, XML file, JSON file, CSV file, image file, video file etc. The type of the stored file doesn’t matter as the file can be uploaded in the raw format.
Data, maintained in its raw form inside the Azure Data Lake Storage, is the Unstructured Data. To use the Structured approach, databases can be created within the Azure Data Lake Storage. The data, that is held within the databases in the Azure Data Lake Storage, is the Structured Data.
Once all the data is stored inside the Azure Data Lake Storage, then technologies like - Hadoop, Databricks or Data Factory process and analyze the stored data, and, present to the business.
What is Azure Data Lake Storage Gen 2
Azure Data Lake Storage Gen 2 is a set of capabilities dedicated to Big Data Analytics, built on top of Azure Blob Storage. Azure Data Lake Storage Gen 2 is the result of converging the capabilities of two existing Storage Services, i.e., Azure Blob Storage and Azure Data Lake Storage Gen 1.
Features from Azure Data Lake Storage Gen 1, such as, file system semantics, directory, file level security and scale are combined with low-cost, tiered storage, high availability, and, disaster recovery capabilities from Azure Blob Storage.
How Azure Data Lake Storage Gen 2 Evolved
One of the key components of Hadoop is the Hadoop Distributed File System (HDFS), which works as a database for Hadoop. The features of HDFS are -
- Fault Tolerant File System - This means that the data is stored in large blocks and replicated at least three times. So, data can be easily recovered in case of any unplanned events, like - hardware failure, or, natural disaster etc.
- Runs on Commodity Hardware - To set up HDFS, super expensive machines are not needed, as, HDFS runs on Commodity Hardware, which is very cheap.
- Big Data Platforms Use HDFS - Plenty of Big Data Platforms use HDFS, such as, MapReduce, Pig, Hive, Spark etc.
- HDFS in Cloud - HDFS was so famous that, whenever there was a new Big Data Platform introduced, it was highly likely that it would support HDFS. So, it was no surprise that companies would want to use HDFS technologies, while designing the Cloud Platform. Microsoft Azure uses HDFS in Cloud, and, that Service is known as Azure Data Lake Storage Gen 1.
In Cloud Architecture, there came a situation when Processing Power and Storage Capacity needed to be increased.
- Processing Power - Increasing the Processing Power was easily sorted out by creating Virtual Machine with significant amount of CPU and RAM. In this way, the Processing Power were optimized.
- Storage Capacity - Storage requirements can vary substantially from application to application. There can be an application that streams audio and video. There can be another application that performs transaction over structured data. There may be an application which serves the small files, or, another application which serves the large files.
So, Microsoft realized that there is a Cloud Object Store that has proven to be extremely useful, i.e., Azure Blob Storage.
The Features of Azure Blob Storage are -
- Azure Blob Storage is a large Object Storage in Cloud in all region.
- Azure Blob Storage is optimized for storing massive amounts of unstructured data. The unstructured data may be in the text or binary format.
- One of the key advantages of Azure Blob Storage is that, it can be used to store varieties of data. It is a general-purpose storage.
- Azure Blob Storage is cost-efficient, which is very important.
- Azure Blob Storage has multiple Performance Tiers. Azure Blob Storage is designed in a way that the more it is used, the less needs to be paid per-Gigabyte. So, depending on how frequently the data is accessed, the storage can be used as Hot, Cool, or, Archive with significant saving, if the data is not required very frequently.
Microsoft combined the widespread options of HDFS with all the advantages of Azure Blob Storage, and, ended up creating Azure Data Lake Storage Gen 2.
Microsoft recommends using Azure Data Lake Storage Gen 2 for Big Data Storage needs.
If the user has an existing infrastructure that is built on U-SQL language, which was specially designed for Big Data Analysis, the user should use Azure Data Lake Storage Gen 1. Otherwise, the user should always use Azure Data Lake Storage Gen 2.
What is Hierarchical Storage and Hierarchical Namespace
A Storage, inside which, the collection of objects and files is organized into a tree, or, folders and nested-tree, or, sub-folders like structure, in the same way that the File System on a computer is organized, is called a Hierarchical Storage.
Hierarchical Namespace organizes objects, or, files into a hierarchy of Directories for efficient data access.
Disadvantages of Azure Blob Storage Not Being a Hierarchical Storage
Though Azure Blob Storage is often organized in a structure that seems to include folders and sub-folders, it is not considered as Hierarchical Storage, because, this is
simply a naming convention. The user can put slashes “/” in the Blob names to simulate a tree-like Hierarchical Directory Structure, but these are simply just files in a flat structure. This process works to a certain extent to organize objects.
When manipulation actions like - moving, renaming, or, deleting is performed on Files, or, Directories in Azure Blob Storage, the slash-like structure is of no help, because, without real Directories, applications perform separate operation on each of the millions of individual Files, stored in Azure Blob Storage, to achieve Directory-level tasks.
Manipulation operations on Files, or, Directories are performed from front-end in Azure Blob Storage. Example: If there were a folder with 5000 files in traditional object storage, like - Azure Blob Storage, and it needed to be renamed, then 5000 files were to be copied at first, followed by 5000 files deleted, because these operations were to be performed from the front-end.
Traditional Object Storages have not historically supported Hierarchical Namespace because, a Hierarchical Namespace limits the scalability of the respective Object Storage.
Advantages of Azure Data Lake Storage Gen 2 Being a Hierarchical Storage
- Azure Blob Storage does not support the Hierarchical Structure, but Hadoop requires Hierarchical Namespace to integrate with the Storage. That is why Hadoop cannot be integrated with Azure Blob Storage, whereas, Azure Data Lake Storage Gen2 supports Hierarchical Namespace and that is why it can be seamlessly integrated with the huge ecosystem of Hadoop software.
- A Hierarchical Namespace processes the tasks of moving, renaming, or, deleting the Files, or, Directories by updating a single entry. Azure Data Lake Storage Gen2 is designed to perform operations on Folders. Hence, Azure Data Lake Storage Gen2 is easily manageable.
- Manipulation operations on Files, or, Directories are performed in back-end in Azure Data Lake Storage Gen2.
- Hierarchical Namespace feature has significantly improved overall performance of many Analytics jobs. This improvement in performance means that less Compute Power is required to process the same amount of data and a lower total cost of ownership for end-to-end Analytic jobs.
- Hierarchical Namespace in Azure Data Lake Storage Gen2 makes it scale linearly and does not let either the data capacity or the performance to be degraded.
- Once the Hierarchical Namespace is enabled on an Azure Blob Storage Account, it cannot be reverted to the flat file namespace.
When Hierarchical Namespace is Not Used
There are some scenarios when Hierarchical Namespace is not used, because some workload might not gain any benefit enabling Hierarchical Namespace.
Example: if there is a database backup, or, image storage etc. where the object organization is stored separately from the objects, like, in separate databases.
Important Features of Azure Data Lake Storage Gen 2
A) Integration -
- POSIX Compliant - Azure Data Lake Storage Gen 2 can be integrated with almost all the Big Data Analytics platform. The biggest advantage of Azure Data Lake Storage Gen 2 being a POSIX compliant is that it can act as a replacement for the Hadoop Distributed File System (HDFS). This means that Azure Data Lake Storage Service, or, ADLS can seamlessly integrate with a huge ecosystem of Hadoop software.
- Hadoop Integration - Data can be accessed and managed in ADLS as the data is in Hadoop Distributed File System (HDFS). Azure can be integrated with major Hadoop distributions, like - Hortonworks Data Platform, Cloudera Enterprise Data hub, or, projects like - Databricks, Spark, Storm, Flume, Scoop, Kafka etc.
- Usage of ABFS Driver - There is a new Driver, called “ABFS Driver”, which is optimized specially for Big Data Analytics. This driver is available within all the Apache Hadoop environments, including - Azure HDInsight, Azure Databricks. Using this driver, data stored in Azure Data Lake Storage Gen 2 can be easily accessed.
- Other Azure Services Integration - Azure Data Lake Storage Gen 2 can also be integrated with other Azure Services, like - Azure Data Factory can be used to ingest the data into ADLS, Azure Databricks can be used to extract, transform and load the data into ADLS, Azure Event Hub can be used to capture events and store into ADLS, PowerBI can be used to analyze and visualize the data that are stored in ADLS etc.
B) Scalability - In Azure Storage, there is no limit on the size of the data to store, as Azure Storage is scalable by design. So, whether it is Azure Blob Storage, or, Azure Data Lake Storage Gen 2, there is no limit on the size of the data to store.
The first thing to understand about Azure Data Lake Storage Gen 2 is that it is designed for Enterprise Big data Analytics. This is the primary purpose for both the Data Lake Services. There is no fixed size or limit of the Files, or, Accounts. So, Azure Data Lake Storage Gen 2 can store and serve, i.e., process with Hadoop many Files of sizes up to Petabytes, or, even Exabytes.
C) Cost-Effective -
- Built on Top of Azure Blob Storage - Azure Data Lake Storage Gen 2 is built on top of low-cost Azure Blob Storage. That is why Azure Data Lake Storage Gen 2 costs almost half of the price of Azure Data Lake Storage Gen 1.
- No Need to Move Data - In Azure Data Lake Storage Gen 2, there is no need to move the data for Analytics operations to be performed. Analytics operations can be performed where the data is in ADLS. It saves a lot of computing costs.
D) Performance - Azure Data Lake Storage Gen 2 is also optimized for high- speed throughput. Because of the Hadoop Filesystem, Azure Data Lake Storage Gen 2 also keeps the data close to the Compute and is designed to support throughput for Parallel Processing Scenarios.
E) Security - Azure Data Lake Storage Gen 2 has many added security features, like - ADLS allows to set the permission at the individual File or Directory level, whereas, Azure Blob Storage allows to set permission only on the Container level, and, not on the individual Blob level.
F) Fault Tolerant / High Availability / Disaster Recovery - Azure Data Lake Storage Gen 2 is fault tolerant in a typical Hadoop implementation, i.e., data is replicated three times in a Cluster, and, Azure Data Lake Storage Gen 2 has opted this practice in order to ensure that the data is available in case of a hardware failure, or, natural disaster etc.
G) Global Footprint - Azure Blob Storage is available in all the regions. Since, Azure Data Lake Storage Gen 2 is built on top of Azure Blob Storage, it is also available in all the regions.
Challenge of Azure Data Lake Storage Gen 2
A) Hard to Query Unstructured Data - Azure Data Lake Storage Gen 2 can store unstructured data. These types of data do not have a schema. Hence, it can be hard to query, or, consume this kind of data.
B) Hard to Manage Data Quality - Azure Data Lake Storage Gen 2 also has an inherent challenge to manage the data quality. Unless good protocol is put into place prior to the movement of data, challenges can arise like — who has access to the data, who manages the data Pipelines etc.