Introduction to Amazon EMR Cluster

18 min readJul 24, 2024

Why Amazon EMR is Needed at All?

The use of internet, Iot devices, sensors, social media, and so on, have resulted in vast amount of data being generated every day.
For the companies to process and derive insights from these data of Terabyte or Gigabyte scale, or, even train the Machine Learning Models, a Big Data Technology, or, Framework with Distributed Computing capability will need to be used.
Examples of such Distributed Computing Frameworks are Apache Hadoop, Apache Spark.

Now, instead of running these Distributed Computing Frameworks on an On-Premise machine, it is preferrable to run on the Amazon EMR, which comes with pre-installed Big Data Frameworks, such as - Apache Hadoop, Apache Spark, Presto, Flink, TensorFlow and so on.

What is Amazon EMR?

Amazon EMR, or, Amazon Elastic Map-Reduce, is a Service offered by AWS, which comes with pre-installed Big Data Frameworks, such as - Apache Hadoop, Apache Spark, Presto, Flink, TensorFlow and so on.

Amazon EMR is a Managed Cluster Platform that simplifies running the previously mentioned Big Data Frameworks to process and analyze vast amount of data of Terabyte or Gigabyte scale. for Analytics purposes or Business Intelligence Workloads.

Amazon EMR can easily integrate with the different Data Stores, or, the Databases, like - Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Understanding the Cluster and Node Concept with Respect to Amazon EMR

Cluster:

When a piece of code is being run on Amazon EC2, that means the piece of code is being run on a single machine. Similarly, when a piece of code is being run on a personal laptop, that means the piece of code is being run on a single machine.

The central concept of Amazon EMR is the Cluster, which is a collection of Amazon Elastic Compute Cloud (EC2) Instances.

That means, when an Amazon EMR service is provisioned, in truth, a Cluster of Nodes is provisioned, where each Node in that Cluster is actually an EC2 Instance being provisioned for that Amazon EMR service.

So, when a piece of code would run on Amazon EMR processing some data, the data would be processed throughout the multiple EC2 Instances as a Distributed Computation System.

To explore the Amazon EMR, it is necessary to provision a Cluster of Machines., which is called a Cluster of Nodes. That means each EC2 Instance is a Node in an Amazon EMR.

Node:

Each of the EC2 Instance in the Cluster of Amazon EMR is called a Node.

Each Node has a role within the Cluster, referred to as the Node Type. Amazon EMR also installs different software components on each Node Type, giving each Node a role in a Distributed Application, like - Apache Spark.

Following is the three Node Types in an Amazon EMR Cluster -

A. Primary Node/Master Node: Primary Node/Master Node is a Node that manages the Amazon EMR Cluster by running software components to co-ordinate the Distribution of Data and Tasks among the other Nodes (Core Node and/or Task Node) for processing.
The Primary Node/Master Node tracks the status of the Tasks assigned to the other Nodes (Core Node and/or Task Node), and, monitors the health of the Amazon EMR Cluster.
Whenever, an Amazon EMR Cluster is provisioned, it will always have a Primary Node/Master Node.
Also, when an Amazon EMR Cluster is provisioned, it can be created as a Single-Node Cluster. That means the Cluster will have only the Master Node.

However, to process the Big Data, the processes need to be very efficient and quick, an Amazon EMR Cluster needs to be provisioned with one Primary Node/Master Node and multiple Core Nodes.

B. Core Node: Core Node is a Node with software components that runs the Tasks assigned to it and stores the Data in Hadoop Distributed File System (HDFS) on the Amazon EMR Cluster.

So, the Primary Node/Master Node assigns the Tasks to be executed to the Core Nodes, and, also distributes the Data to the Core Nodes on which the Tasks would be executed. The Core Nodes store the Data that is sent by the Primary Node/Master Node for processing to the Hadoop Distributed File System (HDFS).

A multi-node Amazon EMR Cluster has at least one Core Node.
So, if an Amazon EMR Cluster is provisioned containing 2 EC2 Instances, then one of the EC2 Instances would be the Master Node, and, the remaining EC2 Instance would be the Core Node.

Suppose, an Amazon EMR Cluster is provisioned containing 5 EC2 Instances. Then one of the 5 EC2 Instances would be the Master Node, and, the remaining 4 EC2 Instances would be the Core Nodes.

C. Task Node: Task Node is a Node with software components that only runs the Tasks assigned to it and does not store the Data in Hadoop Distributed File System (HDFS) on the Amazon EMR Cluster.

So, the Primary Node/Master Node assigns the Tasks to be executed to the Task Nodes, and, also distributes the Data to the Task Nodes on which the Tasks would be executed.
The Task Nodes in this case just execute the Tasks assigned and process the Data that is sent by the Primary Node/Master Node, but, do not store the Data in the Hadoop Distributed File System (HDFS).

Task Nodes are optional. Depending on the use case, it needs to be determined if a Task Node will be provisioned in an Amazon EMR Cluster or not.

Suppose, an Amazon EMR Cluster is provisioned containing 5 EC2 Instances. Then one of the 5 EC2 Instances would be the Master Node, 3 of the EC2 Instances would be the Core Nodes and, the last remaining EC2 Instance would be the Task Node.

What is the Difference Between a Core Node and a Task Node?

A Task Node is pretty much similar to a Core Node, but, it does not store the Data in the Hadoop Distributed File System (HDFS) on the Amazon EMR Cluster.

What is the Advantage of Using Amazon EMR?

Amazon EMR Saves Cost: When Amazon EMR is used as an On-Demand Compute, there is no need to purchase and maintain any Big Data Infrastructure on the company side, i.e., in on-premise.
That means, there is no need to buy the big server, and, maintain it, which were a lot of work and would have incurred a lot of expenses. So, Amazon EMR helps to save some cost.
Just create an Amazon account and provision an Amazon EMR Cluster to spin the Cluster up or down depending on the requirement.

Amazon EMR pricing depends on the following -

Instance type
Number of EC2 Instances that are provisioned as Nodes in the Amazon EMR Cluster
Region in which the Amazon EMR Cluster is launched

Although, the on-demand pricing offers low rates, the cost can still be reduced even further by purchasing Reserved Instances or Spot Instances.
Spot Instances can offer significant savings - as low as a tenth of an on-demand pricing in some cases.

Amazon EMR Integrates with Other AWS Services: Amazon EMR can be easily integrated with other AWS services, like - Amazon S3, Amazon DynamoDB or different Data Stores.

Amazon EMR is Deployed Easily: Amazon EMR can be Deployed Easily.

Amazon EMR is Scalable and Flexible: Amazon EMR provides the flexibility to scale the Cluster up or down as the Computing needs change.

Suppose, in every week, on Mondays, there is always a very high workload as there are a lot of data to process, while, on Tuesdays, there are a very limited data to process. So, in such cases, it is possible to configure the Amazon EMR Cluster to do Auto-Scaling. So, the following will happen -
1. Whenever there is a large dataset to process, Amazon EMR will provide the flexibility to resize the Cluster to add more EC2 Instances for peak workloads.

2. Whenever there is a smaller dataset to process, Amazon EMR will provide the flexibility to resize the Cluster to remove some EC2 Instances to control costs when peak workloads subside.

Amazon EMR is Reliable: Amazon EMR monitors the Nodes, i.e., the EC2 Instances in the Cluster, and, automatically terminates and replaces an EC2 Instance in case of a failure, by default.

Amazon EMR is Secure: Amazon EMR leverages other AWS services, such as - IAM and Amazon VPC, and, features, such as - Amazon EC2 Key Pairs to help secure the Cluster and the data.

Why User is Created in AWS?

To perform any operation in the AWS Console, an Agent is needed. That agent can be a User, a Group etc.

In context to the Amazon EMR Cluster, once an Amazon EMR Cluster is provisioned, to run a Jupyter Notebook on the provisioned Cluster to process the data, a User needs to be created.

Also, it is possible to assign Permission on the created User. So that, the User does not have access to everything in the AWS account in which it is created. This is being done in the IT industry to maintain the security.

How to Create a User in AWS Console?

Step 1: Open AWS Console using the Root User and search for IAM (Identity and Access Management). Click on the IAM service in the searched result.

Step 2: Click on the option “Users” under the “Access management” category on the left menu.

Step 3: To start creating a new User, click on the button “Create user” in the top right corner of the page.

Step 4: Specify the following user details -

A. Provide “oindrila-emr-user” in the textbox for User name.

B. Select the checkbox for the option “Provide user access to the AWS Management Console — optional”.

C. Then, select the radio button for the option “I want to create an IAM user”.

D. Select the radio button for the option “Custom password” and provide the desired password in the below textbox.

E. Uncheck the checkbox for the option “Users must create a new password at next sign-in — Recommended” as in the above section, the custom password is already provided.

F. Finally, click on the “Next” button.

Step 5: In the “Set permissions” section, since there is no Group created presently, select the radio button for the option “Attach policies directly”.

For now, select the checkbox for the option “AdministratorAccess”. That means the created User will have access to every AWS service in FULL access mode in the AWS account in which it is created.

Finally, click on the “Next” button.

Step 6: In the “Review and create” section, validate the values and options selected for the previous two sections.

Finally, click on the “Create user” button.

Step 7: Finally, the User is created.

How to Create Secret Access Key for a User in AWS Console?

Step 1: In the “Users” page, the just created User “oindrila-emr-user” is displayed. Click on the name of the User.

Step 2: In the opened page of the User “oindrila-emr-user”, click on the “Security credentials” link.

Step 3: In the opened page for the “Security credentials” of the User “oindrila-emr-user”, scroll down to the “Access keys” section. Click on the “Create access key” button.

Step 4: In the opened page of “Access key best practices & alternatives”, select the radio button for the option “Command Line Interface (CLI)” under “Use case”.

Scroll down, and, select the checkbox for confirmation and finally click on the “Next” button.

Step 5: Finally, click on the “Create access key” button.

It can be seen from the below image that the Access Key and the Secret Access Key are created for the User “oindrila-emr-user”.

One thing to remember is that if the values of the Access Key and the Secret Access Key are lost somehow, the values can’t be retrieved for that User. A new Access Key and the Secret Access Key would need to be created.

So, it is better to download the CSV file containing the information of the Access Key and the Secret Access Key for the User “oindrila-emr-user”.

Finally, click on the “Done” button.

How to Open AWS Console for a Particular User?

Step 1: Log out of the AWS account that is still now logged in with the Root User.

Step 2: Open the downloaded CSV file containing the information of the User Name, Password and the Sign-In URL for the created User “oindrila-emr-user”.

Step 3: Copy the Sign-In URL for the created User “oindrila-emr-user” from the CSV file, and, paste in the browser.

The log in page for the User “oindrila-emr-user” opens. It can be seen that the value of the “Account ID”, or, “account alias” is already pre-populated.

Step 4: Copy the User name and Password for the created User “oindrila-emr-user” from the CSV file, and, paste in the textboxes for “IAM username” and “Password” respectively. Finally, click on the “Sign in” button.

Step 5: It can be seen from the below image that the Amazon Console is now opened with the created User “oindrila-emr-user”.

How to Create a S3 Bucket in AWS Console?

Step 1: Open AWS Console using the created User “oindrila-emr-user”, and, search for S3. Click on the S3 service in the searched result.

Click on the “Create bucket” button.

Step 2: The Bucket name should be unique. So, provide a unique name in the textbox for Bucket name, like “oindrila-emr-raw-data-bucket”.

It also must be made sure that the AWS Region must be actually close to the physical place from where the AWS service is tried to be provisioned. Currently, I am in India. So, the AWS Region selected is “Asia Pacific (Mumbai) ap-south-1” at the time of creating the S3 Bucket.

Keep all other configurations of all other sections to its default. Finally, click on the “Create bucket” button.

It can be seen from the below image that the S3 Bucket “oindrila-emr-raw-data-bucket” is created successfully.

What is Amazon VPC?

Amazon VPC stands for Amazon Virtual Private Cloud (Amazon VPC), which can launch AWS resources in a logically isolated virtual network that is defined. This virtual network closely resembles a traditional network that is operated in a data center, with the benefits of using the scalable infrastructure of AWS.

The following diagram shows an example VPC. The VPC has one subnet in each of the Availability Zones in the Region, EC2 instances in each subnet, and an internet gateway to allow communication between the resources in your VPC and the internet.

How to Create a VPC in AWS Console?

The VPC needs to be created so that the Amazon EMR Cluster can run in that particular VPC.

Step 1: Open AWS Console using the created User “oindrila-emr-user”, and, search for VPC. Click on the VPC service in the searched result.

Step 2: Click on the “Create VPC” button.

Step 3: Select the radio button for the option “VPC and more”.

Provide a suitable name in the textbox displayed below the checkbox “Auto-generate”, like “oindrila_emr_project”. Depending on the given name, the names of the VPC, Subnets etc will be named, which can be seen from the “Preview”.

Keep the rest of the configurations to its respective default values and finally, click on the “Create VPC” button.

Step 4: It can be seen from the below image that the VPC is created successfully.

One thing must be remembered that once the work is done it is better to terminate the Amazon VPN.

How to Provision an Amazon EMR Cluster in AWS Console?

Currently, the requirement is to create an Amazon EMR Cluster where Jupyter Notebook can be run on Apache Spark.

Step 1: Open AWS Console using the created User “oindrila-emr-user”, and, search for EMR. Click on the EMR service in the searched result.

Step 2: Click on the “Create cluster” button.

Step 3: Provide a suitable name in the textbox for “Name” of the Cluster to be created, like “oindrila_emr_cluster”.

Keep the default selected EMR release version in the dropdown for “Amazon EMR release”, i.e., “emr-7.2.0”.

Since, Amazon EMR comes with pre-installed Big Data Framework, in the “Application bundle” section, select the desired Big Data Framework. Currently, the requirement is to select “Apache Spark”.

When, “Apache Spark” is selected, the following applications are selected as well -

Livy 0.8.0
Spark 3.5.1
Hadoop 3.3.6
JupyterEnterpriseGateway 2.6.0
Hive 3.1.3

So, these five softwares will be installed across the Master Node, Core Node and Task Node.

By default, in Amazon EMR Cluster, the Apache Spark Big Data Framework expects that Jupyter Notebook would be run on the Amazon EMR Cluster. Hence, the software “JupyterEnterpriseGateway 2.6.0” is also selected to be installed. This software helps attach the Jupyter Notebook to the Amazon EMR Cluster.

Step 3: In the “Cluster configuration” section, keep the default selected radio button for the option “Instance groups”.

Under “Instance groups”, keep the default selected “ EC2 instance type” options for “Primary” and “Core” respectively.

Since, in an Amazon EMR Cluster, the Task Node is optional, click on the “Remove instance group” button to delete the provisioning of Task Node.

Step 4: In the “Cluster scaling and provisioning — required” section, select the radio button for the option “Use EMR-managed scaling”. This will enable the Auto-Scaling feature in the Amazon EMR Cluster to be created.

Provide the value of the “Minimum cluster size” as 3. That means, whenever there would be limited dataset to process, the Amazon EMR Cluster would be resized to have 1 Master Node and 2 Core Nodes.

Provide the value of the “Maximum cluster size” as 8. That means, whenever there would be larger dataset to process, the Amazon EMR Cluster would be resized to have 1 Master Node and 7 Core Nodes.

There would be no Task Node, as in the “Cluster configuration” section, the provisioning of Task Node has been removed.

Provide the value of the “Maximum core nodes in the cluster” as 7. Since, the “Maximum cluster size” is provided as 8, that means 1 Node will be used as Master Node, and, the remaining 7 Nodes would be used as Core Nodes when the dataset to process would be large.

Provide the value of the “Maximum on-demand instances in the cluster” as same as the value of the “Maximum cluster size”, i.e., 8.

Under “Provisioning configuration”, provide the value of the “Instance(s) size” as 2. Amazon EMR will provision this many Core Nodes when the Cluster is launched for the first time. Since, the value of the “Minimum cluster size” is provided as 3, the number of Core Nodes should be 2, which is provided in this case.

Step 5: In the “Networking — required” section, click on the “Browse” button for selecting “Virtual private cloud” to select the previously created VPC, i.e., “oindrila_emr_project-vpc”.

In the opened dialog box, select the VPC “oindrila_emr_project-vpc” and click on the “Choose” button.

It can be seen from the below image that the VPC and Subnet is now populated accordingly with the details of the selected VPC, i.e., “oindrila_emr_project-vpc”.

Step 6: In the “Cluster termination and node replacement” section, keep the default selected radio button for the option “Automatically terminate cluster after idle time (recommended)” under “Termination option”.

Also, keep the default selected time of 1 hour as the “Idle time”. That means if the created Amazon EMR Cluster sits idle for 1 hour, the Cluster would be auto-terminated.

Select the checkbox for “Use termination protection”.

Also, keep the default selected radio button for the option “Turn on” under “Unhealthy node replacement — new”.

Step 7: All the logs that are being generated by the Amazon EMR Cluster to be created needs to be saved somewhere in the Amazon S3 Bucket. A custom S3 Bucket can be created for this reason.
If no custom S3 Bucket is created, Amazon EMR will save all the logs in a default S3 Bucket, which would be created by Amazon EMR.

Currently, the S3 Bucket “oindrila-emr-log-bucket” is created to save all the logs. So, in the “Cluster logs” section, click on the “Browse S3” button.

In the opened dialog box, select the S3 Bucket “oindrila-emr-log-bucket” and click on the “Choose” button.

It can be seen from the below image that the details of the selected S3 Bucket “oindrila-emr-log-bucket” is now displayed as “Amazon S3 Location”.

Step 8.A: In the “Identity and Access Management (IAM) roles — required” section, select the radio button for the option “Create a service role” under “Amazon EMR service role”. This will automatically select the “Virtual Private Cloud (VPC)” and “Subnet” with the details of the VPC, i.e., “oindrila_emr_project-vpc”.

In the dropdown for “Security group”, select the Security group details for the VPC, i.e., “oindrila_emr_project-vpc”.

It can be seen from the below image that the “Security group” is now populated with the details of the VPC, i.e., “oindrila_emr_project-vpc”.

Setting up “Amazon EMR service role” is required as this Role gives the Amazon EMR a defined permission to access other AWS services.

Step 8.B: In the “Identity and Access Management (IAM) roles — required” section, select the radio button for the option “Create an instance profile” under “EC2 instance profile for Amazon EMR”. Select the radio button for the option “All S3 buckets in this account with read and write access”.

Finally, click on the “Create cluster” button.

Setting up “EC2 instance profile for Amazon EMR” is required as this gives each EC2 Instance of the Cluster a Role with a defined permission to access the S3 Buckets for data processing.

Step 9: Usually, it takes more than 7 minutes for the Amazon EMR Cluster to be created.

It can be seen from the below image that the Amazon EMR Cluster “oindrila_emr_cluster” is created, but, is still spinning up.

Finally, the Amazon EMR Cluster “oindrila_emr_cluster” is started with 1 Master Node and 2 Core Nodes.