Introduction to Distributed Computing

5 min readJul 10, 2022

What is Distributed Computing

In order to process Big Data, organizations use Distributed Computing, because, it divides the Big Data into more manageable chunks, and, distributes the work among computers that can process the data.

Apache Spark is the de facto standard when it comes to Distributed Computing Tools used for Big Data Processing, and, Analytics. Most organizations are either already using Apache Spark to process their Big Data, or, they are in the process of migrating to Apache Spark to help process their Big Data.

Types of Big Data Processing

When we think in terms of Big Data Processing, there are two types of data that we process — Batch, and, Streaming. These terms, “Batch” and “Streaming”, refer to the way that we are getting our data, and, the speed at which we are getting our data.

A) Batch Data -

What Is It - Batch Data is the data that we have in Storage, and, that we process all at once.
Example - A real-world example of Batch Processing is how Telecommunication companies process cellular phone usage each month to generate our monthly phone bills. To do this, they process Batch Data - the phone calls you have made, text messages you have sent, and, any additional charges you have incurred through that billing cycle, to generate your bill. They process that Batch Data in a Batch Job.

B) Streaming Data -

What Is It - Streaming Data is the data that is being continually produced by one, or, more sources, and, therefore, must be processed incrementally as it arrives.
Example - A real-world example of Stream Processing is how Heart Monitors work. All day long, as you wear your Heart Monitor, it receives new data - dozens of thousands of Data Points per day as your heart beats. Every time your heart beats, new data is added to your Heart Monitor in real-time. If your Heart Monitor has a display of your average heartbeat for the day, that average must be constantly updated with the new numbers from the incoming stream of data.

Both Batch and Streaming Data have their places when it comes to Big Data Analytics.

Batch Data is used for things like periodic reporting
Streaming Data is used for things like fraud detection, which needs to be identified in real-time

Historically, it has been difficult to use these different types of Data in conjunction. Thanks to new advances in technology however, combining Batch and Streaming is possible, and, it leads to significant advantages of Big Data Analytics.

Types of Data Storage Systems

Once we have collected and processed Batch and Streaming Data, we need somewhere to put it. As you can imagine, storing Big Data required a lot of space. It is

no longer the case that an organization can store all of their Data on a single computer, or, Server.

Today, most organizations are storing their Big Data in one, or, a combination of the following Storage Systems -

A) Data Warehouse - Data Warehouse technology emerged in the 1980’s, and, provides a centralized repository for storing all of an organization’s data. Data Warehouses can be on-premises, or, in the clouds.

Benefits -

Data Warehouses have been around for decades, work well for Structured Data, and, are reliable.
Since, Data Warehouses generally only take Structured Data, Data is typically clean, and, easy to query.

Challenges -

Data Warehouses can be hard, and, pretty expensive to scale, if you need more space, for example.
You lose a lot of valuable potential by not taking advantage of Unstructured Data.
You often have to deal with Vendor Lock-In. This occurs when your data is stored in a system that does not belong to you.
Data Warehouses are very expensive to build, license and maintain, especially for large Data Volumes, even with the availability of cloud storage.

B) Data Lakes - Data Lakes store data in its raw format. Data Lakes can store Unstructured, as well as Structured Data, and, are known to be more horizontally scalable. In other words, it is easy to keep adding more data into Data Lakes.

Benefits -

The ability to hold different types of data.
Data Lakes are easier to scale since these are usually cloud based.
Data Lakes rely on cloud storage, which is cheap. This allows organizations to capture and keep all of their data, even if the organizations are not quite sure what to do with it at the time it is collected.
Data Lakes help organizations save money by allowing them to separate Data Storage Costs from Data Compute Costs. In other words, you have to pay to store your data in a Data Lake, but only pay for Computation Costs once you need to do something with it.

Challenges -

Individuals, unfamiliar with working with raw data can experience a bit of a learning curve, or, difficulties navigating a Data Lake.
Due to larger Volumes of data, and, the occasional lack of Structure, query speeds can be impacted in traditional Data Lakes.

C) Lakehouse Platform - Lakehouse Platform is quickly gaining popularity today as a platform to store, and, manage all of an organization’s data. It provides all of the benefits of Data Lakes, with the addition of some Data Warehousing capabilities, all wrapped up in a platform that Data Teams can work together.

Benefits -

Lakehouse Platform gives you a single source of truth for your data, and, can guarantee the validity of that data.
You don’t have to worry about maintaining your own physical infrastructure. A third-party service does that for you.
Lakehouse Platforms are easily scalable - both in terms of storage and compute.
Lakehouse Platforms offer organizations a collaborative space where members of Data Teams with various levels of programming ability can work together.
Today, many organizations are moving towards the Lakehouse Platform, because, in addition to having a place to store data, these solutions come with a place to work with data as well.

Introduction to Distributed Computing

What is Distributed Computing

Types of Big Data Processing

Types of Data Storage Systems

Written by Oindrila Chakraborty

No responses yet