Extracting Insights from Big Data
Techniques to Work with Big Data
After organizations collect and process, they usually use a variety of techniques to extract insights from it that can help drive business decisions.
The most common techniques used to work with Big Data are the following -
A) Artificial Intelligence -
- What Is It - Artificial Intelligence (AI) is a branch of Computer Science in which computer systems are developed to perform tasks that would typically need human intelligence. AI is a broad field, and, it encapsulates many techniques within its umbrella.
- Example - To contextualize AI, let’s look at a classic example - a Turing Test. In a Turing Test -
1. A human evaluator asks a series of text-based questions to a machine and a human, without being able to see either.
2. The human and the machine answer the questions.
3. If the evaluator cannot differentiate between human and machine responses, the machine passes the Turing Test. This means that it exhibited human-like behavior, or, artificial intelligence.
B) Machine Learning -
- What Is It - Machine Learning (ML) is a subset of Artificial Intelligence that works very well with Structured Data. The goal behind Machine Learning is for machines to learn patterns in your data without you explicitly programming the machines to do so. There are a few types of Machine Learning. The most commonly used type is called Supervised Machine Learning.
- Example - Supervised Machine Learning is commonly used in detecting fraud. It works at a high-level, like the following -
1. A human-being specifies rules for what constitutes fraud. For example - a bank account with more than 20 transactions a month, or, an average balance of less than $100.
2. These rules are passed through a machine using an algorithm with data that is labeled either as “fraud” or “not fraud”, and, the machine learns what fraudulent data looks like.
3. The machine uses those rules to predict fraud.
4. A human manually investigates and verifies the model’s predictions if the model predicts “fraud”.
C) Deep Learning -
- What Is It - Deep Learning (DL) is a subset of Machine Learning (ML) that uses sets of algorithms modeled by the structure of the human brain (also called Neural Networks). These are much more complex than most Machine Learning models, and, require significantly more time and effort to build. Unlike Machine Learning which plateaus after a certain amount of data, Deep Learning continues to improve as Data Size increases. It performs well on complex datasets like, Images, Sequences, and, Natural Language.
- Example - Dep Learning is often used to classify images. For example, say that you want to build a Machine Learning model to classify if an image contains a Koala. You would feed hundreds, thousands, or, millions of pictures into a machine - some of these showing Koalas, and, others not showing Koalas. Over time, the model learns what a Koala is and what it is not. Over time, it can more easily and quickly identify a Koala over other images.
It is important to note that while humans might recognize Koalas by their fluffy ears or large oval-shaped noses, a machine will detect things that we cannot - things like patterns in the Koala’s fur or the exact shape of its eyes. It is able to make decisions quickly based on that information.
D) Data Science -
- What It Is - Data Science is a field that combines tools and workflows from disciplines like Math and Statistics, Computer Science and Business, to process, manage, and, analyze data. Data Science is very popular in Businesses today as a way to extract insights from Big Data to help inform Business Decisions.
- Example - Machine Learning and Deep Learning are common tools, among many others, in a Data Scientist’s toolbox to help extract insights from data.
One of the benefits of using these techniques, particularly Machine Learning and Deep Learning, is that these helps scale Analytics. Once machines learn how to detect patterns in our data, the machines are able to make predictions much faster than humans can.
All of these techniques are used by the Data Science practitioners to help extract insights from Big Data. They use these techniques as part of the Data Science Workflow, a series of steps they follow to process, manage, and, analyze the data.
Data Science Workflow
The Data Science Workflow is a series of steps that Data Practitioners follow to work with Big Data. It is a cyclical process that often starts with “Identifying Business Problems”, and, ends with “Delivering Business Value”.
Step 1: Identifying Business Needs - In this step, business leaders, or, managers (usually) come up with a list of questions they want answered. Questions like -
- Should we make changes to our product?
- Which of our customers are at the greatest risk for churn and why?
- Can we save money by changing the way we are pricing our products?
In this phase of the Data Workflow, business leaders identify a set of questions, or, business goals for Data Practitioners to solve, or, work towards.
Step 2: Data Ingestion - In the Data Ingestion phase, an organization takes in data. This data could be real-time (Streaming) data which comes in streams, such as customer transactions that get added to your data store every time a customer purchases something, or, the continuous data from a Heart Rate Monitor, or, Fitness Tracker. Other times, data is ingested in batches (Batch), such as loading customer records into your data store that exists in Spreadsheets somewhere else, like a Local Drive for example. In the Data Ingestion phase, data is in its raw and messy state.
Step 3: Preparing Data - After your organization ingests raw data, it needs to be prepared for use through a data cleaning, or, preparation process, referred to as Data Munging. During Data Munging, raw data is cleansed, aggregated, and/or augmented with the intent to serve the needs of the team members who need to use it for things like Machine Learning, or, Business Analytics. Munging Data can mean anything that involves data clean up - things like Extracting, Standardizing, Joining, Consolidating, or, Filtering data.
The goal is to get data to a point so that Data Practitioners using that data to train Machine Learning models, or, generate Business Insights don’t have to fix their input data.
Step 4: Analyzing Data - During Data Analysis, Data Teams explore munged data to find data insights. Often, this is where Analysis techniques like, Machine Learning and Deep Learning come into play.
During Data Analysis, Data Practitioners also query data and use other traditional Data Science methods to produce insights.
Step 5: Sharing Data Insights - Finally, once Data Practitioners generate results from their Data Science projects, these results are typically shared with business leaders, or, stakeholders who use the results to make business decisions. Data insights are often shared through Interactive Dashboards, Emails, Presentations, and, more.
Roles on a Data Science Team
Data Science teams usually include several individuals that have different skill sets and tools they need to work with Big Data. While no two Data Teams look the same, the overall mission of a Data Team is to follow the steps in the Data Science Workflow to help organizations make more informed business decisions.
Following are the different types of Data Practitioners that typically make up Data Science Teams -
A) Platform Administrators -
What Do They Do - Platform Administrators are usually responsible for managing and supporting an organization’s Big Data Infrastructure. Some of these tasks include -
- Setting up and configuring an organization’s Big Data Infrastructure.
- Performing updates and maintenance work.
- Performing health checks.
- Keeping track of how team members are using the Big Data Platforms, for example, by setting up and monitoring Alerts.
- Implementing best practices for managing data.
Additionally, Platform Administrators provide governance to Development Teams around change, configuration, and, upgrades to a Big Data system, and, often evaluate new tools and technologies that can compliment the Big Data Infrastructure.
What Do They Need - To perform their duties, Platform Administrators often use tools like “Infrastructure and Monitoring Services” that major cloud providers offer to help them keep data secure, and, scale and manage their Big Data system.
B) Data Engineers -
What Do They Do - Data Engineers develop, construct, test and maintain Data Pipelines, which are mechanisms that allow data to move between systems or people.
In the Data Science Workflow, once data is ingested, it needs to be prepared for use for Machine Learning and Business Analytics. This is where a Data Pipeline fits in - taking data from its raw data source and moving it along that Pipeline to where it can be used at different stages of a Data Science project.
What Do They Need - To perform their duties, Data Engineers use tools to build and maintain Data Pipelines, including -
- Programming languages, like Python and Scala.
- Different Data Storage solutions.
- Data processing engines, like Apache Spark.
C) Data Scientists -
What Do They Do - Data Scientists take data prepared by Data Engineers, and, use a variety of methods to extract insights. Data Scientists usually have a strong background in disciplines like Math, Statistics, and, Computer Science. They are often tasked with building Machine
Learning models, testing those models, and, keeping track of their Machine Learning Experiments.
What Do They Need - To perform their duties, Data Scientists use tools like -
- Programming languages, like Python, R, and, SQL.
- Machine Learning Libraries.
- Notebook Interfaces, like Databricks Notebooks, or, Jupyter.
- Systems that help them log and keep track of Machine Learning Experiments.
D) Data Analysts -
What Do They Do - Data Analysts take data prepared by Data Engineers to extract insights. Typically, a Data Analyst present data in the form of Graphs, Charts, and, Dashboards to stakeholders to help them make business decisions.
Data Analysts can also take advantage of the work of Machine Learning Engineers to help derive insights from data.
Data Analysts are typically well-versed in Data Visualization tools and Business Intelligence concepts, and, can be in charge of interpreting Data Insights and effectively communicating their findings with stakeholders.
What Do They Need - To perform their duties, Data Analysts often use –
- SQL Programming Language.
- Visualization tools, like “Tableau”, “PowerBI”, “Looker”, and, others.
It is important to note that in many small organizations, Data Science Teams sometimes consist of one individua trying to do all of these work on their own, or, with little help. This set-up is typically not scalable. Over time, organizations benefit from having multiple individuals on a Data Science Team that can work together and tackle these tasks.
Big Data Use Cases in Different Industries
Thousands of organizations around the world are applying Advanced Analytics to Big Data to enrich and accelerate business outcomes.
A) Oil, Gas and Energy -
Who - Oil upstream and downstream organizations, utility companies and more.
Goals - Applying Advanced Analytics to large volumes of Sensor, Supply Chain, and, Customer Data to improve Exploration, reduce Machinery Downtime, and, optimize Sales and Supply Chain operations.
Example Use-Cases -
- Smart Grids- Analyzing e-sensor data from IoT devices to detect energy consumption patterns, predict future usage and optimize production, storage and distribution.
- Predictive Maintenance - Avoiding production failures by analyzing real-time machine data, maintenance schedules and other historical data to predict equipment maintenance.
- Improved Well Production - Analyzing geospatial data to determine optimal well placement and real-time insights to improve drill and well efficiency.
B) Health and Life Sciences -
Who - Large integrated healthcare systems, major pharmaceutical companies, diagnostic labs, and, more.
Goals - Applying Advanced Analytics to their large volumes of clinical and research data to accelerate R&D and improve patient outcomes.
Example Use-Cases -
- Precision Medicine - Analyzing clinical and genomic datasets to prescribe targeted treatments specific to an individual’s biology.
- Disease Prediction - Using real-world evidence and public datasets to identify biomarkers that have a high probability of driving the onset of disease.
- Claims Analysis - Applying Machine Learning to large volumes of Claims to determine preventative measures to improve patient health and identify fraud patterns.
C) Retail -
Who - Traditional brick and mortar companies, e-commerce companies.
Goals - Applying Advanced Analytics to large volumes of customer, product, and, supply chain data to better attract customers, increase basket size, and, reduce costs.
Example Use-Cases -
- Targeted Recommendation - Using Machine Learning to mine clickstream, purchase and customer data to provide personalized recommendations.
- Demand Forecasting - Predicting real-time demand and returns at a granular level using new and non-traditional data sources to optimize inventory.
- Optimized Pricing - Improving campaign conversion and return-on-ad-spend by using Big Data to serve the right ad, at the right time, to the right person.
D) Telecom -
Who - Global communication service providers, network, and, equipment providers, and, more.
Goals - Applying Advanced Analytics to large volumes of customer and network data to improve network services and performance, while reducing customer churn.
Example Use-Cases -
- Network Performance - Understanding network checkpoints and automate load balancing in real-time.
- Upselling Services - Maximizing customer revenue by using customer usage data to drive cross and upsell services and products.
- Fraud Prevention - Analyzing SIM cards, and, other data sources to minimize fraudulent transactions.
E) Financial Services -
Who - Retail and commercial banks, hedge funds, financial technology innovators, and, more.
Goals - Applying Advanced Analytics to large volumes of customer and transaction data to reduce risk, boost returns, and, improve customer satisfaction.
Example Use-Cases -
- Investment Decisions - Maximizing returns with AI-powered insights based on billions of market signals and alternative data sources.
- Personalized Banking - Delivering the right financial products and guidance to customers with real-time customer insights and predictive analytics.
- Fraud Prevention - Detecting and preventing fraudulent activities (e.g. - money laundering, credit card fraud) by leveraging Machine Learning to predict anomalies in real-time.
F) Media and Entertainment -
Who - Major publishers, streamers, gaming companies, and, more.
Goals - Applying Advanced Analytics to large volumes of audience and content data to deepen audience engagement, reduce churn, and, optimize advertising revenues.
Example Use-Cases -
- Content Personalization - Driving 1:1 experience to drive engagement and customer satisfaction.
- Sentiment Analytics - Understanding how content is resonating in social channels and using data to find the next most popular article, show, or, game.
- Churn Management - Determining which customers are likely to churn to drive personalization and prevent them from churning.