Redshift and EMR
Understand the difference between a cloud data warehouse (Redshift) and a big data processing platform (EMR).
Some services in this lesson have no free tier and will incur charges.
AWS Services Used
Learning outcomes
By the end of this lesson, you will be able to:
- Explain what Amazon Redshift is.
- Explain what Amazon EMR is.
- Distinguish between a data warehouse and a big data processing platform.
- Choose between Redshift and EMR for a given scenario.
- Recognize that both services now have serverless options.
Analytics vs processing
Redshift is for SQL analytics on warehouse data. EMR is for running big data frameworks to process data.
A simple memory rule:
- Redshift = Analyze structured data with SQL (The Warehouse).
- EMR = Process massive datasets with frameworks like Spark (The Factory).
1) What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service. It is designed for large-scale data analysis using standard SQL and your favorite Business Intelligence (BI) tools (like Tableau or QuickSight).
What makes it fast?
- Columnar Storage: Instead of reading whole rows, it only reads the columns your query needs.
- Massively Parallel Processing (MPP): It distributes your data and queries across multiple nodes to work on them simultaneously.
- Redshift Spectrum: Allows you to query data directly from S3 without even loading it into the warehouse first.
Key takeaway:
- Redshift is not for a live app's daily database.
- It is for analytics, dashboards, and reports across huge amounts of history.
2) What is Amazon EMR?
Amazon EMR (Elastic MapReduce) is a managed platform that makes it easy to run big data frameworks like Apache Spark and Apache Hadoop on AWS.
What is it good at?
Instead of just querying a table, EMR is built for complex data processing tasks:
- ETL (Extract, Transform, Load): Cleaning and reformatting raw data.
- Machine Learning: Training models on massive datasets.
- Genomics: Processing vast amounts of scientific data.
Key takeaway:
- EMR is for when you need to run custom code or specific big data engines (Spark, Hive, Presto) to crunch through data.
3) The Biggest Difference
The simplest distinction is the Interface:
- Redshift uses SQL. You talk to it like a database.
- EMR uses Frameworks. You write code (Python, Java, Scala) to process data.
| Feature | Amazon Redshift | Amazon EMR |
|---|---|---|
| Role | Data Warehouse | Big Data Platform |
| Primary Language | SQL | Spark, Hadoop, Flink, etc. |
| Data Type | Structured / Semi-structured | Any (Raw files, logs, etc.) |
| Best For | Business Intelligence & Reporting | Data engineering & Complex processing |
4) Better Together: The Pipeline
In the real world, these two services often work together in a single data pipeline:
- Raw Data (logs, sensor data) lands in Amazon S3.
- Amazon EMR picks up that raw data, cleans it, and transforms it into a clean format.
- The clean data is loaded into Amazon Redshift.
- Analysts use Redshift to run SQL queries and build dashboards.
5) Going Serverless
You don't always have to manage servers for these services anymore:
- Redshift Serverless: Automatically provisions capacity. You only pay when the warehouse is actually running a query.
- EMR Serverless: A serverless runtime for Spark and Hive. You just submit your job, and AWS handles the rest.
Micro-activity 1: Pick the Better Fit
Warehouse vs. Processing
Which service should you choose for these tasks?
Examples
Choose one, then match it on the right
Characteristics
Select an example first
0 of 5 matched so far.
Micro-activity 2: Key Concepts
Analytics Terminology
Match the concept to its definition.
Examples
Choose one, then match it on the right
Characteristics
Select an example first
0 of 4 matched so far.
Summary
Amazon Redshift and Amazon EMR are the heavy hitters of AWS data services. Use Redshift when you want to analyze data using SQL and build reports. Use EMR when you need the power of big data frameworks to process, transform, and clean your data.
Knowledge Check
Next lesson
Lesson 4.17: Neptune and QLDB