Introduction xxi
Assessment Test xxx
Chapter 1 History of Analytics and Big Data 1
Evolution of Analytics Architecture Over the Years 3
The New World Order 5
Analytics Pipeline 6
Data Sources 7
Collection 8
Storage 8
Processing and Analysis 9
Visualization, Predictive and Prescriptive Analytics 9
The Big Data Reference Architecture 10
Data Characteristics: Hot, Warm, and Cold 11
Collection/Ingest 12
Storage 13
Process/Analyze 14
Consumption 15
Data Lakes and Their Relevance in Analytics 16
What is a Data Lake? 16
Building a Data Lake on AWS 19
Step 1: Choosing the Right Storage - Amazon S3
Is the Base 19
Step 2: Data Ingestion - Moving the Data into
the Data Lake 21
Step 3: Cleanse, Prep, and Catalog the Data 22
Step 4: Secure the Data and Metadata 23
Step 5: Make Data Available for Analytics 23
Using Lake Formation to Build a Data Lake on AWS 23
Exam Objectives 24
Objective Map 25
Assessment Test 27
References 29
Chapter 2 Data Collection 31
Exam Objectives 32
AWS IoT 33
Common Use Cases for AWS IoT 35
How AWS IoT Works 36
Amazon Kinesis 38
Amazon Kinesis Introduction 40
Amazon Kinesis Data Streams 40
Amazon Kinesis Data Analytics 54
Amazon Kinesis Video Streams 61
AWS Glue 64
Glue Data Catalog 66
Glue Crawlers 68
Authoring ETL Jobs 69
Executing ETL Jobs 71
Change Data Capture with Glue Bookmarks 71
Use Cases for AWS Glue 72
Amazon SQS 72
Amazon Data Migration Service 74
What is AWS DMS Anyway? 74
What Does AWS DMS Support? 75
AWS Data Pipeline 77
Pipeline Definition 77
Pipeline Schedules 78
Task Runner 79
Large-Scale Data Transfer Solutions 81
AWS Snowcone 81
AWS Snowball 82
AWS Snowmobile 85
AWS Direct Connect 86
Summary 87
Review Questions 88
References 90
Exercises & Workshops 91
Chapter 3 Data Storage 93
Introduction 94
Amazon S3 95
Amazon S3 Data Consistency Model 96
Data Lake and S3 97
Data Replication in Amazon S3 100
Server Access Logging in Amazon S3 101
Partitioning, Compression, and File Formats on S3 101
Amazon S3 Glacier 103
Vault 103
Archive 104
Amazon DynamoDB 104
Amazon DynamoDB Data Types 105
Amazon DynamoDB Core Concepts 108
Read/Write Capacity Mode in DynamoDB 108
DynamoDB Auto Scaling and Reserved Capacity 111
Read Consistency and Global Tables 111
Amazon DynamoDB: Indexing and Partitioning 113
Amazon DynamoDB Accelerator 114
Amazon DynamoDB Streams 115
Amazon DynamoDB Streams - Kinesis Adapter 116
Amazon DocumentDB 117
Why a Document Database? 117
Amazon DocumentDB Overview 119
Amazon Document DB Architecture 120
Amazon DocumentDB Interfaces 120
Graph Databases and Amazon Neptune 121
Amazon Neptune Overview 122
Amazon Neptune Use Cases 123
Storage Gateway 123
Hybrid Storage Requirements 123
AWS Storage Gateway 125
Amazon EFS 127
Amazon EFS Use Cases 130
Interacting with Amazon EFS 132
Amazon EFS Security Model 132
Backing Up Amazon EFS 132
Amazon FSx for Lustre 133
Key Benefits of Amazon FSx for Lustre 134
Use Cases for Lustre 135
AWS Transfer for SFTP 135
Summary 136
Exercises 137
Review Questions 140
Further Reading 142
References 142
Chapter 4 Data Processing and Analysis 143
Introduction 144
Types of Analytical Workloads 144
Amazon Athena 146
Apache Presto 147
Apache Hive 148
Amazon Athena Use Cases and Workloads 149
Amazon Athena DDL, DML, and DCL 150
Amazon Athena Workgroups 151
Amazon Athena Federated Query 153
Amazon Athena Custom UDFs 154
Using Machine Learning with Amazon Athena 154
Amazon EMR 155
Apache Hadoop Overview 156
Amazon EMR Overview 157
Apache Hadoop on Amazon EMR 158
EMRFS 166
Bootstrap Actions and Custom AMI 167
Security on EMR 167
EMR Notebooks 168
Apache Hive and Apache Pig on Amazon EMR 169
Apache Spark on Amazon EMR 174
Apache HBase on Amazon EMR 182
Apache Flink, Apache Mahout, and Apache MXNet 184
Choosing the Right Analytics Tool 186
Amazon Elasticsearch Service 188
When to Use Elasticsearch 188
Elasticsearch Core Concepts (the ELK Stack) 189
Amazon Elasticsearch Service 191
Amazon Redshift 192
What is Data Warehousing? 192
What is Redshift? 193
Redshift Architecture 195
Redshift AQUA 198
Redshift Scalability 199
Data Modeling in Redshift 205
Data Loading and Unloading 213
Query Optimization in Redshift 217
Security in Redshift 221
Kinesis Data Analytics 225
How Does It Work? 226
What is Kinesis Data Analytics for Java? 228
Comparing Batch Processing Services 229
Comparing Orchestration Options on AWS 230
AWS Step Functions 230
Comparing Different ETL Orchestration Options 230
Summary 231
Exam Essentials 232
Exercises 232
Review Questions 235
References 237
Recommended Workshops 237
Amazon Athena Blogs 238
Amazon Redshift Blogs 240
Amazon EMR Blogs 241
Amazon Elasticsearch Blog 241
Amazon Redshift References and Further Reading 242
Chapter 5 Data Visualization 243
Introduction 244
Data Consumers 245
Data Visualization Options 246
Amazon QuickSight 247
Getting Started 248
Working with Data 250
Data Preparation 255
Data Analysis 256
Data Visualization 258
Machine Learning Insights 261
Building Dashboards 262
Embedding QuickSight Objects into Other Applications 264
Administration 265
Security 266
Other Visualization Options 267
Predictive Analytics 270
What is Predictive Analytics? 270
The AWS ML Stack 271
Summary 273
Exam Essentials 273
Exercises 274
Review Questions 275
References 276
Additional Reading Material 276
Chapter 6 Data Security 279
Introduction 280
Shared Responsibility Model 280
Security Services on AWS 282
AWS IAM Overview 285
IAM User 285
IAM Groups 286
IAM Roles 287
Amazon EMR Security 289
Public Subnet 290
Private Subnet 291
Security Configurations 293
Block Public Access 298
VPC Subnets 298
Security Options during Cluster Creation 299
EMR Security Summary 300
Amazon S3 Security 301
Managing Access to Data in Amazon S3 301
Data Protection in Amazon S3 305
Logging and Monitoring with Amazon S3 306
Best Practices for Security on Amazon S3 308
Amazon Athena Security 308
Managing Access to Amazon Athena 309
Data Protection in Amazon Athena 310
Data Encryption in Amazon Athena 311
Amazon Athena and AWS Lake Formation 312
Amazon Redshift Security 312
Levels of Security within Amazon Redshift 313
Data Protection in Amazon Redshift 315
Redshift Auditing 316
Redshift Logging 317
Amazon Elasticsearch Security 317
Elasticsearch Network Configuration 318
VPC Access 318
Accessing Amazon Elasticsearch and Kibana 319
Data Protection in Amazon Elasticsearch 322
Amazon Kinesis Security 325
Managing Access to Amazon Kinesis 325
Data Protection in Amazon Kinesis 326
Amazon Kinesis Best Practices 326
Amazon QuickSight Security 327
Managing Data Access with Amazon QuickSight 327
Data Protection 328
Logging and Monitoring 329
Security Best Practices 329
Amazon DynamoDB Security 329
Access Management in DynamoDB 329
IAM Policy with Fine-Grained Access Control 330
Identity Federation 331
How to Access Amazon DynamoDB 332
Data Protection with DynamoDB 332
Monitoring and Logging with DynamoDB 333
Summary 334
Exam Essentials 334
Exercises/Workshops 334
Review Questions 336
References and Further Reading 337
Appendix Answers to Review Questions 339
Chapter 1: History of Analytics and Big Data 340
Chapter 2: Data Collection 342
Chapter 3: Data Storage 343
Chapter 4: Data Processing and Analysis 344
Chapter 5: Data Visualization 346
Chapter 6: Data Security 346
Index 349