AWS is the cloud services subsidiary of Amazon. It provides many tools and services to develop AI and Machine Learning models on its platform, from data ingestion, data exploration, data transformation, to model training, tuning, optimization, and deployment.
Data ingestion
Amazon Athena
Amazon Athena is a serverless fast, efficient, highly available, durable, and secured database query engine for big data. It is based on Presto, an open-source query engine originally created by Facebook to query its own databases with low latency. It gets data from Amazon S3 in different formats such as CSV, JSON, ORC, Avro, or Parquet using standard SQL queries. It can also execute join queries from JDBC-compliant databases such as MySQL, and other databases such as Amazon Redshift.
Amazon Redshift
Amazon Redshift is a cloud-based data warehouse and a relational database management system. It replaces on-site data warehouses and database systems. It is based on the open-source PostgreSQL project but works very differently as it is focused on very large databases.
It works with clusters of nodes and slices of nodes to process the SQL queries and retrieve the structured data stored in the nodes. A cluster can have a leader node distributing tasks to the worker nodes and make them work in parallel. Amazon Redshift is highly scalable with added nodes when required and can run very fast queries on petabytes of data. It can be linked to ETL processes and feed analytical workloads (dashboard, visualization, and business intelligence tools) at the enterprise level.
Amazon Kinesis
Amazon Kinesis is a data streaming platform to ingest, process, analyze, and store real-time, high-throughput streaming data. It is based on the open-source project Apache Kafka, initially developed by LinkedIn for its own needs. Streaming data can be video data, transaction data, time-series data, and any data that are produced continuously. Contrary to batch analytics, streaming analytics allow an almost immediate reaction to new events and constantly refreshed data outputs and instances for end-users and customers. It is for instance ideal for price data, fraud detection, and system monitoring data.
Amazon Kinesis offers four capabilities: Kinesis Video Streams for video data captured by cameras, Kinesis Data Streams to capture, process, and store streaming data from multiple sources, Kinesis FireHose for continuous ETL jobs and data transfer to AWS databases, and Kinesis Data Analytics for transforming and analyzing streaming data.
Data exploration
Amazon SageMaker Notebooks
Amazon SageMaker Notebooks are Jupyter style notebooks that can import, process, and analyze the data from AWS data stores. Usually, only small data samples can be analyzed in a SageMaker Python notebook. If necessary Spark jobs using SparkMagic notebooks and an EMR Spark cluster can be run to process the data or even Redshift or Athena are used directly to explore the data.
Amazon Athena
Since Amazon Athena is a database query engine, it can be used for data exploration like in a normal relational database.
Amazon QuickSight
Amazon QuickSight is a business intelligence tool to create interactive dashboards that can be embedded into websites, analytics reports, emails to share ML insights with the entire organization. It connects seamlessly with all AWS storage and database solutions. It is serverless and therefore scalable, as the number of users grows, it can grow along with them. It allows quick iteration when developing new ML models as results can quickly be shared with all stakeholders.
AWS Glue
AWS Glue is a serverless extract, transform, and load (ETL) tool to prepare data and identify useful metadata and data transformation from an AWS data lake or data source (Amazon S3, Redshift,..). The metadata and table definitions are stored in an AWS Glue Metadata Catalog. It can load the final data into a data store such as Amazon Redshift. It is built with Apache Spark and generates ETL and visualization and automatic ETL modifiable code in Scala or Python.
Data preparation
Amazon SageMaker Processing Jobs
Amazon SageMaker provides Notebooks that a user can use to write Python scripts and access the standard data science and machine learning libraries (Pandas, Matplotlib, Seaborn, Sklearn, TensorFlow..). Athena and Redshift can also be accessed through these notebooks thanks to the Athena client library (PyAthena) and SQL libraries (SQLAlchemy). Complex queries can be sent directly from the notebooks.
Amazon SageMaker Processing is used when the whole production data needs to be processed and transformed into useful features at scale. The type and the number of instances need to be defined to perform the processing step.
Amazon Elastic MapReduce (EMR)
Amazon EMR is a scalable data processing engine built on Hadoop or Apache Spark. Apache Spark is a very popular distributed processing and analytics engine for big data. Workloads are automatically deployed to clusters and nodes. A SageMaker Notebook can run Spark commands and process data on a Spark cluster. The data can be analyzed and tested with the Amazon DeeQu API. Data can be tested for missing or Null values, range, correct formatting, completeness, uniqueness, consistency, size, correlation, etc..
Model training
Amazon SageMaker Notebooks
Amazon SageMaker Notebooks can use standard machine learning libraries such as Scikit-Learn, TensorFlow, MXNet, or PyTorch to transform the data, do feature engineering, split the data, and train the models on samples. The libraries are accessed by loading containers with pre-defined environments, through scripts, or customized containers.
Some objective metrics such as accuracy have to be defined to evaluate the model performance. Model hyperparameters and parameters can be saved to be examined for model review and evaluation.
Amazon SageMaker Training Jobs Debugger
Amazon SageMaker Training Jobs Debugger uses rules to check for issues such as overfitting, data imbalance, or vanishing gradients. If the rules are triggered, the training stops to allow debugging of the model and inspection of intermediary steps and objects.
Model tuning and optimization
Amazon SageMaker Hyper-Parameter Optimizer
Amazon SageMaker Hyper-Parameter Optimizer can find the best hyperparameters within some ranges to optimize some objective metrics using different methods such as grid search, random search, or Bayesian optimization.
Amazon SageMaker AutoPilot
Amazon SageMaker AutoPilot is the AutoML tool of SageMaker. It analyzes the raw data and the target to be predicted. It chooses the best algorithm candidates, processes the data to create the best features, and automatically trains and tunes the models. The best hyperparameters are automatically selected for each algorithm.
Amazon SageMaker Experiment Tracking
Amazon SageMaker Experiments track the multiple model runs and provide auditability, traceability, and reproducibility of these runs. Data, parameters, hyperparameters, models can be accessed historically to review and reproduce feature engineering, training, tuning, and deployment results. Each experiment includes trials, and each trial includes steps, each step includes tracking information. Versioning and lineage are kept across all the trials.
Model deployment
Amazon SageMaker Model Endpoints
Amazon SageMaker Model Endpoints allow the user to interface with a model to get inference results on production data. It requires the location of the data and model artifacts (e.g. an S3 bucket), the container of the model, and some parameters and compute resource configurations to get inferences from the model. Different variants of the model can be requested to run in parallel. Endpoints are accessed through REST APIs.
Amazon SageMaker Model Monitoring
Amazon SageMaker Model Monitoring is used for monitoring the model and identifies any deviations from a baseline. A baseline is created from the training data using a tool such as Amazon DeeQu in Apache Spark. Model Monitoring captures the data and model inference results and checks that all the constraints are verified, if not, Amazon CloudWatch gets triggered and sends warnings about the deviation. Amazon CloudTrails will save all the model logs to perform model reviews and debugging.
Amazon SageMaker A/B Tests
A/B Tests are used to improve production models and test models and hypotheses on production data. Amazon SageMaker A/B Tests can be performed using Endpoints. Different training data, model versions, compute resource configurations can be tested with Amazon SageMaker Model Endpoints. After reviewing the different model results, an improved model can be selected and replace the current one.
Amazon SageMaker Canary Rollouts
With Amazon SageMaker Canary Rollouts, a new model with a different production variant than the current model can be deployed through Endpoints to a limited number of customers and progressively be expanded to more customers if the model performance is satisfactory.
Amazon SageMaker Batch inference
Amazon SageMaker Batch inference is an alternative to Endpoints if real-time results are not necessary. Amazon SageMaker reads the batch data from an S3 bucket location, runs inference from a model, and delivers the results to another S3 bucket location.
Model Pipeline
AWS Step Functions
Figure 1. AWS Step Functions. Source: Amazon
AWS Step Functions is an orchestration tool to coordinate the tasks of a machine learning workflow such as processing the data and running AWS Lambda functions or pre-trained models. It can be used for extract, transform, and load (ETL) processes, for breaking down complex machine learning codebase and makes it more modular, for coordinating batch processing jobs, for triggering events and notifications. AWS Step Functions is presented through a visual workflow graph.
Amazon EventBridge
Figure 2. AWS Event Bridge. Source: Amazon
Amazon EventBridge connects events (changes of states) to workflows. The events can come from SaaS applications (Datadog, OneLogin, PagerDuty, Savyint, Segment, SignalFX, SugarCRM, Symantec, Whispir, and Zendesk), customized applications, or AWS Services. They trigger workflows that can include connecting to applications, microservices or databases, AWS Lambda functions, and other AWS applications, or communicating results.