Steps to Start Training your Custom Tensorflow Model in AWS SageMaker

Create an experiment, train and use checkpoints for a Transformer model

In this post we will describe the most relevant steps to start training a custom algorithm in Amazon SageMaker, not using a custom container, showing how to handle experiments and solving some of the common problems that happen when facing with custom models using SageMaker script mode. Some basics concepts on SageMaker will not be detailed in order to focus on the relevant concepts.

The following steps will be explained:

  1. Create an Experiment and Trial to keep track of our experiments
  2. Load the training data to our training instance
  3. Create the scripts to train our custom model, a Transformer.
  4. Create an Estimator to train our model in Tensorflow 2.1 in script mode
  5. Create metric definitions to keep track of them in SageMaker
  6. Download the trained model to make predictions
  7. Resume training using the latest checkpoint from a previous training

We will show and describe the most useful and important pieces of code, but at the end, you will be linked to the source code.

tensorflow machine-learning data-science sagemaker nlp

