+ Gradient

How to use Gradient and



Use TensorFlow to train models on Gradient

TensorFlow is an open source machine learning framework developed by Google. The framework natively supports CPUs and GPUs and is designed for both research and production. TensorFlow is tightly integrated with Gradient. See below for examples including automatic model parsing and built-in distributed training capabilities. Gradient supports any version of TensorFlow for Notebooks, Experiments, or Jobs (see TensorFlow Serving for deploying trained models). A set of pre-built TensorFlow containers is provided out of the box though customers can also bring their own customized TensorFlow containers.

Using a customized TensorFlow container

In Gradient, the ML framework used to execute workloads runs within a Docker container. Containers are lightweight and portable environments that can easily be customized to include various framework versions and other libraries. Any Docker container hosted on a public or private container registry is supported on the Gradient platform. This flexibility makes it easy to switch between different framework versions and to incorporate other libraries to be used alongside the framework itself.

Running workloads with TensorFlow

When launching a workload via the web interface, CLI, or automatically via a pipeline step, you can simply pass-in a Docker image path (e.g. <inline-code>tensorflow/tensorflow:1.13.1-gpu-py3<inline-code>. There are also several pre-configured templates available. These templates are updated regularly and optimize for performance.

A set of pre-built containers can be used as a starting point within Gradient

When using the CLI, the command would like something like this:

Automatic metrics parsing

Gradient will automatically parse models created on the platform and save the model performance metrics under the entity details as metadata. All that is required is for the workload to declare the model type, in this case, as <inline-code>--modelType Tensorflow<inline-code>. These metrics can be used to track various iterations of your training and saved models. When developing an ML pipeline, these metrics can also be used as inputs into a step in the pipeline. Pipelines can evaluate the status of these metrics before performing a subsequent step e.g. to prevent pull requests merges that degrade your model. GradientCI can forward these metrics to GitHub so you can view them alongside your source code.

GitHub status checks

Distributed TensorFlow

Gradient offers push-button distributed / multi-node training as a first-class citizen (and inference via TensorFlow Serving). Scaling-out your workloads with distributed TensorFlow doesn't require any background in DevOps and can be accomplished with a few additional lines of code. By specifying the <inline-code>multinode<inline-code> mode and a few additional parameters, you can take any TensorFlow model and execute training across as many nodes as desired. Learn more in the docs. For a primer, read the TensorFlow documentation on Distribution Strategy for multi-worker training.