Skip to main content

GPU inference container

Transcription:Batch Deployments:Container

info
Latest images
  • docker-public.artifacts.speechmatics.io/sm-gpu-inference-server-en:10.0.0
  • docker-public.artifacts.speechmatics.io/batch-asr-transcriber-en:10.0.0
  • docker-public.artifacts.speechmatics.io/rt-asr-transcriber-en:10.0.0

Note: Customers who are not using GPU hardware should continue to use transcriber version 9.X.X and below. Transcriber version 10.0.0 has not yet been optimized for CPU-only usage.

This guide will walk you through the steps needed to deploy the Speechmatics GPU inference container. Using a GPU enables more complex models with accuracy improvements on all operating points compared to conventional CPU-based inference. It also allows multiple speech recognition containers to offload heavy inference tasks to a single GPU server, where they can be batched and parallelized efficiently.

The following steps are required to use this in your environment:

  • Check system requirements
  • Pull the Docker Image into your local Docker Registry
  • Run the container

System requirements

The system must have:

  • Nvidia GPU(s) with at least 16GB of GPU memory
  • CUDA compute capability of 7.5 or above, which corresponds to the Turing architecture. Cards with the Volta architecture (7.0) or below are not able to run the models.
  • 24 GB RAM
  • The nvidia-container-toolkit installed (https://github.com/NVIDIA/nvidia-docker)
  • Docker version >19.03 (required for the above)

The raw image size of the GPU inference container is around 15GB.

The authoritative compatibility matrix is maintained by Nvidia at https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html, search for Triton Inference Server 22.08. At time of release this stated:

Release 22.08 is based on CUDA 11.7.1, which requires NVIDIA Driver release 515 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.8. 3

Using a cloud provider

The GPU node can be provisioned in the cloud. Our SaaS deployment uses

but any NC or ND series with sufficient memory should work.

For example, to create such an instance using Terraform:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "speechmatics_transcriber" {
  name     = "speechmatics-transcriber"
  location = "West Europe"
}

data "template_cloudinit_config" "setup" {
  gzip          = true
  base64_encode = true

  part {
    content_type = "text/cloud-config"
    content     = <<-EOT
    apt:
      sources:
        docker-ce.list:
          source: deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu bionic stable
          keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
        nvidia-container-toolkit.list:
          source: deb [signed-by=$KEY_FILE] https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
          keyid: C95B321B61E88C1809C4F759DDCAE044F796ECB0
    packages:
      - containerd.io
      - docker-ce
      - docker-ce-cli
      - docker-compose-plugin
      - nvidia-docker2
      - nvidia-driver-525
    power_state:
      delay: now
      mode: reboot
    EOT
  }
}

module "network" {
  source  = "Azure/network/azurerm"
  version = "4.2.0"

  resource_group_name = azurerm_resource_group.speechmatics_transcriber.name
  address_spaces      = ["10.0.1.0/24"]
  subnet_prefixes     = ["10.0.1.0/24"]
  subnet_names        = ["subnet1"]

  depends_on = [
    azurerm_resource_group.speechmatics_transcriber,
  ]
}

module "vm" {
  source  = "Azure/compute/azurerm"
  version = "4.2.0"

  resource_group_name = azurerm_resource_group.speechmatics_transcriber.name

  vm_hostname = "speechmatics-transcriber"
  vm_size     = "Standard_NC8as_T4_v3"

  vm_os_publisher = "Canonical"
  vm_os_offer     = "UbuntuServer"
  vm_os_sku       = "18_04-lts-gen2"
  ssh_key         = null
  ssh_key_values  = ["<Your SSH key>"]
  custom_data     = data.template_cloudinit_config.setup.rendered

  vnet_subnet_id = module.network.vnet_subnets[0]
  public_ip_dns  = ["speechmatics-transcriber"]

  depends_on = [
    azurerm_resource_group.speechmatics_transcriber,
    module.network,
  ]
}

output "public_ip" {
  value = module.vm.public_ip_address
}
output "public_dns" {
  value = module.vm.public_ip_dns_name
}

Prerequisites

Licensing

There is no specific license for the inference server. It will run using an existing Speechmatics license for the real-time or batch container.

Note: If you do not have a license or access to the docker repository, please contact support@speechmatics.com.

Running the image

If the image has sole use of the GPU(s), then CUDA_VISIBLE_DEVICES may be omitted. Otherwise, it should be set to indicate to the server which device-id it should target. See Nvidia/CUDA documentation for details.

docker run --rm -it \
  -v $PWD/license.json:/license.json \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES \
  -p 8001:8001 \
  <gpu_inference_server_image_name>

When the container starts you should see output similar to this, indicating that the server has started and is ready to serve requests.

I1207 09:34:22.462341 1 server.cc:592] 
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| kaldi_enhanced | 1       | READY  |
| kaldi_standard | 1       | READY  |
| lm_enhanced    | 1       | READY  |
+----------------+---------+--------+
...
I1207 09:34:22.612211 1 grpc_server.cc:4375] Started GRPCInferenceService at 0.0.0.0:8001
I1207 09:34:22.624076 1 http_server.cc:3075] Started HTTPService at 0.0.0.0:8000
I1207 09:34:22.665759 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Once the GPU server is running, follow the instructions for linking a speech container.

Running only one operating point

Operating points represent different levels of model complexity. To save GPU memory for throughput, you can run the server with only one operating point loaded. To do this, pass the SM_OPERATING_POINT environment variable to the container and set it to either standard or enhanced.

Batch and real-time inference

The inference server can run in two modes: batch, for processing whole files and returning the transcript at the end, and real-time for processing audio streams. The default mode is batch. To configure the GPU server for real-time, set the environment variable SM_BATCH_MODE=false by passing it into the docker run command.

The modes correspond to the two types of client speech container, which are distinguished by their name:

  • rt-asr-transcriber-en:<version>
  • batch-asr-transcriber-en:<version>

The server can only support one of these modes at once.

Monitoring the server

The inference server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).

Operating points in GPU inference

When inference is out-sourced to a GPU server, alternative GPU-specific models are used, so you should not expect to see identical results compared to CPU-based inference. For convenience, the GPU models are also designated as 'standard' and 'enhanced'.