GPU inference container
Transcription:Batch Deployments:ContainerLatest images
docker-public.artifacts.speechmatics.io/sm-gpu-inference-server-en:10.0.0
docker-public.artifacts.speechmatics.io/batch-asr-transcriber-en:10.0.0
docker-public.artifacts.speechmatics.io/rt-asr-transcriber-en:10.0.0
Note: Customers who are not using GPU hardware should continue to use transcriber version 9.X.X and below. Transcriber version 10.0.0 has not yet been optimized for CPU-only usage.
This guide will walk you through the steps needed to deploy the Speechmatics GPU inference container. Using a GPU enables more complex models with accuracy improvements on all operating points compared to conventional CPU-based inference. It also allows multiple speech recognition containers to offload heavy inference tasks to a single GPU server, where they can be batched and parallelized efficiently.
The following steps are required to use this in your environment:
- Check system requirements
- Pull the Docker Image into your local Docker Registry
- Run the container
System requirements
The system must have:
- Nvidia GPU(s) with at least 16GB of GPU memory
- CUDA compute capability of 7.5 or above, which corresponds to the Turing architecture. Cards with the Volta architecture (7.0) or below are not able to run the models.
- 24 GB RAM
- The nvidia-container-toolkit installed (https://github.com/NVIDIA/nvidia-docker)
- Docker version >19.03 (required for the above)
The raw image size of the GPU inference container is around 15GB.
The authoritative compatibility matrix is maintained by Nvidia at https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html, search for Triton Inference Server 22.08. At time of release this stated:
Release 22.08 is based on CUDA 11.7.1, which requires NVIDIA Driver release 515 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.8. 3
Using a cloud provider
The GPU node can be provisioned in the cloud. Our SaaS deployment uses
- Azure Standard_NC8as_T4_v3
but any NC
or ND
series with sufficient memory should work.
For example, to create such an instance using Terraform:
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "speechmatics_transcriber" {
name = "speechmatics-transcriber"
location = "West Europe"
}
data "template_cloudinit_config" "setup" {
gzip = true
base64_encode = true
part {
content_type = "text/cloud-config"
content = <<-EOT
apt:
sources:
docker-ce.list:
source: deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu bionic stable
keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
nvidia-container-toolkit.list:
source: deb [signed-by=$KEY_FILE] https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
keyid: C95B321B61E88C1809C4F759DDCAE044F796ECB0
packages:
- containerd.io
- docker-ce
- docker-ce-cli
- docker-compose-plugin
- nvidia-docker2
- nvidia-driver-525
power_state:
delay: now
mode: reboot
EOT
}
}
module "network" {
source = "Azure/network/azurerm"
version = "4.2.0"
resource_group_name = azurerm_resource_group.speechmatics_transcriber.name
address_spaces = ["10.0.1.0/24"]
subnet_prefixes = ["10.0.1.0/24"]
subnet_names = ["subnet1"]
depends_on = [
azurerm_resource_group.speechmatics_transcriber,
]
}
module "vm" {
source = "Azure/compute/azurerm"
version = "4.2.0"
resource_group_name = azurerm_resource_group.speechmatics_transcriber.name
vm_hostname = "speechmatics-transcriber"
vm_size = "Standard_NC8as_T4_v3"
vm_os_publisher = "Canonical"
vm_os_offer = "UbuntuServer"
vm_os_sku = "18_04-lts-gen2"
ssh_key = null
ssh_key_values = ["<Your SSH key>"]
custom_data = data.template_cloudinit_config.setup.rendered
vnet_subnet_id = module.network.vnet_subnets[0]
public_ip_dns = ["speechmatics-transcriber"]
depends_on = [
azurerm_resource_group.speechmatics_transcriber,
module.network,
]
}
output "public_ip" {
value = module.vm.public_ip_address
}
output "public_dns" {
value = module.vm.public_ip_dns_name
}
Prerequisites
- A license file or a license token - see Licensing
- Access to our docker repository - see Accessing images
Licensing
There is no specific license for the inference server. It will run using an existing Speechmatics license for the real-time or batch container.
Note: If you do not have a license or access to the docker repository, please contact support@speechmatics.com.
Running the image
If the image has sole use of the GPU(s), then CUDA_VISIBLE_DEVICES
may be omitted. Otherwise, it should be set to indicate
to the server which device-id it should target. See Nvidia/CUDA documentation for details.
docker run --rm -it \
-v $PWD/license.json:/license.json \
--gpus=all \
-e CUDA_VISIBLE_DEVICES \
-p 8001:8001 \
<gpu_inference_server_image_name>
When the container starts you should see output similar to this, indicating that the server has started and is ready to serve requests.
I1207 09:34:22.462341 1 server.cc:592]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| kaldi_enhanced | 1 | READY |
| kaldi_standard | 1 | READY |
| lm_enhanced | 1 | READY |
+----------------+---------+--------+
...
I1207 09:34:22.612211 1 grpc_server.cc:4375] Started GRPCInferenceService at 0.0.0.0:8001
I1207 09:34:22.624076 1 http_server.cc:3075] Started HTTPService at 0.0.0.0:8000
I1207 09:34:22.665759 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
Once the GPU server is running, follow the instructions for linking a speech container.
Running only one operating point
Operating points represent different levels of model complexity.
To save GPU memory for throughput, you can run the server with only one operating point loaded. To do this, pass the
SM_OPERATING_POINT
environment variable to the container and set it to either standard
or enhanced
.
Batch and real-time inference
The inference server can run in two modes: batch, for processing whole files and returning the transcript at the end, and real-time for
processing audio streams. The default mode is batch. To configure the GPU server for real-time, set the environment variable
SM_BATCH_MODE=false
by passing it into the docker run
command.
The modes correspond to the two types of client speech container, which are distinguished by their name:
- rt-asr-transcriber-en:<version>
- batch-asr-transcriber-en:<version>
The server can only support one of these modes at once.
Monitoring the server
The inference server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).
Operating points in GPU inference
When inference is out-sourced to a GPU server, alternative GPU-specific models are used, so you should not expect to see identical results compared to CPU-based inference. For convenience, the GPU models are also designated as 'standard' and 'enhanced'.