Linking to a GPU inference container

Transcription:Batch Deployments:Container

The GPU inference container allows multiple speech recognition containers to offload heavy inference tasks to a GPU, where they can be batched and parallelized more efficiently.

The CPU container is run as normal, but with the additional environment variable SM_INFERENCE_ENDPOINT which indicates the GRPC endpoint of the inference server.

Speech containers running in GPU mode use less local CPU and memory, so they can be packed more densely on a server.

docker run \
  --rm \
  -it \
  -e SM_INFERENCE_ENDPOINT=<server>:<port> \
  -v $PWD/license.json:/license.json \
  -v $PWD/example.wav:/input.audio \
  <speech_container_image_name>

When the inference server is not available

At start up, the container will make a TCP connection to the SM_INFERENCE_ENDPOINT server to establish if it's accessible. If this test fails, the transcription will terminate with an error.

Batch

In the event of a connection error during transcription, the transcriber will retry for up to 60 seconds using an exponential back off. The length of this retry period can be configured with the SM_SPLIT_RETRY_TIMEOUT environment variable, which is a whole number of seconds.

Real-time

In RT mode, the transcriber will retry connection to the server for a maximum of 250ms before giving up.

For more details see GPU inference container

Linking to a GPU inference container

When the inference server is not available​

Batch​

Real-time​

When the inference server is not available

Batch

Real-time