Linking to a GPU inference container
Transcription:Batch Deployments:ContainerThe GPU inference container allows multiple speech recognition containers to offload heavy inference tasks to a GPU, where they can be batched and parallelized more efficiently.
The CPU container is run as normal, but with the additional environment variable SM_INFERENCE_ENDPOINT
which indicates the GRPC endpoint of the inference server.
Speech containers running in GPU mode use less local CPU and memory, so they can be packed more densely on a server.
docker run \
--rm \
-it \
-e SM_INFERENCE_ENDPOINT=<server>:<port> \
-v $PWD/license.json:/license.json \
-v $PWD/example.wav:/input.audio \
<speech_container_image_name>
When the inference server is not available
At start up, the container will make a TCP connection to the SM_INFERENCE_ENDPOINT
server to establish if it's accessible.
If this test fails, the transcription will terminate with an error.
Batch
In the event of a connection error during transcription, the transcriber will retry for up to 60 seconds using an exponential back off.
The length of this retry period can be configured with the SM_SPLIT_RETRY_TIMEOUT
environment variable, which is a whole number of seconds.
Real-time
In RT mode, the transcriber will retry connection to the server for a maximum of 250ms before giving up.
For more details see GPU inference container