GPU workloads are increasingly short-lived. Model evaluations, fine-tuning runs, batch inference, simulations, and GPU-enabled CI/CD checks often need temporary compute capacity, then need that capacity shut down as soon as the job completes.
That changes the automation problem. A useful GPU workflow has to select suitable hardware, package the workload correctly, expose only required ports, capture logs and outputs, and clean up the instance even when the job fails.
This guide uses the Fluence GPU Cloud API as the implementation example for an API-first pattern that can be adapted to short-lived GPU jobs. Fluence documents GPU Cloud API support for browsing plans, deploying containers, VMs, and bare metal, and managing instances through their lifecycle.
Why GPU Jobs Need Purpose-Built Automation
GPU infrastructure has more placement and runtime constraints than standard CPU infrastructure. A job may depend on GPU model, VRAM, CPU and memory allocation, storage, location, CUDA compatibility, and container runtime behavior. NVIDIA’s Container Toolkit documentation notes that Docker GPU access can be controlled with the –gpus option or NVIDIA_VISIBLE_DEVICES, which is exactly the type of runtime detail generic VM scripts often miss.
Automation also has to treat cleanup as a first-class step. A GPU job that launches successfully but leaves an instance running after an error has not been automated safely. The core lifecycle should be: discover capacity, deploy the workload, monitor logs and events, retrieve outputs, and terminate the instance.
Choose the Right Orchestration Model
Teams usually choose between managed batch, Kubernetes, marketplace CLIs, and direct APIs. Google Cloud Batch supports GPU jobs through gcloud or the Batch API, Kubernetes schedules GPUs through device plugins, and Vast.ai’s CLI supports instance lifecycle actions such as checking status, streaming logs, and destroying instances.
| Model | Best fit | Main tradeoff |
| Direct GPU cloud API | Short-lived jobs, CI/CD, scripted evaluation | You own lifecycle logic |
| Managed batch service | Queued task execution | Provider-specific job configuration |
| Kubernetes GPU scheduling | Existing platforms and long-running GPU services | Cluster and GPU runtime overhead |
| Marketplace CLI | Ad-hoc experiments | Provider-specific commands |
A direct API is useful when the team wants REST-level control rather than a scheduler. It is not a universal replacement for batch services or Kubernetes. It works best when the job lifecycle is simple enough to script and important enough to control directly.
Prerequisites for an API-First GPU Job
Start with an API key, an account ready for deployment, a CUDA-compatible container image, and an SSH public key if the workload needs administrative access or output retrieval. Fluence API requests use JSON, and all endpoints require an API key sent in the X-API-KEY header.
For Fluence GPU endpoints, use https://api.fluence.dev/gpu as the base URL. Container plan discovery is available at GET /plans/, while VM and bare-metal plans are available at GET /plans/vms and GET /plans/baremetal. The plan response includes GPU, resource, pricing, and location data that automation can use for selection.
SSH keys are account-level credentials in Fluence. The API supports listing, adding, and removing keys; when adding a key, provide a friendly name and the full publicKey string. Removing a key from the account does not affect instances already deployed with it.
Step-by-Step API Workflow
First, browse plans and choose a plan that matches the workload:
curl https://api.fluence.dev/gpu/plans/ \
-H "X-API-KEY: $FLUENCE_API_KEY"
For containers, deploy with POST /gpu/instances/. The Fluence OpenAPI spec requires plan_id, name, and container_settings; container settings include the image, exposed ports, optional environment variables, startup command, registry credentials, constraints, and optional SSH key.
curl -X POST https://api.fluence.dev/gpu/instances/ \
-H "X-API-KEY: $FLUENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"plan_id": "<plan-id>",
"name": "gpu-eval-job",
"container_settings": {
"image": "<registry>/<image>:<tag>",
"expose": [{"port": "8080", "protocol": "tcp"}],
"environment": [{"name": "JOB_MODE", "value": "evaluation"}],
"startup_command": "python run_job.py"
},
"constraints": {"location": "US"},
"ssh_key": "<ssh-public-key>"
}'
For VMs and bare metal, use POST /gpu/instances/vms or POST /gpu/instances/baremetal. Those requests require a chosen option ID as plan_id, an instance name, an SSH key, and an os_image that matches one of the selected plan option’s os_options.
After deployment, monitor the job. Fluence exposes container stdout and stderr through GET /gpu/instances/{instance_id}/logs, with an optional last query parameter. It also exposes cluster-level lifecycle events through GET /gpu/instances/{instance_id}/events, including scheduling, image pull, container creation, and startup information.
curl "https://api.fluence.dev/gpu/instances/$INSTANCE_ID/logs?last=50" \
-H "X-API-KEY: $FLUENCE_API_KEY"
curl "https://api.fluence.dev/gpu/instances/$INSTANCE_ID/events?last=50" \
-H "X-API-KEY: $FLUENCE_API_KEY"
Connection details become available after the instance is active. Fluence documents ssh_connection for VMs and bare metal, and network.domain plus network.forwarded_ports for containers. Retrieve outputs before termination, then terminate the instance explicitly. Only active instances can be terminated through the delete endpoint, and a successful delete returns no content.
A Compact Bash Pattern
A production script should capture the instance ID immediately and guarantee cleanup with a trap:
#!/usr/bin/env bash
set -euo pipefail
API_BASE="https://api.fluence.dev/gpu"
INSTANCE_ID=""
cleanup() {
if [[ -n "$INSTANCE_ID" ]]; then
curl -sS -X DELETE "$API_BASE/instances/$INSTANCE_ID" \
-H "X-API-KEY: $FLUENCE_API_KEY" || true
fi
}
trap cleanup EXIT
PLAN_ID=$(curl -sS "$API_BASE/plans/" \
-H "X-API-KEY: $FLUENCE_API_KEY" | jq -r '.[0].plan.id')
DEPLOY_JSON=$(curl -sS -X POST "$API_BASE/instances/" \
-H "X-API-KEY: $FLUENCE_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"plan_id\": \"$PLAN_ID\",
\"name\": \"gpu-ci-job\",
\"container_settings\": {
\"image\": \"<registry>/<image>:<tag>\",
\"expose\": [{\"port\": \"8080\", \"protocol\": \"tcp\"}],
\"startup_command\": \"python run_job.py\"
}
}")
INSTANCE_ID=$(echo "$DEPLOY_JSON" | jq -r '.instance_id')
curl -sS "$API_BASE/instances/$INSTANCE_ID/logs?last=50" \
-H "X-API-KEY: $FLUENCE_API_KEY"
Real automation should add plan fallback logic, status polling, retries, timeouts, structured logs, artifact retrieval, and explicit handling for validation or business logic errors.
CI/CD Integration and Production Practices
In CI/CD, the same lifecycle becomes a pipeline step: build or pull a CUDA-compatible image, read API and SSH credentials from secrets, choose a plan, deploy the container, poll logs and events, retrieve artifacts, and terminate the instance in a final or post step.
The most important production rule is to make cleanup unavoidable. Track every created instance ID, run cleanup on failure and cancellation, and do not hardcode a single GPU plan without fallback. Limit exposed ports, protect private registry credentials, and validate CUDA/runtime assumptions before deployment. Fluence supports private registry credentials for docker.io and ghcr.io, with a maximum of 10 exposed ports and 64 environment variables for container settings.
Conclusion
To automate GPU jobs safely, focus on the full lifecycle rather than the launch command alone. The durable pattern is plan discovery, access setup, deployment, monitoring, output retrieval, and termination.
A direct cloud API is a strong fit for short-lived GPU jobs, CI/CD evaluation, batch inference, and scripted experiments where explicit lifecycle control matters. Managed batch and Kubernetes remain better choices for large queues, existing platforms, and complex long-running systems.
For API-first workflows, the Fluence GPU Cloud API provides a concrete path for browsing GPU plans, deploying containers, VMs, or bare metal, inspecting logs and events, and cleaning up resources programmatically.