How to Use Serverless Inference on DigitalOcean Gradient™ AI Platform
Validated on 9 Feb 2026 • Last edited on 27 Feb 2026
DigitalOcean Gradient™ AI Platform lets you build fully-managed AI agents with knowledge bases for retrieval-augmented generation, multi-agent routing, guardrails, and more, or use serverless inference to make direct requests to popular foundation models.
Serverless inference lets you send API requests directly to foundation models without creating or managing an AI agent. This generates responses without any initial instructions or configuration to the model.
The serverless inference API is available at https://inference.do-ai.run and has the following endpoints:
Endpoint
Verb
Description
/v1/models
GET
Returns a list of available models and their IDs.
/v1/chat/completions
POST
Sends chat-style prompts and returns model responses.
/v1/responses
POST
Sends chat-style prompts and returns text or multimodal model responses.
/v1/images/generations
POST
Generates images from text prompts.
/v1/async-invoke
POST
Sends text, image, or text-to-speech generation requests to fal models.
We support both /v1/chat/completions and /v1/responses endpoints for sending prompts. Choose the endpoint that best fits your use case:
Use /v1/chat/completions when building or maintaining chat-style integrations that rely on structured messages with roles such as system, user, and assistant, or when migrating existing chat-based code with minimal changes.
Use /v1/responses when building new integrations or working with newer models that only support the Responses API. It’s also useful for multi-step tool use in a single request, preserving state across turns with store: true, and simplifying requests by using a single input field with improved caching efficiency.
You can use these endpoints through cURL, Python OpenAI, or Gradient SDK.
Retrieve Available Models
The following cURL, Python OpenAI, and Gradient SDK examples show how to retrieve available models.
Send a GET request to the /v1/models endpoint using your model access key. For example:
Send Prompt to a Model Using the Chat Completions API
The following cURL, Python OpenAI, and Gradient SDK examples show how to send a prompt to a model. Include your model access key and the following in your request:
model: The model ID of the model you want to use. Get the model ID using /v1/models or on the available models page.
messages: The input prompt or conversation history. Serverless inference does not have sessions, so include all relevant context using this field.
temperature: A value between 0.0 and 1.0 to control randomness and creativity.
max_completion_tokens: The maximum number of tokens to generate in the response. Use this to manage output length and cost.
For Anthropic models, we recommend you specify this parameter for better accuracy and control of the model response. For models by other providers, this parameter is optional and defaults to around 2048 tokens.
max_tokens: This parameter is deprecated. Use max_completion_tokens instead to control the size of the generated response.
You can also use prompt caching and reasoning parameters in your request. For examples, see Use Prompt Caching and Use Reasoning.
Send a POST request to the /v1/chat/completions endpoint using your model access key.
The following example request sends a prompt to a Llama 3.3 Instruct-70B model with the prompt What is the capital of France?, a temperature of 0.7, and maximum number of tokens set to 256.
curl -X POST https://inference.do-ai.run/v1/chat/completions \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-H "Content-Type: application/json"\
-d '{
"model": "llama3.3-70b-instruct",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"temperature": 0.7,
"max_completion_tokens": 256
}'
The response includes the generated text and token usage details:
{"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"audio":null,"content":"The capital of France is Paris.","refusal":null,"role":""}}],"created":1747247763,"id":"","model":"llama3.3-70b-instruct","object":"chat.completion","service_tier":null,"usage":{"completion_tokens":8,"prompt_tokens":43,"total_tokens":51}}
fromopenaiimportOpenAIfromdotenvimportload_dotenvimportosload_dotenv()client=OpenAI(base_url="https://inference.do-ai.run/v1/",api_key=os.getenv("MODEL_ACCESS_KEY"),)resp=client.chat.completions.create(model="llama3-8b-instruct",messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Tell me a fun fact about octopuses."}],)print(resp.choices[0].message.content)
fromgradientimportGradientfromdotenvimportload_dotenvimportosload_dotenv()client=Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))resp=client.chat.completions.create(model="llama3-8b-instruct",messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Tell me a fun fact about octopuses."}],)print(resp.choices[0].message.content)
Use Reasoning
For models that support reasoning, you can pass a reasoning parameter in the request body either in OpenAI format using reasoning_effort or Anthropic format using reasoning.effort. The reasoning effort can be set to none, low, medium, high or max.
The following cURL example shows how to specify reasoning effort for Claude Opus 4.5 model in the Anthropic format:
The output shows the response shown step-by-step as requested in the model prompt:
{"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"# Calculating 27 × 453\n\nI'll break this into smaller parts:\n\n**Step 1:** Break down 453 into 400 + 50 + 3\n\n**Step 2:** Multiply each part by 27\n- 27 × 400 = 10,800\n- 27 × 50 = 1,350\n- 27 × 3 = 81\n\n**Step 3:** Add the results\n- 10,800 + 1,350 + 81 = **12,231**","reasoning_content":"I need to calculate 27 * 453.\n\nLet me break this down step by step.\n\n27 * 453 = 27 * (400 + 50 + 3)\n= 27 * 400 + 27 * 50 + 27 * 3\n\n27 * 400 = 10,800\n27 * 50 = 1,350\n27 * 3 = 81\n\n10,800 + 1,350 + 81 = 12,231","refusal":null,"role":"assistant"}}],"created":1771946745,"id":"","model":"anthropic-claude-opus-4.5",...}
Note
For Anthropic models, if you omit the max_tokens parameter for reasoning , we calculate the token budget using the following ratio of the total tokens passed in max_completion_tokens:
Effort Level
Reasoning Token Budget (% of max_completion_tokens)
low
0.2
medium
0.5
high
0.8
max
0.95
The following cURL example shows how to specify reasoning effort for Claude Sonnet 4.6 model in the OpenAI format:
The following cURL, Python OpenAI, and Gradient SDK examples show how to send a prompt using the /v1/responses endpoint. Include your model access key and the following in your request:
model: The model ID of the model you want to use. Get the model ID using /v1/models or on the available models page.
input: The prompt or input content you want the model to respond to.
max_output_tokens: The maximum number of tokens to generate in the response.
temperature: A value between 0.0 and 1.0 to control randomness and creativity.
Send a POST request to the /v1/responses endpoint using your model access key.
The following example request sends a prompt to an OpenAI GPT-OSS-20B model with the prompt What is the capital of France?, a temperature of 0.7, and maximum number of output tokens set to 50.
curl -sS -X POST https://inference.do-ai.run/v1/responses \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-H "Content-Type: application/json"\
-d '{
"model": "openai-gpt-oss-20b",
"input": "What is the capital of France?",
"max_output_tokens": 50,
"temperature": 0.7,
"stream": false
}'
The response includes structured output and token usage details:
{..."output":[{"content":[{"text":"We need to answer: The capital of France is Paris. This is straightforward.","type":"reasoning_text"}],...},{"content":[{"text":"The capital of France is **Paris**.","type":"output_text"}],...}],..."usage":{"input_tokens":72,"input_tokens_details":{"cached_tokens":32},"output_tokens":35,"output_tokens_details":{"reasoning_tokens":17,"tool_output_tokens":0},"total_tokens":107},...}
fromopenaiimportOpenAIfromdotenvimportload_dotenvimportosload_dotenv()client=OpenAI(base_url="https://inference.do-ai.run/v1/",api_key=os.getenv("MODEL_ACCESS_KEY"),)resp=client.responses.create(model="openai-gpt-oss-20b",input="What is the capital of France?",max_output_tokens=50,temperature=0.7,)print(resp.output[1].content[0].text)
fromgradientimportGradientfromdotenvimportload_dotenvimportosload_dotenv()client=Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))resp=client.responses.create(model="openai-gpt-oss-20b",input="What is the capital of France?",max_output_tokens=50,temperature=0.7,)print(resp.output[1].content[0].text)
Use Prompt Caching in Chat Completions and Responses API
Anthropic Models
Use prompt caching for Anthropic models in the chat completions API. Specify the cache_control parameter with type: ephemeral and ttl in your JSON request body. The ttl value can be 5m (default) or 1h. The following request body examples show how to use the cache_control parameter.
...{"role":"user","content":{"type":"text","text":"This is cached for 1h.","cache_control":{"type":"ephemeral","ttl":"1h"}}}
...{"role":"developer","content":[{"type":"text","text":"Cache this segment for 5 minutes.","cache_control":{"type":"ephemeral","ttl":"5m"}},{"type":"text","text":"Do not cache this segment"}]}
...{"role":"tool","tool_call_id":"tool_call_id","content":[{"type":"text","text":"Tool output cached for 5m.","cache_control":{"type":"ephemeral","ttl":"5m"}}]}
The JSON response looks similar to the following and shows the number of input tokens cached during this request:
Use prompt caching for OpenAI models for prompts containing 1024 tokens or more in both chat completions and responses API. Caching applies when the input tokens of a response match tokens from a previous response, though this is best-effort and not guaranteed.
To use prompt caching, specify the prompt_cache_retention parameter as either in_memory or 24h. The following request body example shows how to use the prompt_cache_retention parameter:
...{"model":"gpt-4o-mini","prompt_cache_retention":"24h","messages":[{"role":"system","content":"You are a helpful assistant that summarizes text."},{"role":"user","content":"Summarize the following text:\n\nArtificial intelligence is transforming industries by automating tasks, improving efficiency, and enabling new innovations..."}],"temperature":0.2}
The JSON response looks similar to the following and shows the number of input tokens cached during this request:
{"id":"chatcmpl-xyz789","object":"chat.completion","created":1772134300,"model":"gpt-4o-mini","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Artificial intelligence is reshaping industries by automating processes, increasing efficiency, and enabling innovation."}}],"usage":{"prompt_tokens":1200,"completion_tokens":35,"total_tokens":1235,"cache_read_input_tokens":0,"cache_created_input_tokens":1200,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":1200}}}
If you send the request again within the retention window, cached input tokens are used and the response looks like this:
The following cURL, Python OpenAI, and Gradient SDK examples show how to generate an image from a text prompt. Include your model access key and the following in your request:
model: The model ID of the image generation model you want to use. Get the model ID using /v1/models or on the available models page.
prompt: The text prompt to generate the image from.
n: The number of images to generate. Must be between 1 and 10.
size: The desired dimensions of the generated image. Supported values are 256x256, 512x512, and 1024x1024.
Make sure to always specify n and size when generating images.
Send a POST request to the /v1/images/generations endpoint using your model access key.
The following example request sends a prompt to the openai-gpt-image-1 model to generate an image of a baby sea otter floating on its back in calm blue water, with an image size of 1024x1024:
curl -X POST https://inference.do-ai.run/v1/images/generations \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-d '{
"model": "openai-gpt-image-1",
"prompt": "A cute baby sea otter floating on its back in calm blue water",
"n": 1,
"size": "1024x1024"
}'
The response includes a JSON object with a Base64 image string and other details such as image format and tokens used:
If you want to save the image as a file, pipe the image string to a file using jq and base64:
curl -X POST https://inference.do-ai.run/v1/images/generations \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-d '{
"model": "openai-gpt-image-1",
"prompt": "A cute baby sea otter floating on its back in calm blue water",
"n": 1,
"size": "1024x1024"
}'| jq -r '.data[0].b64_json'| base64 --decode > sea_otter.png
An image named sea_otter.png is created in your current directory after a few seconds.
fromopenaiimportOpenAIfromdotenvimportload_dotenvimportos,base64load_dotenv()client=OpenAI(base_url="https://inference.do-ai.run/v1/",api_key=os.getenv("MODEL_ACCESS_KEY"),)result=client.images.generate(model="openai-gpt-image-1",prompt="A cute baby sea otter, children’s book drawing style",size="1024x1024",n=1)b64=result.data[0].b64_jsonwithopen("sea_otter.png","wb")asf:f.write(base64.b64decode(b64))print("Saved sea_otter.png")
fromgradientimportGradientfromdotenvimportload_dotenvimportos,base64load_dotenv()client=Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))result=client.images.generations.create(model="openai-gpt-image-1",prompt="A cute baby sea otter, children’s book drawing style",size="1024x1024",n=1)b64=result.data[0].b64_jsonwithopen("sea_otter.png","wb")asf:f.write(base64.b64decode(b64))print("Saved sea_otter.png")
Generate Image, Audio, or Text-to-Speech Using fal Models
The following examples show how to generate an image or audio clip, or use text-to-speech with fal models with the /v1/async-invoke endpoint.
The following example sends a request to generate an image using the fal-ai/fast-sdxl model.
curl -X POST 'https://inference.do-ai.run/v1/async-invoke'\
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-H "Content-Type: application/json"\
-d '{
"model_id": "fal-ai/flux/schnell",
"input": { "prompt": "A futuristic city at sunset" }
}'
You can update the image generation request to also include the output format, number of inference steps, guidance scale, number of images to generate, and safety checker option:
curl -X POST https://inference.do-ai.run/v1/async-invoke \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"\
-H "Content-Type: application/json"\
-d '{
"model_id": "fal-ai/fast-sdxl",
"input": {
"prompt": "A futuristic cityscape at sunset, with flying cars and towering skyscrapers.",
"output_format": "landscape_4_3",
"num_inference_steps": 4,
"guidance_scale": 3.5,
"num_images": 1,
"enable_safety_checker": true
},
"tags": [
{"key": "type", "value": "test"}
]
}'
The following example sends a request to generate a 60 second audio clip using the fal-ai/stable-audio-25/text-to-audio model:
When you send a request to the /v1/async-invoke endpoint, it starts an asynchronous job for the image, audio, or text-to-speech generation and returns a request_id. The job status is QUEUED initially and the response looks similar to the following:
Query the status endpoint frequently using the request_id to check the progress of the job:
curl -X GET "https://inference.do-ai.run/v1/async-invoke/<request_id>/status" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
When the job completes, the status updates to COMPLETE. You can then use the /async-invoke/<request_id> endpoint to fetch the complete generated result:
curl -X GET "https://inference.do-ai.run/v1/async-invoke/<request_id>" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
The response includes a URL to the generated image, audio, or text-to-speech file, which you can download or open directly in your browser or app:
Alternatively, you can call serverless inference from your automation workflows. The n8n community node connects to any DigitalOcean-hosted model using your model access key. You can self-host n8n using the n8n Marketplace app.
Model Access Keys
You can create and manage model access keys in the Model Access Keys section of the Serverless inference page in the DigitalOcean Control Panel or using the API.
Create Keys
To create a model access key, click Create model access key to open the Add model access key window. In the Key name field, enter a name for your model access key, then click Add model access key.
Your new model access key with its creation date appears in the Model Access Keys section. The secret key is visible only once, immediately after creation, so copy and store it securely.
Model access keys are private and incur usage-based charges. Do not share them or expose them in front-end code. We recommend storing them using a secrets manager (for example, AWS Secrets Manager, HashiCorp Vault, or 1Password) or a secure environment variable in your deployment configuration.
How to Create Model API Key Using the DigitalOcean API
Renaming a model access key can help you organize and manage your keys more effectively, especially when using multiple keys for different projects or environments.
To rename a key, click … to the right of the key in the list to open the key’s menu, then click Rename. In the Rename model access key window that opens, in the Key name field, enter a new name for your key and then click UPDATE.
How to Rename Model API Key Using the DigitalOcean API
Regenerating a model access key creates a new secret key and immediately and permanently invalidates the previous one. If a key has been compromised or want to rotate keys for security purposes, regenerate the key, then update any applications using the previous key to use the new key.
To regenerate a key, click … to the right of the key in the list to open the key’s menu, then click Regenerate. In the Regenerate model access key window that opens, enter the name of your key to confirm the action, then click Regenerate access key. Your new secret key is displayed in the Model Access Keys section.
How to Regenerate Model API Key Using the DigitalOcean API
Deleting a model access key permanently and irreversibly destroys it. Any applications using a destroyed key lose access to the API.
To delete a key, click … to the right of the key in the list to open the key’s menu, then click Delete. In the Delete model access key window that opens, type the name of the key to confirm the deletion, then click Delete access key.
How to Delete Model API Key Using the DigitalOcean API