Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Inceptron compiler, now open for early access. Auto-compile models for maximum efficiency. Join early access →

Models

MiniMax-M2.5

Minimax

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.28

Output tokens, 1M

$1.10

Cache read

$0.03

Tokens per sec

40

Quantization

fp8

Size

230B

196K context

text

code

tool-calling

reasoning

Kimi-K2.5

Moonshotai

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.50

Output tokens, 1M

$2.40

Cache read

$0.12

Tokens per sec

47

Quantization

Int4

Size

1T

262K context

multimodal

tool-calling

reasoning

Llama-3.3-70B-Instruct

Meta

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.12

Output tokens, 1M

$0.38

Cache read

N/A

Tokens per sec

40

Quantization

fp8

Size

70B

131K context

text

chat

tool-calling

Enterprise-Ready Inference

Run and scale Llama, Qwen, Kimi, and DeepSeek with SLA-backed uptime, zero-retention data handling, and pay-as-you-go pricing—no GPU ops.

gpt-oss-120b

OpenAI

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.05

Output tokens, 1M

$0.45

Cache read

$0.025

Tokens per sec

65

Quantization

fp8

Size

120B

131K context

text

code

tool-calling

reasoning

DeepSeek-V3.2

DeepSeek

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.269

Output tokens, 1M

$0.40

Cache read

$0.13

Tokens per sec

30

Quantization

fp8

Size

685B

163K context

text

code

tool-calling

reasoning

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.80

Output tokens, 1M

$2.56

Cache read

$0.20

Tokens per sec

50

Quantization

fp8

Size

744B

200K context

text

code

tool-calling

reasoning

DeepSeek-R1-0528

DeepSeek

Mode

Inceptron Optimized

Region

Input tokens, 1M

$0.50

Output tokens, 1M

$2.00

Cache read

20

Tokens per sec

fp8

Quantization

fp8

Size

685B

164K context

JSON mode

MoE

code

reasoning

Run any model on the fastest endpoints

Use our API to deploy any model on one of the most cost-efficient inference stacks available.

Scale seamlessly to a dedicated deployment at any time for optimal throughput.



Curl

Python

JavaScript

curl https://api.inceptron.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $INCEPTRON_API_KEY" \
-d '{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "How many moons are there in the Solar System?"
    }
  ]
}'

Curl

Python

JavaScript

curl https://api.inceptron.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $INCEPTRON_API_KEY" \
-d '{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "How many moons are there in the Solar System?"
    }
  ]
}'

Curl

Python

JavaScript

curl https://api.inceptron.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $INCEPTRON_API_KEY" \
-d '{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": "How many moons are there in the Solar System?"
    }
  ]
}'