---
title: About the AI Accelerator page
summary: null
url: >-
  https://www.fastly.com/documentation/guides/platform/about-the-ai-accelerator-page
---

[AI Accelerator](https://docs.fastly.com/products/ai-accelerator) is a caching solution for artificial intelligence services from providers like OpenAI. By caching large language model (LLM) API responses and leveraging the cache for semantically similar queries, AI Accelerator can reduce latency and lower your LLM API usage costs.

## Before you begin

Be sure you know how to [access the web interface controls](https://www.fastly.com/documentation/guides/getting-started/navigating-fastly/about-the-web-interface-controls).

AI Accelerator can be enabled in the Fastly control panel by anyone assigned the role of [superuser](https://www.fastly.com/documentation/guides/account-info/user-and-account-management/about-user-roles-and-permissions). Once enabled, all account users will be able to view the metrics.

### Supported LLMs

AI Accelerator currently supports OpenAI, Azure OpenAI, Gemini, and LLMs with OpenAI-compatible APIs.

## Enabling AI Accelerator

To enable AI Accelerator, follow these steps:

1.   Log in to the [Fastly control panel](https://manage.fastly.com).

2. Go to **Tools** > **AI Accelerator**.
3. Click **Enable AI Accelerator**.
4. On the Enable AI Accelerator page, click **Enable Now**.

## Configuring your application to use AI Accelerator

After AI Accelerator is enabled, you'll need to create a [read-only API token](https://www.fastly.com/documentation/guides/account-info/user-and-account-management/using-api-tokens) and update your application to use the AI Accelerator endpoint. Refer to the code examples below if you need help updating your application's code.

### OpenAI and OpenAI-compatible code examples

### Python

```python
from openai import OpenAI
client = OpenAI(
# Set the API endpoint
base_url="https://ai.fastly.app/api.openai.com/v1",
   # Set default headers
   default_headers = {
    "Fastly-Key": f"",
   }
)
```

### Javascript

```javascript

const openai = new OpenAI({
  apiKey: request.env.OPENAI_API_KEY,
  baseURL: "https://ai.fastly.app/api.openai.com/v1",
  defaultHeaders: {
    "Fastly-Key": ``,
  },
});
```

For LLMs with OpenAI-compatible APIs, use `https://ai.fastly.app/compat/openai/<llm-endpoint>` as the base URL.

### Azure OpenAI code examples

### Python

```python
from openai.lib.azure import AzureOpenAI
client = AzureOpenAI(
	api_key=azure_key,
	api_version="2024-06-01",
	azure_deployment="ai-member-4o-chat",
azure_en dpoint=f"https://ai.fastly.app/.openai.azure.com",
default_headers = {
    "Fastly-Key": f"",
   }
)
```

### Gemini code examples

### Python

```python
project_region = ""
project_id = ""
vertexai.init(
location=project_region,
      project=project_id,
  api_endpoint=f"ai.fastly.app/{project_region}-aiplatform.googleapis.com",
      api_transport='rest',
      request_metadata=[("fastly-key", f"")]
)

model = GenerativeModel("gemini-pro")
print(model.generate_content("Why is the sky blue?"))
```

### Javascript

```javascript
const {VertexAI} = require('@google-cloud/vertexai');

async function generate_from_text_input(projectId = '') {
  const vertexAI = new VertexAI({project: projectId, location: ''});

  const generativeModel = vertexAI.getGenerativeModel({
    model: 'gemini-1.5-flash-001',
  });

  const prompt =
    "What's a good name for a flower shop that specializes in selling bouquets of dried flowers?";

  const resp = await generativeModel.generateContent(prompt);
  const contentResponse = await resp.response;
  console.log(JSON.stringify(contentResponse));
}
```

### Setting and checking headers

You can use the following request and response headers to control and monitor how AI Accelerator caches LLM responses.

| Header name            | Type            | Description                                                                                                                                                                                                                                                                    |
| ---------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `x-semantic-threshold` | Request header  | Controls the similarity threshold for responses from the semantic cache. The default is `0.75`. A lower threshold may increase the likelihood of a cached response at the risk of returning a lower quality response.                                                          |
| `x-semantic-cache-key` | Request header  | User provided value that is used to segment the response in the cache. Only requests with a matching `x-semantic-cache-key`  above the similarity threshold will be returned as a response. Not required to be set. If not set, the default value of `_default_` will be used. |
| `x-settings-overrides` | Request header  | Controls whether or not semantic cache is enabled. The default is `{"semantic_cache_enabled": true}`.                                                                                                                                                                          |
| `Cache-Control`        | Request header  | We currently only respect the `max-age` cache control directive. If a `Cache-Control` header is set on a request with a `max-age`, we will set that as the TTL on the cache entry up to a max TTL of 30 days (in seconds).                                                     |
| `x-semantic-cache`     | Response header | Previously `x-cache`. Possible values are `HIT` or `MISS`.                                                                                                                                                                                                                     |

## About the AI Accelerator page

The AI Accelerator page provides metrics related to requests, tokens, and origin latency. The page displays the following charts:

- **Total requests:** The total number of requests sent to AI Accelerator.
- **Tokens served from cache:** The estimated number of tokens served from cache based on responses served from cache. A token is an LLM billing unit, the exact measure of which varies between vendor and version of LLM.
- **Estimated time saved:** The estimated amount of time saved in minutes based on responses served from cache.
- **Requests:** The total number of AI Accelerator requests aggregated across your account.
- **Tokens:** The estimated number of tokens served from cache or origin.
- **Origin Latency Percentiles:** The origin latency percentile approximations.

## Purging the cache

> **IMPORTANT:** 
>
> This information is part of a beta release. For additional details, read our [product and feature lifecycle](https://docs.fastly.com/products/fastly-product-lifecycle#beta) descriptions.
>
>

You can purge all cache by using the AI Accelerator API endpoint. For example, you can purge all cache using curl in a terminal application.

```term copy
$ curl -X POST -H "Fastly-Key: YOUR_FASTLY_TOKEN" https://api.fastly.com/ai_accelerator/expire
```

> **IMPORTANT:** The API token must have `purge_all` scope.

## Disabling AI Accelerator

To disable AI Accelerator, follow these steps:

1. Update your application code to remove the AI Accelerator integration.
2.   Log in to the [Fastly control panel](https://manage.fastly.com).

3. Go to **Account** > **Billing** > **Overview**.
4. Click **Options** next to AI Accelerator, then click **Cancel**.
5. Click **Cancel AI Accelerator**.

## Related content

- [AI Accelerator API documentation](https://www.fastly.com/documentation/reference/api/products/ai_accelerator/)
