CHUNK_TEXT — Document Chunking

CHUNK_TEXT is a built-in table-valued function that splits text into overlapping chunks suitable for embedding and retrieval-augmented generation (RAG) pipelines. It is markdown-aware — it prefers to break at paragraph boundaries, headings, and list items rather than mid-sentence.

Syntax

SELECT chunk_index, chunk_text, chunk_pos
FROM CHUNK_TEXT(
  content,
  [max_chars],
  [overlap_chars],
  [title]
);

All parameters except content are optional and can be passed positionally or as named arguments.

Parameter	Type	Default	Description
`content`	`TEXT`	—	The document text to chunk (max 1 MB).
`max_chars`	`INT`	`3600`	Target maximum chunk size in characters (~900 tokens).
`overlap_chars`	`INT`	`540`	Number of characters to overlap between adjacent chunks (default: 15% of `max_chars`).
`title`	`TEXT`	`NULL`	Document title prepended to each chunk.

Output Columns

Column	Type	Description
`chunk_index`	`INT`	Zero-based chunk number.
`chunk_text`	`TEXT`	The chunk content (with title prefix if provided).
`chunk_pos`	`INT`	Character offset of this chunk in the original document.

Basic Usage

SELECT chunk_index, length(chunk_text) AS len, chunk_pos
FROM CHUNK_TEXT(
  content => 'PostgreSQL is a powerful database system...
  (long document text)',
  max_chars => 200,
  overlap_chars => 30
);

 chunk_index | len | chunk_pos
-------------+-----+-----------
           0 | 127 |         0
           1 | 181 |        97
           2 | 153 |       248
           3 | 151 |       371

With Title

When title is provided, it is prepended to each chunk so that embedding models have document context:

SELECT chunk_index, chunk_text
FROM CHUNK_TEXT(
  'Hello world. This is the document body.',
  title => 'My Document'
);

 chunk_index | chunk_text
-------------+---------------------------------------------------
           0 | title: My Document | text: Hello world. This is...

Smart Splitting

CHUNK_TEXT identifies natural break points in the text and assigns them a score based on proximity to the target chunk boundary:

Paragraph breaks (double newlines) — strongest break point
Markdown headings (## ...) — strong break point
List items (- ..., 1. ...) — moderate break point
Sentence endings — fallback break point

Code fences (triple backticks) are treated as no-break zones — the function will not split inside a fenced code block.

RAG Pipeline Example

Combine CHUNK_TEXT with embedding() and HNSW indexes for a complete RAG pipeline:

-- 1. Create the documents and chunks tables
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT
);

CREATE TABLE chunks (
  id SERIAL PRIMARY KEY,
  doc_id INT REFERENCES documents(id),
  chunk_index INT,
  chunk_text TEXT,
  embedding VECTOR(768)
);

-- 2. Chunk and embed a document
INSERT INTO chunks (doc_id, chunk_index, chunk_text, embedding)
SELECT
  1,
  c.chunk_index,
  c.chunk_text,
  embedding('default', c.chunk_text)
FROM CHUNK_TEXT(
  content => (SELECT content FROM documents WHERE id = 1),
  max_chars => 1500,
  overlap_chars => 200,
  title => (SELECT title FROM documents WHERE id = 1)
) AS c;

-- 3. Create an HNSW index for fast search
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

-- 4. Semantic search
SELECT chunk_text, embedding <=> embedding('default', 'How does branching work?') AS distance
FROM chunks
ORDER BY distance
LIMIT 5;

Use in CTEs

CHUNK_TEXT works in WITH clauses for multi-step processing:

WITH raw_chunks AS (
  SELECT chunk_index, chunk_text, chunk_pos
  FROM CHUNK_TEXT(
    content => 'Long document...',
    max_chars => 2000
  )
)
SELECT chunk_index, length(chunk_text) AS size
FROM raw_chunks
WHERE length(chunk_text) > 100;

Edge Cases

Input	Result
`NULL` content	Zero rows returned.
Empty string `''`	Zero rows returned.
Content shorter than `max_chars`	Single chunk with the full content.
Content exceeding 1 MB	Error.

Vector Search — embedding and HNSW indexes
Built-in Functions — all SQL functions
fs9 — File System — ingest files to chunk from the filesystem