Skip to content
Discord Get Started

CHUNK_TEXT — Document Chunking

CHUNK_TEXT is a built-in table-valued function that splits text into overlapping chunks suitable for embedding and retrieval-augmented generation (RAG) pipelines. It is markdown-aware — it prefers to break at paragraph boundaries, headings, and list items rather than mid-sentence.

SQL
SELECT chunk_index, chunk_text, chunk_pos
FROM CHUNK_TEXT(
content,
[max_chars],
[overlap_chars],
[title]
);

All parameters except content are optional and can be passed positionally or as named arguments.

ParameterTypeDefaultDescription
contentTEXTThe document text to chunk (max 1 MB).
max_charsINT3600Target maximum chunk size in characters (~900 tokens).
overlap_charsINT540Number of characters to overlap between adjacent chunks (default: 15% of max_chars).
titleTEXTNULLDocument title prepended to each chunk.
ColumnTypeDescription
chunk_indexINTZero-based chunk number.
chunk_textTEXTThe chunk content (with title prefix if provided).
chunk_posINTCharacter offset of this chunk in the original document.
SQL
SELECT chunk_index, length(chunk_text) AS len, chunk_pos
FROM CHUNK_TEXT(
content => 'PostgreSQL is a powerful database system...
(long document text)',
max_chars => 200,
overlap_chars => 30
);
Output
chunk_index | len | chunk_pos
-------------+-----+-----------
0 | 127 | 0
1 | 181 | 97
2 | 153 | 248
3 | 151 | 371

When title is provided, it is prepended to each chunk so that embedding models have document context:

SQL
SELECT chunk_index, chunk_text
FROM CHUNK_TEXT(
'Hello world. This is the document body.',
title => 'My Document'
);
Output
chunk_index | chunk_text
-------------+---------------------------------------------------
0 | title: My Document | text: Hello world. This is...

CHUNK_TEXT identifies natural break points in the text and assigns them a score based on proximity to the target chunk boundary:

  • Paragraph breaks (double newlines) — strongest break point
  • Markdown headings (## ...) — strong break point
  • List items (- ..., 1. ...) — moderate break point
  • Sentence endings — fallback break point

Code fences (triple backticks) are treated as no-break zones — the function will not split inside a fenced code block.

Combine CHUNK_TEXT with embedding() and HNSW indexes for a complete RAG pipeline:

SQL
-- 1. Create the documents and chunks tables
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT
);
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
doc_id INT REFERENCES documents(id),
chunk_index INT,
chunk_text TEXT,
embedding VECTOR(768)
);
-- 2. Chunk and embed a document
INSERT INTO chunks (doc_id, chunk_index, chunk_text, embedding)
SELECT
1,
c.chunk_index,
c.chunk_text,
embedding('default', c.chunk_text)
FROM CHUNK_TEXT(
content => (SELECT content FROM documents WHERE id = 1),
max_chars => 1500,
overlap_chars => 200,
title => (SELECT title FROM documents WHERE id = 1)
) AS c;
-- 3. Create an HNSW index for fast search
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
-- 4. Semantic search
SELECT chunk_text, embedding <=> embedding('default', 'How does branching work?') AS distance
FROM chunks
ORDER BY distance
LIMIT 5;

CHUNK_TEXT works in WITH clauses for multi-step processing:

SQL
WITH raw_chunks AS (
SELECT chunk_index, chunk_text, chunk_pos
FROM CHUNK_TEXT(
content => 'Long document...',
max_chars => 2000
)
)
SELECT chunk_index, length(chunk_text) AS size
FROM raw_chunks
WHERE length(chunk_text) > 100;
InputResult
NULL contentZero rows returned.
Empty string ''Zero rows returned.
Content shorter than max_charsSingle chunk with the full content.
Content exceeding 1 MBError.