CHUNK_TEXT — Document Chunking
CHUNK_TEXT is a built-in table-valued function that splits text into overlapping chunks suitable for embedding and retrieval-augmented generation (RAG) pipelines. It is markdown-aware — it prefers to break at paragraph boundaries, headings, and list items rather than mid-sentence.
Syntax
Section titled “Syntax”SELECT chunk_index, chunk_text, chunk_posFROM CHUNK_TEXT( content, [max_chars], [overlap_chars], [title]);All parameters except content are optional and can be passed positionally or as named arguments.
| Parameter | Type | Default | Description |
|---|---|---|---|
content | TEXT | — | The document text to chunk (max 1 MB). |
max_chars | INT | 3600 | Target maximum chunk size in characters (~900 tokens). |
overlap_chars | INT | 540 | Number of characters to overlap between adjacent chunks (default: 15% of max_chars). |
title | TEXT | NULL | Document title prepended to each chunk. |
Output Columns
Section titled “Output Columns”| Column | Type | Description |
|---|---|---|
chunk_index | INT | Zero-based chunk number. |
chunk_text | TEXT | The chunk content (with title prefix if provided). |
chunk_pos | INT | Character offset of this chunk in the original document. |
Basic Usage
Section titled “Basic Usage”SELECT chunk_index, length(chunk_text) AS len, chunk_posFROM CHUNK_TEXT( content => 'PostgreSQL is a powerful database system... (long document text)', max_chars => 200, overlap_chars => 30); chunk_index | len | chunk_pos-------------+-----+----------- 0 | 127 | 0 1 | 181 | 97 2 | 153 | 248 3 | 151 | 371With Title
Section titled “With Title”When title is provided, it is prepended to each chunk so that embedding models have document context:
SELECT chunk_index, chunk_textFROM CHUNK_TEXT( 'Hello world. This is the document body.', title => 'My Document'); chunk_index | chunk_text-------------+--------------------------------------------------- 0 | title: My Document | text: Hello world. This is...Smart Splitting
Section titled “Smart Splitting”CHUNK_TEXT identifies natural break points in the text and assigns them a score based on proximity to the target chunk boundary:
- Paragraph breaks (double newlines) — strongest break point
- Markdown headings (
## ...) — strong break point - List items (
- ...,1. ...) — moderate break point - Sentence endings — fallback break point
Code fences (triple backticks) are treated as no-break zones — the function will not split inside a fenced code block.
RAG Pipeline Example
Section titled “RAG Pipeline Example”Combine CHUNK_TEXT with embedding() and HNSW indexes for a complete RAG pipeline:
-- 1. Create the documents and chunks tablesCREATE TABLE documents ( id SERIAL PRIMARY KEY, title TEXT, content TEXT);
CREATE TABLE chunks ( id SERIAL PRIMARY KEY, doc_id INT REFERENCES documents(id), chunk_index INT, chunk_text TEXT, embedding VECTOR(768));
-- 2. Chunk and embed a documentINSERT INTO chunks (doc_id, chunk_index, chunk_text, embedding)SELECT 1, c.chunk_index, c.chunk_text, embedding('default', c.chunk_text)FROM CHUNK_TEXT( content => (SELECT content FROM documents WHERE id = 1), max_chars => 1500, overlap_chars => 200, title => (SELECT title FROM documents WHERE id = 1)) AS c;
-- 3. Create an HNSW index for fast searchCREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
-- 4. Semantic searchSELECT chunk_text, embedding <=> embedding('default', 'How does branching work?') AS distanceFROM chunksORDER BY distanceLIMIT 5;Use in CTEs
Section titled “Use in CTEs”CHUNK_TEXT works in WITH clauses for multi-step processing:
WITH raw_chunks AS ( SELECT chunk_index, chunk_text, chunk_pos FROM CHUNK_TEXT( content => 'Long document...', max_chars => 2000 ))SELECT chunk_index, length(chunk_text) AS sizeFROM raw_chunksWHERE length(chunk_text) > 100;Edge Cases
Section titled “Edge Cases”| Input | Result |
|---|---|
NULL content | Zero rows returned. |
Empty string '' | Zero rows returned. |
Content shorter than max_chars | Single chunk with the full content. |
| Content exceeding 1 MB | Error. |
Related Docs
Section titled “Related Docs”- Vector Search — embedding and HNSW indexes
- Built-in Functions — all SQL functions
- fs9 — File System — ingest files to chunk from the filesystem