String Table

The string table is a core component of the binary format that provides string deduplication.

Purpose

In a typical document with 1,000 user records, field values like "active", "Engineering", or city names repeat frequently. Without deduplication, each occurrence stores the full string. The string table stores each unique string once and uses 4-byte indices everywhere else.

Structure

┌─────────────────────────────┐
│ Total Size: u32              │  Size of the entire string table section
│ Count: u32                   │  Number of unique strings
├─────────────────────────────┤
│ Offsets: [u32 × Count]       │  Byte offset of each string in the data section
│ Lengths: [u32 × Count]       │  Length of each string (up to 4 GB)
├─────────────────────────────┤
│ String Data: [u8...]         │  Concatenated UTF-8 string data
└─────────────────────────────┘

How It Works

During Compilation

The writer traverses all values in the document
Every unique string is collected (keys, string values, schema names, field names, ref names, tag names)
Duplicates are eliminated
Each string gets an index (0, 1, 2, …)
The string table is written first in the file
All subsequent encoding uses indices instead of raw strings

During Reading

The reader loads the string table at startup
When decoding a string value, it reads a u32 index
The index maps to an offset and length in the string data
The string is read from the data section

Lookup Performance

String table access is O(1) by index:

index → offsets[index] → offset in data section
index → lengths[index] → number of bytes to read
string = data[offset..offset+length]

Size Impact

Example: 1,000 Users with 5 Fields

Without deduplication:

Field names repeated 1,000 times each
Common values (“active”, “Engineering”) repeated many times
Estimated overhead: ~20-30 KB just for repeated strings

With string table:

Each unique string stored once
References are 4 bytes each
Estimated savings: 60-80% on string data

Extreme Case: Large Tabular Data

For 10,000 rows with 10 fields, field names alone would consume:

Approach	Field Name Storage
JSON (per-field)	~10 × 10,000 × avg(8 bytes) = ~800 KB
TeaLeaf (string table)	10 × avg(8 bytes) + 100,000 × 4 bytes = ~400 KB
TeaLeaf with schema	10 × avg(8 bytes) = ~80 bytes (field names in schema only!)

With schema-typed data, field names appear only in the schema table – the string table contains only the actual string values.

What Gets Deduplicated

String Source	Deduplicated?
Top-level key names	Yes
Object field names	Yes
String values	Yes
Schema names	Yes
Schema field names	Yes
Reference names	Yes
Tag names	Yes

Maximum String Length

String lengths are stored as u32, supporting individual strings up to ~4 GB. The total string table size (all strings + metadata) is also capped at u32::MAX by the table’s Size header field.

Interaction with Compression

The string table itself is not compressed (it’s needed for decoding). However, data sections that reference the string table benefit doubly:

String references are 4 bytes (already compact)
ZLIB compression can further compress repetitive index patterns

Implementation Note

The string table uses a HashMap<String, u32> during compilation for O(1) dedup lookups. The final table is written as parallel arrays (offsets + lengths + data) for O(1) indexed access during reading.

Keyboard shortcuts

TeaLeaf Documentation