I analyzed 79,704 Chinese sentences from Tatoeba to understand which pinyin syllables actually appear in practice. The corpus contains 1,161 unique syllables (there are likely more obscure ones not covered here). Using a Trie data structure to map every syllable, I found patterns in tone distribution, character complexity, and polyphonic pronunciation that differ from what textbooks suggest.

Here’s what the data reveals.

The Trie Structure

A Trie (prefix tree) is a data structure where each path from root to leaf represents a complete string. For pinyin, each node is a single letter, and following a path like h → a → n → 4 gives you the syllable “han4”.

Pinyin syllables form a tree where each letter is a node. Here’s the complete h-branch showing how syllables form from root to terminal nodes:

Pinyin Trie - H Branch

The h-branch from root to all complete syllables (ha, hao, he, etc.). Dashed lines show other branches.

Explore more:

Level 1 →
Level 2 →
Level 3 →
Full Trie → (hover over nodes for details)

The Tone Paradox

Neutral tone represents only 2.2% of unique syllables, but 8.4% of actual character usage. Why? Extremely common grammatical particles (的, 了, 吗, 么) are all neutral tone:

Tone Distribution Analysis Three perspectives on tone: by syllables, by characters, and by frequency

Key insight: Tone 4 dominates across all measures (27-34%), while neutral tone punches far above its weight due to particle frequency.

Syllable Crowding

Most pinyin syllables map to a handful of characters. But a few are extremely crowded:

Syllable Complexity Distribution Character count per syllable - most have 3-4 characters, but some have 30+

The most crowded syllables:

  • yi4: 37 characters (意, 义, 议, 异, 易, 亿, 艺, 益…)
  • shi4: 32 characters (是, 事, 市, 式, 试, 视, 世, 士…)
  • ji4: 30 characters (记, 际, 计, 技, 季, 继, 既, 寄…)

Meanwhile, some syllables are unique to a single character: wo3 (我), le0 (了), ni3 (你).

The Polyphonic Myth

Textbooks emphasize that many Chinese characters have multiple pronunciations. But in the 5,002 characters that appear in this corpus, only 3.8% (199 characters) are actually polyphonic. While Unicode defines 80,000+ Chinese characters, this analysis focuses on what learners encounter in real usage:

Polyphonic Characters Top 20 characters with multiple pronunciations

Even when characters have multiple pronunciations, one usually dominates:

  • : 4 pronunciations, but de0 is used 99.8% of the time
  • : 3 pronunciations, but yi1 is the primary form
  • : 5 pronunciations (the most polyphonic character in the corpus)

Syllable Structure

Most Chinese syllables complete at depth 4 (3 letters + tone, like ban1 or mao2):

Depth Distribution Syllable completion by depth - 41.5% of syllables complete at depth 4 (482 out of 1,161)

The longest syllables (depth 7: 6 letters + tone) are all -uang combinations:

  • chuang1, chuang2, chuang3, chuang4
  • shuang1, shuang3
  • zhuang1, zhuang4

Syllable × Tone Coverage

When you strip away tones, 1,161 syllables collapse to 401 base forms. This heatmap shows which base syllables exist across all 5 tones:

Syllable-Tone Matrix 401 base syllables × 5 tones. Cell values show character count. Some bases exist across all tones, others only in one or two.

Observations:

  • Most base syllables don’t have all 5 tone variants
  • Neutral tone (column 0) is sparse - only 25 base syllables have it
  • Some bases are tone-specific (appear in only 1-2 tone columns)

Key Findings

  • 1,161 unique syllables found in 79,704 sentences from the Tatoeba corpus
  • 401 base syllables when tones are removed (average 2.9 tones per base)
  • 91.8% of characters have only one pronunciation in practice
  • Neutral tone: 2.2% of syllables but 8.4% of usage (particle effect)
  • Top 10 syllables account for 18.8% of all character instances
  • Character homophony: Average 4.5 characters per syllable (range: 1-37)

Technical Notes

This analysis is based on 79,704 Chinese sentences from the Tatoeba corpus. I built a character-level Trie data structure where each node represents a single letter or tone number, with terminal nodes storing character metadata and frequency data.

Tools: Python 3.9+, matplotlib (charts), graphviz (tree visualization)

Data pipeline:

  1. Extract characters from corpus
  2. Compute pinyin with jieba + pypinyin and/or GPT-4o-mini
  3. Build Trie structure
  4. Analyze distributions, patterns, and edge cases

Full analysis code and visualizations: hanzi-flow on GitHub