Yeah, the usual way to compile pattern matching is a decision tree, because once you decide to check one of the arguments it can completely change which of the remaining arguments is a better candidate to look at next. (Of course, in a lazy language, it's even more subtle, but ML is strict.) It's a bit like compiling switch statements, only more so, and I expect compilers for functional languages have less tuning for crazy corner cases like a 18k-entry table than, say, Clang, simply because they have less tuning crazy corner cases overall. (I once had Clang compile a switch into a lookup table and stuff the table into a 64-bit constant, for crying out loud.)
For Unicode tables, the usual approach is of course a radix trie, even if the traditional two-level trie from the Unicode 2.0 days is starting to look a bit bloated now and a three-level tree is quite a bit slower. A hash might be a good approach for some things. One potential downside is that it completely destroys memory locality, while normal Unicode data will usually use a fairly small subset of the characters; I'm not sure how important this would turn out to be. People (e.g. libutf and IIRC duktape) have also used a sorted table of alternating start- and endpoints of ranges.
[The 18k figure sounds scarier than it is: 11k of those are precomposed Korean syllables encoded in a well-known systematic way, so for 0xAC00 <= codepoint < 0xAC00 + 19 * 21 * 28 the table just says (if (codepoint - 0xAC00) mod 28 = 0 then LV else LVT).]
For Unicode tables, the usual approach is of course a radix trie, even if the traditional two-level trie from the Unicode 2.0 days is starting to look a bit bloated now and a three-level tree is quite a bit slower. A hash might be a good approach for some things. One potential downside is that it completely destroys memory locality, while normal Unicode data will usually use a fairly small subset of the characters; I'm not sure how important this would turn out to be. People (e.g. libutf and IIRC duktape) have also used a sorted table of alternating start- and endpoints of ranges.
[The 18k figure sounds scarier than it is: 11k of those are precomposed Korean syllables encoded in a well-known systematic way, so for 0xAC00 <= codepoint < 0xAC00 + 19 * 21 * 28 the table just says (if (codepoint - 0xAC00) mod 28 = 0 then LV else LVT).]