TextTokenizer strange behavior with dictionary files / tables
Hi, I am trying to tokenize a text that contains punctuation, amputated words, digits, etc...(garbage) by using user dictionary tables or files but the output keeps comming with this garbages. The syntax I am using (and runs without errors) is:
SELECT aa, lower(token) AS token, bb
FROM TextTokenizer ( ON (select * from public.t1 where coalesce(trim(bb),'') = 'j00' and coalesce(cc,'') <> '') as texttoparse PARTITION BY ANY ON public.dexdim AS dict DIMENSION TextColumn ('cc') OutputByWord ('true') Accumulate ('aa', 'bb') UserDictionaryFile ('words.txt') ) where coalesce(token,'') <> ''
The dictionary file / table contains only full words but the output looks like this:
Isn't correct that the second column (token) should be based only on the words from dictionaries (table and file)?