TextTokenizer strange behavior with dictionary files / tables

Aster
Highlighted
Enthusiast

TextTokenizer strange behavior with dictionary files / tables

Hi, I am trying to tokenize a text that contains punctuation, amputated words, digits, etc...(garbage) by using user dictionary tables or files but the output keeps comming with this garbages. The syntax I am using (and runs without errors) is:

 

SELECT aa, lower(token) AS token, bb

FROM TextTokenizer (
ON (select * from public.t1 where coalesce(trim(bb),'') = 'j00' and coalesce(cc,'') <> '') as texttoparse PARTITION BY ANY
ON public.dexdim AS dict DIMENSION
TextColumn ('cc')
OutputByWord ('true')
Accumulate ('aa', 'bb')
UserDictionaryFile ('words.txt')
) where coalesce(token,'') <> ''

 

The dictionary file / table contains only full words but the output looks like this:

 

764756(3-4j00
868818(37.2)j00
868171).j00
756212+j00

 

Isn't correct that the second column (token) should be based only on the words from dictionaries (table and file)?

 

Thanks