Hi, I am trying to tokenize a text that contains punctuation, amputated words, digits, etc...(garbage) by using user dictionary tables or files but the output keeps comming with this garbages. The syntax I am using (and runs without errors) is:
SELECT aa, lower(token) AS token, bb
FROM TextTokenizer (
ON (select * from public.t1 where coalesce(trim(bb),'') = 'j00' and coalesce(cc,'') <> '') as texttoparse PARTITION BY ANY
ON public.dexdim AS dict DIMENSION
TextColumn ('cc')
OutputByWord ('true')
Accumulate ('aa', 'bb')
UserDictionaryFile ('words.txt')
) where coalesce(token,'') <> ''
The dictionary file / table contains only full words but the output looks like this:
764756 | (3-4 | j00 |
868818 | (37.2) | j00 |
868171 | ). | j00 |
756212 | + | j00 |
Isn't correct that the second column (token) should be based only on the words from dictionaries (table and file)?
Thanks