Hi, does anyone know some tips & tricks for speeding up the run time of a NaiveBayesTextPredict that takes about 5 mio text rows and applies a Token model table of about 160K tokens? Did a sample of 10K text rows and it took about 2 hours (Aster Express 8GB RAM Worker + 4GB RAM Queen, SSD i7, 2,7 Ghz processor). Tried to run on 1 Mio rows and cancelled it after 2 days of working. I had in my syntax a create table as select from NaiveBayesTextPredict so I don't know if the lag komes from table creation or from prediction step. Created index on all the columns of the dimesin table (model) but stil no improvement. What could be done to speed up the process? Is the delay due to the dimension table (structure, distribution, etc..or to the input table (prediction text)? Indexing 5 Mio text rows would be dezastruos I suppose from space and performance point of view...
Did the same test with a prediction table containing only 2 columns (a document_id and a token word) with 5 mio rows against a model with 800K rows (and 3 columns), both tables being indexed on all the columns. Still any attempts to do a prediction on more than 1000 test rows takes more than 12 hours (when I cancel it). Is there anything that can be done in terms of table creation that could speed up the function? I mean the fact table of prediction text, like DISTRIBUTE in conjunction with PARTITION BY, or columnar table type? I did also a CLUSTER method on the prediction table but it didn't bring any help.