Is anyone performing Statistical analysis on data using Teradata SQL
We have an application that is written in 'R' and was wondering if we could incorporate the calculations directly in Teradata SQL.
We do Linear Algebra, matrix/vector calculations, Multidimensional Statisticsal Distributions, Random Number Generation, Numerical Optimizations and use algorithimy for correlation estimation simluation, aggregation, and disaggregation.
Re: Is anyone performing Statistical analysis on data using Teradata SQL
SAS guys are working with Teradata guys to add statistical functions to Teradata.
I do a lot of simple CASE statements with nested SUM combined with GROUP BY a key column as a form of transposing and summarisation. The 'case' satements create many dummy variables and the 'sum' and 'group by' are used to aggregate to a single row of data per customer ID. This is common data preparation prior to running a linear regression or similar predictive modelling algorithm. I use QUARTILE in order to squish my data into percentiles (or even deciles if I can get away with it) as a lazy way of reducing the effects of outliers.
I don't build the models in Teradata (obviously), but do score linear regression, C5.0, CART decision trees, and also back progagation neural networks as SQL on our Teradata box. Once prepared, all data preparation, transformation and predictive scoring is on the Terdata box. The SQL is massively verbose and not very effectively written (much of it is auto generated and looks hideous) but it is optimised well and runs amazingly fast.
As a freeware tool, I wouldn't expect 'R' or the related tools ('Yale' i think it's called) to be able to autogenerate SQL and do a lot of the data preparation in-database. The main commercial data mining tools have this functionality. I'm biased because I'm ex-SPSS and heavily use Clementine, but SAS appearently has similar basic features. Maybe the freeware tools will be developed in a similar way over the next few years.