[Tutorial] [Python] Analytics on Teradata Database with Vantage

Analytics
Teradata Employee

[Tutorial] [Python] Analytics on Teradata Database with Vantage

Hey team Teradata,

I come across a lot of people having connectivity issues with Teradata Database especially when it comes to a platform of their choice.

For Python users, who want to inspect a database or the data and run analytics on either, I have designed a Jupyter Notebook to help new users get started.

 

8 REPLIES 8
Enthusiast

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

@ahmadmansoor  This is really great.

I am not able to download the link as it's asking for login info onto sharepoint, it may be a Teradata internal site.

Would it be possible to share it on the download section so every body can be benefited? Also do you have any documentation for it

Thanks again,

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Sciences on Teradata Database by ~A7 [AM250152]\n",
    "Requirements: Python 3.5+, teradatasql, teradataml and a Vantage Machine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "┌┬┐┌─┐┌┬┐┌─┐  ┌─┐┌─┐┬┌─┐┌┐┌┌─┐┌─┐\n",
      " ││├─┤ │ ├─┤  └─┐│  │├┤ ││││  ├┤ \n",
      "─┴┘┴ ┴ ┴ ┴ ┴  └─┘└─┘┴└─┘┘└┘└─┘└─┘\n",
      "Welcome to Data Sciences with A7™\n",
      "using Pandas version: 0.23.4\n",
      "using SciKit-Learn version: 0.19.2\n",
      "using Python version: 3.5.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:11:22) \n",
      "[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]\n"
     ]
    }
   ],
   "source": [
    "print(\"┌┬┐┌─┐┌┬┐┌─┐  ┌─┐┌─┐┬┌─┐┌┐┌┌─┐┌─┐\")\n",
    "print(\" ││├─┤ │ ├─┤  └─┐│  │├┤ ││││  ├┤ \")\n",
    "print(\"─┴┘┴ ┴ ┴ ┴ ┴  └─┘└─┘┴└─┘┘└┘└─┘└─┘\")\n",
    "\n",
    "import os, time,sys\n",
    "for char in \"Welcome to Data Sciences with A7™\":\n",
    "    time.sleep(0.05)\n",
    "    print(char, end='', flush=True)\n",
    "\n",
    "# os.system(\"pip install libraries/teradatasql-16.20.0.39-py3-none-any.whl\")\n",
    "# os.system(\"pip install libraries/teradataml-16.20.0.0-py3-none-any.whl\")\n",
    "\n",
    "import teradatasql\n",
    "import teradataml as tdml\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import sklearn # classic ML\n",
    "#import keras # deep learning\n",
    "\n",
    "print(\"\\nusing Pandas version:\",pd.__version__)\n",
    "print(\"using SciKit-Learn version:\", sklearn.__version__)\n",
    "print(\"using Python version:\", sys.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this demo, we are going to run some analytics on the Iris dataset. This Notebook is intended to get Python data scientists to get a jump start on analyzing client-side data exisiting in Teradata Databases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Configurations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "host = \"sdt19085.labs.teradata.com\"\n",
    "username = \"user1\"\n",
    "password = \"user1\"\n",
    "database_name = \"demo\"\n",
    "table_name = \"iris\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Connect to SQL Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "con = teradatasql.connect(None, host=host, user=username, password=password)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load Data\n",
    "We are going to work with two tables: iris and iris_large"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2315912</td>\n",
       "      <td>6.283448</td>\n",
       "      <td>8.236815</td>\n",
       "      <td>10.391486</td>\n",
       "      <td>20.720223</td>\n",
       "      <td>versicolor</td>\n",
       "      <td>FF</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3984584</td>\n",
       "      <td>5.605049</td>\n",
       "      <td>4.246290</td>\n",
       "      <td>5.814689</td>\n",
       "      <td>4.570939</td>\n",
       "      <td>setosa</td>\n",
       "      <td>BF</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5653256</td>\n",
       "      <td>5.312167</td>\n",
       "      <td>6.365951</td>\n",
       "      <td>7.940895</td>\n",
       "      <td>6.101106</td>\n",
       "      <td>setosa</td>\n",
       "      <td>EB</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1145313</td>\n",
       "      <td>6.949874</td>\n",
       "      <td>10.257845</td>\n",
       "      <td>4.598636</td>\n",
       "      <td>5.890818</td>\n",
       "      <td>virginica</td>\n",
       "      <td>CI</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2813985</td>\n",
       "      <td>5.086145</td>\n",
       "      <td>4.033024</td>\n",
       "      <td>9.856735</td>\n",
       "      <td>7.284636</td>\n",
       "      <td>setosa</td>\n",
       "      <td>DD</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         0         1          2          3          4           5   6  7\n",
       "0  2315912  6.283448   8.236815  10.391486  20.720223  versicolor  FF  0\n",
       "1  3984584  5.605049   4.246290   5.814689   4.570939      setosa  BF  1\n",
       "2  5653256  5.312167   6.365951   7.940895   6.101106      setosa  EB  0\n",
       "3  1145313  6.949874  10.257845   4.598636   5.890818   virginica  CI  0\n",
       "4  2813985  5.086145   4.033024   9.856735   7.284636      setosa  DD  0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table_name = \"iris_large\"\n",
    "cursor = con.cursor()\n",
    "cursor.execute(\"SELECT TOP 300 * FROM {}\".format(database_name+\".\"+table_name))\n",
    "raw_data = pd.DataFrame(cursor.fetchall())\n",
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Modify Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal-length</th>\n",
       "      <th>sepal-width</th>\n",
       "      <th>petal-length</th>\n",
       "      <th>petal-width</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6.283448</td>\n",
       "      <td>8.236815</td>\n",
       "      <td>10.391486</td>\n",
       "      <td>20.720223</td>\n",
       "      <td>versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.605049</td>\n",
       "      <td>4.246290</td>\n",
       "      <td>5.814689</td>\n",
       "      <td>4.570939</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.312167</td>\n",
       "      <td>6.365951</td>\n",
       "      <td>7.940895</td>\n",
       "      <td>6.101106</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.949874</td>\n",
       "      <td>10.257845</td>\n",
       "      <td>4.598636</td>\n",
       "      <td>5.890818</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.086145</td>\n",
       "      <td>4.033024</td>\n",
       "      <td>9.856735</td>\n",
       "      <td>7.284636</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal-length  sepal-width  petal-length  petal-width       class\n",
       "0      6.283448     8.236815     10.391486    20.720223  versicolor\n",
       "1      5.605049     4.246290      5.814689     4.570939      setosa\n",
       "2      5.312167     6.365951      7.940895     6.101106      setosa\n",
       "3      6.949874    10.257845      4.598636     5.890818   virginica\n",
       "4      5.086145     4.033024      9.856735     7.284636      setosa"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleaned_data = raw_data.drop([0,6,7],axis=1)\n",
    "cleaned_data.columns = [\"sepal-length\", \"sepal-width\",\"petal-length\",\"petal-width\",\"class\"]\n",
    "cleaned_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initial Analysis of Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       sepal-length  sepal-width  petal-length  petal-width\n",
      "count    300.000000   300.000000    300.000000   300.000000\n",
      "mean       5.727166     8.056860      7.350811     9.512028\n",
      "std        0.811897     2.811425      2.108194     8.420680\n",
      "min        4.172506     1.209392      1.935968    -8.943831\n",
      "25%        5.068811     5.930002      5.900518     4.808855\n",
      "50%        5.616010     7.794138      7.406402     6.256710\n",
      "75%        6.348974     9.884348      8.665233    12.063914\n",
      "max        7.919086    16.079922     12.967613    46.799579 \n",
      "\n",
      "Class Distribution: \n",
      " class\n",
      "setosa        120\n",
      "versicolor     94\n",
      "virginica      86\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(cleaned_data.describe(), \"\\n\")\n",
    "print(\"Class Distribution: \\n\", cleaned_data.groupby('class').size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visual Analysis of Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_data.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, title=\"Iris Dataset: Box and Whiskers Plot\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_data.hist()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.plotting.scatter_matrix(cleaned_data)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction: Training Models (SkLearn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "array = cleaned_data.values\n",
    "X = array[:,0:4]\n",
    "Y = array[:,4]\n",
    "validation_size = 0.20\n",
    "seed = 7\n",
    "X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)\n",
    "seed = 7\n",
    "scoring = 'accuracy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LR: 0.762500 (0.064684)\n",
      "LDA: 0.783333 (0.078617)\n",
      "KNN: 0.829167 (0.070833)\n",
      "CART: 0.795833 (0.092139)\n",
      "NB: 0.858333 (0.053359)\n",
      "SVM: 0.808333 (0.046398)\n"
     ]
    }
   ],
   "source": [
    "# Create a list of different models\n",
    "\n",
    "models = []\n",
    "models.append(('LR', LogisticRegression()))\n",
    "models.append(('LDA', LinearDiscriminantAnalysis()))\n",
    "models.append(('KNN', KNeighborsClassifier()))\n",
    "models.append(('CART', DecisionTreeClassifier()))\n",
    "models.append(('NB', GaussianNB()))\n",
    "models.append(('SVM', SVC()))\n",
    "\n",
    "# evaluate each model in a loop\n",
    "results = []\n",
    "names = []\n",
    "for name, model in models:\n",
    "\tkfold = model_selection.KFold(n_splits=10, random_state=seed)\n",
    "\tcv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)\n",
    "\tresults.append(cv_results)\n",
    "\tnames.append(name)\n",
    "\tmsg = \"%s: %f (%f)\" % (name, cv_results.mean(), cv_results.std())\n",
    "\tprint(msg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare Algorithms\n",
    "\n",
    "fig = plt.figure()\n",
    "fig.suptitle('Algorithm Comparison')\n",
    "ax = fig.add_subplot(111)\n",
    "plt.boxplot(results)\n",
    "ax.set_xticklabels(names)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: \n",
      " 0.85\n",
      "Confusion Matrix: \n",
      " [[26  0  0]\n",
      " [ 7 13  0]\n",
      " [ 1  1 12]]\n",
      "Report: \n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "     setosa       0.76      1.00      0.87        26\n",
      " versicolor       0.93      0.65      0.76        20\n",
      "  virginica       1.00      0.86      0.92        14\n",
      "\n",
      "avg / total       0.87      0.85      0.85        60\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Make predictions on validation dataset using LDA\n",
    "lda = LinearDiscriminantAnalysis()\n",
    "lda.fit(X_train, Y_train)\n",
    "predictions = lda.predict(X_validation)\n",
    "print(\"Accuracy: \\n\", accuracy_score(Y_validation, predictions))\n",
    "print(\"Confusion Matrix: \\n\", confusion_matrix(Y_validation, predictions))\n",
    "print(\"Report: \\n\", classification_report(Y_validation, predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Teradata ML Library on Vantage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn_ml = tdml.create_context(host = host, username=username, password = password)\n",
    "\n",
    "from teradataml.analytics.NaiveBayes import NaiveBayes\n",
    "from teradataml.analytics.NaiveBayesPredict import NaiveBayesPredict\n",
    "from teradataml.dataframe.dataframe import DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_name = \"iris\"\n",
    "iris_input_train = DataFrame.from_query(\"SELECT * FROM {} WHERE id MOD 5 <> 0\".format(database_name+\".\"+table_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Analyze data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    id  sepal_length  sepal_width  petal_length  petal_width     species\n",
       "0  138           6.4          3.1           5.5          1.8   virginica\n",
       "1   32           5.4          3.4           1.5          0.4      setosa\n",
       "2  124           6.3          2.7           4.9          1.8   virginica\n",
       "3    8           5.0          3.4           1.5          0.2      setosa\n",
       "4   56           5.7          2.8           4.5          1.3  versicolor\n",
       "5  137           6.3          3.4           5.6          2.4   virginica\n",
       "6  136           7.7          3.0           6.1          2.3   virginica\n",
       "7   62           5.9          3.0           4.2          1.5  versicolor\n",
       "8  117           6.5          3.0           5.5          1.8   virginica\n",
       "9   76           6.6          3.0           4.4          1.4  versicolor"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_input_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/am250152/anaconda3/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.\n",
      "  return _compile(pattern, flags).split(string, maxsplit)\n"
     ]
    }
   ],
   "source": [
    "naivebayes_train = NaiveBayes(formula=\"species ~ petal_length + sepal_width + petal_width + sepal_length\", data=iris_input_train)\n",
    "iris_input_test = DataFrame.from_query(\"SELECT * FROM {} WHERE id MOD 5 = 0\".format(database_name+\".\"+table_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create model and predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/am250152/anaconda3/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.\n",
      "  return _compile(pattern, flags).split(string, maxsplit)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "############ STDOUT Output ############\n",
       "\n",
       "    id  prediction  loglik_virginica  loglik_setosa  loglik_versicolor\n",
       "0   70  versicolor        -15.236845    -152.472574          -2.353846\n",
       "1   85  versicolor         -7.002832    -249.656534          -2.004556\n",
       "2   40      setosa        -58.353886       0.976841         -35.442558\n",
       "3  105   virginica         -1.583216    -540.563571         -14.859641\n",
       "4   95  versicolor        -10.180244    -198.037172          -1.105673\n",
       "5  100  versicolor        -10.131539    -187.295006          -1.028853\n",
       "6  110   virginica         -6.113021    -654.802111         -28.838515\n",
       "7   35      setosa        -58.198028       0.660203         -34.933600\n",
       "8   15      setosa        -64.716957      -3.554763         -42.613273\n",
       "9   65  versicolor        -12.649648    -138.435759          -2.189800"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "naivebayes_predict_result = NaiveBayesPredict(newdata=iris_input_test,\n",
    "                                       modeldata = naivebayes_train,\n",
    "                                       id_col = \"id\",\n",
    "                                       responses = [\"virginica\",\"setosa\",\"versicolor\"])\n",
    "naivebayes_predict_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extra Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_table_cols(db,table):\n",
    "    cursor = con.cursor()\n",
    "    cursor.execute(\"select columnname from dbc.columns where databasename = '{}' and tablename='{}';\".format(db,table))\n",
    "    col_names = cursor.fetchall()\n",
    "    col_names = [item.strip(\" \") for sublist in col_names for item in sublist]\n",
    "    return col_names"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Tags (1)

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Sciences on Teradata Database by ~A7 [AM250152]\n",
    "Requirements: Python 3.5+, teradatasql, teradataml and a Vantage Machine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "┌┬┐┌─┐┌┬┐┌─┐  ┌─┐┌─┐┬┌─┐┌┐┌┌─┐┌─┐\n",
      " ││├─┤ │ ├─┤  └─┐│  │├┤ ││││  ├┤ \n",
      "─┴┘┴ ┴ ┴ ┴ ┴  └─┘└─┘┴└─┘┘└┘└─┘└─┘\n",
      "Welcome to Data Sciences with A7™\n",
      "using Pandas version: 0.23.4\n",
      "using SciKit-Learn version: 0.19.2\n",
      "using Python version: 3.5.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:11:22) \n",
      "[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]\n"
     ]
    }
   ],
   "source": [
    "print(\"┌┬┐┌─┐┌┬┐┌─┐  ┌─┐┌─┐┬┌─┐┌┐┌┌─┐┌─┐\")\n",
    "print(\" ││├─┤ │ ├─┤  └─┐│  │├┤ ││││  ├┤ \")\n",
    "print(\"─┴┘┴ ┴ ┴ ┴ ┴  └─┘└─┘┴└─┘┘└┘└─┘└─┘\")\n",
    "\n",
    "import os, time,sys\n",
    "for char in \"Welcome to Data Sciences with A7™\":\n",
    "    time.sleep(0.05)\n",
    "    print(char, end='', flush=True)\n",
    "\n",
    "# os.system(\"pip install libraries/teradatasql-16.20.0.39-py3-none-any.whl\")\n",
    "# os.system(\"pip install libraries/teradataml-16.20.0.0-py3-none-any.whl\")\n",
    "\n",
    "import teradatasql\n",
    "import teradataml as tdml\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import sklearn # classic ML\n",
    "#import keras # deep learning\n",
    "\n",
    "print(\"\\nusing Pandas version:\",pd.__version__)\n",
    "print(\"using SciKit-Learn version:\", sklearn.__version__)\n",
    "print(\"using Python version:\", sys.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this demo, we are going to run some analytics on the Iris dataset. This Notebook is intended to get Python data scientists to get a jump start on analyzing client-side data exisiting in Teradata Databases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Configurations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "host = \"sdt19085.labs.teradata.com\"\n",
    "username = \"user1\"\n",
    "password = \"user1\"\n",
    "database_name = \"demo\"\n",
    "table_name = \"iris\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Connect to SQL Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "con = teradatasql.connect(None, host=host, user=username, password=password)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load Data\n",
    "We are going to work with two tables: iris and iris_large"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2315912</td>\n",
       "      <td>6.283448</td>\n",
       "      <td>8.236815</td>\n",
       "      <td>10.391486</td>\n",
       "      <td>20.720223</td>\n",
       "      <td>versicolor</td>\n",
       "      <td>FF</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3984584</td>\n",
       "      <td>5.605049</td>\n",
       "      <td>4.246290</td>\n",
       "      <td>5.814689</td>\n",
       "      <td>4.570939</td>\n",
       "      <td>setosa</td>\n",
       "      <td>BF</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5653256</td>\n",
       "      <td>5.312167</td>\n",
       "      <td>6.365951</td>\n",
       "      <td>7.940895</td>\n",
       "      <td>6.101106</td>\n",
       "      <td>setosa</td>\n",
       "      <td>EB</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1145313</td>\n",
       "      <td>6.949874</td>\n",
       "      <td>10.257845</td>\n",
       "      <td>4.598636</td>\n",
       "      <td>5.890818</td>\n",
       "      <td>virginica</td>\n",
       "      <td>CI</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2813985</td>\n",
       "      <td>5.086145</td>\n",
       "      <td>4.033024</td>\n",
       "      <td>9.856735</td>\n",
       "      <td>7.284636</td>\n",
       "      <td>setosa</td>\n",
       "      <td>DD</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         0         1          2          3          4           5   6  7\n",
       "0  2315912  6.283448   8.236815  10.391486  20.720223  versicolor  FF  0\n",
       "1  3984584  5.605049   4.246290   5.814689   4.570939      setosa  BF  1\n",
       "2  5653256  5.312167   6.365951   7.940895   6.101106      setosa  EB  0\n",
       "3  1145313  6.949874  10.257845   4.598636   5.890818   virginica  CI  0\n",
       "4  2813985  5.086145   4.033024   9.856735   7.284636      setosa  DD  0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table_name = \"iris_large\"\n",
    "cursor = con.cursor()\n",
    "cursor.execute(\"SELECT TOP 300 * FROM {}\".format(database_name+\".\"+table_name))\n",
    "raw_data = pd.DataFrame(cursor.fetchall())\n",
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Modify Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal-length</th>\n",
       "      <th>sepal-width</th>\n",
       "      <th>petal-length</th>\n",
       "      <th>petal-width</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6.283448</td>\n",
       "      <td>8.236815</td>\n",
       "      <td>10.391486</td>\n",
       "      <td>20.720223</td>\n",
       "      <td>versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.605049</td>\n",
       "      <td>4.246290</td>\n",
       "      <td>5.814689</td>\n",
       "      <td>4.570939</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.312167</td>\n",
       "      <td>6.365951</td>\n",
       "      <td>7.940895</td>\n",
       "      <td>6.101106</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.949874</td>\n",
       "      <td>10.257845</td>\n",
       "      <td>4.598636</td>\n",
       "      <td>5.890818</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.086145</td>\n",
       "      <td>4.033024</td>\n",
       "      <td>9.856735</td>\n",
       "      <td>7.284636</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal-length  sepal-width  petal-length  petal-width       class\n",
       "0      6.283448     8.236815     10.391486    20.720223  versicolor\n",
       "1      5.605049     4.246290      5.814689     4.570939      setosa\n",
       "2      5.312167     6.365951      7.940895     6.101106      setosa\n",
       "3      6.949874    10.257845      4.598636     5.890818   virginica\n",
       "4      5.086145     4.033024      9.856735     7.284636      setosa"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleaned_data = raw_data.drop([0,6,7],axis=1)\n",
    "cleaned_data.columns = [\"sepal-length\", \"sepal-width\",\"petal-length\",\"petal-width\",\"class\"]\n",
    "cleaned_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initial Analysis of Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       sepal-length  sepal-width  petal-length  petal-width\n",
      "count    300.000000   300.000000    300.000000   300.000000\n",
      "mean       5.727166     8.056860      7.350811     9.512028\n",
      "std        0.811897     2.811425      2.108194     8.420680\n",
      "min        4.172506     1.209392      1.935968    -8.943831\n",
      "25%        5.068811     5.930002      5.900518     4.808855\n",
      "50%        5.616010     7.794138      7.406402     6.256710\n",
      "75%        6.348974     9.884348      8.665233    12.063914\n",
      "max        7.919086    16.079922     12.967613    46.799579 \n",
      "\n",
      "Class Distribution: \n",
      " class\n",
      "setosa        120\n",
      "versicolor     94\n",
      "virginica      86\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(cleaned_data.describe(), \"\\n\")\n",
    "print(\"Class Distribution: \\n\", cleaned_data.groupby('class').size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visual Analysis of Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_data.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, title=\"Iris Dataset: Box and Whiskers Plot\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaned_data.hist()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.plotting.scatter_matrix(cleaned_data)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction: Training Models (SkLearn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "array = cleaned_data.values\n",
    "X = array[:,0:4]\n",
    "Y = array[:,4]\n",
    "validation_size = 0.20\n",
    "seed = 7\n",
    "X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)\n",
    "seed = 7\n",
    "scoring = 'accuracy'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LR: 0.762500 (0.064684)\n",
      "LDA: 0.783333 (0.078617)\n",
      "KNN: 0.829167 (0.070833)\n",
      "CART: 0.795833 (0.092139)\n",
      "NB: 0.858333 (0.053359)\n",
      "SVM: 0.808333 (0.046398)\n"
     ]
    }
   ],
   "source": [
    "# Create a list of different models\n",
    "\n",
    "models = []\n",
    "models.append(('LR', LogisticRegression()))\n",
    "models.append(('LDA', LinearDiscriminantAnalysis()))\n",
    "models.append(('KNN', KNeighborsClassifier()))\n",
    "models.append(('CART', DecisionTreeClassifier()))\n",
    "models.append(('NB', GaussianNB()))\n",
    "models.append(('SVM', SVC()))\n",
    "\n",
    "# evaluate each model in a loop\n",
    "results = []\n",
    "names = []\n",
    "for name, model in models:\n",
    "\tkfold = model_selection.KFold(n_splits=10, random_state=seed)\n",
    "\tcv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)\n",
    "\tresults.append(cv_results)\n",
    "\tnames.append(name)\n",
    "\tmsg = \"%s: %f (%f)\" % (name, cv_results.mean(), cv_results.std())\n",
    "\tprint(msg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare Algorithms\n",
    "\n",
    "fig = plt.figure()\n",
    "fig.suptitle('Algorithm Comparison')\n",
    "ax = fig.add_subplot(111)\n",
    "plt.boxplot(results)\n",
    "ax.set_xticklabels(names)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: \n",
      " 0.85\n",
      "Confusion Matrix: \n",
      " [[26  0  0]\n",
      " [ 7 13  0]\n",
      " [ 1  1 12]]\n",
      "Report: \n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "     setosa       0.76      1.00      0.87        26\n",
      " versicolor       0.93      0.65      0.76        20\n",
      "  virginica       1.00      0.86      0.92        14\n",
      "\n",
      "avg / total       0.87      0.85      0.85        60\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Make predictions on validation dataset using LDA\n",
    "lda = LinearDiscriminantAnalysis()\n",
    "lda.fit(X_train, Y_train)\n",
    "predictions = lda.predict(X_validation)\n",
    "print(\"Accuracy: \\n\", accuracy_score(Y_validation, predictions))\n",
    "print(\"Confusion Matrix: \\n\", confusion_matrix(Y_validation, predictions))\n",
    "print(\"Report: \\n\", classification_report(Y_validation, predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Teradata ML Library on Vantage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn_ml = tdml.create_context(host = host, username=username, password = password)\n",
    "\n",
    "from teradataml.analytics.NaiveBayes import NaiveBayes\n",
    "from teradataml.analytics.NaiveBayesPredict import NaiveBayesPredict\n",
    "from teradataml.dataframe.dataframe import DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_name = \"iris\"\n",
    "iris_input_train = DataFrame.from_query(\"SELECT * FROM {} WHERE id MOD 5 <> 0\".format(database_name+\".\"+table_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Analyze data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    id  sepal_length  sepal_width  petal_length  petal_width     species\n",
       "0  138           6.4          3.1           5.5          1.8   virginica\n",
       "1   32           5.4          3.4           1.5          0.4      setosa\n",
       "2  124           6.3          2.7           4.9          1.8   virginica\n",
       "3    8           5.0          3.4           1.5          0.2      setosa\n",
       "4   56           5.7          2.8           4.5          1.3  versicolor\n",
       "5  137           6.3          3.4           5.6          2.4   virginica\n",
       "6  136           7.7          3.0           6.1          2.3   virginica\n",
       "7   62           5.9          3.0           4.2          1.5  versicolor\n",
       "8  117           6.5          3.0           5.5          1.8   virginica\n",
       "9   76           6.6          3.0           4.4          1.4  versicolor"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_input_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/am250152/anaconda3/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.\n",
      "  return _compile(pattern, flags).split(string, maxsplit)\n"
     ]
    }
   ],
   "source": [
    "naivebayes_train = NaiveBayes(formula=\"species ~ petal_length + sepal_width + petal_width + sepal_length\", data=iris_input_train)\n",
    "iris_input_test = DataFrame.from_query(\"SELECT * FROM {} WHERE id MOD 5 = 0\".format(database_name+\".\"+table_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create model and predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/am250152/anaconda3/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.\n",
      "  return _compile(pattern, flags).split(string, maxsplit)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "############ STDOUT Output ############\n",
       "\n",
       "    id  prediction  loglik_virginica  loglik_setosa  loglik_versicolor\n",
       "0   70  versicolor        -15.236845    -152.472574          -2.353846\n",
       "1   85  versicolor         -7.002832    -249.656534          -2.004556\n",
       "2   40      setosa        -58.353886       0.976841         -35.442558\n",
       "3  105   virginica         -1.583216    -540.563571         -14.859641\n",
       "4   95  versicolor        -10.180244    -198.037172          -1.105673\n",
       "5  100  versicolor        -10.131539    -187.295006          -1.028853\n",
       "6  110   virginica         -6.113021    -654.802111         -28.838515\n",
       "7   35      setosa        -58.198028       0.660203         -34.933600\n",
       "8   15      setosa        -64.716957      -3.554763         -42.613273\n",
       "9   65  versicolor        -12.649648    -138.435759          -2.189800"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "naivebayes_predict_result = NaiveBayesPredict(newdata=iris_input_test,\n",
    "                                       modeldata = naivebayes_train,\n",
    "                                       id_col = \"id\",\n",
    "                                       responses = [\"virginica\",\"setosa\",\"versicolor\"])\n",
    "naivebayes_predict_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extra Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_table_cols(db,table):\n",
    "    cursor = con.cursor()\n",
    "    cursor.execute(\"select columnname from dbc.columns where databasename = '{}' and tablename='{}';\".format(db,table))\n",
    "    col_names = cursor.fetchall()\n",
    "    col_names = [item.strip(\" \") for sublist in col_names for item in sublist]\n",
    "    return col_names"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Tags (1)
Enthusiast

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

Hi,

I have a newbie question, does the Teradata Python Module (aka teradataml nowadays if I am not wrong) work also with Teradata Express VM installations that can be freely downloaded from Teradata? Is it the same with Teradata Pyhon Package (described at https://docs.teradata.com/reader/GsM0pYRZl5Plqjdf9ixmdA/MzdO1q_t80M47qY5lyImOA). I am quite confused about these two term Teradata Python Module and Teradata Python Package, I don’t know which of them contains the machine learning functions and which of them can connect to a Teradata Express instance. I tried to connect from a Jupyter Notebook to a TD Express 16.20 via teradataml, I could also load example functions in a table, got them into a dataframe with success but when running the test I got an error that maybe tells me that ONLY Vantage can handle this requests:

 

stringsimilarity_out1 = StringSimilarity (data=strsimilarity_input, comparison_columns=['jaro (src_text1 , tar_text ) AS jaro1_sim',
'LD (src_text1 , tar_text, 2) AS ld1_sim',
'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'],
case_sensitive=True,
accumulate = ["id","src_text1","tar_text"],
data_sequence_column='id')
# Print result dataframe.
stringsimilarity_out1.result

 

and the error:

 

OperationalError                          Traceback (most recent call last)
C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradataml\analytics\mle\StringSimilarity.py in __execute(self)
    297         try:
--> 298             UtilFuncs._create_view(sqlmr_stdout_temp_tablename, self.sqlmr_query)
    299         except Exception as emsg:

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradataml\common\utils.py in _create_view(view_name, query)
    419         try:
--> 420             UtilFuncs._execute_ddl_statement(crt_view)
    421             return True

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradataml\common\utils.py in _execute_ddl_statement(ddl_statement)
    355                 cursor = conn.cursor()
--> 356                 cursor.execute(ddl_statement)
    357                 conn.commit()

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradatasql\__init__.py in execute(self, sOperation, params)
    538             if not params:
--> 539                 self.executemany (sOperation, None)
    540 

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradatasql\__init__.py in executemany(self, sOperation, seqOfParams)
    723                 goside.goFreePointer (self.connection.uLog, pcError)
--> 724                 raise OperationalError (sErr)
    725 

OperationalError: [Version 16.20.0.42] [Session 1004] [Teradata Database] [Error 3707] Syntax error, expected something like ';' between the word 'StringSimilarity' and '('.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1072
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1088
 at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1127
 at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1172
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:298
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:1751
 at database/sql.ctxDriverQuery ctxutil.go:48
 at database/sql.(*DB).queryDC.func1 sql.go:1464
 at database/sql.withLock sql.go:3032
 at database/sql.(*DB).queryDC sql.go:1459
 at database/sql.(*Conn).QueryContext sql.go:1701
 at main.goCreateRows goside.go:548
 at main._cgoexpwrap_545a47f24425_goCreateRows _cgo_gotypes.go:305
 at runtime.call64 asm_amd64.s:574
 at runtime.cgocallbackg1 cgocall.go:316
 at runtime.cgocallbackg cgocall.go:194
 at runtime.cgocallback_gofunc asm_amd64.s:826
 at runtime.goexit asm_amd64.s:2361

During handling of the above exception, another exception occurred:

TeradataMlException                       Traceback (most recent call last)
<ipython-input-13-e6ba6dea14cf> in <module>
      5                                         case_sensitive=True,
      6                                         accumulate = ["id","src_text1","tar_text"],
----> 7                                         data_sequence_column='id')      8 # Print result dataframe.
      9 stringsimilarity_out1.result

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradataml\analytics\mle\StringSimilarity.py in __init__(self, data, comparison_columns, case_sensitive, accumulate, data_sequence_column, data_order_column)
    179             self.__form_tdml_query()
    180             # Execute ML query
--> 181             self.__execute()
    182 
    183     def __validate(self):

C:\Radu\Miniconda3\envs\purr2\lib\site-packages\teradataml\analytics\mle\StringSimilarity.py in __execute(self)
    298             UtilFuncs._create_view(sqlmr_stdout_temp_tablename, self.sqlmr_query)
    299         except Exception as emsg:
--> 300             raise TeradataMlException(Messages.get_message(MessageCodes.TDMLDF_EXEC_SQL_FAILED, str(emsg)), MessageCodes.TDMLDF_EXEC_SQL_FAILED)
    301 
    302         # Update output table data frames.

TeradataMlException: [Teradata][teradataml](TDML_2102) Failed to execute SQL: '[Version 16.20.0.42] [Session 1004] [Teradata Database] [Error 3707] Syntax error, expected something like ';' between the word 'StringSimilarity' and '('.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1072
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1088
 at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1127
 at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1172
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:298
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:1751
 at database/sql.ctxDriverQuery ctxutil.go:48
 at database/sql.(*DB).queryDC.func1 sql.go:1464
 at database/sql.withLock sql.go:3032
 at database/sql.(*DB).queryDC sql.go:1459
 at database/sql.(*Conn).QueryContext sql.go:1701
 at main.goCreateRows goside.go:548
 at main._cgoexpwrap_545a47f24425_goCreateRows _cgo_gotypes.go:305
 at runtime.call64 asm_amd64.s:574
 at runtime.cgocallbackg1 cgocall.go:316
 at runtime.cgocallbackg cgocall.go:194
 at runtime.cgocallback_gofunc asm_amd64.s:826
 at runtime.goexit asm_amd64.s:2361'

This error "Syntax error, expected something like ';' between the word 'StringSimilarity' and '('." appears usually when TD is lacking the function that is called in the querry....:(...I would be happy to be wrong and to just be a typo in the notebook cell.

 

Thanks

Highlighted
Enthusiast

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

Hi again,

 

There is a free Developer Tier of Teradata advertised here https://www.teradata.com/Products/Cloud/VMware but actually it's not a Vantage but a "classical" 16.20.07 version of a TD Database Engine that lacks IMHO the ML functionaluties requred by teradataml package. Do you know if there is any newer free developer tier installation available that contains also the ML functions? Thanks again.

Enthusiast

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

Hi
Not sure what do mean by ML but if you are referring to teradata multiload then it is part of tuf utilities and i think it comes with td latest vm ware. Also python mudule needs to be downloaded and installed.
After you configure the environment it should work with vm as well. Hope that helps
Thanks
Enthusiast

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

Hi,

 

ML = Machine Learning. It's the analytics functions set that are implemented partially in Teradata Aster and partially in Teradata Database (with the new names they have inside Teradata Vantage). In my example it was the StringSimilarity function.

Teradata Employee

Re: [Tutorial] [Python] Analytics on Teradata Database with Vantage

the basic python package is now teradatasql

 

https://pypi.org/project/teradatasql/

 

thanks

 

dave