Let’s face it: many business users would prefer to use a nice GUI when analyzing their data, rather than having to write code. KNIME is an open-source data analytics platform where end users can do just that – that is, build an analytic workflow using a GUI. In this article, we will look at an example of how to use KNIME to analyze open-source heart disease data.
The open-source data used in the example that follows was obtained from the University of California Irvine Machine Learning Repository and can be downloaded here: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ . (We will use the processed Switzerland, Cleveland, and VA data sets and the reprocessed Hungarian data set).
NOTE: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the collection of the data at each institution. They would be:
This data was collected from 920 cardiac patients; 294 were from the Hungarian Institute of Cardiology, 123 were from the University Hospitals in Zurich and Basel, Switzerland, 200 were from the V.A. Medical Center in Long Beach, California, and 303 were from the Cleveland Clinic in Ohio. The combined data set includes 725 men and 193 women; the ages of the patients ranged from 28-77 years. Each patient was classified according to severity of heart disease:
For reference, here is a table of the different variables in our data set:
age of patient
gender of patient (1 = male, 0 = female)
type of chest pain experienced by patient (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
resting blood pressure (mmHg)
serum cholesterol (mg/dl)
fasting blood sugar > 120 mg/dl? 1 = true, 0 = false
resting ECG results (0 = normal, 1 = ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy according to Estes’ Criteria)
maximum heart rate
exercise-induced angina (1 = yes, 0 = no)
ST depression (a measurement on the patient's ECG readout) induced by exercise relative to rest
slope of peak exercise ST segment (a measurement on the patient's ECG readout; 1 = positive slope, 2 = flat, 3 = negative slope)
number of major vessels colored by fluoroscopy (0-3)
type of heart defect present (3 = normal, 6 = fixed defect, 7 = reversible defect)
severity of heart disease (0 = no disease, 1 = mild disease, 2 = moderate disease, 3 = acute disease, 4 = severe disease)
Some of these data sets contain the “?” symbol (without quotes) wherever there are missing values. For now, let's replace those with the value -9. Since these are small data sets, we can do this by opening our favorite text editor and running a "Find and Replace" on each of the input data sets.
Also, with the reprocessed Hungarian data set, it is recommended that you replace all the spaces between the values with commas.
It also helps if you get rid of the ".0"s (without quotes) in the Cleveland data set. (This is so that our input data is formatted consistently).
We’ll assume that, once we have preprocessed our data, we can hand it over to our DBA, who then loads it into a Teradata Aster table called [schema].heartdx_prelim. (To see an example of how to create a Teradata Aster table and load .csv data into it, please see Sentiment Analysis with Teradata Aster ).
Data Loading/Prep Work:
In this example, we will be using KNIME 3.2.1, which is licensed under the GNU General Public License, Version 3.
To begin, we open KNIME and create a new workflow. Here is what the KNIME screen looks like:
To create a new workflow, go to File > New…
In the dialog box that appears, select New KNIME Workflow > Next > Give your workflow a title in the “Name of the workflow to create” field > Click “Finish”. Our KNIME screen should now look like this:
We will now connect to the database where our data is stored, which, in this case, is a Teradata Aster database. In the Node Repository section on the bottom left side of the screen, select Database > Connector > Database Connector.
NOTE: Nodes can be selected either by double-clicking their name in the Node Repository panel or by dragging and dropping them to the grid panel in the center of the screen. Also, you can search for any node by using the Node Repository search window in the upper right corner of the Node Repository panel.
Now, right click the new Database Connector node and click “Configure,” as shown below:
In the dialog box that appears, fill out the following fields with the credentials of the database that we are connecting to:
NOTE: In the Database URL field, the value for <protocol> when connecting to a Teradata Aster database is jdbc:ncluster . Also, in order to connect KNIME to Aster, you will need to have the noarch-aster-jdbc-driver.jar file saved on your local system. Once you have downloaded this file, go to KNIME and select File > Preferences > KNIME > Databases > Add File > navigate to the location where you saved the noarch-aster-jdbc-driver.jar file. Select the .jar file, then click Open > OK.
When our node is configured, it will look like this (in the rectangle under the node, the middle circle is yellow, in case you have trouble seeing the color):
To execute the node and establish our database connection, either right click the node and select “Execute”, or left-click the node and press F7.
Now that we are connected to our database, we need KNIME to look at our specific table of interest, [schema].heartdx_prelim. For this, we can use a Database Reader node:
Our nodes should look like this:
Right click the Database Reader node and click “Configure.” Ok, so for this node, we are going to need to write some code, but don’t worry; this is the only code writing that we will do in this workflow.
In the dialog box that appears, write the following code in the SQL Statement panel:
DROP TABLE IF EXISTS [schema].knime_pract;
CREATE TABLE [schema].knime_pract
DISTRIBUTE BY HASH(age)
COMPRESS LOW AS (
(ROW_NUMBER() OVER (ORDER BY age)) AS rownum,
nullif (age, -9) AS age,
nullif (sex, '-9') AS gender,
nullif (chestpaintype, '-9') AS chestpaintype,
WHEN restbps IN (-9, 0) THEN null
END AS restbps,
WHEN chol IN (-9, 0) THEN null
END AS chol,
nullif (fastbloodsug, '-9') AS fastbloodsug,
nullif (restecg, '-9') AS restecg,
nullif (maxheartrate, -9) AS maxheartrate,
nullif (exindang, '-9') AS exindang,
nullif (oldpeak, -9) AS oldpeak,
nullif (slope, '-9') AS slope,
nullif (numvessels, -9) as numvessels,
nullif (defect, '-9') AS defect,
nullif (dxlevel, '-9') AS dxlevel,
WHEN dxlevel IN ('1', '2', '3', '4') THEN '1'
END AS dxpresent
SELECT * FROM [schema].knime_pract;
This takes our [schema].heartdx_prelim table and replaces the null values (denoted by -9) with true null values (which KNIME denotes with ?s). It then takes the result and creates a new table called [schema].knime_pract, from which it then selects everything. Notice also that we have created an ID variable called "rownum" (without quotes) and a new response variable called "dxpresent" (also without quotes). This dxpresent variable indicates whether or not a particular patient has heart disease.
Click “OK” and then run the node. Now, if we right click the Database Reader node and select “Data from Database”, we can see our data in KNIME:
Now that KNIME can both see our data and recognize the missing values as “missing,” we need to tell KNIME how to handle the missing values. For now, we will focus on getting rid of the columns with over 50% missing values. To do this, we will use a Missing Value Column Filter node.
At this point, we will do a little exploration of our remaining data. We will do this using Pie Chart, Histogram, and Correlation nodes (there are many other nodes that can be used for data exploration, but for now, we’ll focus on these three).
To make a Pie chart, do the following:
Notice that our pie chart is in black and white. To add color to our pie chart, do the following:
As we can see, a little over half of our patients have a chest pain type of 4, which is asymptomatic (in other words, they do not have chest pain).
To make a Histogram, do the following:
If you chose the age variable and used a Color Manager node, your histogram should look something like this:
As you can see, the ages of our patients appear to be approximately normally distributed.
A good thing to do when we are working on building a predictive model is to check our predictor variables to see if any of them are highly correlated with each other. Let’s do this now with a Rank Correlation node.
NOTE: The Rank Correlation node calculates correlations between all predictor pairs; the Linear Correlation node will calculate correlations for numeric-numeric pairs and nominal-nominal predictor pairs, but not for numeric-nominal pairs.
It should look something like this:
If we hover our mouse over each of the squares, we can see what the rank correlation is for that pair of variables. We see that chestpaintype is moderately correlated with maxheartrate and exindang (that is, the presence/absence of exercise-induced angina), which intuitively makes sense. Also, maxheartrate, exindang, oldpeak, and slope are all moderately correlated with each other, which makes sense, since all of these measures have to do with the state of a patient’s heart while they are exercising. If moderate correlation doesn’t bother us, then we’re fine to leave all the predictors in for model building.
We’re almost ready to move on to model building, but first, we need to take care of the missing values in our data set. (We can’t just ignore them at this point because the Logistic Regression Learner node that we are about to use won’t work unless we tell KNIME what to do about any missing values…trust me).
Now, let’s partition our main data set into training and test data sets. For this example, we’ll put 70% of our data into the training data set and 30% of our data into the test data set. Also, we’ll focus use dxpresent (1 = disease present, 0 = no disease) as our response variable.
Now, we are ready to build some statistical models. In this example, we will build and test a logistic regression model.
If you right click on the node and select View: Logistic Regression Result View, you can see your model-building results, as shown below:
If we use a typical α = 0.05 significance threshold, we see that our model considers age, gender, chest pain type, cholesterol, fasting blood sugar, maximum heart rate, exercise-induced angina, and oldpeak to be significant predictors of whether or not a person has heart disease.
How reliable are these model results, though? To answer this question, we’ll run our model on our test data set and score the results.
There isn’t really any configuration that needs to be done with this node for this example, so go ahead and run the node.
If we right click on the node and select Predicted Data, we can see the predictions generated by the node:
To summarize our test results, let's use a Scorer node:
Here, we see that our logistic model’s accuracy at correctly classifying patients as having heart disease/not having heart disease is about 79%. If we right click the Scorer node and select Accuracy Statistics, we can see more measures of accuracy, such as sensitivity, specificity, recall and precision.
Our current workflow looks like this:
At this point, we have several options to continue our work. Our current model does pretty well at predicting whether or not a patient has heart disease; however, we could try to tune our logistic model to improve our accuracy (or whichever accuracy statistic we care about most). We could also try different models, such as Naïve Bayes and Random Forest, to see if a different type of statistical model gives us better results. If going this route, we can hook up multiple model-building nodes to the same Partitioning node. Make sure to use a different Regression Predictor node and Scorer node with each model, though.
In this post, we have seen how to build an analytic workflow using KNIME connected to an Aster database. Of course, this only scratches the surface of what KNIME can do. For more information, you may either contact me at email@example.com or visit www.knime.org.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.