Set SQL is a way of operating on data that is radically different from historical concepts of computer programming that can be diagrammed with flow charts. Grab a cup of coffee and review some history that will explain why the industry is mired in cursor logic and loops that cannot exploit the power of today's software and hardware.
Early IBM computers were programmed not with language but with the placement of jumper wires on large circuit boards - essentially building custom electronic circuits. Programmers drew flow charts - a tool that was first used by integrated circuit designers - to plan and document their programs. Flow charts are still used today to document many logical process flows, from computer programs to instructions for representatives in call centers. Circuits and flow charts are useful ways of thinking about serial processes - that is, procedures that operate on one thing at a time.
This seems like a natural way of thinking about anything we do: for example, to make American tea one might start heating water, put the tea in a cup, wait until the water boils, poor water over the tea and let it steep. In some cases the order doesn’t matter – it might save time to start the water boiling first, but it doesn’t affect the outcome. In other cases, the order matters – the tea should be in the cup before the water is poured. However it is done, it is a step-by-step process because a person can only do one thing at a time. Most electronic circuits and computer programs are designed to execute in this way: doing something or some things, checking the results, then doing another thing depending on those results.
The evolution of computer languages before 1970 did not affect this processing model. At first, machine languages simply offered canned methods for accessing specially provided circuits: the machine instruction set. Then, assembler languages provided convenient ways of writing these instructions and automatically resolving memory addresses. Third-generation languages like COBOL merely abstracted the assembler languages to a higher level. One can think of it like this: a COBOL program generates assembler language, which is turned into machine language, which navigates the circuits in the computer. The basic paradigm of doing things one step at a time did not change.
In a 1970 paper published in Communications of the ACM, "A Relational Model of Data for Large Shared Data Banks," mathematician Edgar F. Codd introduced a different way of thinking about the structure of data and the logic of manipulating it. Codd used mathematical set logic to describe the things people were trying to accomplish with data processing. He called his approach a "relational model" because it was based on the concept of a relation, which, strictly speaking in mathematical terms, is a set of ordered pairs. (More formally, any subset of a Cartesian product is a relation.) Codd extended the concept by showing that a relation could be viewed not only as a set of ordered pairs but also as a set of ordered triplets, quadruplets or any number of items (a “tuple”), and this became what we know today as a “table”; hence the term "relational database." The term relation in “relational database” refers to table, not to the relationships between tables.
Codd argued that if data were organized in well-defined sets or “relations,” then set operators could be used to define any view or computation an end-user might want to see from the data. A Set Operator is a way to define a subset of data or a way in which a subset might be combined with another subset. Thus, relational theory addresses not only the structure of data but also operations that can be applied to the data. All of these operations on data are set-based. Many of them are familiar terms: SELECTION, PROJECTION, UNION, INTERSECT, MINUS and JOIN. These are static terms: they don't do anything - they simply define subsets of the data.
When viewing data in this way, flowcharts, circuits and procedural logic become irrelevant because there is no concept in relational theory of data "flowing" through an application program. Data is not considered to be manipulated by procedural methods but is simply expressed or viewed in different ways through combinations of operators.
All of the early relational database management systems (RDBMS) included a language for implementing the set operators. Ingres, for example, had the Quel language. IBM developed the SQL language (initially, Sequel [Structure English Query Language]) for these operators. In SQL, PROJECTION became the select statement and SELECTION (the predicate) became the where clause in SQL. SQL retained the terms UNION, MINUS and INTERSECT, and JOIN was implied by the “FROM” clause. (JOIN was added to SQL as an explicit operator years later.) SQL incorporated the Insert, Update and Delete commands as ways of modifying the data in the tables, and these could be combined with the set operators to identify which subset of the data needed to be changed. Codd’s contention was that these operators would provide a means for users to do anything they needed to do with the data.
Professional computer programmers disagreed. Programmers thought in terms of records, and records corresponded in their minds to rows in relational tables. There was nothing in relational theory to explain how a program could loop through a table, manipulating one row at a time, and doing various things depending on the contents of that row. Although SQL offered Insert, Update and Delete operations on sets, programmers found that the set operators (UNION, MINUS, INTERSECT and JOIN) were likely to incur too much overhead. Many people argued that these operators would require so many computing resources (CPU, Memory and I/O) that a relational implementation would require a larger machine and could operate only on data sets that were smaller than existing database management systems like IMS and IDMS could handle. The experiences of most IT groups in the 1980s proved these suspicions correct.
Relational theory did not address these problems because the model only defined ways of viewing the data. Codd insisted on leaving the details of how data might be physically traversed to the database engine itself: it was not his purpose to describe any sort of programming or hardware model of data access or data manipulation. Those details would be left to the database management software that implemented the relational model of data and set operators. Unfortunately, it would be years before hardware and software evolved to the point where they could efficiently support relational databases as Codd envisioned them.
In the meantime, for SQL/DS and DB2, IBM added the SQL Cursor construct to the COBOL programming language interface to enable programmers to treat a relational database as if it were just a collection of flat files and maintain them in the same manner. A cursor is defined by a SQL Select statement. The select statement defines a subset of the data in the database that the program wants to retrieve. The cursor can be “opened” like a file, and rows can be “fetched” from a cursor just as records are read from a file. A cursor could also be used to improve performance: sometimes a program reading two cursors could join two tables faster than the RDBMS could; and sometimes it was faster to extract a table to a flat file using a cursor, sort the file and reload it into a table than to use ORDER BY in a SQL SELECT statement.
With these familiar concepts defined, a programmer needed only learn the specific syntax of this interface to SQL, and the rest of the program logic can proceed just as when updating a flat (sequential) file. The relational set operators could be largely ignored and the programming paradox for programmers used to working with flat files and navigable databases was solved.
Fast forward to the 21st century - today all commercial RDBMSes are designed such that set-based updates are more efficient than serial updates. Massively Parallel Processing (MPP) engines like Teradata can rapidly execute complex set operations on data that would never have been imagined in the 1980s except by theoreticians like Codd. With this computing power, set processing is the new programming paradigm. Cursor-based updates are often very fast because of the brute force speed of the processors, memory and disk, and they may perform well enough on some RDBMS engines, but converting these to Set SQL will increase the speed even further. The use of Set SQL is the practical performance choice in all relational database systems. Even non-SQL MPP environments like Map-Reduce implementations do not define loops, and where there is an SQL language on a Map-Reduce platform such as Aster Data, there is still no concept of a cursor.
To take full advantage of any RDBMS, the application programmer must think of the engine as a whole and write in the language that the engine can best use. That language is Set SQL. As we have been exploring in this blog, even the old cursor-oriented code can be converted to set processing. Future posts will continue to offer options for converting this old code and get your processing into the 21st century.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.