Declarative vs Procedural approach to Data Science

Learn Aster
Teradata Employee

Declarative vs Procedural:

If you ask questions or solve day to day business problems with a language like SQL, you are using a declarative language. However, if you are using Scala, Python, Java, Perl  etc.,, you are doing procedural stuff.

Declarative languages do not have WHILE{} and FOR{} loops or nested iterations and such. The language primarily depends on stable data structures and frequently used logic/design patterns to be a prerequisites. In relational world, structure is provided by tables and logic is provided by joins, functions, casts, etc., and hence SQL is considered as a declarative language. In fact, if you need any WHILE or FOR loops, they are hidden under stored procedures which are actually 'declared', so it's invoked at the SQL layer abstraction by the user. 

Justification for Declarative: Data is structured, so logic can be imposed on it. Usually ACID compliant (transactions/committed data etc.,).

Procedural languages don't expect rigid formalized structures or to be available to them always. You code in Python, R, Java or Scala to unravel the structures and perform analysis. You can dig through JSON structures, parse binary streams etc., in C, C++, java etc., For accessing relational databases, procedural code can always embed a declarative language like SQL and invoke it through a JDBC layer, however the results are often analyzed by procedural code.

Justification for Procedural: Data cannot be structured that easily and only procerdural code or a programming language can wrangle it case by case. ACID compliance etc., is after thought.

Reality is that most users use either a procedural or a declarative interface as  primary access and then optionally use embed the other. As an example, I use SQL as my primary method for solving problems from a tool like Aqua Data Studio or Teradata Studio with optionally embedding my java code in the Aster SQL/MR functions. My colleague uses R and Python as a front end that calls R and Python libraries that has embedded SQL statements in it. One can argue that SQL has also a procedural interface with variables and cursors like PL/PGSQL and so a procedural language! Yes, that's true too. The interface is procedural, but the access method continues to be declarative. Procedural SQL comes to aid where standard SQL falls short of certain types of processing hard to express in declarative form.

Trade-offs between Procedural and Declarative interfaces - Why some prefer one over the other:

It is unfair to over generalize which set of users prefer which, but there is a subtle pattern. Feel free to comment at the end if you disagree. Declarative interfaces like SQL are meant to simplify the common tasks with a standardized abstraction. Most of the common business tasks/questions/queries can be expressed in SQL. A lot of business analysts use SQL. As more machine learning and analytics becomes well understood and mainstream, they also get embedded and available in SQL as a standardized way of accessing. Teradata Aster SQL/MR is a great example. 

You can see how invoking graph and pathing analytics is easier and intuitive through a native SQL declarative interface.The last SQL statement above invokes a procedural code like R through the SQL stream interface. While using this approach, the declarative interface tries to abstract the map/reduce, graph or other architectural & procedural nuances from the business user! The Teradata Aster SQL/MR/GR also works with semi structured and unstructured data, so you don't need to be writing procedural code to extract information. This is for those users who want to eliminate the programming bottleneck to get to the answer quickly.

A counter example is how you'd invoke SQL from a procedural construct like Java:

The above is from the Postgres user guide.The Java code uses the JDBC interface to access a relational table called mytable.

Here's another example of R procedural code talking to the Aster database, mapping a virtual data frame to a relational table and creating a word cloud out of it. No SQL was involved here.

What about Apache Spark ?

Apache Spark is primarily a NOSQL environment. The primary access method is from a procedural language like Scala, R, Python, Java etc., It also supports SPARK SQL which is again SQL that is referred within the code to work on data frames. See Scala code below that embeds SQL inside.

 

So who would use which ?

Programmers, NO SQL oriented coders & Data Scientists usually love the procedural interface (like R/Python) with the amount of flexibility it provides. However Business Analysts and/or Business Scientists would prefer the simplified declarative type construct as all the data science tools are available packaged to them today to do iterative discovery. Both camps would however agree that time to analytics & insights is the key regardless of which method is used and keep them productive.

Insights sharing tools like Teradata Aster App Center, R Shiny also makes sure that regardless of the method used, the workflow is packaged to be consumed to the end user with a browser interface ...