The merge join requires that both tables (spool files) be sorted on the hash-code of the columns being joined (or subset of columns being joined). This will be the case if they both have the same primary index, or if the optimizer decides that the cost is not too great to sort one or both tables (or spool) to get them in this state.
The hash join does not require that both tables are sorted. The smaller table/spool is "hashed" into memory (sometimes using multiple hash partitions). Then, the larger table is scanned and for each row, it looks up the row from the smaller table in the hashed table that was created in memory. If the smaller table must be broken into partitions to fit into memory, the larger table must also be broken into the same partitions prior to the join.
So, the main advantage of the hash join is that the big table does not have to be sorted and the small table can be much larger than for a product join. However, if the optimizer thinks that the small table is too large, then it will not choose the hash join as it will not be able to fit the small table in memory, even after breaking into partitions.
As with everything, the optimizer will try to determine the best path based on the cost of the alternatives.
Hi Barry, Thanks for reply, but I am finding that Hash join would do the product join, because one by one it is picking all the rows from small table and comparing(scanning) this with all the rows in the larger table.
The difference between the hash join and the product join is what is done with the small table. In the product join, the small table is scanned for each row in the large table.
In the hash join, the small table is loaded into memory based on the hashing of the join columns. So, each time a row from the large table is read, it can take the join columns from the large table through the hashing algorithm which will then point it directly to the correct row(s) from the small table. So, this is a direct access to the small table, rather than a scan of it.
In addition, the hash join supports the notion of partitions. So, even if the small table if fairly large, it can be broken into pieces and each partition is loaded one at a time into memory. Of course, in this case it also has to rearrange the large table into the same partitions to avoid scanning rows from the large table multiple times. This partitioning concept allows the hash join to be done even if the small table is not that small.
Hi Barry, Since it is not the case of merge join and nested join, but how will it decides between product join and Hash Join i.e. as per my understanding I am finding that when, one table is very large and another table is very small.
When the small table is very small (I don't know where the exact cutoff is, but I would guess that it's one data block), the optimizer will choose a product join. As the size of the small table gets larger, it will switch to a hash join.