I am encountering an unexpected situation when using multiple DataConnector operators UNION ALLed together, and would appreciate your input/suggestions.
I have a process that uses TPT DataConnector and TPT Stream to load files that are being continuously received. The process is been working fine, but requires two instances of the DataConnector operator in order to prevent file read from being a bottleneck.
I now have the requirement of loading the files in order received. I can accomplish this through the VigilSortField property of DataConnector, but I cannot use multiple instances when using this property.
In order to get around this limitation, I was hoping to achieve the same parallelism by using two distinct DataConnector operators, each processing half the files and UNION ALLing into one Stream operator. This approach seemed to fit my situation well, since two files with different file patterns are received every 5 seconds.
However, I am running into an issue because one of the two files is consistently larger. The DataConnector operator processing the larger files is consistently falling behind, and is unable to ever process its backlog.
Based on the feinholz quote below, I understand that the balancing between the two operators is based on file size rather than file count, so I believe it is expected that the operator processing the larger files would process fewer files per checkpoint.
"When using multiple instances to read from multiple files, we load balance the files across the instances according to the file sizes."
However, my scenario in which files are constantly received is creating a situation in which this "larger-file" operator is never able process its backlog. I would guess that this is occurring because whenever either DataConnecter finishes processing all files detected during its directory scan, it causes both operators to checkpoint.
Can anyone suggest a way to resolve this issue? Is there any setting that will change the balancing to file count rather than file size, or prevent the checkpoint from occurring until both DataConnector operators have processed their entire directory scan?
(I am using TPT 14.0, VigilMaxFiles of max of 50,000, VigilWaitTime of 1. No -l latency_interval set on tbuild command.)
A few thoughts.
If you are using the Stream operator, I doubt the file reading will ever be the bottleneck.
The Stream operator will probably always be running slower than the file reading.
Next, if you are worried about order, the UNION ALL will not help you. The data from the 2 DC operators will be merged together into the same data streams that feed the Stream operator and order cannot be guaranteed.
I am also not quite sure what you mean by "The DataConnector operator processing the larger files is consistently falling behind". Logically speaking, processing large files will always take longer than processing small files. Thus, the instance that is reading the small files will always finish prior to the instance processing the larger files.
Another thing. If you are using UNION ALL, the 2 DC operators do not know about each other. They run independently of each other and thus file size assigning does not come into play. Only when instance of a single DC operator are specified will the files be spread out between the instances according to size.
Thanks for the reply feinholz. I am not going to pursue multiple reader operators for this process anymore, but I would like to reply to your individual thoughts too.
You are right that file reading is not the actual bottleneck. In my situation, the overall performance is sufficient (read: amazing) when I increase the number of Stream instances. But, I ran into deadlock issues due to rowhash locking, and had to enable Serialize. Since this limited me to one Stream instance, I compensated with additional DataConnector instances, and did see an improvement in performance. But since I agree that file reading is not the bottleneck, I assume this was due to some incidental benefit, such as the multiple DataConnectors increasing the number of IO buffers, and thereby better distributing the data sent to Stream. If this is the case, perhaps changes to other settings would provide the same benefit without requiring the second operator. But trial and error has not led me there yet.
Regarding UNION ALL not guaranteeing order, I understand. This process is only concerned with file order, not row order. The only desire is to load older files before newer files, when there is a backlog.
It's good to know that the file size balancing does not apply in this case. But to rephrase the part that you did not follow -- what concerns me is that it appears that when DataConnector A has finished processing all files in its current directory scan, it and DataConnector B both perform a checkpoint, even though DataConnector B has more files left it could have processed. I understand that B would naturally take longer since it is processing larger files, but I worry that this premature checkpoint is the cause of DataConnector B falling behind, especially since one DataConnector on its own is sufficient to handle both sets of files. So I wondered if there was any setting (such as -z option) that could change this behavior, so that DataConnector B would finish all the files in its active directory files before checkpointing.
It seems I may have wandered off into a use case that isn't worth supporting, though, and that I'm throwing the wrong tools at the problem of increasing throughput when both Serialize and VigilSortOrder are required.
I will look into the checkpoint issue, to see if we have a synchronization issue regarding the use of UNION ALL. I know that when using multiple instances of a single DC operator, the checkpoint does not take place until all instances complete the processing of the files in their directory.