Database restart - system was in kernel: WARNING: BLLI: 140002903 #lost contact with node 1-1

Teradata Database on Azure
Teradata Employee

Database restart - system was in kernel: WARNING: BLLI: 140002903 #lost contact with node 1-1

Product Categories  
Teradata on Amazon Web Services (AWS),Teradata on Azure  
system was in kernel: WARNING: BLLI: 140002903 #lost contact with node 1-1, restart initiated.

## Split net also detected on the this node after restart.

Sep 22 16:28:02 kernel: WARNING: BLLI: 140002903 #lost contact with node 1-1, restart initiated.
Sep 22 16:28:02 kernel: klogd 1.4.1, ---------- state change ----------
Sep 22 16:28:03 Teradata[6892]: INFO: Teradata: 13014 #Event number 33-13014-00 (severity 20, category 3) TPA reset generated by Bynet driver.
Sep 22 16:28:03 kernel: INFO: BLLI: 140002603 #online with 1 nodes.
Sep 22 16:28:08 Teradata[6892]: INFO: Teradata: 13006 #Event number 33-13006-00 (severity 10, category 12) split net detected.
TD on Azure
From past incidents we usually see some kind of temporary network outage or hardware issue that causes this. On cloud environments such as AWS and Azure we are seeing temporary network timeouts that sometimes last more than the 10 second (the current allowed network timeout value for the bynet). Systems that have two nodes are more susceptible to a split net condition than systems that have three or more nodes. A single node system is not  susceptible to a split net condition. 
On  a cloud based system the resolution is to make sure the network connectivity is restored and bring up the system. The steps are as follows:
1)  Run bam -s (to check the network, sample output below)
# bam -s
Version information: BLM commands BLM driver BLM protocol
Node is running in protocol emulation mode.
Node state: attached
Nodes   Routes  Net name
2       2       eth0:0-udp-1001- 
<== Make sure you see all of your nodes (2 nodes in this case)
2       2       eth0:0-udp-1002-
2) Check to make sure all vprocs are online with vprocmanager. If any amp are offline set them online (example: set 0 online )
SMP001-01:~ # /usr/tdbms/bin/vprocmanager
    |                                |              |
    |    ___     __     ____         |    ____    __|__    ____
    |   /      |/  \    ____|    ____|    ____|     |      ____|
    |   ---    |       /    |   /    |   /    |     |     /    |
    |   \___   |       \____|   \____|   \____|     |__   \____|
    Release Version
    VprocManager Utility (Sep 98)

Enter a command, HELP or QUIT:
status not
SYSTEM NAME: mpp                                            17/12/05 09:20:34
All DBS vprocs are fully online.
All PDE nodes are fully online.
Enter a command, HELP or QUIT:
Enter a command, HELP or QUIT:

Exiting VprocManager...
SMP001-01:~ #
2)  # /etc/init.d/tpa start <== Run this command only on one node to start the database 
We have created JIRAs to address these split net and bynet issues:
1) TAWS-2008 - Two node systems on AWS (and Azure) seem to have low system availability due to hitting split net issue
2) TAWS-2009 - Request to lengthen Bynet Timeout on AWS (and Azure) platforms - We plan to provide a package to raise the timeout value from 10 seconds to a higher timeout value to be determined in the JIRA.
We also recommend contacting  the cloud vendor to get their server logs that would show network outages, network errors, and hardware errors. If the vendor will not provide these logs ask the vendor to check  for hardware or network errors around the time of the problem.

The bynet split issue is by design on a 2-node system.   The typical case on a node failure is that the larger set of surviving nodes form a quorum and come up as the active system.  In the case of a 2-node system where both nodes are still alive, but just not communicating with each other, a quorum cannot reached and to avoid the problem of having both 2 active systems, the DBS is shutdown as a preventative measure.  This is somewhat unique to the public cloud environments as the bynet relays are part of the nodes themselves so by configuration each node in a 2-node system has everything to be a functional system.  However, 2-node systems have always been problematic regardless of where they are deployed.  A Teradata system is a closely-coupled system and the nodes are always cross-verifying that each node is functional and communicating.  There is currently a 10 second timeout in this check mechanism that if communications are lost for longer than that, the system goes into error recovery. The actions should focus on determine why communication was lost for such a short period of time and triggering the timeout.  In Azure we know that service events can cause a node to “stall” for up to 30 seconds and we have a mechanism in place to be notified of these events and extend the timeout to ride through the services events.  AWS, however, does not operate in the same manner and we don’t know of anything in the infrastructure that would produce these short outages.  It is important to include AWS to correlate any timeouts on our end with activity on their end. If we can’t find a correlation and these are recurring events, then we should carefully look into how long the communication outage lasted.  The timeout value is a tunable and can be extended, but that needs to be done carefully as this will also affect how quickly a system will react to a real failure before initiating error recover.

According to KA S110008EA0A, the definition of a split net is:

A split bynet condition exists when the following conditions are met:

When the system is coming up, it checks rule (1):
Rule 1 states the system MUST have > 50% of the total number of nodes available. If this condition IS met, then the system comes up normally. If this condition is NOT met, then the system will check for rule (2):
Rule 2 states all of the bynets ( 0, 1) must be present. If rule 2 is true, then the system will come up. If rule 2 is false, then the system experiences a "split bynet condition".