How do I troubleshoot my Direct Connect BGP when it goes from UP to DOWN status?

5 minute read
0

I want to troubleshoot when my Border Gateway Protocol (BGP) goes from an UP to DOWN status in an idle state.

Resolution

If your BGP session goes DOWN, then check the following:

Check for any on-going AWS Direct Connect maintenance

Direct Connect connections can go DOWN due to on-going AWS maintenance. Maintenance lasts from a few minutes to a few hours and during this time, BGP connections move to an idle state. The owner of the Direct Connect connection can view notifications of AWS maintenance in the AWS Personal Health Dashboard under the Events section.

To configure notifications for Direct Connect schedule maintenance, see How can I get notifications for Direct Connect scheduled maintenance or events?

Check the Direct Connect link status

For your BGP to be in an UP status, the physical Direct Connect link or layer 1 must also be UP. Check your physical Direct Connect link on the AWS Management Console on the Connections page or by using CloudWatch connection state metrics. If there are issues with your physical layer, then troubleshoot layer 1 (physical layer).

Check if you can ping the Amazon peer IP address from your on-premises router

If you can't ping the Amazon peer IP address, then the layer 2 address resolution protocol (ARP) isn't established. To fix this, you need to troubleshoot layer 2 (data link) issues and check the following:

  • ARP flapping on your side or your partner side.
  • If a new device was introduced in the path. If so, then make sure it allows the VLAN that's configured for the virtual interface (VIF).

Check the BGP debug logs on your customer gateway device

BGP connections go DOWN for multiple reasons and in the debug logs, BGP Cease NOTIFICATION message subcodes help identify the reasons. The following is a list of common subcodes and methods to resolve them:

BGP Cease NOTIFICATION code 6 subcode 1 (6/1)

Check whether you're exceeding the limit of advertised routes over the Private and Transit VIF BGP session from your on-premises router. The limit for prefixes from on-premises to AWS Private or Transit VIFs is 100 for IPv4 and IPv6. The limit for routes per BGP session on a public VIF is 1000. These limits can't be increased.

BGP Cease NOTIFICATION code 6 subcode 4 (6/4) and BGP Cease NOTIFICATION code 6 subcode 10 (6/10)

  • Check Amazon CloudWatch metrics such as ConnectionBpsEgress, ConnectionBpsIngress, VirtualInterfaceBpsEgress, and VirtualInterfaceBpsIngress and make sure that the that bitrate hasn't reached maximum capacity.
  • Check for packet loss between the two peers. This includes checking metrics, interface counters, CPU, memory, port utilization, drops, and discards in your router or firewall.
  • Check for interface input and output errors such as CRC, frame, collisions, and carrier by using show interface statistics.
  • Check for worn counters. If worn, then clean or replace the fiber patch lead and SFP module.

If you're having further issues, see How can I troubleshoot Direct Connect network performance issues?

BGP Cease NOTIFICATION code 6 subcode 6 (6/6)

This notification message appears in your debug logs when configuration changes have occurred on the BGP setting of your customer gateway or on the AWS end. If on the AWS end, then check AWS CloudTrail and the recorded activity for the VIF API UpdateVirtualInterfaceAttributes. Checking this shows which user performed the configuration change.

If no configuration changes were done on either side, then reach out to AWS Support.

BGP Cease NOTIFICATION code 6 subcode 7 (6/7)

This notification appears as a result of a connection collision. A connection collision occurs when a pair of BGP speakers tries to establish a connection with each other simultaneously and two parallel connections are formed. To resolve, complete the following steps:

  1. Manually shut down the BGP peering on the customer gateway device. Re-connect after a few minutes.
  2. If this issue occurs frequently, then configure your customer gateway device as the BGP server. The customer gateway device won't initiate the TCP handshake, but it'll listen on the TCP port 179 and accept peer connection requests.

Check for recent configuration changes at your on-premises router

  • Check if TCP port 179 is blocked at the firewall.
  • Check if the BGP configuration was accidentally modified. Check the local or remote ASN number, MD5 password, and the peer IP address.
  • Check if the NAT configuration has a recent change where the BGP local IP address and interface IP address is included inside. This change will cause your BGP to fail due to an MD5 password mismatch. To resolve, deny access for the interface IP address or allow access for the desired network to the public interface IP address.

If you have a partner or last-mile service provider, then contact them and inquire about any maintenance events on their end.

AWS OFFICIAL
AWS OFFICIALUpdated a year ago