Showing posts with label Notes. Show all posts
Showing posts with label Notes. Show all posts

Saturday, June 2, 2018

Cisco 5508 WLC HA Datacenter Migration Notes



Recently I had a need to relocate a redundant pair of Cisco 5508 Wireless LAN Controllers (WLC) from one Datacenter to another.  The HA/SSO pair of WLCs were servicing an entire geographical region with about 400+ APs associated to them.  The requirement was to have minimal to no downtime during the relocation process, so moving both controllers at the same time wasn’t an option.  The notes below were based on the migration procedures I took to accomplish this.  I’m sure there were plenty of ways to do this, but this worked for me.  Hopefully it can aid in others trying to do the same…


Prerequisite Information

Licensing (Base vs. Adder)

Before starting, I would recommend doing a little research on the WLCs’ licenses.  This may determine how and which device gets relocated first.  For example, if the primary WLC in the HA has the permanent license to cover the number of APs in your environment, then it might be easier to relocate that WLC first (the HA secondary unit will inherit the license so that unit will continue to function).  Personally I would check CCO’s licensing page to ensure which of the WLC’s serial number is registered to.

Also, please be aware of the differences between Base licensing vs. Capacity Adder licensing.  Base licensing is acquired upon the purchase of the device, whereas Capacity Adder licenses are purchased as an upgrade to the Base licensing, adding to the Base AP count.

My understanding of Base licensing in the context of deploying HA is that if you have two WLCs with a 50 count AP license, the end AP count after creating the HA/SSO pair is 50.  Base licenses do not aggregate in HA so if you have a requirement to support 100 APs, then a 50 AP Upgrade Adder license would be necessary.


Here are links to documents explaining the Base vs. Capacity Adder type licenses.



HA SKU Licensing

There is also a third license type that can be purchased with a specific intent to only be used as HA.  It has a specific “HA” license (e.g., AIR-CT5508-HA-K9) that has a 0 AP count but will inherit the active WLC's license upon a failure.  This “failover” WLC will fully function and will give a 90 day grace period to repair or RMA the failed active unit.  After 90 days, the HA WLC will alert but will continue to function.  This article does a good job explaining the specifics of a WLC running an HA-SKU license.



My Setup (before migration)

  • 2 Cisco 5508 WLCs connected in HA/SSO located in DC “A”.
  • WLC1 is the Active unit and WLC2 is the Standby unit.
  • Both WLCs have 100 AP Base licenses (AIR-CT5508-100-K9).
  • Both WLCs are running version 8.3.141.0.
  • WLC2’s serial number had the upgraded “adder” license registered with Cisco.  The total AP count I had was 500.

Task
  • Relocate the 2 Cisco 5508 WLCs to DC “Z”.


Migration Process
  • In DC “A” powered off WLC2.  WLC1 continued to operate without any issues but just reported the Standby unit was down.
  • Disconnected all network cables (LAN and Redundancy Port), except the Console connection on WLC2.
  • Powered WLC2 back on.
  • Once WLC2 booted, checked redundancy state and disabled SSO.  The WLC rebooted again.
(Cisco Controller) >show redundancy summary 
            Redundancy Mode = SSO ENABLED 
                Local State = ACTIVE 
                 Peer State = UNKNOWN - Communication Down 
                       Unit = Secondary (Inherited AP License Count = 500)
                    Unit ID = C8:9C:1D:xx:xx:xx
           Redundancy State = Non Redundant
               Mobility MAC = 50:3D:E5:xx:xx:xx
            BulkSync Status = Pending


(Cisco Controller) >config redundancy mode disable 


All unsaved configuration will be saved.
And the system will be reset. Are you sure? (y/n)y


(Cisco Controller) >
Saving the configuration...

Configuration Saved!
System will now reboot!
Creating license client restartability thread

Exit Called
Switchdrvr exited!
Restarting system.

  • After the reboot, WLC2 was reconfigured to be “Primary”.
(Cisco Controller) >config redundancy unit primary 

(Cisco Controller) >show redundancy summary 
 Redundancy Mode = SSO DISABLED 
     Local State = ACTIVE 
      Peer State = N/A 
            Unit = Primary
         Unit ID = C8:9C:1D:xx:xx:xx
Redundancy State = N/A 
    Mobility MAC = 50:3D:E5:xx:xx:xx 

  • Saved the configuration one last time and powered off, then shipped WLC2 to new location (DC “Z”).
  • Once the device arrived at DC “Z”, WLC2 was racked and all physical network connections were made.
    • Note: During my WLC installation at DC “Z”, I opted to connect the Redundancy Port (RP) to a L2 infrastructure switch instead of a back to back connection.  Since this DC was a remote location, and was forced to use a facility provided onsite technician, I had everything pre-connected to switches I had control over. This way I was able to perform the HA connection (i.e., no shut RP port) without the need to schedule the local resource, which would have been time consuming and disruptive to my schedule. Connecting the RP port to a L2 switch is supported on 7.5 or later code as explained in this document.
  • From WLC2’s console, its Management IP and Default Gateway were changed based on the assigned WLAN VLAN in DC “Z”.  Made sure the new IP was reachable on the network and able to access the WebUI etc.
  • WLC2's hostname was changed and any other parameters based on its new location.
  • Checked license status again.  Since this box did not have the 500 AP count permanent license (was registered to WLC1’s serial), this WLC’s license defaulted back to the base license of 100 APs.  This was not good considering that 400+ APs needed to be migrated.
  • Forced to enable the 500 AP count evaluation license.  Rebooted the WLC to commit the change.
  • Performed the AP migration.
    • Changed DHCP server's option 43 to inform the APs the new controller’s IP address (changed the HEX value as explained in this document.)

    • For all registered APs on WLC1 in DC A, any references to primary, secondary or tertiary controller’s name or IP in the HA section were removed.

    • APs were resetted/rebooted.
  • Verified all 400+ APs were registered to DC Z’s WLC (was WLC2) and APs were functioning without issues.  At this point DC A’s WLC can be reconfigured and shipped.
  • Performed a backup of DC Z’s WLC configuration (configured as Primary/Standalone).
  • Copied this configuration to DC A’s WLC and rebooted (First had to disable SSO, since it was the Primary HA unit).  Made sure the new configuration took.
    • Note: This step was done because of my misstep with the licensing.  Since this WLC had the 500 count permanent AP license, this unit had to be the primary in the HA.  The plan was to configure this unit as the new primary HA (with SSO enabled) and to swap it out with the existing WLC when it arrived at DC Z.  The existing WLC would then be reconfigured as the secondary HA with SSO.
  • Configured the Redundancy Management and Peer Redundancy Management IP addresses on DC A’s WLC, enabled SSO and rebooted (Remember that DC Z’s WLC was configured as Primary/Standalone).
  • After the reload, the redundancy status was checked to ensure it was Primary with SSO enabled.
(Cisco Controller) >show redundancy summary 
            Redundancy Mode = SSO ENABLED 
                Local State = ACTIVE 
                 Peer State = UNKNOWN - Communication Down 
                       Unit = Primary
                    Unit ID = 50:3D:E5:xx:xx:xx
           Redundancy State = Non Redundant
               Mobility MAC = 50:3D:E5:xx:xx:xx
            BulkSync Status = Pending
  • Saved configuration, Powered off and shipped DC A’s WLC to DC Z.
  • Installed DC A’s WLC and made all physical network connections. All ports were kept in VLAN 1 so those ports could be enabled without it being on the “network”.  Used CDP to verify the WLC's ports to switchport mapping were correct.  Once everything checked out, the switchports were shutdown.
  • From the existing DC Z’s WLC, the Redundancy Management and Peer Redundancy Management IP were changed to the IP that the "secondary" unit should have.  Saved configuration.
  • DC Z’s WLC was configured to be the secondary HA and enabled SSO.  The WLC saved the configuration and rebooted.  During this reboot, the switchports were shutdown to disable it on the network (including RP port connected to another L2 switch).
  • At this point, the WLCs were swapped by enabling DC A’s WLC back on the network.  Verified it was reachable on the network.  The RP port was still shutdown.  I ensured that this WLC was the unit with the 500 count permanent AP license.
  • Verified that all the APs were re-registering to this WLC. This took about 30 mins.
    • Note: At this point, since the WLCs were swapped, there was only a momentary blip of downtime.  Any FlexConnect locally switched WLANs would continue to function without issue, however any centrally switched WLANs would be disrupted.
  • After DC Z’s WLC rebooted, the redundancy state was re-verified as Secondary HA with SSO.  After that checked out, this WLC was rebooted again for good measure.  While it was rebooting, its switchports were reconfigured out of VLAN1 (as it was before), the appropriate trunk and VLANs were configured, and finally enabled on the network (including the RP ports for both WLCs).
  • Watched the console on each WLC and verified the primary and secondary negotiation process and bulk configuration sync took place.
  • Verified that after the negotiation process, the redundancy states were satisfactory.  From the primary WLC, once the peer state was showing “standby hot” and bulk configuration sync complete, the HA was fully enabled.

(Cisco Controller) >show redundancy sum
            Redundancy Mode = SSO ENABLED 
                Local State = ACTIVE 
                 Peer State = STANDBY HOT 
                       Unit = Primary
                    Unit ID = 50:3D:E5:xx:xx:xx
           Redundancy State = SSO
               Mobility MAC = 50:3D:E5:xx:xx:xx
            BulkSync Status = Complete
Average Redundancy Peer Reachability Latency = 457 Micro Seconds
Average Management Gateway Reachability Latency = 953 Micro Seconds

  • Initiated SSO testing by performing a “redundancy force-switchover” via the CLI.  Performed this on the primary first to see if the secondary took over without issues.
(Cisco Controller) >redundancy force-switchover 

This will reload the active unit and force a switch of activity. Are you sure? (y/N) y

System will now restart! Creating license client restartability thread

Exit Called
Switchdrvr exited!
Restarting system.
  • CLI view from secondary WLC.
(Cisco Controller) >
Blocked: Configurations blocked as standby WLC is still booting up.
         You will be notified once configurations are Unblocked

Unblocked: Configurations are allowed now...

(Cisco Controller) >show redundancy summary 

            Redundancy Mode = SSO ENABLED 
                Local State = ACTIVE 
                 Peer State = STANDBY HOT 
                       Unit = Secondary (Inherited AP License Count = 500)
                    Unit ID = C8:9C:1D:xx:xx:xx
           Redundancy State = SSO
               Mobility MAC = 50:3D:E5:xx:xx:xx
            BulkSync Status = In-Progress
Average Redundancy Peer Reachability Latency = 519 Micro Seconds
Average Management Gateway Reachability Latency = 750 Micro Seconds

  • Performed the same “redundancy force-switchover” on secondary WLC to test HA on that unit and to preempt the roles.

Lessons Learned
  • Again, my lack of research of the WLC licensing made this process a little more complicated that it needed to be.  If I just relocated the WLC with the permanent license first, I wouldn't have needed to "swap" the primary WLCs when joining the HA.  I could have simply configured the 2nd relocated WLC as secondary and joined the HA without any downtime.
  • If the WLC's software needs to be upgraded, this might be a good time to do this.  However research must be done to ensure that the new version doesn't conflict with existing APs etc. (i.e., make sure the new software version supports all of your APs).


Sunday, February 18, 2018

EVPN All-Active Multi-Homing - Load Balancing Traffic Analysis

The test conducted below was my attempt to observe and document EVPN’s all-active load balancing capabilities under normal operating conditions.  By examining the BGP routing and Wireshark traces, my objective was to get a detailed understanding of how the MPLS labels were exchanged and used in an EVPN network to achieve the load balancing behavior.

Test Plan
  • Simulate traffic flow from CE_R27 to CE_R29
  • Observe interface statistics to determine traffic routing
  • Observe BGP routing and label exchange
  • Perform targeted packet captures on PE links
  • Examine packet captures for label usage



Test Traffic

Continuous ping from host CE_R27 to CE_R29.

CE_R27#ping 172.16.50.2 rep 2147483647
Type escape sequence to abort.
Sending 2147483647, 100-byte ICMP Echos to 172.16.50.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
.. snip ..


Traffic Observation

CE_R27’s Traffic Distribution

CE_R27’s G1 and G2 Ether-channeled interface.  The input rate was double the output. 

CE_R27#sh int po50 | in rate
  Queueing strategy: fifo
  5 minute input rate 50000 bits/sec, 55 packets/sec
  5 minute output rate 25000 bits/sec, 28 packets/sec

CE_R27’s G1 Ethernet interface to PE_MXR01.  This traffic flow selected G1 as the egress interface and sent 28 packets per second (pps).  The input rate was identical to the output, so this would suggest these were the legitimate replies.

CE_R27#sh int g1 | in rate 
  Queueing strategy: fifo
  5 minute input rate 25000 bits/sec, 28 packets/sec
  5 minute output rate 25000 bits/sec, 28 packets/sec

CE_R27’s G2 Ethernet interface to PE_MXR03.  This interface was receiving 28 pps of extra traffic.

CE_R27#sh int g2 | in rate
  Queueing strategy: fifo
  5 minute input rate 25000 bits/sec, 28 packets/sec
  5 minute output rate 0 bits/sec, 0 packets/sec


PE_MXR01’s Traffic Distribution

PE_MXR01’s AC interface to CE_R27 (Gig1).  This AC interface’s input/output rates were in-line with the CE.

admin@PE_MXR01> show interfaces ge-0/0/2.500 detail | grep pps
     Input  packets:              2613229                   28 pps
     Output packets:               696154                   28 pps

PE_MXR01 MPLS interface to P_R03.  Majority of traffic was via P3 in the MPLS core.

admin@PE_MXR01> show interfaces ge-0/0/1.44 detail | grep pps    
     Input  packets:               698531                   28 pps
     Output packets:              2615863                   28 pps

PE_MXR01 MPLS interface to P_R01.  No traffic out to the P1 core router.

admin@PE_MXR01> show interfaces ge-0/0/1.45 detail | grep pps   
     Input  packets:                    2                    0 pps
     Output packets:                    0                    0 pps


PE_MXR02’s Traffic Distribution

PE_MXR02’s AC interface to CE_R28.  The destination AC interface was seeing 28 pps in both directions.  This would indicate that the end host only received and replied 28 pps worth of traffic, not anything more.

admin@PE_MXR02> show interfaces ge-0/0/2.500 detail |grep pps
     Input  packets:              2265249                   28 pps
     Output packets:              2485439                   28 pps

PE_MXR02’s MPLS interface to P_R03.  At this point in the network, the output rate doubles.  If the AC interface towards CE_R28 saw only 28 pps of outbound traffic, this would suggest this PE duplicated traffic.

admin@PE_MXR02> show interfaces ge-0/0/1.46 detail |grep pps   
     Input  packets:              2513443                   28 pps
     Output packets:              3069218                   56 pps

PE_MXR02’s MPLS interface to P_R04.  No traffic to P4.

admin@PE_MXR02> show interfaces ge-0/0/1.47 detail |grep pps    
     Input  packets:                23965                    0 pps
     Output packets:                    0                    0 pps



PE_MXR03’s Traffic Distribution

PE_MXR03’s AC interface to CE_R27 (Gig2).  The redundant AC interface to CE_R27 only saw outbound traffic.  This was the 28 pps of extra traffic that was sent from PE_MXR02.

admin@PE_MXR03> show interfaces ge-0/0/3.500 detail |grep pps
     Input  packets:                    0                    0 pps
     Output packets:               170575                   28 pps

PE_MXR03’s MPLS interface to P_R02.  Again, this was the 28 pps of extra traffic received from PE_MXR02.

admin@PE_MXR03> show interfaces ge-0/0/1.48 detail |grep pps    
     Input  packets:              2412297                   28 pps
     Output packets:                  493                    0 pps

PE_MXR03’s MPLS interface to P_R01.  No traffic to P1.

admin@PE_MXR03> show interfaces ge-0/0/1.49 detail |grep pps   
     Input  packets:                    3                    0 pps
     Output packets:                    0                    0 pps



Wireshark Traffic Analysis

Capture 1 – Link between PE_MXR01 and P_R03

Capture 1, Frame 2 was the ICMP request from CE_R27 to CE_R28 as seen between PE_MXR01 and P_R03.  This request was sent to CE_R28's MAC address of 00:0c:29:49:aa:8c.  




Based on the Type 2 route lookup for this destination MAC, a zero ESI would indicate that the destination was single-homed and therefore a Type 1 aliasing route lookup wasn't necessary.  As seen from the capture, its label usage (bottom label 299776, top label 328) was consistent with the route lookup.

admin@PE_MXR01> show route table bgp.evpn.0 extensive

bgp.evpn.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)

.. snip..

2:112.112.112.112:50::500::00:0c:29:49:aa:8c/304 MAC/IP (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 112.112.112.112:50
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a31f0
                Next-hop reference count: 6
                Source: 112.112.112.112
                Protocol next hop: 112.112.112.112
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 4:59       Metric2: 1
                Validation State: unverified
                Task: BGP_2345.112.112.112.112
                AS path: I             
                Communities: target:2345:50
                Import Accepted        
                Route Label: 299776    
                ESI: 00:00:00:00:00:00:00:00:00:00
                Localpref: 100         
                Router ID: 112.112.112.112
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1  
                        Protocol next hop: 112.112.112.112 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.81 via ge-0/0/1.44
                                Session Id: 0x0
                        112.112.112.112/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.81 via ge-0/0/1.44


admin@PE_MXR01> show route table bgp.evpn.0

bgp.evpn.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

.. snip ..

2:112.112.112.112:50::500::00:0c:29:49:aa:8c/304 MAC/IP       
                   *[BGP/170] 00:17:37, localpref 100, from 112.112.112.112
                      AS path: I, validation-state: unverified
                    > to 10.1.1.81 via ge-0/0/1.44, Push 328



Capture 2 – Link between P_R03 and PE_MXR02

Capture2, Frame 1 was the ICMP request from CE_R27 to CE_R28 as seen from P_R03 to PE_MXR02.  This request was sent to CE_R28's MAC address of 00:0c:29:49:aa:8c.  





As P_R03 forwarded the packet to PE_MXR02, it popped the top label of 328.  The VPN label of 299776 was looked up, associated to the EVPN instance and the packet was then delivered to CE_R28.  At this point, everything looked normal.

admin@PE_MXR02> show route table mpls.0 label 299776

mpls.0: 52 destinations, 52 routes (52 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

299776             *[EVPN/7] 00:18:26, routing-instance EVPN_CUSTOMER_G_ELAN_500, route-type Ingress-MAC, vlan-id 500
                      to table EVPN_CUSTOMER_G_ELAN_500.evpn-mac.0



Capture 2, Frame 2 was the ICMP reply from CE_R28 to CE_R27 as seen from PE_MXR02 to P_R03.  According to Wireshark’s analysis (arrows), this was Frame 1’s corresponding reply. 





The initial Type 2 route lookup would indicate the destination host was multi-homed and as a result, the Type 1 aliasing route's labels should have been used to load balance the return traffic (label 300640 to PE_MXR01 and label 300512 to PE_MXR03).

admin@PE_MXR02> show route table bgp.evpn.0 extensive   

bgp.evpn.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)

.. snip ..

1:111.111.111.111:50::112233445566778899::0/192 AD/EVI (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 111.111.111.111:50
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a5950
                Next-hop reference count: 10
                Source: 111.111.111.111
                Protocol next hop: 111.111.111.111
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 20:54:43   Metric2: 1
                Validation State: unverified
                Task: BGP_2345.111.111.111.111
                AS path: I
                Communities: target:2345:50
                Import Accepted
                Route Label: 300640
                Localpref: 100
                Router ID: 111.111.111.111
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1
                        Protocol next hop: 111.111.111.111 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.89 via ge-0/0/1.46
                                Session Id: 0x0
                        111.111.111.111/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.89 via ge-0/0/1.46


1:113.113.113.113:50::112233445566778899::0/192 AD/EVI (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 113.113.113.113:50
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a7630
                Next-hop reference count: 8
                Source: 113.113.113.113
                Protocol next hop: 113.113.113.113
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 21:12:15   Metric2: 1
                Validation State: unverified
                Task: BGP_2345.113.113.113.113
                AS path: I
                Communities: target:2345:50
                Import Accepted
                Route Label: 300512
                Localpref: 100
                Router ID: 113.113.113.113
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1
                        Protocol next hop: 113.113.113.113 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.89 via ge-0/0/1.46
                                Session Id: 0x0
                        113.113.113.113/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.89 via ge-0/0/1.46


2:111.111.111.111:50::500::00:1e:e5:c8:0f:f1/304 MAC/IP (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 111.111.111.111:50
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a5950
                Next-hop reference count: 10
                Source: 111.111.111.111
                Protocol next hop: 111.111.111.111
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 3  Metric2: 1
                Validation State: unverified
                Task: BGP_2345.111.111.111.111
                AS path: I
                Communities: target:2345:50
                Import Accepted
                Route Label: 300640
                ESI: 00:11:22:33:44:55:66:77:88:99
                Localpref: 100
                Router ID: 111.111.111.111
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1
                        Protocol next hop: 111.111.111.111 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.89 via ge-0/0/1.46
                                Session Id: 0x0
                        111.111.111.111/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.89 via ge-0/0/1.46


However, the Wireshark capture displayed PE_MXR02's frame using a label stack based on the PE_MXR03’s Type3 Ingress-IM route’s label (bottom 300528, top 330).  Neither of the Type 2 label or Type 1 aliasing label were used.  This was not in-line with the expected behavior for EVPN aliasing.

admin@PE_MXR02> show route table bgp.evpn.0 extensive   

bgp.evpn.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)

.. snip ..

3:113.113.113.113:50::500::113.113.113.113/248 IM (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 113.113.113.113:50
                PMSI: Flags 0x0: Label 300528: Type INGRESS-REPLICATION 113.113.113.113
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a6af0
                Next-hop reference count: 8
                Source: 113.113.113.113
                Protocol next hop: 113.113.113.113
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 34:37      Metric2: 1
                Validation State: unverified
                Task: BGP_2345.113.113.113.113
                AS path: I
                Communities: target:2345:50
                Import Accepted
                Localpref: 100
                Router ID: 113.113.113.113
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1
                        Protocol next hop: 113.113.113.113 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.89 via ge-0/0/1.46
                                Session Id: 0x0
                        113.113.113.113/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.89 via ge-0/0/1.46


admin@PE_MXR02> show route table bgp.evpn.0   

bgp.evpn.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

.. snip ..

3:113.113.113.113:50::500::113.113.113.113/248 IM           
                   *[BGP/170] 00:32:50, localpref 100, from 113.113.113.113
                      AS path: I, validation-state: unverified
                    > to 10.1.1.89 via ge-0/0/1.46, Push 330


Furthermore, the EVPN instance output from PE_MXR02 confirmed both Type 1 and Type 2 labels were received from their respective PEs.  As for why this PE ignored these labels to forward traffic was unknown.

admin@PE_MXR02> show evpn instance extensive           
Instance: EVPN_CUSTOMER_G_ELAN_500
  Route Distinguisher: 112.112.112.112:50
  Per-instance MAC route label: 299776
  MAC database status                     Local  Remote
    MAC advertisements:                       1       1
    MAC+IP advertisements:                    0       0
    Default gateway MAC advertisements:       0       0
  Number of local interfaces: 1 (1 up)
    Interface name  ESI                            Mode             Status     AC-Role
    ge-0/0/2.500    00:00:00:00:00:00:00:00:00:00  single-homed     Up         Root
  Number of IRB interfaces: 0 (0 up)
  Number of bridge domains: 2
    VLAN  Domain ID   Intfs / up    IRB intf   Mode             MAC sync  IM route label  SG sync  IM core nexthop
    500                  1    1                Extended         Enabled   299840          Disabled
    501                  1    1                Extended         Enabled   299856          Disabled
  Number of neighbors: 2
    Address               MAC    MAC+IP        AD        IM        ES Leaf-label
    111.111.111.111         1         0         2         2         0
    113.113.113.113         0         0         2         2         0
  Number of ethernet segments: 1
    ESI: 00:11:22:33:44:55:66:77:88:99
      Status: Unresolved
      Number of remote PEs connected: 2
        Remote PE        MAC label  Aliasing label  Mode
        113.113.113.113  300512     300512          all-active
        111.111.111.111  300640     300640          all-active

Instance: __default_evpn__
  Route Distinguisher: 112.112.112.112:0
  Number of bridge domains: 0
  Number of neighbors: 0



Capture 2, Frame 3 was not part of the original ICMP request.  This reply appeared to be replicated from PE_MXR02.  The difference with this frame vs. the legitimate reply was that the packet’s label stack again used the Ingress-IM labels (bottom 303216, top 329) to forward towards PE_MXR01.

Although PE_MXR02 appeared to “load-balance” the reply traffic, (i.e., Frame 2 going towards PE_MXR03 and Frame 3 towards PE_MXR01) it clearly did not do so using the aliasing method.  It simply replicated this traffic and sent it via another path.  The strange thing was that despite the ICMP test traffic being known unicast and also receiving valid labels (from the Type 2 MAC and Type 1 EAD per EVI routes), the PE still ignored these labels.

Perhaps PE_MXR02 misclassified this frame (as well as Frame 2) as BUM traffic and therefore used the IM label for Ingress replication?  I’m unsure why this would happen.  If it was classified as BUM, I would think this PE would have also replicated this frame and sent it to the other host connected off PE_MXR03 (i.e., CE_R29).  However this was not seen in any of the captures.




admin@PE_MXR02> show route table bgp.evpn.0 extensive   

bgp.evpn.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)

.. snip ..

3:111.111.111.111:50::500::111.111.111.111/248 IM (1 entry, 0 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 111.111.111.111:50
                PMSI: Flags 0x0: Label 303216: Type INGRESS-REPLICATION 111.111.111.111
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb7a6790
                Next-hop reference count: 10
                Source: 111.111.111.111
                Protocol next hop: 111.111.111.111
                Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:  2345 Peer AS:  2345
                Age: 34:34      Metric2: 1
                Validation State: unverified
                Task: BGP_2345.111.111.111.111
                AS path: I
                Communities: target:2345:50
                Import Accepted
                Localpref: 100
                Router ID: 111.111.111.111
                Secondary Tables: EVPN_CUSTOMER_G_ELAN_500.evpn.0
                Indirect next hops: 1
                        Protocol next hop: 111.111.111.111 Metric: 1
                        Indirect next hop: 0x2 no-forward INH Session ID: 0x0
                        Indirect path forwarding next hops: 1
                                Next hop type: Router
                                Next hop: 10.1.1.89 via ge-0/0/1.46
                                Session Id: 0x0
                        111.111.111.111/32 Originating RIB: inet.3
                          Metric: 1                       Node path count: 1
                          Forwarding nexthops: 1
                                Nexthop: 10.1.1.89 via ge-0/0/1.46


admin@PE_MXR02> show route table bgp.evpn.0   

bgp.evpn.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

.. snip ..

3:111.111.111.111:50::500::111.111.111.111/248 IM           
                   *[BGP/170] 00:32:47, localpref 100, from 111.111.111.111
                      AS path: I, validation-state: unverified
                    > to 10.1.1.89 via ge-0/0/1.46, Push 329


Conclusion

From a 40,000 foot view, the traffic appeared to be load balanced as expected.  However after I examined the traffic more closely, this was not the case.  At this point I’m unsure why this occurred.  This was not normal behavior from my understanding of all-active operation.  This all well could be a simple misconfiguration on my part, however I couldn’t find a lot of information on the Internet about this particular issue.  I will keep searching for the answer though…