Saturday, July 29, 2017

Tunnel Path MTU Discovery in a DMVPN Network: Use with Caution

Everyone knows one of the main issues in managing a DMVPN network is dealing with fragmentation.  Typically, when calculating the tunnel MTU and MSS, we are under the assumption that we are working with a network transport with normal MTU of 1500.  However I’ve been seeing more cases of Internet services being delivered to customers with lower than normal MTU.

Recently, I discovered a couple of sites in Europe where an ISP delivered a DSL service with a backend MTU of 1444.  This MTU was not disclosed to us and the MTU between the router and service provider edge device was set at 1500, giving the appearance of a normal working MTU.  This site performed normally for a good amount of time until the “tunnel path-mtu discovery” command (or tunnel PMTUD) was enabled on the tunnel interface (we discovered later it was added accidentally during a maintenance period). This is when I was alerted about a debilitating performance issue affecting that site.

At first I thought that this tunnel PMTUD feature couldn't have possibly cause such an issue. However after researching and testing it, it made more sense on why things broke.  I wanted to be able to share the experience in this post so others wouldn't get burned by it too.

As a small disclaimer, the following is based on my own experience, so I’m not saying this is a bad feature or to never use it.  But my advice is to use it with caution and enable it after fully understanding how the feature can affect your network.


Lab Environment

  • Headend Router: Cisco 3945 (ISRG2) with IOS version 15.5(3)M2
  • Branch Router: Cisco 2951 (ISRG2) with IOS version 15.5(3)M2
  • Internet R01: Cisco 1801 with IOS version 12.4(15)T4
  • Internet S01: Cisco 3560 with IOS version 12.2(55)SE5

Diagram & Topology



Technology Overview

When looking at what the Tunnel PMTUD feature is doing, it basically does two main things.

  • Copy DF bit from the original IP packet to new GRE header
  • Router listens for ICMP Unreachables with Fragmentation Needed and Don't Fragment Set (Type 3 Code 4)



On the other hand, when Tunnel PMTUD is not configured, the default behavior of GRE is to not copy the DF bit from the original IP header.  This basically allows fragmentation to occur when it encounters a path that has a MTU lower than 1500.


So with these two bits of information in mind, let’s run through a few different scenarios.  This will show Tunnel PMTUD operation on a greater detail.  All scenarios will run through a network with a lower than normal MTU of 1300 in its path (see lab diagram).


  1. Traffic sent from Headend LAN Switch to Branch over the DMVPN tunnel with Tunnel PMTUD disabled.
  2. Traffic sent from Headend LAN Switch to Branch over the DMVPN tunnel with Tunnel PMTUD enabled and working.
  3. Traffic sent from Headend LAN Switch to Branch over the DMVPN tunnel with Tunnel PMTUD enabled and not working (ICMP unreachable blocked).

Scenario 1 (Tunnel PMTUD Disabled)

In this scenario, we do not have any tunnel PMTUD configured.  The tunnel MTU is set at 1400 and the test traffic uses at 1400 byte packet sent from Headend to Branch.

  • Ping with size of 1400 and DF bit set from LAN Switch to Branch network.  Packet makes it to destination without issue.
LAN_SWITCH#ping 10.100.100.254 size 1400 df

Type escape sequence to abort.
Sending 5, 1400-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/9 ms

  • Headend router’s tunnel MTU is 1400 and shows no fragmentation.
HEADEND#sh ip traffic | i Frag|frag
  Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         0 fragmented, 0 fragments, 0 couldn't fragment






  • Packet reaches router INTERNET_R01 and fragments to pass the MTU path of 1300.  One fragment has a length of 1300 and other with 200. Note: Output below is showing only 1 of 5 packets for brevity.
INTERNET_R01#
* Jul 27 16:12:17.770: IP: tableid=0, s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), routed via FIB
* Jul 27 16:12:17.770: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), g=3.3.3.1, len 1480, forward, proto=50
* Jul 27 16:12:17.770: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), len 1300, sending fragment
* Jul 27 16:12:17.770:     IP Fragment, Ident = 40355, fragment offset = 0, proto=50
* Jul 27 16:12:17.770: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), len 200, sending last fragment
* Jul 27 16:12:17.770:     IP Fragment, Ident = 40355, fragment offset = 1280


  • IP traffic statistic shows 5 packets has been fragmented.
INTERNET_R01#sh ip traffic | in Frag|frag
  Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         5 fragmented, 10 fragments, 0 couldn't fragment


  • Branch router receives packets.
BRANCH#
Jul 27 16:20:11.634: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 16:20:11.638: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 16:20:11.642: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 16:20:11.650: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 16:20:11.654: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0


  • Branch router reassembles fragmented packets.

BRANCH#sh ip traffic | i Frag|frag
  Frags: 5 reassembled, 0 timeouts, 0 couldn't reassemble
         0 fragmented, 0 fragments, 0 couldn't fragment



Scenario 2 (Tunnel PMTUD Enabled)

In this scenario, tunnel PMTUD is configured.  The tunnel MTU is set at 1400 and the test traffic uses at 1400 byte packet sent from Headend to Branch.

  • Ping with size of 1400 and DF bit set from LAN Switch to Branch network.  This fails.
LAN_SWITCH#p 10.100.100.254 si 1400 df

Type escape sequence to abort.
Sending 5, 1400-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
.M.M.
Success rate is 0 percent (0/5)


  • With Tunnel PMTUD enabled, the router is doing the following:
    • Router sends an ICMP unreachable to itself to adjust MTU.
    • Tunnel PMTUD process calculates the new MTU based on the configured policy and current MTU of 1400 and drops it down to 1334.
    • Router sends an ICMP unreachable to the source telling it the new MTU is 1334.
HEADEND#
Jul 27 06:09:22.507: ICMP: dst (3.3.3.1) frag. needed and DF set unreachable sent to 1.1.1.1
Jul 27 06:09:22.507: ICMP: dst (1.1.1.1) frag. needed and DF set unreachable rcv from 1.1.1.1 mtu:1362
Jul 27 06:09:22.507: Tunnel1: dest 3.3.3.1, received frag needed (mtu 1362), adjusting soft state MTU from 0 to 1334
Jul 27 06:09:22.507: Tunnel1: tunnel endpoint for transport dest 3.3.3.1, change MTU from 0 to 1334
Jul 27 06:09:24.510: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1334
Jul 27 06:09:26.514: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1334


  • When trying to figure out why the tunnel PMTUD process is using a value of 1334, the IPSec Overhead Calculator was very useful tool.  We can now see why a payload size of 1334, after encryption and GRE, would yield a packet size is 1400.
  • This begs the question on why PMTUD need to alter the MTU at this point?  It seems inefficient to do so because we are clearly sending data that will fit the tunnel MTU of 1400.  I can understand it if we sent a 1500 byte packet.


  • Headend’s Tunnel interface sets new MTU of 1334 via the PMTUD feature.
HEADEND#sh int tun1 | in Path        
  Path MTU Discovery, ager 10 mins, min MTU 92
  Path destination 3.3.3.1: MTU 1334, expires 00:09:23

  • Headend’s IP statistics shows that it dropped the packet because it couldn’t fragment it.
HEADEND#sh ip traffic | i Frag|frag
  Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         0 fragmented, 0 fragments, 2 couldn't fragment

  • The LAN Switch re-attempts to send traffic to the Branch site with the new MTU of 1334.  It still fails because we have a MTU of 1300 somewhere in the path.
LAN_SWITCH#ping 10.100.100.254 size 1334 df

Type escape sequence to abort.
Sending 5, 1334-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
.M.M.
Success rate is 0 percent (0/5)

  • Internet router received a 1400 byte packet and sends an ICMP unreachable to source because it can’t pass the interface with MTU of 1300.
INTERNET_R01#
* Jul 27 22:59:31.764: IP: tableid=0, s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), routed via FIB
* Jul 27 22:59:31.764: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), g=3.3.3.1, len 1400, forward, proto=50
* Jul 27 22:59:31.764: ICMP: dst (3.3.3.1) frag. needed and DF set unreachable sent to 1.1.1.1

  • Internet router drops packet because it couldn’t fragment.
INTERNET_R01#sh ip traffic | i Frag|frag
  Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         0 fragmented, 0 fragments, 1 couldn't fragment

  • This time the Headend’s Tunnel PMTU process is doing the following:
    • Router receives an ICMP unreachable from the Internet router with the MTU of 1300.
    • Router send an ICMP unreachable to itself to adjust the MTU.
    • Tunnel PMTUD process calculates the new MTU based on MTU of 1250 and drops it down to 1222.  I’m unsure how it got the 1250 value.
    • Router sends a new ICMP unreachable to the source telling it to drop the MTU to 1222.
HEADEND#
Jul 27 23:24:01.752: ICMP: dst (1.1.1.1) frag. needed and DF set unreachable rcv from 1.0.0.1 mtu:1300
Jul 27 23:24:03.752: ICMP: dst (3.3.3.1) frag. needed and DF set unreachable sent to 1.1.1.1
Jul 27 23:24:03.752: ICMP: dst (1.1.1.1) frag. needed and DF set unreachable rcv from 1.1.1.1 mtu:1250
Jul 27 23:24:03.752: Tunnel1: dest 3.3.3.1, received frag needed (mtu 1250), adjusting soft state MTU from 0 to 1222
Jul 27 23:24:03.752: Tunnel1: tunnel endpoint for transport dest 3.3.3.1, change MTU from 0 to 1222
Jul 27 23:24:05.757: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1222
Jul 27 23:24:07.761: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1222

  • Headend’s Tunnel interface sets new MTU of 1222 via the PMTUD feature.

HEADEND#sh int tun1 | In Pa       
  Path MTU Discovery, ager 10 mins, min MTU 92
  Path destination 3.3.3.1: MTU 1222, expires 00:01:06

  • When we use the IPSec Overhead Calculator with a payload size of 1222, after encryption and GRE, the packet size is 1288.  Now this will fit over the 1300 MTU link.



Note: The Tunnel PMTUD process must know the exact overhead calculations to be able to set the correct MTU.  As an example, I set the payload size 1 byte higher and the total is now bigger than 1300.



  • The LAN Switch re-attempts to send traffic to the Branch site with the new MTU of 1222.  This time it succeeds.
LAN_SWITCH#p 10.100.100.254 size 1222 df

Type escape sequence to abort.
Sending 5, 1222-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms

  •  Internet router with MTU 1300 receives the post encrypted packet of 1288 and forwards it (showing only 1 of 5 packets).
INTERNET_R01#
* Jul 27 23:59:53.800: IP: tableid=0, s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), routed via FIB
* Jul 27 23:59:53.800: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), g=3.3.3.1, len 1288, forward

  •  Branch router receives packet and replies.
BRANCH#
Jul 27 23:55:37.862: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 23:55:37.866: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 23:55:37.870: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 23:55:37.874: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0
Jul 27 23:55:37.882: ICMP: echo reply sent, src 10.100.100.254, dst 10.1.1.253, topology BASE, dscp 0 topoid 0



Scenario 3 (Tunnel PMTUD Enabled, ICMP Unreachable Blocked)

In this scenario, tunnel PMTUD is configured but ICMP unreachables are purposely blocked in the simulated Internet infrastructure.  The tunnel MTU is set at 1400 and the test traffic uses at 1400 byte packet sent from Headend to Branch.

To simulate a situation where the ICMP unreachables are blocked or lost within a network, an ACL is added to prevent any ICMP unreachables from reaching the Headend router from the Internet router.


ip access-list extended BLOCK_UNREACHABLES
 deny   icmp any any unreachable log
 permit ip any any


INET_SWITCH#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
INET_SWITCH(config)#int f0/5                                
INET_SWITCH(config-if)#ip access-group BLOCK_UNREACHABLES in
INET_SWITCH(config-if)#end



  • With the block in place, let see what happens now.  We ping with size of 1400 and DF bit set from LAN Switch to Branch network.  It fails.
LAN_SWITCH#ping 10.100.100.254 size 1400 df

Type escape sequence to abort.
Sending 5, 1400-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
.M.M.
Success rate is 0 percent (0/5)

  • With Tunnel PMTUD still enabled, the router is doing the following:
    • Router sends an ICMP unreachable to itself to adjust MTU.
    • Tunnel PMTUD process calculates the new MTU based on the configured policy and current MTU of 1400 and drops it down to 1334.
    • Router sends an ICMP unreachable to the source telling it the new MTU is 1334.
HEADEND#
Jul 27 18:28:21.312: ICMP: dst (3.3.3.1) frag. needed and DF set unreachable sent to 1.1.1.1
Jul 27 18:28:21.312: ICMP: dst (1.1.1.1) frag. needed and DF set unreachable rcv from 1.1.1.1 mtu:1362
Jul 27 18:28:21.312: Tunnel1: dest 3.3.3.1, received frag needed (mtu 1362), adjusting soft state MTU from 0 to 1334
Jul 27 18:28:21.312: Tunnel1: tunnel endpoint for transport dest 3.3.3.1, change MTU from 0 to 1334
Jul 27 18:28:23.317: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1334
Jul 27 18:28:25.322: ICMP: dst (10.100.100.254) frag. needed and DF set unreachable sent to 10.1.1.253mtu:1334

  • Headend’s Tunnel interface sets new MTU of 1334 via the PMTUD feature.
HEADEND#sh int tun1 | in Path      
  Path MTU Discovery, ager 10 mins, min MTU 92
  Path destination 3.3.3.1: MTU 1334, expires 00:07:13

  • The LAN Switch re-attempts to send traffic to the Branch site with the new MTU of 1334.  It still fails because we have a MTU of 1300 somewhere in the path.
LAN_SWITCH#ping 10.100.100.254 si 1334 df

Type escape sequence to abort.
Sending 5, 1334-byte ICMP Echos to 10.100.100.254, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)

  • The Internet router receives a 1400 byte packet.  A MTU 1300 path exists so an ICMP unreachable is sent to the Headend router (1.1.1.1).
INTERNET_R01#
* Jul 27 18:21:02.661: IP: tableid=0, s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), routed via FIB
* Jul 27 18:21:02.661: IP: s=1.1.1.1 (Vlan100), d=3.3.3.1 (FastEthernet0), g=3.3.3.1, len 1400, forward, proto=50
* Jul 27 18:21:02.661: ICMP: dst (3.3.3.1) frag. needed and DF set unreachable sent to 1.1.1.1

  •  The Internet Switch blocks the ICMP unreachable per the ACL.
INET_SWITCH#
* Jul 27 20:20:36.813: %SEC-6-IPACCESSLOGDP: list BLOCK_UNREACHABLES denied icmp 1.0.0.1 -> 1.1.1.1 (3/4), 1 packet


At this point the Headend router doesn’t see the ICMP unreachable message so it does nothing to react.  It doesn’t know to drop the MTU down to 1300 and continues to use the MTU value of 1334 as previously set by the Tunnel PMTU feature.  Without any further mechanisms to correct the MTU, the source will continue to send packets with a size of 1334 and will be dropped, which will create a serious network performance problem (due to TCP retransmits etc.)  This scenario demonstrates firsthand that receiving the ICMP is key for this feature to work correctly.




Conclusion

Besides the fact that this tunnel PMTUD feature seems to be a bit chatty and inefficient at times, the main caveat I would like to mention here is that for this to work successfully in the practical sense, we have to rely on our Internet service providers to generate and forward ICMP unreachables.  In my opinion, that’s a tall order.  We frequently hear of ISPs blocking ICMP’s for security reasons and with most common deployments of DMVPN (or any IPSec VPN network) being created over the Internet, I think its a bad idea to use it in the first place.

The best practice, in my opinion, is to lower tunnel MTU and re-adjust MSS for sites that have an Internet service with a lower than normal MTU.  It’s a quick and easy fix.  Otherwise, if there a specific need for this feature, you just have to be aware of your network environment and ensure the ICMP unreachable prerequisite is met.