BGP is the protocol used to announce prefixes throughout the internet. It’s a very robust protocol, and very useful to carry lot of prefixes, such as the Internet prefixes or internal client prefixes of an ISP.
When a prefix is received in BGP, the path passes through two steps before being chosen as candidate to populate the RIB.
The first step consists on checking if the path is valid. If it is, the prefix will get into the BGP table, and later the second step of selection will start.
In order to pass this first check, the path must meet the following requirements:
- The prefix must not been marked as “not-synchronized”
- There must be a route in the RIB to reach the next-hop
- For prefixes learned through eBGP sessions, the local ASN must not be in the AS_PATH of the prefix
In the second step, the best path to reach the prefix is selected. If there is only one path, no comparison needed. If there are many paths to reach the prefix, there is a special algorithm that BGP uses to select the best path, and this is what I want to talk about.
This algorithm dictates the following:
- Prefer the path with the highest WEIGHT
- Prefer the path with the highest LOCAL PREFERENCE
- Prefer the path that was locally originated via a network o redistribute command over aggregate-address command
- Prefer the path with the lowest AS_PATH
- Prefer the path with the lowest ORIGIN type
- Prefer the path with the lowest MULTI-EXIT DISCRIMINATOR (MED)
- Prefer eBGP over iBGP
- Prefer the path with the lowest IGP metric to the BGP next-hop
- When both path are external, prefer the one that was received first
- Prefer the route that comes from the BGP router with the lowest router ID
- If the originator or router ID is the same for multiple paths, prefer the path with the minimum cluster list length
- Prefer the path that comes from the lowest neighbor address
As you can see, the selection process is quite long, although in most cases the selection doesn’t go further than point 8.
Let’s study points 1 through 8 and how we can influence them within the following lab. The prefix we are going to be working with is 100.100.100.0/24, announced by R4 and R6:
1.- PATH WITH HIGHEST WEIGHT
Weight is a Cisco-specific attribute, that means it’s not standard. This attribute is local to the router on witch it’s configured, so it’s not advertised with the prefix to other peers. This attribute is used to tell the router which path to use to reach the prefix. The highest value wins.
It’s the first attribute checked by BGP, so if there are two different paths for the same prefix but with different Weight values, the path with the highest value wins.
In the lab scenario, R4 and R6 both announce the prefix 100.100.100.0/24, one through an eBGP session and other through an iBGP session. Let’s check how R2 and R1 see this prefix without changing anything:
R2#show ip bgp
BGP table version is 3, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* 100.100.100.0/24 4.4.4.4 0 0 65002 i
*>i 6.6.6.6 0 100 0 i
R2#show ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 3
Paths: (2 available, best #2, table default)
Advertised to update-groups:
13 16
65002
4.4.4.4 (metric 11) from 4.4.4.4 (4.4.4.4)
Origin IGP, metric 0, localpref 100, valid, external
Local 6.6.6.6 (metric 11) from 6.6.6.6 (6.6.6.6) Origin IGP, metric 0, localpref 100, valid, internal, best
R2 gets two paths for the prefix 100.100.100.0/24: one of them from an eBGP peer and the other one from an iBGP peer. So R2 doesn’t choose the path through the eBGP peer, as we could think initially as the Administrative Distance for eBGP is less than for iBGP, but that’s not what really happens.
R2 picks the one from the iBGP peer as the best one, because as we will see later, it’s the one with the shortest AS_PATH length. Both paths (through R4 and through R6) have the same weight, local-preference and route origin. So the tie-breaker is the shorter AS_PATH, that is the path through R6.
Let’s see what happens when the weight parameter is configured on R2:
R2#conf term
R2(config)#router bgp 65001
R2(config-router)#neig 4.4.4.4 weight 200
R2(config-router)#end
R2#clear ip bgp 4.4.4.4
R2#sh ip bgp
BGP table version is 4, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 100.100.100.0/24 4.4.4.4 0 200 65002 i
* i 6.6.6.6 0 100 0 i
Now R2 takes the path through R4. And it announces this path to R1 as its own choice, but we said the weight attribute is not attached to the prefix, so if R1 had a BGP session with R6, it would prefer the path through R6 as R2 did at the beginning.
Let’s build this BGP session between R1 and R6, and let’s see which path R1 chooses:
R1#sh ip bgp sum
BGP router identifier 1.1.1.1, local AS number 65001
....
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
2.2.2.2 4 65001 30 30 14 0 0 00:24:37 1
6.6.6.6 4 65001 4 3 14 0 0 00:00:31 1
R1#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 14
Paths: (2 available, best #1, table default)
Not advertised to any peer
Local
6.6.6.6 (metric 21) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, best
65002
4.4.4.4 (metric 21) from 2.2.2.2 (2.2.2.2)
Origin IGP, metric 0, localpref 100, valid, internal
R1#
Although R2 prefers the path through R4, R1 prefers the path through R6 because it has a shorter AS_PATH.
So as I said before, the weight attribute only has local significance, and it’s not attached to the prefix when announced via BGP.
2.- PATH WITH HIGHEST LOCAL-PREFERENCE
When all the paths to the destination have the same weight value, the next attribute to be checked is Local-Preference.
Local-preference is a standard attribute, and it’s transmitted only between iBGP peers.
This parameter is set to outgoing or incoming prefixes by using a route-map with the peer. If there isn’t any statement matching a specific prefix inside the route-map, the local-preference is set for all the prefixes outgoing or incoming for that peer. The highest value wins.
Let’s get back to the original scenario. R4, R3, and R6 are announcing the same 100.100.100.0/24 prefix. But, R3 is announcing this prefix with a local-preference of 150:
R2#sh ip bgp
BGP table version is 7, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*>i100.100.100.0/24 3.3.3.3 0 150 0 i
* 4.4.4.4 0 0 65002 i
* i 6.6.6.6 0 100 0 i
R2#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 7
Paths: (3 available, best #1, table default)
Flag: 0x800
Advertised to update-groups:
13 18
Local, (Received from a RR-client) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 0, localpref 150, valid, internal, best
65002
4.4.4.4 (metric 11) from 4.4.4.4 (4.4.4.4)
Origin IGP, metric 0, localpref 100, valid, external
Local, (Received from a RR-client)
6.6.6.6 (metric 11) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal
It makes R2 select the path through R3 as the best choice, and announce this choice to other iBGP neighbors, as we can see in R1:
R1#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 17
Paths: (1 available, best #1, table default)
Not advertised to any peer
Local
3.3.3.3 (metric 11) from 2.2.2.2 (2.2.2.2)
Origin IGP, metric 0, localpref 150, valid, internal, best
Originator: 3.3.3.3, Cluster list: 2.2.2.2
As we can see, the value of Local-Preference is attached to the prefix.
In order to change this decision, we can configure a route-map in R2 with a higher local-preference value and apply it to the session with R6. After resetting the session with R6 on R2, the prefix announced by R6 will have the highest local-preference value, so R2 will choose this new path. At the same time it would be announced this way to their clients:
R2#configure t
R2(config)#route-map LP-200
R2(config-route-map)#set local-preference 200
R2(config-route-map)#exit
R2(config)#router bgp 65001
R2(config-router)#neig 6.6.6.6 route-map LP-200 in
R2(config-router)#end
R2#clear ip bgp 6.6.6.6
R2#sh ip bgp
BGP table version is 8, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*>i100.100.100.0/24 6.6.6.6 0 200 0 i
* i 3.3.3.3 0 150 0 i
* 4.4.4.4 0 0 65002 i
R1#show ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 18
Paths: (1 available, best #1, table default)
Not advertised to any peer
Local
6.6.6.6 (metric 21) from 2.2.2.2 (2.2.2.2)
Origin IGP, metric 0, localpref 200, valid, internal, best
Originator: 6.6.6.6, Cluster list: 2.2.2.2
A path without LOCAL_PREF is considered to have the value that is set with the bgp default local-preference command, or if this is not configured, a 100 by default.
3.- PATH LOCALLY ORIGINATED
This point is reached if all of the above attributes have the same value for all the feasible paths.
Local paths that are sourced by the network or redistribute commands are preferred over local aggregates that are sourced by theaggregate-address command.
Let’s get back to the original scenario.
Now R5 is announcing the prefix 100.100.100.0/30 to R3 using an iBGP session and R3 generates the bgp aggregated prefix 100.100.100.0/24 using the aggregate-address command, and also through the redistribution of its Loopback100 interface:
R3#show ip bgp
BGP table version is 4, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
s>i100.100.100.0/30 5.5.5.5 0 100 0 i
* 100.100.100.0/24 0.0.0.0 32768 i
*> 0.0.0.0 0 32768 ?
R3#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 3
Paths: (2 available, best #2, table default)
Advertised to update-groups:
16 17
Local, (aggregated by 65001 3.3.3.3)
0.0.0.0 from 0.0.0.0 (3.3.3.3)
Origin IGP, localpref 100, weight 32768, valid, aggregated, local, atomic-aggregate
Local 0.0.0.0 from 0.0.0.0 (3.3.3.3) Origin incomplete, metric 0, localpref 100, weight 32768, valid, sourced, best
R3 prefers the path originated via the redistribute command, instead of the one from the aggregate command. And that path is the one announced to R2.
4.- PATH WITH SHORTEST AS_PATH
If none of the above attributes break the tie and the router doesn’t have the prefix locally generated, the next parameter to check is the AS_PATH attribute.
The AS_PATH is a well-known mandatory attribute. It means every prefix has this attribute attached, and every router must understand this attribute. The shorter this attribute is, the more preferable is the path.
Let’s get back again to the original scenario, with all already seen attributes set by default.
In this scenario, the prefix received from R4 has the longest AS_PATH because it’s an eBGP session.
R2#sh ip bgp
BGP table version is 61, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*>i100.100.100.0/24 6.6.6.6 0 100 0 i
* 4.4.4.4 0 0 65002 i>/pre>
That’s why R2 prefers the iBGP prefix than the eBGP prefix.
The manipulation of the AS_PATH attribute must be done in a eBGP session. Among iBGP peers is not possible to manipulate the AS_PATH (you could hide it with the aggregate-address command, or to manipulate it with confederations)
5.- PATH WITH LOWEST ORIGIN
Origin is also a well-known mandatory attribute, like next-hop and as_path. So every BGP prefix has this attribute.
There are 3 origin types: IGP, EGP and INCOMPLETE.
IGP is more preferable than Exterior Gateway Protocol (EGP), and EGP is more preferable than INCOMPLETE.
Typically, when a prefix is generated by the command network, it gets the type IGP, and when it’s redistributed from another protocol, it gets the type INCOMPLETE.
In our scenario, R6 is generating the prefix 100.100.100.0/24 by redistributing it Loopback100 interface:
R6#show route-map
route-map CONN, permit, sequence 10
Match clauses:
interface Loopback100
Set clauses:
Policy routing matches: 0 packets, 0 bytes
R6#conf term
R6(config)#router bgp 65001
R6(config-router)#redistribute connected route-map CONN
R6(config-router)#end
R6#clear ip bgp
R2#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 76
Paths: (3 available, best #1, table default)
Advertised to update-groups:
13 18
Local, (Received from a RR-client)
3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3)
Origin IGP, metric 0, localpref 100, valid, internal, best
Local, (Received from a RR-client)
6.6.6.6 (metric 11) from 6.6.6.6 (6.6.6.6)
Origin incomplete, metric 0, localpref 100, valid, internal
65002
4.4.4.4 (metric 11) from 4.4.4.4 (4.4.4.4)
Origin IGP, metric 0, localpref 100, valid, external
R2 prefers the path through R3 because of the origin type.
In order to change the origin type, a route-map must be used:
R6#conf term
Enter configuration commands, one per line. End with CNTL/Z.
R6(config)#route-map CONN
R6(config-route-map)#set origin igp
R6(config-route-map)#end
R6# clear ip bgp 2.2.2.2
R2#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 76
Paths: (3 available, best #1, table default)
Advertised to update-groups:
13 18
Local, (Received from a RR-client)
6.6.6.6 (metric 11) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, best
Local, (Received from a RR-client)
3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3)
Origin IGP, metric 0, localpref 100, valid, internal
65002
4.4.4.4 (metric 11) from 4.4.4.4 (4.4.4.4)
Origin IGP, metric 0, localpref 100, valid, external
6.- PATH WITH THE LOWEST MED
MED comparison only occurs if the first (the neighboring) AS is the same in the two paths to compare. There are other implications (check this
Cisco reference to know more about this parameter)
It’s an Optional Non-transitive Attribute, so it may not been passed to other AS’s and its usage as a tie-breaker between several paths depends on each AS policy. The lowest MED is the most preferable.
MED can be manipulated using a route-map:
R3#conf term
R3(config)#route-map MED
R3(config-route-map)#set metric 20000
R3(config-route-map)#router bgp 65001
R3(config-router)#neig 2.2.2.2 route-map MED out
R3(config-router)#end
R3#clear ip bgp 2.2.2.2
R6#conf term
R6(config)#route-map MED
R6(config-route-map)#set metric 1000
R6(config-route-map)#exit
R6(config)#router bgp 65001
R6(config-router)#neig 2.2.2.2 route-map MED out
R6(config-router)#end
R6#clear ip bgp 2.2.2.2
R2#sh ip bgp
BGP table version is 81, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* i100.100.100.0/24 3.3.3.3 2000 100 0 i
*>i 6.6.6.6 1000 100 0 i
* 4.4.4.4 0 0 65002 i
7.- PREFER EBGP OVER IBGP
We reached the most interesting point.. From the first part of the post, we saw that the path through R6, who it’s an iBGP peer, was preferred over the path through R4, who is an eBGP peer.
This is because the fact that the route is learned via iBGP or eBGP is not considered until all the above attributes are equal. In that case, the prefix learned through an eBGP session is preferred over an iBGP session.
In order to try this, I have changed a little bit the scenario. Now R5 keeps an eBGP session with R3, and it announces the prefix 100.100.100.0/24.
R4 has an eBGP session with R2, and it announces also the prefix 100.100.100.0/24. Between R2 and R3 there is an iBGP session, but R2 filters everything towards R3.
In this situation, we see that R2 gets two path for the prefix 100.100.100.0/24. Both paths have the same attributes, but one of them is through an iBGP peer, and the other one through an eBGP peer:
R2#sh ip bgp
BGP table version is 84, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* i100.100.100.0/24 5.5.5.5 0 100 0 65003 i
*> 4.4.4.4 0 0 65002 i
R2#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 84
Paths: (2 available, best #2, table default)
Advertised to update-groups:
13
65003, (Received from a RR-client)
5.5.5.5 (metric 21) from 3.3.3.3 (3.3.3.3)
Origin IGP, metric 0, localpref 100, valid, internal
65002 4.4.4.4 (metric 11) from 4.4.4.4 (4.4.4.4) Origin IGP, metric 0, localpref 100, valid, external, best
R2 prefers the path through the eBGP peer, although it has another path through an iBGP peer.
8.- PATH WITH LOWEST IGP METRIC
If all the above attributes are equal and no path has been chosen yet, the next parameter to check is the IGP cost to reach the different next-hops of the prefix.
Getting back to the original scenario, I changed the OSPF cost of R3′s loopback. Now only R6 and R3 are announcing the prefix 100.100.100.0/24:
R2#sh ip bgp
BGP table version is 88, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* i100.100.100.0/24 3.3.3.3 0 100 0 i
*>i 6.6.6.6 0 100 0 i
R2#sh ip bgp 100.100.100.0/24
BGP routing table entry for 100.100.100.0/24, version 88
Paths: (2 available, best #2, table default)
Advertised to update-groups:
13
Local, (Received from a RR-client)
3.3.3.3 (metric 1010) from 3.3.3.3 (3.3.3.3)
Origin IGP, metric 0, localpref 100, valid, internal
Local, (Received from a RR-client)
6.6.6.6 (metric 11) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, best
R2#sh ip route 3.3.3.3
Routing entry for 3.3.3.3/32
Known via "ospf 1", distance 110, metric 1010, type intra area
Last update from 10.10.23.3 on Ethernet0/2, 00:00:47 ago
Routing Descriptor Blocks:
* 10.10.23.3, from 3.3.3.3, 00:00:47 ago, via Ethernet0/2
Route metric is 1010, traffic share count is 1
R2#sh ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 11, type intra area
Last update from 10.10.26.6 on Ethernet0/3, 05:23:31 ago
Routing Descriptor Blocks:
* 10.10.26.6, from 6.6.6.6, 05:23:31 ago, via Ethernet0/3
Route metric is 11, traffic share count is 1
R2 prefers the path through R6 because the OSPF metric to reach that next-hop is smaller, all the other parameters are exactly the same for both paths.
And that’s all for now.
I hope this post helps clear the BGP best-path selection algorithm!