[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
2009.10.20 NANOG47 Day 2 notes, morning sessions
- Subject: 2009.10.20 NANOG47 Day 2 notes, morning sessions
- From: mpetach at netflight.com (Matthew Petach)
- Date: Tue, 20 Oct 2009 10:08:41 -0700
Here's my notes from this morning's sessions. :)
Off to lunch now!
Matt
2009.10.20 NANOG day 2 notes, first half
Dave Meyer kicks things off at 0934 hours
Eastern time.
Survey! Fill it out!
http://tinyurl.com/nanog47
Cathy Aaronson will start off with a rememberance
of Abha Ahuja. She mentored, chaired working
groups, she helped found the net-grrls group;
she was always in motion, always writing software
to help other people. She always had a smile, always
had lots to share with people.
If you buy a tee shirt, Cathy will match the donation.
John Curran is up next, chairman of ARIN
Thanks to NANOG SC and Merit for the joint meeting;
Add your operator perspective!
Vote today in the NRO number council election!
You can vote with your nanog registration email.
https://www.arin.net/app/election
Join us tonight for open policy hour (this room)
and happy hour (rotunda)
Participate in tomorrow's IPv6 panel discussion
and the rest of the ARIN meeting.
You can also talk to the people at the election
help desk.
During the open policy hour, they'll discuss the
policies currently on the table.
And please join in the IPv6 panel tomorrow!
If you can, stay for the ARIN meeting, running
through Friday.
This includes policy for allocation of ASN blocks
to RIRs
Allocation of IPv4 blocks to RIRs
Open access to IPv6 (make barriers even lower)
IPv6 multiple discrete networks (if you have non
connected network nodes)
Equitable IPv4 run-out (what happens when the free
pool gets smaller and smaller!)
Tomorrow's Joint NANOG panel
IPv6--emerging success stories
Whois RESTful web service
Lame DNS testing
Use of ARIN templates
consultation process ongoing now; do we want to
maintain email-based access for all template types?
Greg Hankins is up next for 40GbE and 100GbE
standards update--IEEE P802.3ba
Lots of activity to finalize the new standards specs
many changes in 2006-2008 as objectives first developed
After draft 1.0, less news to report as task force
started comment resolution and began work towards the
final standard
Finished draft 2.2 in august, crossing Is, dotting Ts
Working towards sponsor ballot and draft 3.0
On schedule for delivery in June 2010
Copper interface moved from 10meter to 7meter.
100m on multimode,
added 125m on OM4 fiber, slightly better grade.
CFP is the module people are working towards as
a standard.
Timeline slide--shows the draft milestones that
IEEE must meet. It's actually hard to get hardware
out the door based around standards definitions.
If you do silicon development and you jump in too
fast, the standard can change under you; but if you
wait too long, you won't be ready when the standard
is fully ratified.
July 2009, Draft 2 (2.2), no more technical changes,
so MSAs have gotten together and started rolling
out pre-standard cards into market.
Draft 3.0 is big next goal, it goes to ballot for
approval for final standards track.
After Draft 3.0, you'll see people start ramping
up for volume production.
Draft 2.x will be technically complete for WG ballot
tech spec finalized
first gen pre-standard components have hit market
technology demonstrations and forums
New media modules:
QSFP modules
created for high density short reach interfaces
(came from Infiniband)
Used for 40GBASE-CR4 and 40GBASE-SR4
CXP modules
proposed for infiniband and 100GE
12 channels
100GbE uses 10 of 12 channels
used for 100GBASE-10
CFP Modules
long reach apps
big package
used for SR4, LR4, SR10, LR4, ER4
about twice the size of a Xenpak
100G and 40G options for it.
MPO/MTP cable
multi-fiber push-on
high-density fiber option
40GBASE-SR4
12 fiber MPO uses 8 fibers
100GBASE-SR10
24 fiber MPO cable, uses 20 fibers
this will make cross connects a challenge
Switches and Routers
several vendors working on pre-standard cards,
you saw some at beer and gear last night.
Alcatel, Juniper
First gen tech will be somewhat expensive and
low density
geared for those who can afford it initially and
really need it.
Nx10G LAG may be more cost effective
higher speed interfaces will make 10GbE denser and
cheaper
Density improves as vendors develop higher capacity
systems to use these cards
density requires > 400Gbps/slot for 4x100GbE ports
Cost will decrease as new technology becomes feasible.
Future meetings
September 2009, Draft 2.2 comment resolution
Nov 2009 plenary
Nov 15-20, Atlanta
Draft 3.0 and sponsor ballot
http://grouper.ieee.org/groups/802/3/ba/index.html
You have to go to meeting to get password for the
draft, unfortunately.
Look at your roadmap for next few years
get timelines from your vendors
optical gear, switches, routers
server vendors
transport and IP transit providers, IXs
Others?
figure out what is missing and ask for it
will it work with your optical systems
what about your cabling infrastructure
40km 40GbE
Ethernet OAM
Jumbo frames?
There's no 40km offering now; if you need it,
start asking for it!
Demand for other interfaces
standard defines a flexible architecture, enables
many implementations as technology changes
Expect more MSAs as tech develops and becomes cost
effective
serial signaliing spec
duplex MMF spec
25Gbps signalling for 100GbE backplane and copper
apps
Incorporation of Energy Efficient Ethernet (P802.3az)
to reduce energy consumption during idle times.
Traffic will continue to increase
Need for TbE is already being discussed by network
operations
Ethernet will continue to evolve as network requirements
change.
Question, interesting references.
Dani Roisman, PeakWeb
RSTP to MST spanning tree migration in a live datacenter
Had to migrate from a Per-vlan RSTP to MST on a
highly utilized network
So, minimal impact to a live production network
define best practices for MST deployment that will
yield maximal stability and future flexibility
Had minimal reference material to base this on
Focus on this is about real-world migration details
read white papers and vendor docs for specs on each
type.
The environment:
managed hosting facility
needed flexibility of any vlan to any server, any rack
each customer has own subnet, own vlan
Dual-uplinks from top-of-rack switches to core.
High number of STP logical port instances
using rapid pvst on core
Multiple VLAN*interface count = logical port instances
Too many spanning tree instances for layer 3 core switch
concerns around CPU utilization, memory, other resource
exhaustion at the core.
Vendor support: per-vlan STP
Cisco: per-vlan is the default config, cannot switch
to single-instance STP
foundry/brocade offers per vlan mode to interoperate
with cisco
Juniper MX and EX offers vstp to interoperate
Force10 FTOS
Are we too spoiled with per-vlan spanning tree?
don't need per-vlan spanning tree, don't want to
utilize alternate path during steady-state since
we want to guarantee 100% capacity during
failure scenario
options:
collapse from per-vlan to single-instance STP
Migrate to standards-based 802.1s MSTP
(multiple spanning tree--but really going to fewer
spanning trees!)
MST introduces new configuration complexity
all switches within region must have same
vlan-to-mst mapping
means any vlan or mst change must be done
universally to all devices in site.
issues with change control; all rack switches
must be touched when making single change.
Do they do one MST that covers all vlans?
Do they pre-create instances?
do all vendors support large instance numbers?
No, some only support instances 1-16
Had to do migration with zero downtime if possible
Used a lab environment with some L3 and L2 gear
Found a way to get it down to one STP cycle of 45secs
Know your roots! Set cores to "highest" STP priority
(lowest value)
Set rack switches to lower-than-default to ensure
they never become root.
Start from roots, then work your way down.
MSTP runs RSTP for backwards compatability
choose VLAN groups carefully.
Instance numbering
some only support small number, 1-16
starting point
all devices running 802.1w
core 1 root at 8192
core 2 root at 16384
You can pre-config all the devices with spanning
tree mapping, but they don't go live until final
command is entered
Don't use vlan 1!
set mst priority for your cores and rack switches.
don't forget MST 0!
vlan 1 hangs out in MST 0!
First network hit; when you change core 1 to
spanning mode mst
step 2, core2 moves to mst mode; brief blocking
moment.
step 3; rack switches, one at a time, go into
brief blocking cycle.
Ongoing maintenance
all new devices must be pre-configured with identical
MST params
any vlan to instance mapping changes, do to core 1
first
no protocol for MST config propagation
vtp follow-on?
MST adds config complexity
MST allows for great multi-vendor interoperability in
a layer 2 datacenter
only deployed a few times--more feedback would be
good.
Q:
Leo Bicknell, ISC; he's done several; he points
half rack switches at one core, other half at
other core; that way in core failure, only half
of traffic sloshes; also, on that way with traffic
on both sides, failed links showed up much more
quickly.
Any device in any rack has to support any vlan
is a scaling problem. Most sites end up going
to Layer3 on rack switches, which scales much
better.
A: Running hot on both sides, 50/50 is good for
making sure both paths are working; active/
standby allows for hidden failures. But
since they set up and then leave, they
needed to make sure what they leave behind is
simple for the customer to operate.
The Layer3 move is harder for managed hosting,
you don't know how many servers will want in a
given rack switch.
Q: someone else comes to mic, ran into same
type of issue. They set up their network
to have no loops by design.
Each switch had 4x1G uplinks; but when they
had flapping, it tended to melt CPU.
Vendor pushed them towards Layer3, but they
needed flexibility for any to any.
They did pruning of vlans on trunk ports;
but they ended up with little "islands" of
MST where vlans weren't trunked up.
Left those as odd 'separate' root islands,
rather than trying to fix them.
A: So many services are built around broadcast
and multicast style topologies that it's hard
to mode to Layer3, especially as virtualization
takes off; the ability to move instances around
the datacenter is really crucial for those
virtualized sites.
David Maltz, Microsoft Research
Datacenter challenges--building networks for agility
brief characterization of "mega" cloud datacenters
based on industry studies
costs
pain-points
traffic pattern characteristics in data centers
VL2--virtual layer 2
network virtualization
uniform high capacity
Cloud service datacenter
50k-200k servers
scale-out is paramount; some services have 10s of
servers, others 10s of 1000s.
servers divided up among hundreds of services
Costs for servers dominates datacenter cost:
servers 45%, power ifrastructure 25%,
maximiize useful work per dollar spent
ugly secret: 10-30% CPU utilization considered "good"
in datacenters
servers not doing anything at all
cause
server are purchased rarely (quarterly)
reassigning servers is hard
every tenant hoards servers
solution: more agility: any server, any service
Network diagram showing L3/L2 datacenter model
higher in datacenter, more expensive gear, designed
for 1+1 redundancy, scale-up model, higher in model
handles higher traffic levels.
Failure higher in model is more impactful.
10G off rack level, rack level 1G
Generally about 4,000 servers per L2 domain
network pod model keeps us from dynamically
growing/shrinking capacity
VLANs used to isolate properties from each otehr
IP addresses topologically determined by ARs
Reconfig of IPs and vlan trunks is painful,
error-prone, and takes time.
No performance isolation (vlan is reachability
isolation only)
one service sending/receiving too much stomps on
other services
Less and less capacity available for each server
as you go to higher levels of network: 80:1 to 240:1
oversubscriptions
2 types of apps: inward facing (HPC) and outward
facing. 80% of traffic is internal traffic; data
mining, ad relevance, indexing, etc.
dynamic reassignment of servers and map/reduce
style computations means explicit TE is almost
impossible.
Did a detailed study of 1500 servers on 79 ToR
switches.
Look at every 5-tuple for every connection.
Most of the flows are 100 to 1000 bytes; lots
of bursty, small traffic.
But most bytes are part of flows that are 100MB
are larger. Huge dichotomy not seen on internet
at large.
median of 10 flows per server to other servers.
how volatile is traffic? cluster the traffic
matrices together.
IF you use 40-60 clusters, cover a day's worth
of traffic. More clusters gives better fit.
traffic patterns change nearly constantly.
80th percentile is 100s; 99 percentile is 800s
server to server traffic matrix; most of the
traffic is diagonal; servers that need to
communicate tend to be grouped to same
top of rack switch.
but off-rack communications slow down the
whole set of server communications.
Faults in datacenter:
high reliability near top of tree, hard to accomplish
maintenance window, unpaired router failed.
0.3% of failure events knocked out all members of
a network redundancy group
typically at lower layers of network, but not always
objectives:
developers want network virtualization; want a model
where all their servers, and only their servers are
plugged into an ethernet switch.
Uniform high capacity
Performance isolation
Layer2 semantics
flat addressing; any server use any IP address
broadcast transmissions
VL2: distinguishing design principles
randomize to cope with volatility
separate names from locations
leverage strengths of end systems
build on proven network technology
what enables a new solution now?
programmable switches with high port density
Fast, cheap, flexible (broadcom, fulcrum)
20 port 10G switch--one big chip with 240G
List price, $10k
small buffers (2MB or 4MB packet buffers)
small forwarding table; 10k FIB entries
flexible environment; general purpose network
processor you can control.
centralized coordination
scale-out datacenters are not like enterprise networks
centralized services already control/monitor health and
role of each server (Autopilot)
Centralized control of traffic
Clos network:
ToR connect to aggs, aggs connect to intermediate node
switches; no direct cross connects.
The bisection bandwidth between each layer is the same,
so there's no need for oversubscription
You only lose 1/n chunk of bandwidth for a single
box; so you can have automated reboot of a device
to try to bring it back if it wigs out.
Use valiant load balancing
every flow is bounced off a random intermediate switch
provably hotspot free for any admissable traffic matrix
works well in practice.
Use encapsulation on cheap dumb devices.
two headers; outer header is for intermediate switch,
intermediate switch pops outer header, inner header
directs packet to destination rack switch.
MAC-in-MAC works well.
leverage strength of endsystems
shim driver at NDIS layer, trap the ARP, bounce to
VL2 agent, look up central system, cache the lookup,
all communication to that dest no longer pays the
lookup penalty.
You add extra kernel drivers to network stack when
you build the VM anyhow, so it's not that crazy.
Applications work with application addresses
AAs are flat names; infrastructure addresses invisible
to apps
How to implement VLB while avoiding need to update
state to every host on every topology change?
many switches are optimized for uplink passthrough;
so it seems to be better to bounce *all* traffic
through intermediate switches, rather than trying
to short-circuit locally.
The intermediate switches all have same IP address,
so they all send to the same intermediate IP, it
picks one switch.
You get anycast+ECMP to get fast failover and good
valiant load balancing.
They've been growing this, and found nearly perfect
load balancing.
All-to-all shuffle of 500MB shuffle among 75 servers;
get within 94% of perfect balancing; they charge for
the extra overhead for extra headers.
NICs aren't entirely full duplex; about 1.8Gb not 2Gb
bidrectional.
Provides good performance isolation as well; as one
service starts up, it has no impact on the service
being running steady state.
VLB does as well as adaptive routing (TE using
oracle) on datacenter traffic
worst link is 20% busier with VLB; median is same.
And that's assuming perfect knowledge of future
traffic flows.
Related work:
OpenFlow
wow that went fast!
Key to economic data is agility!
any server any service
network is largest blocker
right network model to create is virtual layer 2
per service
VL2 uses:
randomization
name-location separation
end systems
Q: Joe Provo--shim only applied to intra-datacenter
traffic; external traffic is *NOT* encapsulated?
A: Yes!
Q: This looks familiar to 802.1aq in IEEE; when you
did the test case, how many did you look at moving
across virtualized domains?
A: because they punt to centralized name system,
there is no limit to how often servers are switched,
or how many servers you use; you can have 10 servers
or 100,000 servers; they can move resources on 10ms
granularity.
Scalability is how many servers can go into VL2 "vlan"
and update the information.
In terms of number of virtual layer 2 environments,
it's looking like 100s to 1000s.
IEEE is looking at MAC-in-MAC for silicon based benefits;
vlans won't scale, so they use 802.1h header, gives
them 16M possibility, use IS-IS to replace spanning tree.
Did they look at moving entire topologies, or just servers?
They don't want to move whole topology, just movement in
the leaves.
Todd Underwood, Google; separate tenants, all work for
the same company, but they all have different stacks,
no coordination among them. this sounds like a
competing federation within the same company; why
does microsoft need this?
A: If you can handle this chaos, you can handle
anything!
And in addition to hosting their own services, they
also do hosting of other outsourced services like
exchange and sharepoint.
Microsoft has hundreds of internal properties
essentially.
Q: this punts on making the software side working
together, right? Makes the network handle it at
the many to many layer.
Q: Dani, Peakweb--how often is the shim lookup happening,
is it start of every flow?
A: Yes, start of every flow; that works out well; you
could aggregate, have a routing table, but doing it
per dest flow works well.
Q: Is it all L3, or is there any spanning tree involved?
A: No, network is all L3.
Q: Did you look at woven at all?
A: Their solution works to about 4,000 servers, but it
doesn't scale beyond that.
Break for 25 minutes now,
11:40 start time. We'll pop in a few more lightning
talks.
Somebody left glasses at beer and gear, reg desk has
them. :)
Break now!
Vote for SC members!!
Next up, Mirjam Kuhne, RIPE NCC,
RIPE Labs, new initiative of RIPE NCC
First there was RIPE, the equivalent of NANOG,
then NCC came into existence to handle the
meeting cordination, registrar, handled mailing
lists, etc.
RIPE Labs is a website, and a platform and a tool
for the community
You can test and evaluate new tools and prototypes
contribute new ideas
why RIPE labs?
faster, tighter innovation cycle
provide useful prototypes to you earlier
adapt to the changing environment more quickly
closer involvement of the community
openness
make feedback and suggestions faster and more
effective
http://labs.ripe.net/
many of the talks here are perfect candidates for
material to post on labs, to get feedback from your
colleagues, get research results, post new findings.
How can it benefit you?
get involved, share information, discover others
working on similar issues, get more exposure.
Few rules:
free and civil discussion between individuals
anyone can read content
register before contributing
no service guarantee
content can disappear based on
community feedback
legal or abuse issues
too little resources
What's on RIPE Labs?
DNS Lameness measurement tool
REX, the resource explorer
Intro to internet number resource database
IP address usage movies
16-bit ASN exhaustion data
NetSent next gen information service
Please take a look and participate!
mir at ripe.net or labs at ripe.net
Q: Cathy Aaronson notes that ISP security
BOF is looking for place to disseminate
information; but they should probably get
in touch with you!
Kevin Oberman is up next, from ESnet
DNSSec Basics--don't fear the signer!
why you should sign your data sooner rather
than later
this is your one shot to experiment with signing
when you can screw up and nobody will care!
later, you screw up, you disappear from the net.
DNSSEC uses public crypto, similar to SSH
DNSSEC uses anchor trust system, NOT PKI! No certs!
Starts at root, and traces down.
Root key is well known
Root knows net key
net knows es key
es key signs *.es.net
Perfect time to test and experiment without fear.
Once you publish keys, and people validate, you
don't want to experiment and fail--you will
disappear!
signing your information has no impact.
Only when you publish your keys will it have impact.
It is REALLY getting closer!
Root will be signed 2010
Org and Gov are signed now
com and net should be signed 2011
Multiple ccTLDs are signed; .se led the way,
and have lots of experience; only once did they
disappear, and that was due to missing dot in
config file; not at all DNSSEC related.
Registration issues still being worked on
transfers are of particular concern
an unhappy losing registrar could hurt you!
Implementation
Until your parent is ready
Develop signing policies and procedures
test, test, and test some more
key re-signing
key rolls
management tools
find out how to transfer the initial key to your parent
(when parent decides)
this is a trust issue--are you really "big-bank.com"
If you're brave
you can test validation (very few doing it--test on
internal server first!!) -- if this breaks, your
users will hurt (but not outside world)
You can give your public keys to the DLV (or ITARs)
this can hurt even more!
(DLV is automated, works with BIND out of box, it's
simpler, but you can chose which way to go)
What to sign?
Forward zone is big win
reverse zone has less value
may not want to sign some or all reverse or forward zones
signing involves 2 types of keys
ZSK, KSK, zone data key and key for sending keys to parent
keys need to be rolled regularly
if all keys and signatures expire, you lose all access,
period.
use two active keys
data resigned by 2 newest keys
sign at short intervals compared to expiration to
allow time to fix things.
new keys require parent to be notified.
ksks are 'safe', not on network (rotate annually)
Wait for BIND 9.7, it'll make your life much
easier.
There are commerical shipping products out there.
Make sure there are at least 2 people who can
run it, in case one gets hit by a bus.
Read NIST SP800-81
SP800-81r1 is out for comment now
Read BIND admin reference manual.
Once in a lifetime opportunity!!
Arien Vijin, AMS-IX
an MPLS/VPLS based internet exchange
(started off as a coax cable between routers)
then became Cisco 5500 switch, AMSIX version 1,
then 2001 went to Foundry switches at gig, version 2,
version 3 has optical switching
AMSIX version 3 vs AMSIX vs 4
June 2009 version 3
six sites, 2 with core switches in middle
two star networks
E, FE, GE, N*GE connections on BI-15K or RX8 switches
N*10GE connextions resilient connected on switching
platform (MLX16 or MLX32)
two separate networks, one active at any moment in
time.
selection of active network by VSRP
inactive network switch blocks ports to prevent loops
photonic switch basically flips from one network to the
other network.
Network had some scaling problems at the end.
Until now, they could always just buy bigger
boxes in the core to handle traffic.
Summer of 2009, they realized there was no sign of
a bigger switch on the horizon to replace the core.
core switches fully utilized with 10GE ports
limits ISL upgrade
no other switches on market
platform failover introduces short link flap on all
10GE customer ports--this leads to BGP flaps
with more 10G customers this becomes more of an issue
AMSIX version 4 requirements
scale to 2x port count
keep resilience in platform, but reduce impact on
failover (photonic switch layer)
increase amount of 10G customer ports on access switches
more local switching
migrate to single architecture platform
reduce management overhead
use future-proof platform that supports 40GE and 100GE
2010/2011 fully standardized
They moved to 4 central core switches, all meshed
together; every edge switch has 4 links, one to each
core.
Photonic switch for 10G members, to have redundancy
for customers.
MPLS/VPLS-based peering platform
scaling of core switches by adding extra switches in
parallel
4 LSPs between each pair of access switches
primary and secondary (backup) paths defined
OSPF
bfd for fast detection of link failures
RSVP-TE signalled LSPs over predefined paths
primary/secondary paths defined
VPLS instance per vlan
static defined VPLS peers (LDP signalle)
load balanced over parallel LSPs over all core routers
Layer 2 ACLs instead of port security
manual adjustment for now
(people have to call with new MAC addresses)
Now they're P/PE routers, not core and access
switches. ^_^;
Resilience is handled by LSP switchover from
primary to secondary path; totally transparent
to access router.
If whole switch breaks down, photonic switch
is used to flip all customers to the secondary
switch.
So, they can only run switches at 50% to allow
for photonic failover of traffic.
How to migrate the platform without customer
impact?
Build new version of photonic switch control daemon (PSCD)
No VSRP traps, but LSP state in MPLS cloud
develop configuration automation
describe network in XML, generate configuration from this
Move non MPLS capable access switches behind MPLS
routers and PXC as a 10GE customer connection
Upgrade all non MPLS capable 10GE access switches to
Brocade MLX hardware
Define migration scenario with no customer impact
2 colocation sites only for simplicity
double L2 network
VSRP for master/slave selection and loop protection
Move GE access behind PXC
Migrate one half to MPLS/VPLS network
Use PXC to move traffic to MPLS/VPLS network, test
for several weeks.
After six weeks, did the second half of the network.
Now, two separate MPLS/VPLS networks.
Waited for traffic on all backbone links to drop
below 50%; split uplinks to hit all the core P
devices; at that point, traffic then began using
paths through all 4 P router cores.
Migration--Conclusion
Traffic load balancing over multiple core switches
solves scaling issues in the core
Increased stability of the platform
Backbone failures are handled in the MPLS cloud and
not seen at the access level
Access switch failures are handled by PXC for single
pair of switches only
Operational experience
BFD instability
High LC CPU load caused BFD timeouts
resolved by increasing timers
Bug: ghost tunnels
double "up" event for LSP path
results in unequal load balancing
should be fixed in next patch release
multicast replication
replication done on ingress PE, not in core
only uses 1st link of aggregate of 1st LSP
with PIM-SM snooping traffic is balanced over multiple
links but has serious bugs
bugfixes and load balancing fixes scheduled for future
code releases.
Ripe TTM boxes used to measure delay through the fabric,
GPS timestamps.
Enormous amounts of jitter in the fabric, delays up to
40ms in the fabric.
Attempts from TTM, send 2 packets per minute, with some
entropy change (source port changes)
VPLS CAM age out after 60s
for 24-port aggregates, traffic often passes a port
without programming (CPU learning), high delay
does not affect real-world traffic, hopefully
will look to change CAM timing
packet is claustraphobic?
customer stack issue
increased stability
backbone failures handled by MPLS (not seen by customers)
access switch failures handled for a single pair of
switches now
easier debugging of customer ports
swap to different using glimmerglass
config generation
absolute necessity due to large size MPLS/VPLS configs
Scalability (future options)
bigger core
more ports
Some issues were found, but nothing that materially
impacted customer traffic
Traffic load-sharing over multiple links is good.
Q: did anything change for gigE access customers,
or are they still homed to one switch?
A: nothing changed for gigE customers; glimmerglass
is single-mode optical only, and they're too
expensive for cheap GigE ports.
no growth in 1G ports; no more FE ports; it's
really moving to a 10G only fabric.
RAS and Avi are up next
Future of Internet Exchange Points
Brief recap of history of exchange points
0th gen--throw cable over wall; PSI and Sprint
conspire to bypass ANS; third network wanted in,
MAE-East was born
1st commercial gen: FDDI, ethernet; multi-access,
had head of line blocking issues.
2nd gen: ATM exchange points, from AADS/PBNAP to
the MAEs, peermaker
3rd gen: GigE exchange points, mostly nonblocking
internal switches, PAIX, rise of Equinix, LINX,
AMS-iX
4th gen: 10G exchange points, upgrades, scale-out
of existing IXes through 2 or 3 revs of hardware
Modern exchange points are almost exclusively
ethernet based; cheap, no ATM headaches
10GE and Nx10GE have been primary growth for years.
Primarily flat L2 VLAN
IX has IP block (/24 or so)
each member router gets 1 IP
any member can talk to any other via L2
some broadcast (ARP) traffic is needed
well policed
Large IX toplogy (LINX), running 8x10G or 16x10G
trunks between locations
What's the problem?
L2 networks are easy to disrupt
forwarding loops easy to create
broadcast storms easy to create, no TTL
takes down not only exchange point, but overwhelms
peering router control plane as well
today we work around these issues by
locking down port to single MAC
hard coded, or learn single MAC only
single directly connected router port allowed
careful monitoring of member traffic with sniffers
good IXes have well trained staff for rapid responses
Accountablility
most routers have poor L2 stat tracking
options in use:
Netflow from member router
no MAC layer info, can't do inbound traffic
some platforms can't do netflow well at all
SFlow from member routers or from IX operator
still sampled, off by 5% or more
MAC accounting from member router
not available on vast majority of platforms today
None integrate well with provider 95th percentile
billing systems
IXs are a poor choice for delivering billed services
If you can't bill, you can't sell services over the
platform.
Security
Anyone can talk to anyone else
vulnerable to traffic injection
poor accounting options make this hard to detect.
when detected, easy to excuse
less security available for selling paid transit
Vulnerable to Denial of Service attacks
can even be delivered from the outside world if
the IX IP block is announced (as is frequently the case)
Vulnerable to traffic interception, ARP/CAM manipulation
Scalability
difficult to scale and debug large layer 2 networks
redundancy provided through spanning-tree or similar
backup-path protocols
large portions of network placed into blocking mode to
provide redundancy.
Managability
poor controls over traffic rates and or QoS
difficult to manage multi-router redundancy
multiple routers see the same IX/24 in multiple places
creates an "anycast" effect to the peer next-hops
can result in blackholing if there is an IX segmentation
or if there is an outage which doesn't drop link state.
Other issues:
inter-network jumbo-frames support is difficult
no ability to negotiate per-peer MTU
almost impossible to find common acceptable MTU for
everyone
service is constrained to IP only between two routers
can't use for L2 transport handoff
Avi talks about shared broadcast domain architecture
on the exchange points today.
Alternative is to use point to point virtual circuits,
like the ATM exchanges.
Adds overhead to setup process
adds security, accountablity advantages
Under ethernet, you can do vlans using 802.1q
handoff multiple virtual circuit vlans.
Biggest issue is limited VLAN ID space
limited to 4096 possible IDs--12-bit ID space
vlan stacking can scale this in transport
but VLANs in this are global across system
Means a 65 member exchange would completely
fill up the VLAN ID with a full mesh.
Traditional VLAN rewrites don't help either.
Now, the exchange also has to be arbiter of all
the VLANs used on the exchange.
Many customers use layer3 switch/routers, so the
vlan may be global across the whole device.
To get away from broadcast domain without using
strict vlans, we need to look at something else.
MPLS as transport rather than Ethernet
solves vlan scaling problems
MPLS pseudowires are 32bits; 4billion VCs
VLAN ID not carried with the packet, used only on handoffs
VLAN IDs not a shared resource anymore
Solves VLAN ID conflict problems
members chose vlan ID per VC handoff
no requirements for vlan IDs to match on each end
solves network scaling problems
using MPLS TE far more flexible than L2 protocols
allows the IX to build more complex topologies,
interconnect more locations, and more efficiently
utilize resources.
The idea is to move the exchange from L2 to L3 to
scale better, give more flexibility, and do better
debugging. You can get better stats, you can do
parallel traffic handling for scaling and redundancy,
and you see link errors when they happen, they aren't
masked by blocked ports.
Security
each virtual circuit would be isolated and secure
no mechanism for a third party to inject or sniff traffic
significantly reduced DOS potential
Accountability
Most provide SNMP measurement for vlan subints
Members can accurately meaasure traffic on each VC
without "guestimation"
capable of integrating with most billing systems.
Now you can start thinking about selling transport
over exchange points, for example
Takes the exchange point out of the middle of the
traffic accounting process.
Services
with more accountability and security, you can offer
paid services
support for "bandwidth on demand" now possible
no longer constrained to IP-only or one-router-only
can be used to connect transport circuits, SANs, etc.
JumboFrame negotiation possible, since MTU is per
interconnect
Could interconnect with existing metro transport
Use Q-in-Q vlan stacking to extend the network onto
third party infrastructures
imagine a single IX platform to service thousands of
buildings!
Could auto-negotiate VC setup using a web portal
Current exchanges mostly work
with careful engineering to protect the L2 core
with limited locations and chassis
siwth significant redundancy overhead
for IP services only
A new kind of exchange point would be better
could transform a "peering only" platform into a
new "ecosystem" to buy and sell services on.
Q: Arien from AMS-IX asks about MTU--why does it matter?
A: it's for the peer ports on both sides.
Q: they offer private interconnects at AMS-IX, but nobody
wants to do that, they don't want to move to a tagged
port. They like having a single vlan, single IP to
talk to everyone.
A: The reason RAS doesn't do it is that it's limited in
scale, you have to negotiate the vlan IDs with each side;
there's a slow provisioning cycle for it; it needs to
have same level of speed as what we're used to on IRC.
Need to eliminate the fees associated with the VLAN
setup, to make it more attractive.
It'll burn IPs as well (though for v6, that's not so much
of an issue)
Having people peer with the route-server is also useful
for people who don't speak the language who use the
route servers to pass routes back and forth.
The question of going outside amsterdam came up, but
the member forbade it, so that it wouldn't compete with
other transit and transport providers.
But within a metro location, it could open more locations
to participate on the exchange point.
The challenge in doing provisioning to many locations
is something that there is a business model for within
the metro region.
Anything else, fling your questions at lunch; return at
1430 hours!
LUNCH!! And Vote! And fill out your survey!!