Routing Working Group,
14 May, 2015, at 2:00 p.m.:
ROB EVANS: So, good afternoon. Hope you had a good lunch. My name is Rob Evans and with João Damas I am going to co‑chair this session. A few first things first.
First of all, a few thanks. Thanks to Romeo who is going to be scribing ‑‑ who is doing the Jabber relay, that's Anand who is doing the minutes and thanks to Aoife who is the stenographer and is frankly amazing job of translating the gobbledy‑gook that some of us speak.
João and I dropped the ball a little bit with the minutes, the NCC sent them around to us, we have had a look at them, thought they looked OK and got forgot to tell the rest of you. João did that this morning. I will ask if there are any immediate reactions on minutes of the last meeting here? If there are not, we will leave it open for two weeks on the list. So does anyone have any comments on the minutes? No. Good.
Next thing is to have a quick look at the agenda. Are there any suggested changes, we are fairly tight for time so additional stuff is going to be awkward to put in? No.
Microphone etiquette, when you are asking questions please state your name clearly at the microphone and your affiliation.
Next thing is the long awaited selection of the Working Group co‑chair. As you might have heard, all of the Working Groups have gone through process recently of developing a process for reselecting Working Groups. We did it ‑‑ we had put this text around fairly early on, there were virtually no comments. I realise that many of the other Working Groups have gone through some processes and come up with longer descriptions so I think this is probably sufficient for now if and when the other Working Groups reach some consensus on theirs, we can have a look and revisit it, otherwise we are going to use this process now and as João mentioned in London, he is going to stand down and he ‑‑ we asked on the list for any other candidates, nobody has put themselves forward, so unless there are objections to João returning as Working Group co‑chair, if you do have any objections please state them now. If not, then I think we can welcome João back as co‑chair of the routing Working Group.
And one other thing I skipped past there, we did agree on a charter a couple of meetings back, but again, João and I being very lax at this whole co‑chairing business forgot to send to the NCC. We will do that up, update the web pages and hopefully next time more comprehensive.
So, without any further ado, I would like to welcome Cengiz up on first for the first presentation. Thanks, Cengiz.
CENGIZ ALAETTINOGLU: Good afternoon, everybody. I hope you can hear me fine. Last time I gave a talk at the Routing Working Group was in Bologna and it's not the last year, it was in 2001, I believe, and at the time I was talking about RPSL as well as LIR toolset that I had brought. But today the subject is analytics and it's a little bit deep technical the talk, any time off question raise and ask, don't wait until the end. Let's make it's clear and so forth.
So there is analytic study with the routing started feeded back in 2000, we measured in three US and one European backbones one‑way jitter across the continent. And what we find 99.99 percent of the time the jitter was less than one millisecond, it was in microseconds. I think that is amazing especially back in 2000, there is no queuing delays nothing in the network, network was extremely clear. I am not going to talk about this study but if you have any questions how we measure the jitter and one way harder what we did and so forth, please catch me afterwards. What we find out that the remaining 0.01% of the time jitter was severe.
So what we observed, actually this was across all of these networks, is that there was, for example, in this measurement there was a 10‑second of packet drops followed by a jitter window of up to seven seconds of jitter. So, in general, we have less than one millisecond but when the jitter kicks in, it is not even seven milliseconds actually seven seconds. So it was huge and we wanted to understand this and we also noticed that there were a lot of packet reordering, so in this lower graph here you actually see each of these dots is a packet reception and this is when the packet was sent and this is the amount of jitter in seconds and then the line actually follows the sending ‑‑ I am sorry, reception orders. So these packets are received and then followed by these packets by these and so forth. This is massive reordering of packets and we were only sending 1 megabit per second stream and usually reordering can happen in the network if we have multiple paths and some packets go one way and others the other way and sometimes it can happen inside the chassis and sometimes it goes over one fibre path versus the other. But actually to exercise that you must send a lot of traffic, you can do that in a lab environment perhaps but not in the Internet when you are sending 1 megabit per second traffic which even back in 2000, it was nothing, it was ignorable amount of traffic.
So the ‑‑ this looks like a routing unwinding so basically if you have a routing ‑‑ a packet goes into that loop and starts going around and it was eventually TTL time out and dropped, but as the first packet comes and the second and they start turning together and the third packet actually can enter the loop in the middle of those two, it doesn't have to come after the second packet. And if you turn enough number of times and eventually goes away whatever is left in the loop at that time turning around will stop unwinding the packet will be ordered like this. This was the ‑‑ we wanted to prove that, this has never been seen before at the time.
So what we did is, OK, this is routing loops, this network was running IS‑IS, let's collect packets as well, there isn't a routing loop but let's say there is, we were sending from this host to this other host one‑way traffic, and then we also started capturing IS‑IS packets. And then the hypothesis is did IS‑IS have more than 10 seconds to converge and during that time did it actually have a loop.
So before I go into that, for those of who don't know IS‑IS that much or ISPF, actually, so in this is the way it works, each route sends a link state packet of this router, it does what its neighbours are, what the metrics are and so forth, it sends this periodically as well as when there is a change in local topology, if one of them wants to go down it will send again this neighbour is missing, for example, and the second router does the same and third and so forth. So these packets flooded across the network, so I send it to my neighbours and to their neighbours and to their neighbours and so forth, eventually everybody receives, everybody link state packets. And they put that to database, that is what we called the LSD B, link state database, from that database you can actually construct the whole topology of the network and construct a graph and using the shortest path algorithm you can compete the shortest path in the network. This is how IS‑IS works and OSPF works when you look from 10,000 feet. Details are ‑‑
So, in this environment, you can actually have routing loops and it's actually by design. These are called micro loops and in today's networks they are typically less than 100 milliseconds these micro loops that happens ‑‑ let's say that the link has gone down, routing starts announcing new link state packet which that particular neighbour is missing now and flooded across the network and some will receive it first and the routers which are further away will receive it later and if they happen to run SPF at different times because it is not syncronised with them, one will say my shortest path is this way and the other the other way. It is only during this flooding that this can actually happen and typically flooding in the networks is in terms of milliseconds, you have to put your packet across the network. Can it happen for a very short period of time. We don't bother solving this problem because it's very hard and the way around it is to do something like a ‑ route in ‑‑ local protection systems to get around this problem, but the IGP itself will have this property, basically during convergeance it will have short‑lived transient micro loops, but what we have observed is not micro; it's actually 10 seconds, that is huge. So that is the question, did we really have basically this kind of routing loops in the IGP.
So this is what we collected from the network, basically in this network there were a lot of churn in the ‑‑ it goes up to 8 updates a second, and 8 updates is a ‑‑ actually when you flood you don't want to be overwhelming the other routers, you send it to your neighbour, don't want to be overwhelming him with more and more changes. At the the default parameter about eight seconds, you couldn't send more than that because of the ‑‑ anyway, if you remove all the refreshers, this is the ‑‑ this network was very volatile at the time, actually failing and repairing and so forth, at times you do see up to 4 topology changes in this network.
It has come along way, look at similar graphs today, much more quiet, better the routing today versus back in 2000. The bottom graph is prove ‑‑ so we have two test machines, one on two sides of the continent, each time we receive an LSP we time stamp it, we look at the difference, how long the LSP took to reach one place versus the other place, and actually if you look at the graph, it goes all the way to two‑and‑a‑half minutes. In this network not a 10‑second routing loop, even at two‑and‑a‑half minute is likely.
Anyway, so to wrap it up, what we find out is that basically, the time the routing databases were not in sync as a result when the SP S were run we ended up with routing loops. And the big lesson for us was routing was extremely valuable in diagnosing performance problems in the network. Because, you know, the jitter wasn't caused by queuing, it was caused by routing and so forth and we decided to say hey, we need to pay attention to routing much more and 15 years later this is how the route analytic looks, we have BGP and NetFlow data, performance data MPLS and all sorts of stuff. Basically a big layer on top of it is your analytics, the algorithms. It's important to collect the data but you must use it to make sense out of the data and sometimes we decide whether the routers will be able to handle, you know, the rate of change and so forth. And/or you may do some stuff like health assessment in the network, and then on the top you use the this for presentation, you may be diagnosing an issue or saying, hey can I afford these new customers, that I have to describe particular customer to the network where is this ‑‑ this much traffic is going to push, can do I that and so forth. This is thousand looks, this is very similar to how you new SDN architectures would look, there is a analytics and instead of one of these use cases would be orchestration, you use this to set paths in the network as opposed to diagnose why the paths are the way they are.
Use case for this is a really good diagnosing case, and this actually happened in the US two US wide major service providers. They peer in actually six locations, and then one of the routers doing maintenance, nothing unexpected. They wanted to develop a line card so this is connected to this other major network and then they were actually ‑‑ was to ‑‑ they pulled the card and unfortunately it crashed, it does happen, it's not like ‑‑ it's a hard problem to solve and it didn't survive and it did crash.
And but what wasn't expected is when they did this, so they peer in six locations, but you expect first of all other five locations not to be impacted, on top of it the traffic that was exiting in this location would you expect it to move to those five places, right? I mean routing will converge and use those other five locations that I have but it's not what happened. In all six locations traffic stopped, entire traffic was black hold which lasted less than three minutes, it's too short for this to be basically ‑‑ for humans to go and look into the routers and look at it and figure out what happened N three minutes the problem was already gone but during that three‑minute to this particular service provider it had a major impact, they were running ‑‑ lost 45 minutes of advertising. It doesn't work for three minutes, you do and get a coffee or something else and you come back and there was ‑‑ there was a significant loss for them, it was a big deal, they contacted the vendors, they blamed the vendors, and say look the network, they couldn't figure it out. Meanwhile, we were actually ‑‑ some of the analytic stuff that we were doing at the time was installed in the network but was a different ‑‑ with that it shows the power of the analytics.
So this is before the incident, before the line card was drawn. So this is the topology of that network. Here are the six locations, according to BGP, you can actually reach the other service providers. The routers here in, for example, the ‑‑ I suppose this is more like purple routers here exit on this one, the yellow greenish routers are exit here, actually this is green and then this is yellow axis here and this is over exit there and so forth. So basically if you have six in locations in the BGP, so BGP tells you all the six locations. If there is no local AS path basically your routers decide based on the IGP distance, this says I can exit here and here, which is closest to me, this one is closest to me, I will exit here. So basically BGP determines the exit, that is basically BGP next‑hop attribute and BGP tell us which have the next‑hop is closest to you and you pick the closest one. This is in a nutshell, how they interact. In reality, actually, this is recursive, so you want ‑‑ hey BGP tell me what the next‑hop for the YouTube is one one and then you say, hey, routing system give me a route for one one one, can be IGP, can be MPLS tunnel, it can also be BGP. If it is BGP, you again get the next‑hop and say, hey to go to one one one I must go to two two two, give me route to two two two, eventually it will resolve and then you find your next‑hop like that. So this recursion can go multiple layers, usually it's just BGP, IGP for connected router something like that but there are cases where another ‑‑ that is how you figure out these will exit in this particular location.
And this is the incident, basically this is the router that, whose line card has been removed and it actually crashed. When a router crashed it cannot say hey, I am done now because it is no longer running. So its neighbours detect that this links went down and this is typically in today's networks is sub seconds, very quick and immediately they will send IS‑IS packets telling that I am router this one and I am dropping this neighbour, it's no longer there, this is the other router, I am also dropping him, you can see between these two routers, less than a second, actually this is like 73611 V 73612, but the difference is in milliseconds, they both defect within the milliseconds of range this particular router has gone down, actually and then eventually the router reboots and come back. And if you look at the difference in time, the router was down basically about, if you take the eleven versus 36, about two minutes and 25 minutes. So this is the window at which you must diagnose this problem if you don't want it to happen ever begin. But we are collecting BGP data, RIPE RIS is a good example of that and other people also collect. If you put together you can do a network we see ‑‑ you can say hey, stop right before this particular event, rewind the topology right before this time basically, take these two down. When you did that, you see a different view. This is the same graph as before, but at this time, here are the original six exit points, there is six other exit points. And who are these routers? Actual lease these are Co. routers. And they do not have any EBGP on them yet they claim I can exit to this particular service provider. As a matter of fact, all of these routers go to this router, all of the yellow ones go to this one, the green here, I think the colour is a bit hard to differentiate but on my slides different, instead of going here, they go here and so forth. I just looked at the path that goes to this particular place, so this was the path before, this is one of these routers that was exiting in this router, this is before the failure. And this is after, it goes and stops here. This link is down but it's not the reason it stopped, it goes and stopped here thinking I can exit the network at this router. And this is the detail of that, so this is this is the path before, 3 hops, one, two and then three, and it's following the BGP route, and then BGP has a next‑hop to resolve, it uses an IS S route and this is the BGP next hops route in IGP. Basically, I need to go to this destination and go to there, go to this exit router and this is basically the IGP route advertised by this router. But we know from the IGP router this router is dead. So afterwards, when you repeat this, it has actually two hops instead of three, one and two. It's still following the same destination but now, if you pay attention the recursion find the route in BGP and this was a /32, this is a /19 ‑‑ 128 /19.
What is this? Why do we have this BGP route? So basically, this router has crashed and as we saw in the IGP event, in this network it converged fast, so everybody knows this /32 is no longer reachable. But, BGP is not that fast; BGP takes ‑‑ to withdrawal all, this is iBGP, so when you have a session, if you have nothing to tell you send ‑‑ and if you miss your routes will be taken away. At the time that is about three minutes with the timers of the routers at the time, that actually one minute each, that is three minutes. Within three minutes before BGP was able to withdraw anything, router actually rebooted, there is no within to withdraw anything, resending them announcements and so forth. So there is this three‑minute window where the BGP router in the system for this dead router, but IGP route to get to that is not. Yet in the network there is another route, a /19 in BGP, that can resolve it. Meaning this BGP routes are still valid, because their next‑hop is resolvable. Why is it resolvable? So basically, you are a service provider, you have address block, multiple address blocks and you announce them in routing, in your BGP, to other people, because if you don't, nobody can get to you, you must announce your address blocks to the Internet. How do you do that, you nought to the BGP. This /19, the sole purpose of this /19 is to announce the address space of the service provider to the world. It has nothing to do with the internal routing. But it is in the routing table of these routers. As a result, it will actually resolve this and you will have this particular problem as a matter of fact I have seen this, even despite the fact this has been widely published and publicised and still I see this from time to time in quite a few networks.
So, basically, what ‑‑ what this illustrates, the key point here is, we are very good at designing networks when everything is up. When things fail, the combinations of the failure mode is so huge that a human cannot really track it, we need software and I think that is the biggest promise of S D M, you can do a lot of tests and failure scenarios in software as to us trying to figure out if I fail this one. You wouldn't even think about that. You would think, OK, this BGP is for this router, here is this loop back. I will get to it. That is what I would do. Luckily with a tool like this you are able to single step and look at what has happened. Fixing this problem is also published, actually this was fixed ‑‑ is a contribution of NANOG, what you need to do, to poison this BGP router in your IGP, you introduce same router but give it a very high metric, so that other routes will be preferred before you prefer this dead routers route, you introduce the same /19 in your IGP, high metric, it will never be used. And even though the problem arises because BGP didn't converge as fast as iBGP you don't want to make iBGP convergence faster, if you did, any time you have transient loop here and there you will shut down your BGP sessions and you don't want to go going that way. Some people suggested that but it's not a good idea.
The second part of my talk for which I have five minutes or so left, is basically how do we use this analytics in the SDN context. We have at the bottom physical as well as now mutual devices, routers, switches, V N Fs and so forth. You can talk to these protocols, to me this part of the SDN is boring, OpenFlow or forces or net cough, I really don't care and I think it will be different problems with ‑‑ as a matter of fact, you will have multiple controllers, each one speaking a different kind of protocol. Then this controller provides northbound where we write our applications like the one that diagnosed that problem can assess the health of the network and can tell you what will happen under failures, you can write. I think the revolution is here, basically we can now make networks programmable and make sure our routing doesn't suffer from the problems like that. However, if the software is making changes to your network, how do we know that software is doing the right change, what govern that is these are any good? Just to give an example of that, so basically today if you were to sign up a major, if your a service provider, your planning group will say this is a big customer and having this much data from this location to this particular place, do I have the capacity and any ‑‑ SLA, do I need to configure queues for them, there is human being that gets involved, right? I mean it's not fast, it's not so fair speed; it's slow but we do assess the network before we do something like that. Or you may be an enterprise and deploy video‑conferencing, people do go and say I must configure queues and you go to the router that the class of service and so forth. And but now it is being done in so fair that we need analytics doing exactly the same thing, so we need to implement this lodge nick software, can you not just say push this much data from this location to that, because if did you that, you may actually impact somebody else's data.
I think the rest of the talk, I will just breeze through it, I was trying to give an example with the bandwidth scheduling application. Is this a very exciting application for service providers. What it says is, there is an app that you can tell it, hey, I want to send this much data from this time to that, from this location to this other location, X megabit per second. For example, there is the soccer games in Brazil, World Cup and you want to be sending HD video of that to Holland, Netherlands and here you are going to send it to an IP TV stream but the data comes uncompressed and gets compressed based on what you have here and you can run this over this network. It's very attractive because usually service provider have abundance of ‑‑ in their ‑‑ getting a little fight of telling, oh, not my network is not congested, your network is congested and so forth. And in the process they reveal in Horizon the average peak is 36%. 54% unused. And level 3 is about the same. So basically, abundance of bandwidth, if you can sell this extra you will actually make money. And also Google has done this, they run their networks all the way almost to 100 percent and this create service providers envy, they can send 100 percent, why can't I run my network 100 percent as well. There is good reason why we have in service provider network this spare bandwidth. It's not like, it's not like the Google's network where Google is the only must mr of that data. If you want to implement this application we could based on link utilisations, I haven't seen one which doesn't have this at all. And so basically you know what the utilization is, to do this application you need to predict when the soccer game is, that is not too hard, you have the baseline and history and you can project it and figure out what the traffic levels will be tonight and then you complete the path from Brazil to Netherlands and say along this path to have the bandwidth, if you have, go ahead and deploy this one. If you don't, then basically, sorry, I cannot sell you particular service. Data flows, when the time is up, you remove this path. That is when controller comes to picture. This would be a naive implementation because it misses the point there is a reason for this spare bandwidth. First, this is the typical utilisation versus delay curve. It has a sharp knee, usually around 65%, and if you pass this point you introduce queuing delays. So basically, if you want to increase the utilisation beyond this point you must have both traffic, it is OK if it is delay if it is slow or doing back‑ups bad traffic so beyond this spectrum you can have bug but it's easy to fix this problem ‑‑ go, no go decision, 65 we have the spare bandwidth mainly for handling failure, in a service provider network you have SLAs, if a link fails in your network you must have capacity, you must have place to put that traffic somewhere else. And that is why the networks are typically run less than 50% utilisation that guarantees you there is another path, it is a path. But, you know, if ‑‑ 65% and still be OK but you need to do this by doing simulation of failures. If you failure in a single link and check whether you have sufficient capacity or not. This simulation, I don't have much time to explain, basically needs that traffic matrix and you can use NetFlow for that, you can create traffic matrix from each router to each other, how much traffic is passing. And then basically you can say, hey, if I fail this link what the new path will be and you can subtract the bandwidth from this link and add to this link and say, tonight at 9:00 when the game is playing, not just under normal conditions, even under failure conditions do I have this much bandwidth in my network to actually send this traffic? You can actually make that judgement. And then only if that is the case, you can actually say, hey, go ahead, provision this particular application.
Otherwise you can cannot promise that particular ‑‑ you don't want in the middle of the World Cup championship, the stream to be interrupted or impacting somebody else's.
So routing has been analytics basically ‑‑ has been great for analysing, assessing the networks, troubleshooting and so forth and with the SDN the same information provides us the necessary means to do what the human being does today, run failure simulations and assess the network and so forth and it can be coupled with orchestration to set up for us.
ROB EVANS: Any questions? No.
CENGIZ ALAETTINOGLU: Either too clear or too confusing. Please catch me during the break if you want to talk deeper and I am happy to talk any aspect of this including ‑‑ it was very interesting in its own right.
ROB EVANS: Thanks very much and welcome back to RIPE. So next we have Yasuhiro Ohara who is going to tell us about BGP
YASUHIRO OHARA: Today I would like to introduce my tool to BGP dump file, please use and give me feedback, that is the main purpose of this presentation.
Actually, how can I use this ‑‑ actually, I wrote a blog article in the RIPE Labs so it's almost the same contents but there is some updates from even the published of the blog article, which is yesterday, so I would like to show you. One of the update is like this, this is the heat map of the reachability in the full route table. Like, I only ‑‑ the reachability so that the, we cannot recognise how small a route is serving these reachibilities, like maybe we can get one single /8 for this 17 area, or maybe 256 class B, we don't know. But the point is that all the IP space can be reachable from this routing table. That is the thing.
And if the ‑‑ the whole space is filled with IP route then it's going to be blue, and if it is classed as 0 it's going to be near to the red and completely 0 is depicted by black. So this is kind of like that, the private is not shown in the routing table.
So, back to the tool explanation. Motivation is that as a researcher in ISP how can we evaluate our IP transit service, specifically BGP for routing table? You know, if we compare to the other ISPs, then there are many differences, but we cannot know why is the difference and how the difference will result. So, you know, we don't know the answer yet but I thought we are going to need some kind of tool that we can analyse the BGP RIB file so I created it. So yes, if we ‑‑ if our route is good then we are good and if we ‑‑ if our route is not good, then we are not good. That is the thing, I think.
So, and I don't know how to evaluate BGP for route table but at least we can compare with others and if there is not so much difference we can say we are OK, right. So, that is the kind of main purpose of my tool.
So, summary is the kind of like 4,000 C lines and it supports only the route table file, not the basic update file of the files. And the big difference from the previous tools is that we can provide longest matching ‑‑ we can construct the full BGP routing table inside a command ‑‑ and then we issue routing lookup, longest match lookup. That is the benefit of this tool.
This is the configuration of the BGP route collectors. You will know well about it. But the format ‑‑ RIB table format is like this. In the first we have peer table which describes what is the IP address of the peer or the, what is the AS number of the peer. They are indexed in the first table. Then, there are many routes from there to 255 and so on. And in the each routing entry there is peer one has it and with this BGP attribute and peer two doesn't have it. Peer three has it with the BGP attributes. This is the format. And actually my tool is kind of like following these.
This is the simple display, if you just specify the RIB file without any options then it's going to last it and for the ‑‑ for this routing table entry, there are many different, you know, AS paths because the peer is different. And this is the speed. Actually, our tool is a little bit faster than the BGP dump, it doesn't make not so much difference, I think. But if you are using the 0 dump parser or any other scripting language then it might be helpful to reduce your time.
So, we can have ‑‑ we can display that index table like this and we can display the routing table like what is the routing table in the entity, you can using like minus P 19 is, this is the entity routing table and you can have the statistics like how many number of and what is the distribution in POP prefix length and how much next hops, typically some of them have overriding their next‑hop so it is only one. And what is the number of origin AS or path number so that we can depict this graphs like, axis is the prefix length and the Y access is the count so more than half of the routes /24. We can consult these kind of things. This is the routing table lookup. If you specify M ‑‑ you can show the what is the result of the longest matching. And also, if you prepare the file then the command is going to resolve them one by one.
And I implemented the diff‑like comparison tool which I think I cannot explain, the time is short, so, yes, this is the ‑‑ a slide is open, so and you can see this in the ‑‑ if you have any questions, then please come to me. This is the kind of difference, like only the, this is the comparison between levels and only the level 3 has this route, that is the interpretation of the diff results. And we can make a ranking so that in the route views file there are 39 capable ISPs but only the NTT doesn't have this prefix so this might be kind of a problem for us, so we might be drilled down to this case. These kind of things we can provide by this tool. And also, we can make a ranking by the organisation name, so if you know someone in this organisation please come to us, please. And this is the route number and this is the difference between ‑‑ we can specify multiple BGP RIB files so that the ‑‑ this is the yesterday, this is today's RIB file and what is the difference between the distribution of the path prefix length? So that you might see the ‑‑ how the BGP routing table is growing, like in this single day /16 is decreasing by number of 47 and /24 is increasing 66, that is the interpretation. And I get users voice that is ‑‑ that was good, like the ‑‑ he was going to do some analysis of the NTP server and there is ‑‑ he wanted to resolve the source address of the ‑‑ those axis and it's going to be like 50 million packets so the 50 million source IP address and he would like to resolve it as an origin AS. And other technique didn't work for him but my tool did a great job, he said. So he was some statistics of it. Like, solving the 50 million IP address is only four minutes. So, he was very satisfied with that. And yeah, this is the just the toy but we can create the heat map, although I haven't commit the that was developed yesterday, but I will. So this is the NTT for routing table that is the same in the previous of this presentation, and this is the level 3s, so, you know, there is not so much difference but we can see a little bit of difference like here and here. Like this. And, you know, the benefit of creating a tool is that you can repeat the same operation to many of the files so that we can have many through routing table, yeah, actually I get Bord with the results like that, they are kind of like the same. And also we can have time machine (bored) feature like this is the NTT this year, and the last year, two years back, three years back and yeah, this is the last, I mean the earliest RIB file that is supporting the MLT file so, yeah, it's kind of like if we create animation of this thing, then it's going to be kind of like, you know, evolving of the Internet, right.
So, yeah, that is all for my presentation. I am wrapping up. I created a tool and I think we can do more analysis on BGP for route routing table, so and I am kind of running out of ideas, so please come to me and give me idea. I will develop something for you. So thank you, thank you very much.
ROB EVANS: Are there any questions? No. OK. Thank you very much. So next up we have Alexander Azimov, who is going to try and understand the report differences between latency in v4 and v6, thank you.
ALEXANDER AZIMOV: Good afternoon. I would like to discuss with you the reason of the difference between in latency in IPv4 and IPv6. So, from early adoption of IPv6 it was always compared with IPv4. It was like younger brother, who always tries to catch the elder one, so and every comparison made to compare if IPv6 have already reached some IPv4 or not. But for the first time in 2010, in a measurement made by Google, there was a very funny result. They found out that late see for some reason in IPv4 became bigger than in IPv6. So, when I first take a look at the results I was very surprised and I thought that these results were made by some short‑lived network ‑‑ but two years later, by Geoff Huston was presented another measurement talk and there was a timescale and there was a shown that IPv4 ‑‑ IPv6 becomes faster and for a long period. So, there was no opportunity to explain such results by short lived ‑‑ and this is the point, so, there is an opportunity to ‑‑ IPv6 but I had a question: How it could be because if we imagine that latency ‑‑ that is some kind of subparagraph of IPv4 then the latency of IPv6 should be not less than in IPv4, so here we have unsolvable conflict and logic. And then ‑‑ but we have measurements. And so I found out that there is something very funny is happening with IPv6 graph. And I was eager to find out what.
So, first of all, I collected a lot of data from the route use, from RIPE sources, from our own collectors, retrieved their BGP paths and retrieved from these logic relationships and I have tried to find out ‑‑ to have minute ‑‑ I found no reliable evidence why IPv6 could be faster than IPv4. And the density of peerings in IPv6 was a little bit better than in IPv4 but it was not a reliable explanation. I thought maybe we are not seeing all the paths because if you are using hundreds thousands speakers of BGP and retrieved data you are not seeing all paths but we can predict possible paths. In BGP, ignoring route ‑‑ there is only five kinds of possible paths. So I made a closure, what kind of possible paths could be in IPv4, and in IPv6. And this proved to be a wrong way. Because in density of possible paths in IPv4 proved to be even better than IPv6. So, there was some kind of dead lock. But we find the way out. So we decided to compare not the amount of paths but decided to try to understand what are these paths are look like and maybe compare the difference. And this proved to be a right way. So, this proved that possible paths in IPv4 differs from IPv6 about 50%. So, when we speak about measurements that compare latency in IPv4 and IPv6 speaking not about comparing latency in Internet for two different categories of protocol but comparing two different graphs, graph IPv4 and graph IPv6. And these brought us to an option that we need to make some global measurement because of course, with such difference ‑‑ different from one autonomous system to another. But we don't want to make a global measurement for all to compare latency. It is very hard work and maybe we should do it next time but we decided to calculate some corresponding value. And for such corresponding value we chose connectivity. Connectivity is of course a simplification. It is a mean distance from speakers that we retrieved from route user at RIPE and so on and, distance between the speakers and all prefixes that are announced by single system and with this help, we were able to compare, now, the IPv4 and IPv6. It's ‑‑ we believe that these values should correspond the latency. And so, first of all, we decided to find out which autonomous systems have increasing latency in IPv6 so then in IPv4. And the results were as we were predicted, the results are this autonomous systems that I mentioned in the stable have the most increasing value of our connectivity from IPv6 to IPv4. And as we have predicted that if the IPv4 ‑‑ IPv6 graph is a sub graph of IPv4 the latency will be bigger. It's normal. And then, we decided to find out who benefits in IPv4 than IPv6. ‑‑ in IPv6 than IPv4. There wasn't surprise. So no matter what connectivity was in IPv4, in IPv6 these autonomous systems ‑‑ and Hurricane Electric proved to be a very smart choice. So, this is top five connectivity in IPv6 and IPv4. You can see that Hurricane Electric exceeds not only all operators in IPv6 but also all Cloud services. And there is more.
If compared to, if we are comparing direct customers, in IPv4 Hurricane Electric only have the 13th place, so it is not in the top five. But in IPv6 it have, the number of direct customers, it exceeds the nearby competitor by two times.
Let us take a closer look at Hurricane Electric. Have or maybe not have relationships with Hurricane Electric in IPv4, have relationships in IPv6. More than some of these providers that are believed to be in type one ‑‑ even buy service from IPv6. And I believe that there is a very interesting thing because we live in very interesting times, when ‑‑ seems to to be unchanged for decade, could be changed, and it's already changing. And I think that this is very promising news.
So, I was very fast, I am sorry. The conclusion. As I said, IPv4 and IPv6 graphs prove to be totally different systems. And it is incorrect to speak that you are comparing IPv4 and IPv6 in Internet. You are comparing latency in two different Internets. I believe that the reason of such a big difference is the low value of traffic that is currently in IPv6, and as soon as the traffic value will increase, the difference will become low and low but, at the same time, there is another very interesting and open question: Of course, if the difference will become lower, the difference in policy will also become lower and the open question is who will be the elder brother and the ‑‑ so thank you for listening. If you have any questions I would be glad to answer them.
ROB EVANS: Have we got any questions?
SPEAKER: I think this room is very quiet. Did you try to find at least one path were both IPv4 and IPv6 path happened to be identical in terms of ASs and if could you tell whether in those cases whether the delay was similar or different?
ALEXANDER AZIMOV: Yes, of course, because if you use a number of speakers you of course see similar paths and different paths.
SPEAKER: Similar is the same delay, that is what I would expect, but is it?
ALEXANDER AZIMOV: The difference was about 50% ‑‑ I am not speaking about the path ‑‑ the path that could be, the difference is about 50% so there is totally different graphs.
SPEAKER: Yes, but there is no single path between source and destination which happened to be the same intersection?
ALEXANDER AZIMOV: Yes, of course, but OK, there is a multiple paths between source and destination but ‑‑ every moment there is only one path. So we are able to compare them.
ROB EVANS: Any more? Thanks again, Alexander. We might be finished before coffee. So, next up is Colin Petrie who works for the NCC and some changes they are making to the RIS structure.
COLIN PETRIE: I work for the RIPE NCC and I am here to give some updates on the routing information service. As I am sure you know, the RIPE NCC has been running the routing information service for several years collecting BGP data from many Internet Exchange points from BGP peers at these locations around the world. At the moment, we have got 12 active collectors, there is a multi hop one in Amsterdam and eleven other ones around the world. They run Quagga.
And they store the BGP update messages every five minutes, they do table dumps every eight hours. Provide a looking glass service through RIPE Stat and this has been running since about 1999 when the first collector went in and the data is all published for people to do research, it's all archived, people can look into the state of the routing table over time and that kind of thing.
There hasn't been much changes on it visibly for quite a while. The last new collector was added in 2008. One of the reasons for that, though, is that we have been doing a lot of work on the back end, that is not quite as visible, and there is stuff I wanted to talk about today. We have been replacing originally the data went into a my SQL database and that had some scaling problems and we could only hold about three months which was queerable through the web interfaces that we provided. We have been working to replace all of that with hadoop which is shared with RIPE Atlas infrastructure as well, this stores us to store more and still be able to query it. We are now able to provide the historical data as well rather than just three months' worth of data. And that then serves again through RIPE stats so that is lot of widgets now that you can see the routing history and basically zoom into the past and things like that.
As a result of doing this, we are now able to start adding new collectors into the system again. We are in discussions with several people who have approached us about hosting a route collector, at their Internet Exchange. We are currently developing the next collector in CAT N IX in Barcelona and in discussion with some other parties who have approached us. If you are interested in hosting at IXP come and speak to us.
The other thing that we have been doing is we have been working on replacing the collectors themselves. There was a few issues with the current Quagga based implementation. One of the troubles with it is that it's single threaded which is hard to scale up on modern multi core CPUs and it causes problems as the amount of BGP update activity gets bigger, the more BGP routes are present and the more peers that we have, it can start to get a bit unstable. There is some issues with the fact that the system has to lock new incoming updates while doing a table dump and has to perform that table dump, which gets bigger as the peers and the table grows, before the whole timer expires and the BGP sessions or all the sessions drop which is not exactly optimal. And there is also some data inconsistency issues that we have come across with it. So we decide we wanted to look at whether or not we can rearchitecture it and replace it with something else.
The new system that we have been looking at, we have been working on it behind the scenes for a while. Last year, we had an intern in RIPE NCC Walter, who was doing developing a prototype of this. We published a couple of articles on RIPE NCC labs about it, along with research paper that he did. And those go into more detail and explain the architecture. And we did that and basically, that was a proof of concept, and the result was that, in principle, it worked, it could produce the same MRT data files as the old collectors. But in a slightly more scaleable way. So, to explain what that looks like, this is what the new architecture looks like. We have the main thing it was switching to using ExaBGP instead of Quagga. The ExaBGP parses the BGP messages and does the peer handling with its neighbour on the IXP and it takes the messages it's receiving and doesn't attempt to make a routing table and outputs them into a queuing system, which allows us to then have multiple queues and multiple threads of ExaBGP talking to each peers. This allows to us handle more peers and scale it a lot better.
On the ‑‑ once the data gets shipped to the RIPE NCC it goes into a queuing cluster and you can fan the data out to multiple applications that are listening and using the data. These can run at different speeds and they are not tied to each other and it decouples the system so you can scale it a lot better. This is some of the sample applications that we are developing at the moment. Mainly to provide the existing functionality because we want to do everything we could do with Quagga anyway. So we have something that consumes the data and writes out in a time series the BGP update messages every five minutes just like Quagga did. We have something that takes the messages and puts them through a state machine to produce a BGP table, every eight hours it writes out a table file but if that takes a long time it doesn't matter because it stops consuming from the queue and it will do its job and start consuming again afterwards.
Also, we wanted to preserve the looking glass functionality, that is the same thing, it goes into a state machine and just holds the RIB in memory, you can then query it, which is basically all a looking glass is. We are able to then put the data straight into our hadop based back ends as well so the data is a lot fresher. And we are also looking at the stream‑consumer system which is similar, if any of you saw the talk on the plenary on Monday about the right Atlas streaming service, it's basically the same thing, it's a web socket interface with you can provide filters to request a subset the BGP data or maybe all of it, if you are interested in that, and have it streamed to you live.
This is a diagram of how that works, stolen from the presentation that my colleague, mast mow did on Monday. Ignoring the part it talks about Atlas controllers and probes at the top, the data is the same structure.
So I am going to quickly demo this and hopefully it's actually going to work. So that was quick bit of Perl that just connects to a Rabbit MQ cluster, sits in a loop, consuming messages from a cube, in JSON and decodes them and dumps them out. This is it currently running, we have got a test version of this, and this is just BGP messages coming from ExaBGP in a queue, if I stop there thank there you can see there is a bunch of withdrawn messages from a neighbour, there is some more withdraws, some people sending us more specifics for no particular reason. There is some announces and some /24s with an AS path. It's just decoded version of your BGP data in a stream. For you to then process in an application.
So that is something we are working on at the moment. So there was another couple of small things that we were doing. One of them was there was a new release of the BGP dump library, which fixed a couple of bugs that we had in the old version. And also, introduced new feature which was to be able to decode messages that came from earlier versions of the RIS collectors in our archive. The previous version didn't actually support the data that we used to produce before 2004, so we added support for that and that now works. What we are looking at doing is now importing that data into our back end system so that RIPE stats can query it, and that should give even larger history, historical information in RIPE Stat if you look at the routing history, widgets and things like that.
There was another thing that we were looking at, it's not finalised or anything, it was just an idea; there's a protocol ‑‑ there is a specification coming out of the IETF at the moment, for additional path support. This is designed to permit peer to send to another BGP peer not just its best path but multiple paths within its RIB. Let's say a router has ten possible routes to a prefix, normally it can only send the best one to its neighbour, this allows to send all 2010 a path discriminator on it. The advantage with this is that if we were able to support this with, let's say, with a peer who is sending us a full table, we only get their best path, this way we get to see all the paths in their network and a much better visibility into lets say large transit networks and things like that. It does have a bit of an issue means it's even more data for us to store and process, and one of the questions actually we had was whether you that I this is ‑‑ if we had some feedback as to whether that would be useful. At the moment the current RIS system has something like 100 full table peers in it so there is already 100 different paths for any given prefix in there so if we add more collectors and this, you could be talking at 10,000 different paths for the same prefix that you can see. Is that actually useful to anybody? That is something we quite like some feedback on. If we do do that we need to look at updating the format of the MRT dump specification because it doesn't allow ‑‑ doesn't have a way of representing and storing multiple ‑‑ a prefix arriving multiple times from the same peer. So we need to try and extend that and submit a draft to update the specification.
So that is what we have been working on at the moment. I would be interested to know if we have any feedback and if you have any questions and you can just now at the mic or you can grab me later on at any point and have a chat about any of this.
GERT DORING: User of RIS, and I like what you are doing because the old system had the issue if you have very short‑lived announcements, typically by somebody hijacking somebody else's prefix, RIPE stats couldn't see it, if it's not in a dump it's not there. And I read this as the system will show it and this is very welcome.
JOAO DAMAS: When you are talking about the back end applications like you are trying to replicate the looking glass functionality in the new system you talk about BGP state machine or routing state machine. Is this something you developed yourself or injecting into ‑‑
COLIN PETRIE: At the moment it's Python application that ‑‑ it came from the prototype that our colleague Walter wrote, we are currently working on it and trying to get it into a production state, but yes, it's something that we wrote internally.
JOAO DAMAS: Is that code you would be able to share with the public at large?
COLIN PETRIE: We don't have an objection in principle to doing so, it needs to be cleaned up a bit. But whether or not ‑‑ I am not sure whether or not the specific input formats that it would take is something that applies to everybody but certainly I don't believe there is an objection to sharing it.
MARTIN LEVY: CloudFlare. Previous heavy user of your data, thank you very much. Can I project you forward and ask you for some commitment; how far will you take your rewrited code such that how many collectors could you run in the field, how many would RIPE actually commit to running in the field and would that be Europe and the RIPE region specifically or globally? This is great data, but the more we have, and I really mean the more we have, the more useful it is. So, we have been two, three years without being added an additional peer to the system, and I don't want this to abrestart that then stops again; I want to give you sort of the permission to go forth and multiply.
COLIN PETRIE: So indeed we haven't been able to add new peers to some of the collectors for quite a while and we have been getting a lot of feedback that people want to be able to peer with us and things like that, so that is one of the reasons why we are doing this. Also, to add the new collectors in as well. We, at the moment, I believe we are currently working on talking to the other ‑‑ all the other RIRs to sort of have agreement in principle that they will help us to put collectors in, at least in every region and assist with that so certainly at the moment strategically, it's not expanded at ridiculous pace, it's going to be sort of gradual, but yes, we are continuing to ‑‑ we are hoping to be able to continue to add if the system works out well.
MARTIN LEVY: If I may just quickly respond. No, as a user I would like you to go at a ridiculous pace, being conservative at this point is less interesting.
COLIN PETRIE: So from that perspective, the reason why we were going to take it slowly tissuely is mainly, although we were trying to remove all the bottlenecks in the system, we still have to keep up with the incoming data, and we have to be able to scale up the back ends.
MARTIN LEVY: Us networks with making that harder, I understand.
COLIN PETRIE: As we add more peers and more collectors as well, it just goes up. So, we also ‑‑ there is also sort of like budgetary constraint as to whether or not ‑‑ how much hardware do we throw at this in order to try and keep up with the desire to have a viewpoint of everywhere, so it's ‑‑ I am not entirely sure what the ‑‑ correct way to answer the question is there, although I think my colleagues Romeo wants to say something.
ROMEO ZWART: So, in that respect, maybe I am able to say something more about the specific wording that you chose, whether or not we would be able to commit things in this phase. So the direct answer to that is, no. And some of the reasons that Colin already pointed out are, I think abundantly clear, there are too many degrees of freedom, basically, in what we are developing in the early phase.
Having said that, I am really happy with you expressing this request or your guidance so clearly, because it helps us to formulate our own goals more clearly and help with that to direct senior management and board etc., going to the direction that the community wishes. Thank you for that comment and we will definitely take it on board.
MARTIN LEVY: Thank you both for the response. I wonder if I may suggest you ask for a show of hands or a hum as to those who would be interested in you moving ahead aggressively in this, therefore giving you some potential budget impetus to your management.
ROMEO ZWART: I think the question has been asked. Can we see a show of hands of people who think this is a valuable way of spends resources for the NCC. OK. Thank you very much.
SPEAKER: I wanted to say AS the path stuff you mentioned at the end, I am very interested in in that. Are you going to allow more people on the multi hop or add another one for more capacity there as well?
COLIN PETRIE: That is the hope, yes. Yes, we have had a lot of people wanted to use the multi hop service as well and it has the same problem, it's got too much at the moment so hopefully this will free it up to add more though that box or add more multi hop boxes.
THOMAS KING: I think it's really cool what you do, thank you for that and I want to express my support for the AS path stuff, it's also very important.
COLIN PETRIE: Thank you very much.
ROB EVANS: Thank you very much, Colin.
So, it's now end of the agenda. Do we have any other business? If not, I am just going to quickly do a quick update on the BoF we had on Tuesday evening. Some of you may remember that at various points in the group's history the topic of route object authorisation has come up, with the consensus being that it's too complicated. This is now being unfolded into another topic which is the fact that the RIPE IRR has a lot of ‑‑ has relatively good quality data for the address space for the address space, but also has a bunch of other data, which is not authoritative, and which can be ‑‑ so, I think inconsistent quality of it and it's a desire to try and solve that. The BoF on Monday we discussed a few different approaches to this. The first one is having a cross registry authorisation, authorise the hierarchy. The other one was to use RPKI signatures, that is not generating the route objects from the RPKI but using the RPKI as an mechanism for creating the route objects. Then there was using RDAP and finally, and possibly simplest, just dropping the requirements for the ought numb holder to authorise the creation of a route object, so you only have the authorisation from the IANN camp. So there was quite a lot of discussion around that, no real conclusions apart from the fact we should probably look at those and variants on them. The discussion is mainly going to take place on Database Working Group. If you are interested join in on that. If you have questions or comments please feel free to make them.
RUEDIGER VOLK: Sorry to annoy again. The question of how to distinguish good or somewhat less reliant ‑‑ reliable information in the RIPE database wasn't raised as such on Tuesday. For that, actually, one could consider something like when the authorisation processes take place, one could generate a little bit of metadata that says, this actually has been authorised, may be authorised by a certain ‑‑ by a certain credential and make that metadata available and not mess and not mess further with the actual data model. And the idea of, well, OK, marking up the metadata that the user cannot change as database user, while probably already is round, though not implemented.
ROB EVANS: Thanks. So if there are no other comments, questions, or business, I shall declare us done and say see you in Bucharest. Thank you.
LIVE CAPTIONING BY AOIFE DOWNES RPR
DOYLE COURT REPORTERS LTD, DUBLIN IRELAND.