Wednesday, July 27, 2011

World Cup 2010 - How did your network do?


Someone asked me yesterday about another old article I wrote around the network problems seen during the 2010 World Cup.  They are rightly worried about the performance problems we should expect during next year's London Olympic Games.
I actually wrote two articles - one before the event and one afterwards.  The one before the event is currently available online:
http://www.info4security.com/story.asp?storycode=4123557 
... the one after the event was over is reprinted below.

How well did your network play during the World Cup?

As predicted, the 2010 FIFA World Cup was the biggest global event in web history.  More people watched the matches, more tweets were posted and more pages viewed and updated than any previous event.  Of course, this huge spike of traffic caused a number of business problems; reminding us of the inexorable rise in traffic, the blurring between work and personal usage, and most importantly that IT management need to plan for the next “big thing” to ensure that business can continue during popular events.

Akamai claimed a peak of 1.6million simultaneous streams, many in HD, and many broadcasters around the world delivered twice the previously-highest peak traffic. Just in the UK, the BBC delivered 800,000 streams during the England vs. Slovenia match alone and total UK Internet usage increased by over 30%.  People watched on their PCs, through their mobile phones and iPads, at home, while travelling and in the workplace.  The BBC statistics for June 2010 showed that there were 9.7million requests for live simulcast content in June 2010, an increase of 26% over the previous month and around a 500% increase on a year ago [source BBC iStats].  This reminds us of the growing expectation that live TV can be watched online and that if something is considered newsworthy that users will do this, even during the working day.

As with car traffic to popular events, the traffic problems are in the last few miles.  It wasn’t the main Internet highways that suffered problems, but the physical bottlenecks along country roads and into car parks were replaced by the final connection to the businesses or sometimes the initial connection to the broadcaster.  Some businesses found that their own Internet connections and links to branch or remote offices using their Wide Area Network (WAN) were overloaded by World Cup traffic, meaning that other data couldn’t be sent or received and business-critical applications came to a halt.

During the initial matches, a number of broadcasters had underestimated the demand – as their servers failed to deliver the quantity of requests made, leading to outages and poor quality video – Twitter and blogs were full of complaints.  It’s clear that a significant demand came during business hour when employees were presumably at work. By the end of the tournament it seemed most of the complaints had subsided or perhaps viewers went back to watching on their TVs – though the later-timed matches perhaps helped the broadcasters out too.

Any popular event will entice the scamers and malware writers out from their dark corners, and the World Cup was no exception. Eight out of the top ten spam messages during June were related to the World Cup, including countless phishing ploys, and there were many Web pages trying to entice people to pay for promised online coverage (often in HD). Scam news articles promised behind the scenes footage that led to the common ploy of “update your codec here” attempts to surreptitiously download malware.

The main issue was the clearly the impact on business networks – there were stories of network traffic failures, followed by rapid emails asking everyone in an organisation to stop watching the World Cup and people saying “if you can’t beat them, join them” as they downed tools for an hour and a half during a particular match.

Just during a 30 minute period I picked up the following tweets, some about business traffic and some just complaining that watching a match was impossible:

AgentOwen: Our internet usage at the office has gone way up during World Cup.  We just got a spanking as it’s slowed down our network.
Dave: Received email from IT “Don’t watch the World Cup – we’ve got business to do” – fine, I’ll go down the pub.
Speedvegan: Note to self: Do not schedule any releases while the US is playing in the World Cup. Network slows to a halt.
Flokemon: Ingerland playing, USA playing, corporate network slows down to a crawl, can I get a stream working eeeekk
Epheramaldog: Ha! The entire wi-fi network went down at precisely 3pm, funny that!
Monchote: On no!! Too many people streaming England’s game at work and the company’s network is about to go down.
PixelMagazine: everyeone in the office watching the match on their PC tends to slow the network down – give up calling tech support, it ain’t gonna happen
Mrmahoosive: If the BBC site buffers one more time I will take down our core network so only I can stream!
TvMiller: Americans turn to iPhone for understanding “offsides” call during World Cup match, AT&T network down leads to “bad call” row.

It’s no surprise that different matches caused problems in different places around the world depending on the teams playing at the time.  The countries reporting the highest difficulties usually had the following characteristics:

  • The country was actually in the tournament and playing at the time
  • The local rights-holder delivered the matches online
  • Football is a sport with a large following
  • The match was taking place during the typical working day
  • Typical bandwidth at work of less than 1Mbps per user
  • Online broadcasting is popular, promoted by the local TV broadcaster
  • Employers didn’t bring in large screens and encourage shared viewing

Various quality of broadcast were available, typically taking between 800Kbps and 3.5Mbps (HD) per screen.  Of course, in a business, this fights with normal business traffic, hence the negative impact on other applications and poor quality streaming for many users who did attempt to watch.  As most organisations do not have the means to minimize the effect of live video streams, such as with stream splitting, each new user accessing video adds another stream with the same network demands.  In large organisations, where Internet traffic is commonly backhauled to and from the data centre across the WAN, remote offices often suffered from poor Internet access and, worse, even to centralised internal applications.

It wasn’t just internal business networks that buckled under the weight; public wireless networks were under strain causing problems for travelling users.  Mobile data bandwidth usage in the USA increased by 24% and post-match mornings saw 32% incease in YouTube traffic, see this email from a colleague:

Due to recent bad weather I spent more time than I cared for at several large airports (Chicago, Washington DC).  In the past few days, the wireless service was bogged down by users watching the World Cup.  If anything exciting happened in the match, you could hear the shouts and fan reaction through the terminal as well.   Many times I had no signal, or very limited web access to email with virtually no ability to surf the web…the impact was real and noticeable.

Perhaps, for you, the World Cup didn’t impact your network.  However, it has lessons for us all – Are you ready for the next explosion in demand for popular content? These demands are happening with increasing frequency—take for example Tiger Woods press conference and President Obama’s State of the Union address in the USA for example. Local news or political announcements often create significant peaks too.  Some content can be predicted, but a global live newsflash can appear at any time.

IT managers need to look at their own statistics for the World Cup and plan for the next flood of content, look at each office and every country – watching the growing popularity of streaming to predict and put in place the solutions to ensure business traffic can be delivered during popular “stream-storms.”

There are a number of different approaches that IT managers can take. If you can predict the next set of content demands, some organisations may take a strict approach by attempting to block web access to all known sites that stream the content. For example, by using web filtering systems, IT management can block access to global sports sites, though users are likely to be unhappy and may still spend time attempting to circumvent the blocking. 

A second option would be to block the protocols used for streaming, however this may include all Real, Microsoft and Flash streams – and in doing so, block internal streams, streaming news and standard parts of web sites, interfering with work-related web information.  This approach will not work with on-demand video clips that are generally delivered as an integral part of the Internet flow.

Instead of either of the above approaches, organisations should look to adopt a more flexible attitude.   IT management can improve their network infrastructure to reduce stream usage through real-time stream-splitting, optimise streaming data or allow users to time-shift the content to be during normal breaks in the working day, as follows. 

Firstly, bandwidth management devices at the Internet egress point can be set to define one stream provider as “approved” and given a high priority (management then encourage employees to use that stream), other streams are lower priority or blocked. 

Secondly, appliances can be installed within the organisation’s network to split the streams – meaning that one external stream request can be sent to multiple users simultaneously.  This greatly reduces the upstream bandwidth required. 

Thirdly, WAN optimisation appliances that support stream splitting can be deployed between offices and at Internet connection points to take a single stream and divided as needed to serve user demand. 

Fourthly, many of the stream splitting appliances can also cache the streams, allowing users to time-shift and watch the game later.

Happily, this doesn’t mean installing four different kinds of network appliances, as there are some devices can deliver multiple benefits in one.

In this way, management can allow (or even encourage) video content whilst minimizing the load on the Internet gateway or branch office by caching locally through a proxy appliance and splitting a single video stream into as many are needed to meet the demands.

There are further benefits to installing solutions to optimise streaming content.  As streaming is embedded in so many business sites, the general load on the network will fall and quality of web content delivered improve.  Internal streams (such as CIO broadcasts) are optimised in the same manner as external streaming, and web video-conferencing between users and customers can be enhanced.

Tuesday, July 26, 2011

The speed of light is too slow


Here's an article I wrote in 2006, referring to another article I wrote in 1998, I thought it was worth posting as the issues haven't gone away. 

The speed of light is too slow, again!


In 1998, I wrote an article stating that the speed of light was too slow and until we fixed it, users would receive poor web performance due to the inefficiencies of the Internet protocols.  Some people said “greater bandwidth will solve the issue” and promptly forgot about it.

Well, here we are, eight years later.  We still haven’t increased the speed of light, available WAN bandwidth has grown many times over and yet those of us remote from the data we need are still waiting for information; if anything the situation has got worse.

More users than ever are working remotely from corporate data, recent research from Nemertes Research states that “fewer than 10% of workers work at headquarters in the average company”[1]. At the same time, IT departments are consolidating servers to ease the management burden and comply with backup regulations, as an example Hewlett Packard announced it is cutting back from 85 worldwide data centers to 6[2].

The last 8 years have also changed the way that applications are delivered to users; web-based applications being the norm, (often using SSL for encryption), streaming data for training and a wealth of rich content distributed around the standard organization.  Web-based applications consume at least ten-times more bandwidth than traditional client-server applications.

Greater bandwidth is not equal to faster throughput


There’s no doubt that adding bandwidth helps delivery of data up to a point and the more users at the remote office, the greater the benefit from adding more bandwidth. 

A simple analogy is to use the idea of a 65 mile length of freeway with a speed limit of 65MPH, when empty a single car can drive the distance in one hour.  If development plans show that the freeway will be used by twice as many cars in eight years, then doubling the number of lanes will provide enough width for the new traffic.  But what of the individual sitting in his or her car, does that individual car get there any quicker?  The speed limit is still 65MPH so even though we have doubled the number of total cars, each individual car still takes the same hour to drive the distance.

To take this analogy further, if a car-owner was moving house eight years ago and it took him two trips to take his belongings along this same road, the total time to move house would be four hours (two round-trips).  Today he is moving house again and has ten times as many belongings, now unless he hires a truck it will take twenty round-trips or a total of twenty hours and the total number of lanes on the road is irrelevant for that individual.

The enemy of applications – distance


If the enemy of application delivery is not bandwidth, what is it?  It is distance.  To be more exact, the enemy is round-trip time.  And round-trip time is defined by the following:

            The speed of light.
            The real distance the data needs to travel (cables don’t go direct from source to destination).
            Any delays from routers, firewalls and network latency.
            The server and PC delays at each end.
            The amount of data that can be transmitted at one time, defined by the protocol being used.

Our protocols are inefficient over the WAN


Now for some mathematics.  Don’t hide, its not that bad.

The original design goal of TCP/IP was to create a protocol that was reliable over almost any network.  A sender transmits small (maximum 64K) packets of information and then wait for an ACKs (acknowledgements of data) back from the recipient before sending more.  The equivalent on the freeway is to take one box of belongings at a time along the 65mile route before driving back empty and collecting another box.

To make matters worse, other protocols reduce this maximum (for example MAPI, used by Microsoft Exchange, uses a maximum of 32K).

So, a single 5MB file needs a minimum of 78 round-trips (or 156 if using MAPI).

Even this assumes that TCP uses its highest window-size, however window-size is negotiated and adjusted between the devices based on response-time, TCP never gets to a 64KB window on long latency links.  There have been a number of articles and papers on this, search for "bandwidth delay product" and you'll see for example that without optimising using window scaling or other techniques it is not possible, for example, to transfer greater than 1Mbit/sec over a satellite link.  This is also a good discussion: http://packetlife.net/blog/2010/aug/4/tcp-windows-and-window-scaling/

Isn’t the speed of light so fast that this is all still only a theoretical problem?


OK, I admit, the speed of light in a vacuum is pretty fast – 299,792 Km/second or 186,282 miles/second.  However, the speed of light in fiber or copper is around 70% of that in a vacuum[3], roughly 210,000 Km/s.

So, to go back to our 5MB file that requires a minimum of 78 round-trips.  Let’s assume the server is in Boston, Massachusetts and the user is in London, a distance of 5279Km[4]. A single round-trip is double the distance: 10,558Km.  78 round-trips is therefore 78 * 10,558 or 823,524Km.  Divide that distance by 210,000 and you have a minimum of 4 seconds to retrieve the file.

But this is all theoretical and assumes a direct link from the user to the server, no routing delays, no congestion and the optimal TCP window size. 

You can calculate it yourself – it’s twice as bad as you think!


Most PCs have a utility called PING, this can be used to see the real round-trip time between devices across WAN links and the Internet.  Before you start, make sure you are really testing to the destination you think you are, there are online utilities that will tell you where the server is hosted[5].

In theory, our round-trip time between Boston and London could be as short as 50milliseconds (10,558 divided by 210,000), however try it and you’ll find it is always at least double.  While writing this near London, I tested the round-trip to three websites hosted near Boston[6] (while most of the USA was asleep for minimum congestion) and received average round-trip times of 129milliseconds.  Now that 5MB will take a minimum of ten seconds to reach me, and this still assumes no server or firewall delays, congestion on the line, no slow-starts, maximum window-size and no additional packets to request the content and deliver approval from the server.

Let’s remember, the round-trip between Boston and London is only 10,558Km.  The circumference of the earth is eight times that and the greater the distance, the worse the situation.  Some examples using other round-trip times for the same 5MB file:

            San Francisco – London                     16 seconds
            San Francisco – Sydney                      23 seconds
            Dallas – Beijing                                   21 seconds
            Paris – New Delhi                               12.5 seconds

(Don’t even think about using a satellite – geostationary satellites are based 35,000 Km above the earth introducing even greater delays).

So, to show the real problem sometimes there’s only one option.  People based in HQ need to jump on an aeroplane and work in a remote office for a week, accessing all the same data that they do at HQ!

What can be done?


In simple terms we need to reduce the number of round-trips that data needs to take to get from a server to a user.  To go back to our analogy of moving house, we could:

1.                        Throw out some of our unwanted stuff – therefore reducing the number of trips.
2.                        Optimize our delivery mechanism; hire a truck instead of using a car and get more items in one journey.
3.                        Prioritize what gets sent first.  Which is more important, the refrigerator or the curling tongs.

In the data world there are also a number of techniques that can work together to achieve faster data delivery.

Object or file caching

Keep a copy of the object at the remote site, using object caching.  When a user requests an object that has already been requested by another user, it can be delivered from the local cache (after checking with the server that the cached copy is still up to date). This reduces WAN bandwidth and latency to almost zero.

Byte caching

When an object is not fully cached, techniques to recognize repeated patterns in the data can send tokens instead of the repeated data.  This can send a few bytes instead of large amounts, thus increasing the apparent bandwidth and reducing the time to deliver the content.

Protocol Optimization

Hide the inefficiencies of the protocols by sending large blocks of data before waiting for acknowledgements, fast-start those protocols that are slow to build up transmissions and even anticipate user requests for data (if a user requests the start of a file, these devices can anticipate that the user will request the rest of the file).

Compression

Use compression technologies between the sites to reduce the bandwidth and round-trips needed.

Bandwidth Management

To make sure the systems use the available bandwidth effectively, set priorities by user group, by server, by application etc.

Remove inappropriate traffic

Let’s not forget that business traffic is often competing with non-business traffic.  Deploy devices that implement policies to block requests for inappropriate traffic.

Conclusion:  Latency – the application killer


Bandwidth is not enough - distance is the real killer.  Even with unlimited bandwidth, data still travels from server to user slowly due to the repeated trips taken before the full data arrives; we still have to wait.  Organizations need to investigate solutions to solve this problem or applications will be unusable in remote offices.