From: Windows Team Blog
Hi, my name is Arthur de Haan and I am responsible for Test and System Engineering in Windows Live. To kick things off, I’d like to give you a look behind the scenes at Hotmail, and tell you more about what it takes to build, deploy and run the Windows Live Hotmail service on such a massive global scale.
Hosting your mail and data (and our own data!) on our servers is a big responsibility and we take quality, performance, and reliability very seriously. We make significant investments in engineering and infrastructure to help keep Hotmail up and running 24 hours a day, day in and day out, year after year. You will rarely hear about these efforts – you will only read about them on the rare occasion that something goes wrong and our service has run into an issue,.
Hotmail is a gigantic service in all dimensions. Here are some of the highlights:
-      We are a worldwide service, delivering localized versions of Hotmail to 59 regional markets, in 36 languages.
-      We host well over 1.3 billion inboxes (some users have multiple inboxes)
-      Over 350 million people are actively using Hotmail on a monthly basis (source: comScore, August 2009).
-      We handle over 3 billion messages a day and filter out over 1 billion spam messages - mail that you never see in your inbox.
-      We are growing storage at over 2 petabytes a month (a petabyte is ~1 million gigabytes or ~1000 terabytes).
-      We currently have over 155 petabytes of storage deployed (70% of storage is taken up with attachments, typically photos).
-      We’re the largest SQL Server 2008 deployment in the world (we monitor and manage many thousands of SQL servers).
You can imagine that the Hotmail user interface you see in the browser is only the tip of the iceberg – a lot of innovations happen beneath the surface. In this post I will give a high level overview of how the system is architected. We will do deeper dives into some specific features in later posts.
Architecture
Hotmail and our other Windows Live services are hosted in multiple datacenters around the world. Our Hotmail service is organized in logical “scale units,” or clusters. Furthermore, Hotmail has infrastructure that is shared between the clusters in each datacenter:
-      Servers to handle incoming and outgoing mail.
-      Spam filters (we will talk more about spam in a future blog post).
-      Data storage and aggregation from our service health monitoring systems.
-      Monitoring and incident response infrastructure.
-      Infrastructure to manage automated code deployment and configuration updates.
A cluster hosts millions of users (how many depends on the age of the hardware) and is a self-contained set of servers including:
-      Frontend servers – Servers that that check for viruses and host the code that talks to your browser or mail client, using protocols such as POP3 and DeltaSync.
-      Backend servers – SQL and file storage servers, spam filters, storage of monitoring- and spam data, directory agents and servers handling inbound and outbound mail.
-      Load balancers – Hardware and software used to distribute the load more evenly for faster performance.
Preventing outages and data loss is our top priority and we take utmost care to keep them from happening. We’ve designed our service to handle failure –our assumption is that anything that can fail will do so eventually. We do have hardware failures—with hundreds of thousands of hard drives in use, some are bound to fail. Fortunately, because of the architecture and failure management processes we have in place, customers rarely experience any impact from these failures.
Here are a few of the ways we keep failures contained:
-      Redundancy – We use a combination of SQL server storage arrays to host our data. We use active/passive failover technologies. This is a fancy way of saying that we have multiple servers and copies of your data that are constantly synchronized. If one server has a failure, another one is ready to take over in seconds. All in all we keep four copies of your data on multiple drives and servers to minimize the chance of data loss due to a hardware failure.
-      Another benefit of this architecture is that we can perform planned maintenance (such as deploying code updates or security patches) without downtime for you. Key pieces of our network gear are also duplicated to minimize the chance of network-related outages.
-      Monitoring – We have an elaborate system for monitoring hardware and software. Thousands of servers monitor service health, transactions (for example, sending an e-mail) and system performance for customers all over the world. Because we’re so large, we’re tracking performance and uptime metrics in aggregate as well as at the cluster level, and by geography. We do want to make sure that your individual experiences are reflected back to us, and not getting lost when we look at averages for the entire system. We care about every single user’s experience. We’ll talk more about performance and monitoring in a future post.
-      Response center – We have a round-the-clock response center team that watches over our global monitoring systems and takes action immediately when there is problem. We have an escalation process that can engage our engineering staff within a few minutes when needed.
Engineering process
I’ve talked a little bit about our architecture and steps we are taking to ensure uninterrupted service. No service is static however; in addition to growth due to usage, we do push out updates on a regular basis. So our engineering processes are just as important as our architecture to provide you with a great service. From patches to minor updates to major releases, we take a lot of precautions during our development and rollout process.
Testing and deployment – For every developer on our staff we have a test engineer who works hand in hand with him or her to give input on the design and specs, set up a test infrastructure, write and automate test cases for new features, and measure quality. When we talk about quality, we mean it in the broadest definition of the word: not just stability and reliability, but also ease of use, performance, security, accessibility (for customers with disabilities), privacy, scalability, and functionality in all browsers and clients that we support, worldwide. Given our scale, this is not an easy feat.
And because we’re a free service funded largely by advertising, we need to be highly efficient on an operational basis. So deployment, configuration, and maintenance of our systems are highly automated. Automation also reduces the risk of human error.
Code deployment and change management – We have thousands of servers in our test lab where we deploy and test code well before it goes live to our customers. In the datacenter we have some clusters reserved for testing “dogfood” and beta versions in the final stages of a project. We test every change in our labs, be it a code update, hardware change or security patch, before deploying it to customers.
After all the engineering teams have signed off on a release (including Test and System Engineering) we start gradually upgrading the clusters in the datacenter to push the changes out to customers worldwide. Typically we do this over a period of a few months – not only because it takes time to perform the upgrades without affecting customers with downtime, but it also allows us to watch and make sure there is no loss of quality and performance.
We can also turn individual features on or off. Sometimes we deploy updates but postpone or delay turning them on. In rare cases we have temporarily turned features off, say for security or performance reasons.
Conclusion
This should begin to give you a sense of the size and scope of the engineering that goes into delivering and maintaining the Hotmail service. We are committed to engineering excellence and continuous improvements of our services for you. We continue to learn as the service grows, and we take all your feedback seriously, so do leave me a comment with your thoughts and questions. I am passionate about our services and so are all the members of the Windows Live team – we may be engineers but we use the services ourselves, along with hundreds of millions of our customers.
Arthur de Haan    
Director, Windows Live Test and System Engineering
No comments:
Post a Comment