the art of scalability scalable web architecture processes and organizations for the modern enterprise phần 10 ppsx

ptg5994185 506 CHAPTER 33 PUTTING IT ALL TOGETHER vision to be both highly available and highly scalable. Lynn and Marty hired a number of people to further augment the team over a period of several years including Tom Keeven and Mike Fisher (both now partners in AKF), creating a perpetual cycle of “seed, feed, and weed.” Experiences were added to the team and talent was boosted. Great individual contributors and managers were recognized and promoted and some people were asked to leave or left of their own volition if they did not fit into the new culture. While continuing to focus on ensuring that the teams had the right skills and experiences, the executives simultaneously looked at processes. Most important were the processes that would allow the organizations to learn over time. Crisis management, incident management, postmortem, and change management and control processes were all added within the first week. Morning operations meetings were added to focus on open and recurring incidents and to drive incidents and problems to closure. Project management disciplines were added to keep business and scalability related projects on track. And of course there was the focus on technology! It is important to understand that although people, process, and technology were all simultaneously focused on, the most important aspects for long-term growth stem from people first and process second. As we’ve said time and time again, technology does not get better without having the right team with the right experiences, and people do not learn without the right (and appropriately sized) processes to reinforce lessons learned, thereby keeping issues from happening repeatedly. Databases and applications were split on the x-, y-, and z-axes of scale. What started out as one monolithic database on the largest server available at the time was necessarily split to allow for the system to scale to user demand. Data elements with high read to write ratios were replicated using x-axis techniques. Customer information was split from product information, product information was split into several databases, and certain functions like “feedback” were split into their own systems over the period of a few years. Quigo: A Young Product with a Scalability Problem Quigo started out as a company offering a service based on technology. Relying on a proprietary relevance and learning engine, its first product promised to help increase the returns in the nascent search engine marketing industry for direct response adver- tisers. Leveraging this existing technology, the company branched out into offering a private label contextual advertising platform for premium publishers. AdSonar was born. Early premium branded publishers loved the product and loved the capability to increase their revenue per page over the existing alternatives. However, within months, the new advertising platform had problems. It simply couldn’t handle the demand of the new publishers. How did a new product fail so ptg5994185 CASE STUDIES 507 quickly? The product wasn’t anywhere near the scale of an eBay, Amazon, or Google; it wasn’t even near the scale of competing ad networks. What went wrong and how could it be fixed? Again, the answer starts with people. The existing team was smart and dedicated, just as with the eBay team. But it missed experience in large-scale operations and designing platforms for hyper growth. This is when two future AKF Partners were brought onboard. The new executives didn’t have direct experience with advertising technology, but their experience with commerce and payment platforms was directly applicable. More importantly, they knew how to focus teams on an objective and how to create a culture that would support the needs of a highly scalable site. Consis- tent with the layout of this book, it all starts with people. The new team created metrics and goals supporting availability, scalability, and cost. It created a compelling vision of the ideal future and gave the team hope that it could be achieved. Where necessary, it added great engineering and managerial talent. The new executives also set about adding the right processes to support scalability. Scalability summits, operations meetings, incident management processes, and change management processes were all added within a couple of weeks. Joint Appli- cation Design and Architecture Review Boards soon followed. Architectural princi- ples focusing the team on the critical elements of scale were introduced and used during Architectural Review Boards. And of course the team focused on technology. Again, what ultimately became the AKF Scale Cube was employed to split services, resources, and (where necessary) data elements. Fault isolation was employed where possible to increase scalability. What were the results of all of this work? Within two years, the company had grown more than 100x in transactions and revenue and was successfully sold to AOL. ShareThis: A Startup Story ShareThis is a company that is all about sharing. Its products allow people to easily share the things they find online, by consolidating address books and friend lists, so that anything can be shared immediately, without even leaving a Web page. Within six months of launching the ShareThis widget, there were already over 30,000 publishers using it. Witnessing this hyper growth, the cofounder and CEO Tim Schigel met with the AKF Partners to talk about guidance with scalability concerns. Tim is a seasoned veteran of startups having seen them for more than a decade as a venture capitalist and was well aware of the need to address scalability early and from a holistic approach. Michael Fisher from AKF Partners worked with Tim to lay out a short- and long-term plan for scalability. At the top of the list was filling some open positions on his team with great people. One of these key hires was Nanda Kishore as the chief technology officer. Prior to ShareThis, Nanda was a general manager at ptg5994185 508 CHAPTER 33 PUTTING IT ALL TOGETHER Amazon.com and knew firsthand about how to hire, lead, design, and develop scalable organizations, processes, and products. In addition to other key hires in operations, engineering, product management, and the data warehouse team, there was a dedicated focus on improving processes. Some of the processes that were put in place within the first few weeks were source code control, on-call procedures, bug tracking, and product councils. After people and process were firmly established, they could properly address scalability within the technology. With a keen focus on managing cost and improving performance, the team worked on reducing the widget payload. It implemented a content delivery network (CDN) solution for caching and moved all serving and data processing into Amazon’s EC2 cloud. Because of the ShareThis architecture and need for large amounts of com- pute processing for data, this combination of a CDN and public cloud worked excep- tionally well. Under Nanda’s leadership, the team reduced the serving cost by more than 56% while experiencing growth rates in excess of 15% per month. All of this sharing activity resulted in terabytes of data that needs to be processed daily. The team has produced a data warehousing solution that can scale with the ever increas- ing amount of data while reducing the processing time by 1900% in the past six months. Less than two years after the launch, the ShareThis widget reached more than 200 million unique users per month and more than 120,000 publisher sites. ShareThis is a scalability success story because of its focus on people, process, and technology. Again, it’s worth repeating a recurring theme throughout this book: You can’t scale without focusing on all three elements of people, process, and technology. Too many books and Web sites feature the flavor of the day technical implementation to fix all needs. Vision, mission, culture, team composition, and focus are the most important elements to long-term success. Processes need to support the development of the team and need to reinforce lessons learned as well as rapid learning. Technol- ogy, it turns out, is the easiest piece of the puzzle, but unfortunately the one people tend to focus on first. Just as with complex math equations, one simply needs to iter- atively simplify the equation until the component parts are easy to solve. People and organizations are more dynamic and demanding. Although there is no single right solution for them, there is an approach that is guaranteed to work every time. Start with a compelling vision mixed with compassion and hope, and treat your organization as you would your garden. Add in goals and measurements and help the team overcome obstacles. Process development should focus on those things that help a company learn over time and avoid repeating mistakes. Use process to help manage risks and repeat supe- rior results. Avoid process that becomes cumbersome or significantly slows down product development. ptg5994185 REFERENCES 509 References We have covered a lot of material in this book. Because of space limitations we have often only been able to cover this material in a summary fashion. Following are a few of the many resources that can be consulted for more information on concepts related to scalability. Not all of these necessarily share our view points on many issues, but that does not make them or our positions any less valid. Healthy discus- sion and disagreement is the backbone of scientific advancement. Awareness of different views on topics will give you a greater knowledge of the concept and a more appropriate decision framework. Blogs AKF Partners Blog: http://www.akfpartners.com/techblog Silicon Valley Product Group by Marty Cagan: http://www.svpg.com/blog/files/ svpg.xml All Things Distributed by Werner Vogels: http://www.allthingsdistributed.com High Scalability Blog: http://highscalability.com Joel On Software by Joel Spolsky: http://www.joelonsoftware.com Signal vs Noise by 37Signals: http://feeds.feedburner.com/37signals/beMH Scalability.org: http://scalability.org Books Building Scalable Web Sites: Building, Scaling, and Optimizing the Next Genera- tion of Web Applications by Cal Henderson Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services by Neil J. Gunther The Art of Capacity Planning: Scaling Web Resources by John Allspaw Scalable Internet Architectures by Theo Schlossnagle The Data Access Handbook: Achieving Optimal Database Application Perfor- mance and Scalability by John Goodson and Robert A. Steward Real-Time Design Patterns: Robust Scalable Architecture for Real-Time Systems (Addison-Wesley Object Technology Series) by Bruce Powel Douglass Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide (Addison-Wesley Information Technology Series) by David S. Linthicum Inspired: How To Create Products Customers Love by Marty Cagan ptg5994185 This page intentionally left blank ptg5994185 Appendices ptg5994185 This page intentionally left blank ptg5994185 513 Appendix A Calculating Availability There are many ways of calculating a site’s availability. Included in this appendix are five ways that this can be accomplished. In Chapter 6, Making the Business Case, we made the argument that knowing your availability is extremely important in order to make the business case that you need to undertake scalability projects. Downtime equals lost revenue and the more scalability projects you postpone or neglect to accomplish, the worse your outages and brownouts are going to be. If you agree with all of that then why does it matter how you calculate outages or downtime or availability? It matters because the better job you do and the more everyone agrees that your method is the standard way of calculating the measurement, the more credibility your numbers have. You want to be the final authority on this measurement; you need to own it and be the custodian of it. Imagine how the carpet could be pulled out from under your scalability projects if someone disputed your availability numbers in the executive staff meeting. Another reason that a proper and auditable measurement should be put in place is that for an Internet enabled service, there is no more important metric than being available to your customers when they need your service. Everyone in the organization should have this metric and goal as part of his personal goals. Every member of the technology organization should know the impact on availability that every outage causes. People should question each other about outages and work together to ensure they occur as infrequently as possible. With availability as part of the company’s goals, affecting employees’ bonus, salary, promotions, and so on, this should be a huge motivator to care about this metric. Before we talk about the five different methods of calculating availability, we need to make sure we are all on the same page with the basic definition of availability. In our vernacular, availability is how often the site is available over a particular duration. It is simply the amount of time the site can be used by customers divided by the total time. For example, if we are measuring availability over one week, we have 10,080 minutes of possibly availability, 7 days u 24 hrs/day u 60 min/day. If our site is available 10,010 minutes during that week, our availability is 10,010 / 10,080 = .9935. ptg5994185 514 APPENDIX ACALCULATING AVAILABILITY Availability is normally stated as a percentage, so our availability would be 99.35% for the week. Hardware Uptime The simplest and most straightforward measurement of availability is calculating it based on device (or hardware) uptime. Using simple monitoring tools that rely on SNMP traps for catching when devices are having issues, organizations can monitor the hardware infrastructure as well as keep track of when the site’s hardware was having issues. On whatever time period availability is to be calculated, the team can look back through the monitoring log and identify how many servers had issues and for what duration. A simple method would be to take the total time of the outage and multiply it by a percentage of the site that was impacted. The percentage would be generated by taking the number of servers having issues and dividing by the total number of servers hosting the site. As an example, let’s assume an access switch failed and the hosts were not dual homed, so it took out 12 Web servers that were attached to it for 1½ hours until someone was able to get in the cage and swap the network device. The site is hosted on 120 Web servers. Therefore, the total downtime would be 9 minutes calculated as follows: Outage duration = 1½ hours Servers impacted = 12 Total servers = 120 90 min u 12/120 = 9 min With the downtime figured, the availability can be calculated. Continuing our example, let’s assume that we want to measure availability over a week and this was our only outage during that week. During a week, we have 10,080 minutes of possibly availability, 7 days u 24 hrs/day u 60 min/hr. Because this is our only downtime of the week, we have 10,080 – 9 = 10,071 of uptime. Availability is simply the ratio of uptime to total time expressed as a percentage, so we have 10,071 / 10,080 = 99.91%. As we mentioned, this is a very simplistic approach to availability. The reason we say this is that the performance of a Web server is not necessarily the experience of your customers. Just because a server was unavailable does not mean that the site was unavailable for the customers; in fact, if you have architected your site properly, a single failure will likely not cause any customer impacting issues. The best measure of availability will have a direct relation to the maximization of shareholder value; this maximization in turn likely considers the impact to customer experience and the resulting impact to revenue or cost for the company. ptg5994185 CUSTOMER COMPLAINTS 515 This is not meant to imply that you should not measure your servers and other hardware’s availability. You should, however, refer back to the goal tree in Chapter 5, Management 101, shown in Figure 5.2. Device or hardware availability would likely be a leaf on this tree beneath the availability of the adserving systems and the regis- tration systems. In other words, the device availability impacts the availability of these services, but the availability of the services themselves is the most important metric. You should use device or hardware availability as a key indicator of your system’s health but you need a more sophisticated and customer centric measurement for availability. Customer Complaints The next approach to determining availability involves using the customers as a barometer or yardstick for your site’s performance. This measurement might be in the form of the number of inbound calls or emails to your customer support center or the number of posts on your forums. Often, companies with very sophisticated customer support services will have real-time tracking metrics on support calls and emails. Call centers measure this every day and have measurements on how many they receive as well as how many they can service. If there is a noticeable spike in such service requests, it is often the fault of an issue with the application. How could we turn the number of calls into an availability measurement? There are many ways to create a formula for doing this, but they are all inaccurate. One simple formula might be to take the number of calls received on a normal day and the number received during a complete outage; these would serve as your 100% available and 0% available. As the number of calls increases beyond the normal day rate, you start subtracting availability until you reach the amount indicating a total site outage; at that point, you count the time as the site being completely unavailable. As an example, let’s say we normally get 200 calls per hour from customers. When the site is completely down in the middle of the day, the call volume goes to 1,000 per hour. Today, we start seeing the call volume go to 400 per hour at 9:00 AM and remain there until noon when it drops to 150 per hour. We assume that the site had some issues during this time and that is confirmed by the operations staff. We mark the period from 9:00 AM to noon as an outage. The percentage of downtime is 25%, calculated as Outage duration = 3 hours = 180 min Normal volume = 200 calls/hr Max volume = 1,000 calls/hr Diff (Max – Norm) = 800 calls/hr [...]... observations The paired t-test calculates the difference for each pair of measurements; for the upload test, this is each test of the datasets Then the paired t-test determines the mean of these weight changes and determines whether this is statistically significant We won’t go into the details of how to perform this analysis other than to say that you formulate a hypothesis and a null hypothesis The hypothesis... times, the number of sql executions, and the number of errors reported by the application In Table C.1 are the response time results of the upload tests The results are from 10 separate runs of the test, each with 10 simultaneous executions of the upload service In the chart are the corresponding response times In Table C.2 are the response times for the All_Emp report tests The results are from 10 separate... determine how the new version of code should perform compared to the old version, but at this point, it decides to continue to the next step by getting the engineers involved Step six is to report to the engineers the results of the tests and analysis AllScale.com gathers the engineers responsible for the various parts of the code that make up the upload and report services and present their analysis The engineers... C.3 are the summary results of the upload and All_Emp report tests for both the current version 3.1 as well as the previous version of the code base 3.0 For the rest of the example, we are going to stick with these two tests in order to cover them in sufficient detail and not have to continue repeating the same thing about all the other tests that could be performed The summary results include the overall... repeatedly They end up running another set of performance tests on version 3.1 of the code before releasing the code to production Because of their diligence, they feel confident that this version of the code will perform in a manner that is acceptable This example has covered the steps of performance testing We have shown how you might decide which tests to perform, how to gather the data, and how to perform... overall mean and standard deviation, the number of SQL that was executed for each test, and the number of errors that the application reported The third column for each test is the difference between the old and new versions’ performance This is where the analysis begins Step five is to perform the analysis on the data gathered during testing As we mentioned, Table C.3 has a third column showing the difference... differently for capacity planning For simplicity of this example, we will continue to group them together as a total number of requests From the graphs, we have put together a summary in Table B.1 of the Web servers, application servers, and the database server You can see that we have for each component the peak total requests, the number of hosts in the pool, the peak request per host, and the maximum... you are and how much buffer you need For our exercise, we are going to use 90% for the database and 75% for the Web and application servers Step 7, our final step, is to perform the calculations You may recall the formula shown in Figure B.4 This formula states that the capacity or headroom of a particular component is equal to the Ideal Usage Percentage multiplied by the maximum capacity of that component... followed by the emp_tng Therefore, they are the most likely to have a performance problem and are prioritized in the testing sequence The fourth step is to execute or conduct the tests and gather the data AllScale.com has automated scripts that run the tests and capture the data simultaneously The scripts run a fixed number of simultaneous executions of a standard data set or set of instructions This... components We also perform load and performance testing before each new code release, and we know the maximum requests per second for each component based on the latest code version In Figure B.1, there are the Web server and application server requests that are being tracked and monitored for AllScale.com You can see that there are around 125 requests per second at peak for the Web servers There are also . Planning for Highly Scalable Applications and Services by Neil J. Gunther The Art of Capacity Planning: Scaling Web Resources by John Allspaw Scalable Internet Architectures by Theo Schlossnagle The. to use 90% for the database and 75% for the Web and application servers. Step 7, our final step, is to perform the calculations. You may recall the formula shown in Figure B.4. This formula states. considered the outage percentage and could be used in the calculation of downtime. In this case, we would calculate that this area is 40% of the normal traffic and therefore the site had a 40% outage for

the art of scalability scalable web architecture processes and organizations for the modern enterprise phần 10 ppsx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Part IV: Solving Other Issues and Challenges

Chapter 33: Putting It All Together

References

Appendices

Appendix A: Calculating Availability

Hardware Uptime

Customer Complaints

Portion of Site Down

Third-Party Monitoring Service

Traffic Graph

Appendix B: Capacity Planning Calculations

Appendix C: Load and Performance Calculations

Index

A

B

C

D

E

F

G

H

I

J

K

L

Tài liệu cùng người dùng

Tài liệu liên quan