Failover với HAProxy

Thông tin tài liệu

In this lab we combined our best web farm configuration from Lab 3 with HAProxy tofacilitate both load balancing and failover. This required us adding HAProxy to the VM running Lighthttpd as well as a dedicated failover machine running a set of backup dispatchers, Memcached, and MySQL server. The final configuration consisted of the following

CS294-1 (RADS) Fall 2006 10/13/06 Andrew Dahl Jeremy Schiff Jesse Trutna Lab 4: Failover with HAProxy Methods In this lab we combined our best web farm configuration from Lab 3 with HAProxy to facilitate both load balancing and failover. This required us adding HAProxy to the VM running Lighthttpd as well as a dedicated failover machine running a set of backup dispatchers, Memcached, and MySQL server. The final configuration consisted of the following: VM51: • Lighthttpd • HAProxy VM52 & VM 53: • 3x Dispatchers per server VM54 & VM55: • MySQL server • Memcached VM56 & VM57: • Load generators VM50 (Backup): • 3x Dispatchers • MySQL server • Memcached (See appendix from web farm configuration diagram) To test the load balancing and failover mechanisms of HAProxy we started by loading our web farm with researchindex_load and successively bringing down each of the main dispatch and database servers. After it was established HAProxy was running correctly we then ran a baseline test using the same methods as in Lab3: researchindex_load was run on two load servers simultaneously with a variable number of concurrent users: 1, 5, 10, 25, 50, 100, 500, 1000 (per load server). This run was then repeated twice, tfirst o simulate failure of a dispatch server and a then to simulate the failure of a database server, each being brought down after 25 concurrent users. Optimizations Initially we were having troubles getting HAProxy to work correctly and were also getting a large number of “500” errors from researchindex_load when trying to generate traffic. After (~10) hours of tweaking the system appeared to be operating in a relatively stable fashion. The following changes we made: • Added garbage collection to the dispatchers located in “public/dispatch.fcgi” on all dispatch servers. This was accomplished by changing the following: RailsFCGIHandler.process! ! RailsFCGIHandler.process! nil, 10 This appeared to solve many of the 500 errors generated and kept dispatchers alive. In fact, the opposite problem then arose. Occassionaly, under heavy load, the dispatchers would become unresponsive and unkillable. • Turned off Ajax in the researchindex_load script, which was commented “#partially working” by making the following change: cfg_forms.use_ajax = true ! cfg_forms.use_ajax = false • Tweaking HAProxy settings, especially timeout intervals and number of connection retries. A copy of our HAProxy configuration file is presented in the appendix. • Removed a call in researchindex_load to parse what it defined as “a partial page”, This method was not defined in the file and when called generated an error. We don’t this method call participated in generating “500” errors, at least in most cases, but was fixed for safety. Results HAProxy test 1) Normal operation 2) This picture depicts our complete web farm configuration with a load of 100 concurrent users. In GREEN you see the primary dispatchers, Memcached and MySQL servers; in BLUE their corresponding backups. 3) 1x Dispatch server and 1x MySQL/Memcached server down Here we have taken a dispatch server and one MySQL/Memcached server down. You can see that the excess load has been taken over by the remaining servers. Note however, that the backup servers (in BLUE) are not yet used! They will only be put into commission when all normal servers are down (this is how HAProxy works). 4) All normal servers down (except Lighttpd/HAProxy server) Now all normal servers are down except the server running Lighthttpd and HAProxy. As you can clearly see from the picture all load has now been shifted from the servers that are down (in RED) to the backup servers (in BLUE). We should note that researchindex_load is now generating errors due to timeouts as a result of the backup servers being overloaded. 5) Return to normal operation Finally we bring all servers back up and we see that load has correctly shifted to the normal servers from the backups. As can also be seen above, a noticeable delay was present between the detection of a server failure or resumption and the redirection of traffic. Additional configuration helped, but never entirely resolved this problem. It is also possible that the HAProxy reporting tool itself was not updating it’s statistics in real time and hence contributed to as a source of these apparent delays, but this was difficult to confirm or deny in practice. Load was distributed appropriately, neglecting these small delays. Baseline run # of Concurrent Users vs Response Time (as seen by user) 0 0.5 1 1.5 2 2.5 3 1 10 100 1000 # of concurrent users response time VM56 VM57 HAProxy does not seem to be incurring any additional overhead, at least not a measurable one. In fact looking at the response times compared to Lab 3, these seem to be slightly better. We might attribute this to HAProxy performing better load balancing than Lighthttpd and/or some of the changes made as mentioned in the “Optimizations” section. In particular, the removal of the ajax calls from researchindex_load might contribute greatly to reducing the per-user load on the server. Due to the difficulty of debugging some of the application problems we were experiencing, we were not able to determine the exact impact of this change. # of Concurrent Users vs # of Errors -2000 0 2000 4000 6000 8000 10000 12000 0 200 400 600 800 1000 1200 # of concurrent users # of errors VM56 VM57 There seems to be nothing unusual here. As expected from previous labs we see errors increasing linearly after some initial load threshold is reached and the number of concurrent users increases. In addition, both failure cases exhibited nearly identical error graphs and were thus excluded. It is interesting that all the error graphs matched so nearly, and in future work, it would be illuminating to have a better understanding and classification system for errors since this is not the expected behavior. # of Concurrent Users and VM vs Processing Time (Seen by Dispatchers) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 1 1 1 5 5 5 10 10 10 25 25 25 50 50 50 100 100 100 500 500 500100010001000 vm & # of concurrent users processing time Render Time Controller Time Database Time This graph shows, as discussed in previous labs, that the database is where most of the processing time is spent. This would indicate that either reducing the number of queries made or optimizing the current ones, could give significant performance improvements. Failure Runs (Summary) # of Concurrent Users vs Response Time (as seen by user) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 1 10 100 1000 # of concurrent users response time Kill Database Kill Dispatcher Baseline Kill Database (Adjusted) # of Concurrent Users vs Response Time (as seen by user) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 10 20 30 40 50 60 70 80 90 100 # of concurrent users response time Kill Database Kill Dispatcher Baseline Kill Database (Adjusted) In each of the failure runs, the failing server was killed after the completion of the 25 user run. The 50 user run was then started. As can be seen in the above graphs, both failure cases show an approximately proportional increase in response time over the 25-100 user case, as expected. Unexpected results for the database and dispatcher failure cases are explored further in the following sections. Due to the latency of the shutdown command, it is possible that some 50 user requests were dispatched to the failing server. Failover run (Dispatcher) # of Concurrent Users vs Response Time (seen by user) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 10 100 1000 # of concurrent users response time vm56 vm57 Graph 1: In this run we kill a dispatch server right before 50 concurrent users are run through the system (50 users per load server). At this point, response time increases proportionally as compared to baseline. At around 100 concurrent users we see the response time drop off slightly and then increase again after 500 users. If This happens because the single dispatch server is being overloaded and starts to drop requests. This is evident in the error graph and processing times shown and described below. [...]... setup HAProxy and a backup server to act as a failover for dispatch, MySQL, and Memcached servers From this exercise we learned a number of things: 1 HAProxy does not seem to incur a substantial overhead (if any) to the system as opposed to our web farm configuration from Lab 3 It might actually reduce response times due to better load balancing 2 Failover with HAProxy works as demonstrated in the "HAProxy. .. demonstrated in the "HAProxy test" section We did however find that it takes HAProxy a considerable amount of time to probably rebalance load when servers go up and down This could be on the order of minutes Of course this could be a configuration error, however, even with hours of tweaking HAProxy we could not seem to improve the failover times 3 In both cases of server failure, response time increases... to someone accidentally restarting the vm and it thus started servicing requests This is almost certainly the cause of the dip in service time seen around the 500 user mark in the response time chart Failover run (Database) # of Concurrent Users vs Response Time (seen by user) 4 3.5 processing time 3 2.5 VM56 VM57 2 1.5 1 0.5 0 1 10 100 1000 # of concurrent users The results for the database/memcached... possible that another team began utilizing these machines while our tests where running In addition, this test was run immediately after the dispatcher failure test and it is possible some of the dispatchers /HAProxy connections had not timed out yet, resulting in extremely long processing times until the old connections were flushed The overall trends remain similar, and match well if adjusted for the slowdown... did not change over the various runs, even at saturation levels This is a strong indicator that a more detailed analysis of the specific errors that occurred is needed Appendix Web Farm Configuration HAProxy Configuration . Trutna Lab 4: Failover with HAProxy Methods In this lab we combined our best web farm configuration from Lab 3 with HAProxy to facilitate both load balancing and failover. This required us adding HAProxy. times due to better load balancing. 2. Failover with HAProxy works as demonstrated in the " ;HAProxy test" section. We did however find that it takes HAProxy a considerable amount of time. lab we succesfully setup HAProxy and a backup server to act as a failover for dispatch, MySQL, and Memcached servers. From this exercise we learned a number of things: 1. HAProxy does not seem to

Ngày đăng: 18/06/2014, 09:47

Xem thêm: Failover với HAProxy