The Practice of System and Network Administration Second Edition phần 3 docx

105 339 0
The Practice of System and Network Administration Second Edition phần 3 docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

6.1 The Basics 171 ❖ Warning: Mobile Phones with Cameras Rented colocation facilities often forbid cameras and therefore forbid mobile phones that include cameras 6.1.10 Console Access Certain tasks can be done only from the console of a computer Console servers and KVM switches make it possible to remotely access a computer’s console For an in-depth discussion, refer to Section 4.1.8 Console servers allow you to maintain console access to all the equipment in the data center, without the overhead of attaching a keyboard, video monitor, and mouse to every system Having lots of monitors, or heads, in the data center is an inefficient way to use the valuable resource of data center floor space and the special power, air conditioning, and firesuppression systems that are a part of it Keyboards and monitors in data centers also typically provide a very unergonomic environment to work in if you spend a lot of time on the console of a server attached to a head in a data center Console servers come in two primary flavors In one, switch boxes allow you to attach the keyboard, video monitor, and mouse ports of many machines through the switch box to a single keyboard, video, and mouse (KVM) Try to have as few such heads in the data center as you can, and try to make the environment they are in an ergonomic one The other flavor is a console server for machines that support serial consoles The serial port of each of these machines is connected to a serial device, such as a terminal server These terminal servers are on the network Typically, some software on a central server controls them all (Fine and Romig 1990) and makes the consoles of the machines available by name, with authentication and some level of access control The advantage of this system is that an SA who is properly authenticated can access the console of a system from anywhere: desk, home, and on the road and connected by remote access Installing a console server improves productivity and convenience, cleans up the data center, and yields more space (Harris and Stansell 2000) It can also be useful to have a few carts with dumb terminals or laptops that can be used as portable serial consoles These carts can be conveniently wheeled up to any machine and used as a serial console if the main console server fails or an additional monitor and keyboard are needed One such cart is shown in Figure 6.15 172 Chapter Data Centers Figure 6.15 Synopsys has several serial console carts that can be wheeled up to a machine if the main console server fails or if the one machine with a head in the machine room is in use 6.1.11 Workbench Another key feature for a data center is easy access to a workbench with plenty of power sockets and an antistatic surface where SAs can work on machines: adding memory, disks, or CPUs to new equipment before it goes into service or perhaps taking care of something that has a hardware fault Ideally, the workbench should be near the data center but not part of it, so that it is not 6.1 The Basics 173 used as temporary rack space and so that it does not make the data center messy These work spaces generate a lot of dust, especially if new hardware is unboxed there Keeping this dust outside the data center is important Lacking space to perform this sort of work, SAs will end up doing repairs on the data center floor and new installs at their desk, leading to unprofessional, messy offices or cubicles with boxes and pieces of equipment lying around A professionally run SA group should look professional This means having a properly equipped and sufficiently large work area that is designated for hardware work ❖ People Should Not Work in the Data Center Time and time again, we meet SAs whose offices are desks inside the data center, right next to all the machines We strongly recommend against this It is unhealthy for people to work long hours in the data center The data center has the perfect temperature and humidity for computers, not people It is unhealthy to work in such a cold room and dangerous to work around so much noise It is also bad for the systems People generate heat Each person in the data center requires an additional 600 BTU of cooling That is 600 BTU of additional stress on the cooling system and the power to run it It is bad financially The cost per square meter of space is considerably more expensive in a data center SAs need to work surrounded by reference manuals, ergonomic desks, and so on: an environment that maximizes their productivity Remote access systems, once rare, are now inexpensive and easy to procure People should enter the room only for work that can’t be done any other way 6.1.12 Tools and Supplies Your data center should be kept fully stocked with all the various cables, tools, and spares you need This is easier to say than to With a large group of SAs, it takes continuous tracking of the spares and supplies and support from the SAs themselves to make sure that you don’t run out, or at least run out only occasionally and not for too long An SA who notices that the data center is running low on something or is about to use a significant 174 Chapter Data Centers quantity of anything should inform the person responsible for tracking the spares and supplies Ideally, tools should be kept in a cart with drawers, so that it can be wheeled to wherever it is needed In a large machine room, you should have multiple carts The cart should have screwdrivers of various sizes, a couple of electric screwdrivers, Torx drivers, hex wrenches, chip pullers, needle-nose pliers, wire cutters, knives, static straps, a label maker or two, and anything else that you find yourself needing, even occasionally, to work on equipment in the data center Spares and supplies must be well organized so that they can be quickly picked up when needed and so that it is easy to an inventory Some people hang cables from wall hooks with labels above them; others use labeled bins of varying sizes that can be attached to the walls in rows A couple of these arrangements are shown in Figures 6.16 and 6.17 The bins provide a more compact arrangement but need to be planned for in advance of laying out the racks in the data center, because they will protrude significantly into the aisle Small items, such as rack screws and terminators, should be in bins or small drawers Many sites prefer to keep spares in a different room with easy access from the data center A workroom near the data center is ideal Keeping spares in another room may also protect them from the event that killed Figure 6.16 Various sizes of labeled blue bins are used to store a variety of data center supplies at GNAC, Inc 6.1 The Basics 175 Figure 6.17 Eircom uses a mixture of blue bins and hanging cables the original Large spares, such as spare machines, should always be kept in another room so that they don’t use valuable data center floor space Valuable spares, such as memory and CPUs, are usually kept in a locked cabinet If possible, you should keep spares for the components that you use or that fail most often Your spares inventory might include standard disk drives of various sizes, power supplies, memory, CPUs, fans, or even entire machines if you have arrays of small, dedicated machines for particular functions It is useful to have many kinds of carts and trucks: two-wheel hand-trucks for moving crates, four-wheel flat carts for moving mixed equipment, carts with two or more shelves for tools, and so on Mini-forklifts with a handcranked winch are excellent for putting heavy equipment into racks, enabling you to lift and position the piece of equipment at the preferred height in the rack After the wheels are locked, the lift is stable, and the equipment can be mounted in the rack safely and easily 6.1.13 Parking Spaces A simple, cheap, effective way to improve the life of people who work in the data center is to have designated parking spaces for mobile items Tools that are stored in a cart should have their designated place on the cart labeled Carts should have labeled floor space where they are to be kept when unused When someone is done using the floor tile puller, there should be a labeled 176 Chapter Data Centers spot to return the device The chargers for battery-operated tools should have a secure area In all cases, the mobile items should be labeled with their return location Case Study: Parking Space for Tile Pullers Two tile pullers were in the original Synopsys data center that had a raised floor However, because there was no designated place to leave the tile pullers, the SAs simply put them somewhere out of the way so that no one tripped over them Whenever SAs wanted a tile puller, they had to walk up and down the rows until they found one One day, a couple of SAs got together and decided to designate a parking space for them They picked a particular tile where no one would be in danger of tripping over them, labeled the tile to say, ‘‘The tile pullers live here Return them after use,’’ and labeled each tile puller with, ‘‘Return to tile at E5,’’ using the existing row and column labeling on the walls of the data center The new practice was not particularly communicated to the group, but as soon as they saw the labels, the SAs immediately started following the practice: It made sense, and they wouldn’t have to search the data center for tile pullers any more 6.2 The Icing You can improve your data center above and beyond the facilities that we described earlier Equipping a data center properly is expensive, and the improvements that we outline here can add substantially to your costs But if you are able to, or your business needs require it, you can improve your data center by having much wider aisles than necessary and by having greater redundancy in your power and HVAC systems 6.2.1 Greater Redundancy If your business needs require very high availability, you will need to plan for redundancy in your power and HVAC systems, among other things For this sort of design, you need to understand circuit diagrams and building blueprints and consult with the people who are designing the system to make sure that you catch every little detail, because it is the little detail you miss that is going to get you For the HVAC system, you may want to have two independent parallel systems that run all the time If one fails, the other will take over Either one on its own should have the capacity to cool the room Your local HVAC engineer should be able to advise you of any other available alternatives 6.2 The Icing 177 For the power system, you need to consider many things At a relatively simple level, consider what happens if a UPS, a generator, or the ATS fails You can have additional UPSs and generators, but what if two fail? What if one of the UPSs catches fire? If all of them are in the same room, they will all need to be shut down Likewise, the generators should be distributed Think about bypass switches for removing from the circuit, pieces of equipment that have failed, in addition to the bypass switch that, ideally, you already have for the UPS Those switches should not be right next to the piece of equipment that you want to bypass, so that you can still get to them if the equipment is on fire Do all the electrical cables follow the same path or meet at some point? Could that be a problem? Within the data center, you may want to make power available from several sources You may want both alternating current (AC) and direct current (DC) and power, but you may also want two different sources of AC power for equipment that can have two power supplies or to power each half of a redundant pair of machines Equipment with multiple power supplies should take power from different power sources (see Figure 6.18) ❖ High-Reliability Data Centers The telecommunications industry has an excellent understanding about how to build a data center for reliability, because the phone system is used for emergency services and must be reliable The standards were also set forth when telecommunication monopolies had the money to go the extra distance to ensure that things were done right Network Equipment Building System (NEBS) is the U.S standard for equipment that may be put in a phone company’s central office In Europe, the equipment must follow the European Telecommunication Standards Institute (ETSI) standard NEBS and ETSI set physical requirements and testing standards for equipment, as well as minimums for the physical room itself These document in detail topics such as space planning, floor and heat loading, temperature and humidity, earthquake and vibration, fire resistance, transportation and installation, airborne contamination, acoustic noise, electrical safety, electromagnetic interference, electrostatic discharge (ESD) immunity, lightning protection, DC potential difference, and bonding and grounding We only mention this to show how anal retentive the telecom industry is On the other hand, when was the last time you picked up your telephone and didn’t receive a dial tone in less than a second? The 178 Chapter Data Centers Figure 6.18 GNAC, Inc., brings three legs of UPS power into a single power strip Redundant power supplies in a single piece of equipment are plugged into different legs to avoid simultaneous loss of power to both power supplies if one leg fails NEBS and ETSI standards are good starting places when creating your own set of requirements for a very-high-availability data center For a high-availability data center, you also need good process The SAS-70 standard applies to service organizations and is particularly relevant to companies providing services over the Internet SAS-70 stands for Statement of Auditing Standards No 70, which is entitled “Reports on the Processing of Transactions by Service Organizations.” It is an auditing standard established by the American Institute of Certified Public Accountants (AICPA) 6.3 Ideal Data Centers 179 6.2.2 More Space If space is not at a premium, it is nice to have more aisle space than you need in your computer room to meet safety laws and to enable you to move equipment around One data center that Christine visited had enough aisle space to pull a large piece of equipment out of a rack onto the floor and wheel another one behind it without knocking into anything Cray’s data center in Eagan, Minnesota, had aisles that were three times the depth of the deepest machine If you are able to allocate this much space, based on your long-term plans—so that you will not have to move the racks later—treat yourself It is a useful luxury, and it makes the data center a much more pleasant environment 6.3 Ideal Data Centers Different people like different features in a data center To provide some food for thought, Tom and Christine have described the features each would like in a machine room 6.3.1 Tom’s Dream Data Center When you enter my dream data center, the first thing you notice is the voiceactivated door To make sure that someone didn’t record your voice and play it back, you are prompted for a dictionary word, which you must then repeat back The sliding door opens It is wide enough to fit a very large server, such as an SGI Challenge XL, even though those servers aren’t sold any more Even though the room has a raised floor, it is the same height as the hallway, which means that no ramp is required The room is on the fourth floor of a six-story building The UPS units and HVAC systems are in the sixth-floor attic, with plenty of room to grow and plenty of conduit space if additional power or ventilation needs to be brought to the room Flooding is unlikely The racks are all the same color and from the same vendor, which makes them look very nice In fact, they were bought at the same time, so the paint fades evenly A pull-out drawer at the halfway point of every third rack has a pad of paper and a couple of pens (I never can have too many pens.) Most of the servers mount directly in the rack, but a few have five shelves: two below the drawer, one just above the drawer, and two farther up the rack The shelves are at the same height on all racks so that it looks neat and are strong enough to hold equipment and still roll out Machines 180 Chapter Data Centers can be rolled out to maintenance on them, and the cables have enough slack to permit this When equipment is to be mounted, the shelves are removed or installed on racks that are missing shelves Only now you notice that some of the racks—the ones at the far end of the room—are missing shelves in anticipation of equipment that will be mounted and not require shelves The racks are 19-inch, four-post racks The network patch-panel racks, which not require cooling, have doors on the front and open in the back The racks are locked together so that each row is self-stable Each rack is as wide as a floor-tile: feet, or one rack per floor tile Each rack is feet deep, or 1.5 floor tiles deep A row of racks takes up 1.5 tiles, and the walkway between them takes an equal amount of space Thus, every three tiles is a complete rack and walkway combination that includes one tile that is completely uncovered and can therefore be removed when access is required If we are really lucky, some or all rows have an extra tile between them Having the extra feet makes it much easier to rack-mount bulky equipment (see Figure 6.19) The racks are in rows that are no more than 12 racks long Between every row is a walkway large enough to bring the largest piece of equipment through Some rows are missing or simply missing a rack or two nearest the walkway This space is reserved for machines that come with their own rack or are floor-standing servers If the room is large, it has multiple walkways If the room is small, its one walkway is in the middle of the room, where the door is Another door, used less frequently, is in the back for fire safety reasons The main door gives an excellent view of the machine room when tours come through The machine room has a large shatterproof plastic window Inside the room, by the window, is a desk with three monitors that display the status of the LAN, WAN, and services The back of each rack has 24 network jacks cable certified for Cat-6 cable The first 12 jacks go to a patch panel near the network equipment The next 12 go to a different patch panel near the console consolidator Although the consoles not require Cat-6 copper, using the same copper consistently means that one can overflow network connections into the console space If perhaps fiber may someday be needed, every rack—or simply every other rack—has six pairs of fiber that run back to a fiber patch panel The popularity of storage-area networks (SANs) is making fiber popular again Chapter 10 Disaster Recovery and Data Integrity A disaster-recovery plan looks at what disasters could hit the company and lays out a plan for responding to those disasters Disaster-recovery planning includes implementing ways to mitigate potential disasters and making preparations to enable quick restoration of key services The plan identifies what those key services are and specifies how quickly they need to be restored All sites need to some level of disaster-recovery (DR) planning DR planners must consider what happens if something catastrophic occurs at any one of their organization’s sites and how they can recover from it We concentrate on the electronic data aspects of DR However, this part of the plan should be built as part of a larger program in order to meet the company’s legal and financial obligations Several books are dedicated to disaster-recovery planning, and we recommend them for further reading: Fulmer (2000), Levitt (1997), and Schreider (1998) Building a disaster-recovery plan involves understanding both the risks that your site faces and your company’s legal and fiduciary responsibilities From this basis, you can begin your preparations This chapter describes what is involved in building a plan for your site 10.1 The Basics As with any project, building a disaster-recovery plan starts with understanding the requirements: determining what disasters could afflict your site, the likelihood that those disasters will strike, the cost to your company if they strike, and how quickly the various parts of your business need to be revived Once you and your management understand what can happen, you 261 262 Chapter 10 Disaster Recovery and Data Integrity can get a budget allocated for the project and start looking at how to meet and, preferably, beat those requirements 10.1.1 Definition of a Disaster A disaster is a catastrophic event that causes a massive outage affecting an entire building or site A disaster can be anything from a natural disaster, such as an earthquake, to the more common problem of stray backhoes cutting your cables by accident A disaster is anything that has a significant impact on your company’s ability to business Lack of Planning Can Cause Risk-Taking A computer equipment manufacturer had a facility in the west of Ireland A fire started in the building, and the staff knew that the fire-protection system was inadequate and that the building would be very badly damaged Several staff members went to the data center and started throwing equipment out the window because the equipment had a better chance of surviving the fall than the fire Other staff members then carried the equipment up the hill to their neighboring building The staff members in the burning building left when they judged that the fire hazard was too great All the equipment that they had thrown out the window actually survived, and the facility was operational again in record time However, the lack of a disaster-recovery plan and adequate protection systems resulted in staff members’ risking their lives Fortunately, no one was badly injured in this incident But their actions were in breach of fire safety codes and extremely risky because no one there was qualified to judge when the fire had become too hazardous 10.1.2 Risk Analysis The first step in building a disaster-recovery plan is to perform a risk analysis Risk management is a good candidate for using external consultants because it is a specialized skill that is required periodically, not daily A large company may hire specialists to perform the risk analysis while having an in-house person responsible for risk management A risk analysis involves determining what disasters the company is at risk of experiencing and what the chances are of those disasters occurring The analyst determines the likely cost to the company if a disaster of each type occurred The company then uses this information to decide approximately how much money is reasonable to spend on trying to mitigate the effects of each type of disaster 10.1 The Basics 263 The approximate budget for risk mitigation is determined by the formula (Probable cost of disaster − Probable cost after mitigation) × Risk of disaster For example, if a company’s premises has one chance in a million of being affected by flooding, and if a flood would cost the company $10 million, the budget for mitigating the effects of the flood would be in the range of $10 In other words, it’s not even worth stocking up on sandbags in preparation for a flood On the other hand, if a company has chance in 3,000 of being within 10 miles of the epicenter of an earthquake measuring 5.0 on the Richter scale, which would cause a loss of $60 million, the budget for reducing or preventing that damage would be in the $20,000 range A simpler, smaller-scale example is a large site that has a single point of failure whereby all LANs are tied together by one large router If it died, it would take one day to repair, and there is a 70 percent chance that failure will occur once every 24 months The outage would cause 1,000 people to be unable to work for a day The company estimates the loss of productivity to be $68,000 When the SAs go looking for redundancy for the router, the budget is approximately $23,800 The SAs also need to investigate the cost of reducing the outage time to hours—for example, by increasing support contract level If that costs a reasonable amount, it further reduces the amount the company would lose and therefore the amount it should spend on full redundancy This view of the process is somewhat simplified Each disaster can occur to different degrees with different likelihoods and a wide range of cost implications Damage prevention for one level of a particular disaster will probably have mitigating effects on the amount of damage sustained at a higher level of the same disaster All this complexity is taken into account by a professional risk analyst when she recommends a budget for the various types of disaster preparedness 10.1.3 Legal Obligations Beyond the basic cost to the company, additional considerations need to be taken into account as part of the DR planning process Commercial companies have legal obligations to their vendors, customers, and shareholders in terms of meeting contract obligations Public companies have to abide 264 Chapter 10 Disaster Recovery and Data Integrity by the laws of the stock markets on which they are traded Universities have contractual obligations to their students Building codes and work-safety regulations also must be followed The legal department should be able to elaborate on these obligations Typically, they are of the form “The company must be able to resume shipping product within week” or “The company can delay reporting quarterly results by at most days under these circumstances.” Those obligations translate into requirements for the DR plan They define how quickly various pieces of the physical and electronic infrastructure must be restored to working order Restoring individual parts of the company to working order before the entire infrastructure is operational requires an in-depth understanding of what pieces of infrastructure those parts rely on and a detailed plan of how to get them working Meeting the time commitments also requires an understanding of how long restoring those components will take We look at that further in Section 10.1.5 10.1.4 Damage Limitation Damage limitation is about reducing the cost of the disasters Some damage limitation can come at little or no cost to the company through advance planning and good processes Most damage limitation does involve additional cost to the company and is subject to the cost/benefit analysis that is performed by the risk analysts For little or no cost, there are ways to reduce the risk of a disaster’s causing significant damage to the company or to limit the amount of damage that the disaster can inflict For example, in an area prone to minor flooding, placing critical services above ground level may not significantly increase construction and move-in costs but avoids problems in the future Choosing rack-mountable and reasonably sturdy racks to bolt it into rather than putting equipment on shelves can significantly reduce the impact of a minor earthquake for little or no extra cost Using lightning rods in the construction of buildings in an area that is prone to lightning storms is also a cheap way of limiting damage These steps are particularly economical because they fix the problem once, rather than requiring a recurring cost Limiting the damage caused by a major disaster is more costly and always should be subject to a cost/benefit analysis For example, a data center could be built in an underground military-style bunker to protect against tornados and bombs In an earthquake zone, expensive mechanisms allow racks to move independently in a constrained manner to reduce the risk of 10.1 The Basics 265 computer backplanes’ shearing, the major issue with rigidly fixed racks during a strong earthquake These mechanisms for limiting damage are so costly that only the largest companies are likely to be able to justify implementing them Most damage-limitation mechanisms fall somewhere between “almost free” and “outlandishly expensive.” Fire-prevention systems typically fall into the latter category It is wise to consider implementing a fire-protection system that is designed to limit damage to equipment in the data center when activated Local laws and human-safety concerns limit what is possible in this area, but popular systems at the time of writing include inert gas systems and selective, limited-area, water-based systems with early-warning mechanisms that permit an operator to detect and resolve a problem, such as a disk or power supply catching fire, before the fire-protection system is activated Systems for detecting moisture under raised data center floors or in rarely visited UPS or generator rooms are also moderately priced damage-limitation mechanisms Another area that often merits attention is loss of power to a building or campus Short power outages, spikes, or brownouts can be handled by a UPS; longer interruptions will require a generator The more equipment on protected power—freezers at biotech companies, call centers at customerservice companies—and the longer you need to be able to run them, the more it will cost See Chapter for building disaster-proof data centers, particularly Section 6.1.1 discusses issues related to picking a good location and Section 6.1.4 explains issues related to power 10.1.5 Preparation Even with a reasonable amount of damage control in place, your organization may still experience a disaster Part of your disaster planning must be preparation for this eventuality Being prepared for a disaster means being able to restore the essential systems to working order in a timely manner, as defined by your legal obligations Restoring services after a disaster can require rebuilding the necessary data and services on new equipment if the old equipment is not operational Thus, you need to arrange a source of replacement hardware in advance from companies that provide this service You also need to have another site to which this equipment can be sent if the primary site cannot be used because of safety reasons, lack of power, or lack of connectivity Make sure that the company providing the standby equipment knows where to send it in 266 Chapter 10 Disaster Recovery and Data Integrity an emergency Make sure that you get turnaround time commitments from the provider and that you know what hardware the company will be able to provide on short notice Don’t forget to take the turnaround time on this equipment into account when calculating how long the entire process will take If a disaster is large enough to require the company’s services, chances are it will have other customers that also are affected Find out how the company plans to handle the situation in which both you and your neighbor have the right to the only large Sun server that was stockpiled Once you have the machines, you need to recreate your system Typically, you first rebuild the system and then restore the data This requires that you have backups of the data stored off-site—usually at a commercial storageand-retrieval service You also need to be able to easily identify which tapes are required for restoring the essential services This part of the basic preparation is built on infrastructure that your site should have already put in place An ongoing part of disaster preparation is to try retrieving tapes from the off-site storage company on a regular basis to see how long it takes This time is subtracted from the total amount of time available to completely restore the relevant systems to working order If it takes too long to get the tapes, it may be impossible to complete the rebuild on time For more on these issues, see Chapter 26, particularly Section 26.2.2 A site usually will need to have important documents archived at a document repository for safekeeping Such repositories specialize in DR scenarios If your company has one, you may want to consider also using it to house the data tapes Remember that you may need power, telephone, and network connectivity as part of restoring the services Work with the facilities group on these aspects It may be advisable to arrange an emergency office location for the critical functions as part of the disaster plan Good Preparation for an Emergency Facility A company had a California call center that was used by its customers, predominantly large financial institutions The company had a well-rehearsed procedure to execute in case of a disaster that affected the call center building The company had external outlets for providing power and the call center phone services and appropriate cables and equipment standing by, including tents and folding tables and chairs When a strong earthquake struck in 1991, the call center was rapidly relocated outside and was operational again within minutes Not long after it was set up, it received lots of calls from its customers in New York, who wanted to make sure that services were still available 10.1 The Basics 267 if they required them The call center staff calmly reassured customers that all services were operating normally Customers had no idea that they were talking to someone who was sitting on a chair in the grass outside the building The call center had to remain outside for several days until the building was certified safe But from the customers’ perspectives, it remained operational the entire time The plan to relocate the call center outside in the case of emergency worked well because the most likely emergency was an earthquake and the weather was likely to be dry, at least long enough for the tents to be put up The company had prepared well for its most likely disaster scenario 10.1.6 Data Integrity Data integrity means ensuring that data is not altered by external sources Data can be corrupted maliciously by viruses or individuals It can also be corrupted inadvertently by individuals, bugs in programs, and undetected hardware malfunctions For important data, consider ways to ensure integrity as part of day-to-day operations or the backup or archival process For example, data that should not change can be checked against a read-only checksum of the data Databases that should experience small changes or should have only data added should be checked for unexpectedly large changes or deletions Examples include source code control systems and databases of gene sequences Exploit your knowledge of the data on your systems to automate integrity checking Disaster planning also involves ensuring that a complete and correct copy of the corporate data can be produced and restored to the systems For disaster recovery, it must be a recent, coherent copy of the data with all databases in sync Data integrity meshes well with disaster recovery Industrial espionage and theft of intellectual property are not uncommon, and a company may find itself needing to fight for its intellectual property rights in a court of law The ability to accurately restore data as it existed on a certain date can also be used to prove ownership of intellectual property To be used as evidence, the date of the information retrieved must be accurately known, and the data must be in a consistent state For both disaster-recovery purposes and use of the data as evidence in a court, the SAs need to know that the data has not been tampered with It is important to make sure that the implementers put in place the dataintegrity mechanisms that the system designers recommend It is inadvisable to wait for corruption to occur before recognizing the value of these systems 268 Chapter 10 Disaster Recovery and Data Integrity 10.2 The Icing The ultimate preparation for a disaster is to have fully redundant versions of everything that can take over when the primary version fails In other words, have a redundant site with redundant systems In this section, we look at having a redundant site and some ways a company might be able to make that site more cost-effective Although this sounds expensive, in large companies, especially banks, having a redundant site is a minimum In fact, large companies have stopped using the term disaster recovery and instead use the term contingency planning or business continuity planning 10.2.1 Redundant Site For companies requiring high availability, the next level of disaster planning is to have a fully redundant second site in a location that will not be affected by the same disaster For most companies, this is an expensive dream However, if a company has two locations with data centers, it may be possible to duplicate some of the critical services across both data centers so that the only problem that remains to be solved is how the people who use those services get access to the redundant site Rather than permanently having live redundant equipment at the second site, it can instead be used as an alternative location for rebuilding the services If the company has a contract for an emergency supply of equipment, that equipment could be sent to the alternative data center site If the site that was affected by the disaster is badly damaged, this may be the fastest way to have the services up and running Another option is to designate some services at each site as less critical and to use the equipment from those services to rebuild the critical services from the damaged site Sometimes, you are lucky enough to have a design that compartmentalizes various pieces, making it easy to design a redundant site 10.2.2 Security Disasters A growing concern is security disasters Someone breaks into the corporate web site and changes the logo to be obscene Someone steals the database of credit card numbers from your e-commerce site A virus deletes all the files it can access Unlike with natural disasters, no physical harm occurs, and the attack may not be from a physically local phenomenon 10.3 Conclusion 269 A similar risk analysis can be performed to determine the kinds of measures required to protect data Architecture decisions have a risk component One can manage the risk in many ways—by building barriers around the system or by monitoring the system so that it can be shut down quickly in the event of an attack We continually see sites that purchase large, canned systems without asking for an explanation of the security risks of such systems Although no system is perfectly secure, a vendor should be able to explain the product’s security structure, the risk factors, and how recovery would occur in the event of data loss Chapter 11 covers constructing security policies and procedures that take into account DR plans 10.2.3 Media Relations When a disaster occurs, the media will probably want to know what happened, what effect it is having on the company, and when services will be restored Sadly, the answer to all three questions is usually, “We aren’t sure.” This can be the worst answer you can give a reporter Handling the media badly during a disaster can cause bigger problems than the original disaster We have two simple recommendations on this topic First, have a public relations (PR) firm on retainer before a disaster so that you aren’t trying to hire one as the disaster is happening Some PR firms specialize in disaster management, and some are proficient at handling security-related disasters Second, plan ahead of time how you will deal with the media This plan should include who will talk to the media, what kinds of things will and will not be said, and what the chain of command is if the designated decision makers aren’t available Anyone who talks to the media should receive training from your PR firm Note that these recommendations have one thing in common: They both require planning ahead of time Never be in a disaster without a media plan Don’t try to write one during a disaster 10.3 Conclusion The most important aspect of disaster planning is understanding what services are the most critical to the business and what the time constraints are for restoring those services The disaster planner also needs to know what disasters are likely to happen and how costly they would be before he can complete a risk analysis and determine the company’s budget for limiting the damage 270 Chapter 10 Disaster Recovery and Data Integrity A disaster plan should be built with consideration of those criteria It should account for the time to get new equipment, retrieve the off-site backups, and rebuild the critical systems from scratch Doing so requires advance planning for getting the correct equipment and being able to quickly determine which backup tapes are needed for rebuilding the critical systems The disaster planner must look for simple ways to limit damage, as well as more complex and expensive ways Preparations that are automatic and become part of the infrastructure are most effective Fire containment, water detection, earthquake bracing, and proper rack-mount equipment fall into this category The disaster planner also must prepare a plan for a team of people to execute in case of emergency Simple plans are often the most effective The team members must be familiar with their individual roles and should practice a few times a year Full redundancy, including a redundant site, is an ideal that is beyond the budget of most companies If a company has a second data center site, however, there are ways to incorporate it into the disaster plan at reasonable expense Exercises Which business units in your company would need to be up and running first after a disaster, and how quickly would they need to be operational? What commitments does your company have to its customers, and how those commitments influence your disaster planning? What disasters are most likely to hit each of your sites? How big an area might that disaster affect, and how many of your company’s buildings could be affected? What would the cost be to your company if a moderate disaster hit one of its locations? What forms of disaster limitation you have in place now? What forms of disaster limitation would you like to implement? How much would each of them cost? If you lost use of a data center facility because of a disaster, how would you restore service? What are your plans for dealing with the media in the event of a disaster? What is the name of the PR firm you retain to help you? Chapter 11 Security Policy Security entails much more than firewalls, intrusion detection, and authentication schemes Although all are key components of a security program, security administrators also have to take on many different roles, and the skills for those roles are quite diverse In this chapter, we look at all aspects of security and describe the basic building blocks that a company needs for a successful security program, some guiding principles, and some common security needs We also briefly discuss how the approaches described here apply in various-size companies and mention some ways in which the approach you take to security may differ in an academic environment Security is a huge topic, with many fine books written on it Zwicky, Chapman, and Cooper (2000) and Bellovin, Cheswick, and Rubin (2003) are excellent books about firewalls The books by Garfinkel and Spafford (1996, 1997) discuss UNIX and Internet security, along with web security and commerce Norberg and Russell (2000) and Sheldon and Cox (2000) provide details on Windows security The book by Miller and Davis (2000) covers the area of intellectual property Wood (1999) is well known for his sample security policies Kovacich (1998) deals with the topic of establishing a program for information protection Neumann (1997) and Denning (1999) cover the topics of risks and information warfare Security should be everyone’s job However, it is important to have individuals who specialize in and focus on the organization’s security needs Security is a huge, rapidly changing field, and to keep current, SAs working in security must focus all their attention on the security arena Senior SAs with the right mind-set for security are good candidates for being trained by security specialists to join the security team Data security requires more negotiating skills and better contacts throughout the company than does any other area of system administration People 271 272 Chapter 11 Security Policy in many companies perceive computer security as being an obstacle to getting work done To succeed, you must dispel that notion and be involved as early as possible in any projects that affect the electronic security of the company If you learn one thing from this chapter, we hope it is this: The policy that gets adhered to is the one that is the most convenient for the customer If you want people to follow a policy, make sure that it is their easiest option For example, if you want people to use encrypted email, make sure that the application you provide makes it as easy as sending regular email, or they won’t it Refining a policy or technology to the point that it is easier than all the alternatives is a lot of work for you However, it is better than spending your time fighting security problems Over time, computer use has evolved from requiring physical proximity to remote access from around the world Security trends have evolved in tandem How will the computer and network access models change in the future, and what impact will such change have on security? The trend so far has been a decreasing ability to rely on physical security or on trust, with more use of encryption and strong authentication Each new access model has required an increased need for planning, education, testing, and new technology 11.1 The Basics Two basic patterns in security policy are perimeter security and defense in depth Perimeter security is like a castle with a high wall around it Make a good wall, and you are free to what you want inside Put a good firewall at the entry to your network, and you don’t have to worry about what’s inside Bellovin, Cheswick, and Rubin (2003) refer to this as the crunchy candy shell with the soft gooey center, or a policy of putting all your eggs in one basket and then making sure that you have a really good basket The problem with perimeter security is that the crunchy candy shell is disappearing as wireless networks become common and as organizations cross-connect networks with partners Defense in depth refers to placing security measures at all points in a network For example, a firewall protects the organization from attacks via the Internet, an antivirus system scans each email message, antimalware software runs on each individual PC, and encryption is used between computers to ensure privacy and authentication 11.1 The Basics 273 As with the design of other infrastructure components, the design of a security system should be based on simplicity, usability, and minimalism Complexity obscures errors or chinks in your armor An overly complex system will prove to be inflexible and difficult to use and maintain and will ultimately be weakened or circumvented for people to be able to work effectively In addition, a successful security architecture has security built into the system, not simply bolted on at the end Good security involves an indepth approach with security hooks at all levels of the system If those hooks are not there from the beginning, they can be very difficult to add and integrate later Some consider security and convenience to be inversely proportional That is, to make something more secure makes it more difficult to use This certainly has been true for a number of security products We believe that when security is done correctly, it takes into account the customer’s ease of use The problem is that often, it takes several years for technology to advance to this point For example, passwords are a good start, but putting a password on every application becomes a pain for people who use a lot of applications in a day’s work However, making things even more secure by deploying a secure single-sign-on system maximizes security and convenience by being much more secure yet nearly eliminating the users’ need to type passwords When security is inconvenient, your customers will find ways around it When security technology advances sufficiently, however, the system becomes more secure and easier to use ❖ Security and Reliability Reliability and security go hand in hand An insecure system is open to attacks that make it unreliable Attackers can bring down an unreliable system by triggering the weaknesses, which is a denial-of-service (DoS) attack If management is unconcerned with security, find out whether it is concerned with reliability If so, address all security issues as realiability issues instead 11.1.1 Ask the Right Questions Before you can implement a successful security program, you must find out what you are trying to protect, from whom it must be protected, what the risks are, and what it is worth to the company These business decisions should be made through informed discussion with the executive management of the company Document the decisions that are made during this process, 274 Chapter 11 Security Policy and review the final document with management The document will need to evolve with the company but should not change too dramatically or frequently 11.1.1.1 Information Protection Corporate security is about protecting assets Most often, information is the asset that a company is most concerned about The information to be protected can fall into several categories A mature security program defines a set of categories and classifies information within those categories The classification of the information determines what level of security is applied to it For example, information could be categorized as public, company confidential, and strictly confidential Public information might include marketing literature, user manuals, and publications in journals or conferences Company-confidential information might include organization charts, phone lists, internal newsletters with financial results, business direction, articles on a product under development, source code, or security policies Strictly confidential information would be very closely tracked and available on a need-to-know basis only It could include contract negotiations, employee information, top-secret product-development details, or a customer’s intellectual property Another aspect of information protection includes protecting against malicious alteration, deliberate and accidental release of information, and theft or destruction Case Study: Protect Against Malicious Alteration Staff members from a major New York newspaper revealed to a security consultant that although they were concerned with information being stolen, their primary concern was with someone modifying information without detection What if a report about a company was changed to say something false? What if the headline was replaced with foul language? The phrase ‘‘Today, the [insert your favorite major newspaper] reported ’’ has a lot of value, which would be diminished if intruders were able to change the paper’s content 11.1.1.2 Service Availability In most cases, a company wants to protect service availability If a company relies on the availability of certain electronic resources to conduct its business, part of the mission of the security team will be to prevent malicious DoS 11.1 The Basics 275 attacks against those resources Often, companies not start thinking about this until they provide Internet-based services, because employees generally tend not to launch such attacks against their own company 11.1.1.3 Theft of Resources Sometimes, the company wants to protect against theft of resources For example, if a production line is operated by computer equipment at less than full capacity because the computer has cycles being used for other purposes, the company will want to reduce the chance that compute cycles are used by intruders on that machine The same applies to computer-controlled hospital equipment, where lives may depend on computing resources being available as needed E-commerce sites are also concerned with theft of resources Their systems can be slowed down by bandwidth pirates hiding FTP or chat servers in the infrastructure, resulting in lost business for the e-commerce company 11.1.1.4 Summary In cooperation with your management team, decide what you need to protect and from whom, how much that is worth to the company, and what the risks are Define information categories and the levels of protection afforded to them Document those decisions, and use this document as a basis for your security program As the company evolves, remember to periodically reevaluate the decisions in that document with the management team Case Study: Decide What Is Important; Then Protect It As a consultant, one gets to hear various answers to the question ‘‘What are you trying to protect?’’ The answer often, but not always, is predictable A midsize electronic design automation (EDA) company that had a crossfunctional information-protection committee regarded customers’ and business partners’ intellectual property, followed by their own intellectual property, as the most important things to be protected Customers would send this company their chip designs if they were having problems with the tools or if they had a collaborative agreement for optimizing the software for the customers’ designs The company also worked with business partners on collaborative projects that involved a twoway exchange of information This third-party information always came with contractual agreements about security measures and restrictive access The company recognized that if customers or business partners lost trust in its security, particularly by inadvertently giving others access to that information, the company would no longer have access to the information that had made this company the leader in its field, and ultimately, customers would go elsewhere If someone gained access to the ... European office Secondary European office New Jersey office Texas office Main U.S office Secondary U.S office Florida office Colorado office Main Asian office Singapore office Secondary Asian office... hold the machines and lighting in the data center, the HVAC system, the UPS charging, the phone switches, the SA work area, and the customer service center The security-access system is also on the. .. overhead for the network administration staff to learn the 214 Chapter Networks configurations and quirks of the diverse equipment and to track software upgrades and bugs Minimizing the number of vendors

Ngày đăng: 14/08/2014, 14:20

Từ khóa liên quan

Mục lục

  • The practice of system and network administration, 2nd ed

    • Part II: Foundation Elements

      • 6 Data Centers

        • 6.2 The Icing

        • 6.3 Ideal Data Centers

        • 6.4 Conclusion

        • 7 Networks

          • 7.1 The Basics

          • 7.2 The Icing

          • 7.3 Conclusion

          • 8 Namespaces

            • 8.1 The Basics

            • 8.2 The Icing

            • 8.3 Conclusion

            • 9 Documentation

              • 9.1 The Basics

              • 9.2 The Icing

              • 9.3 Conclusion

              • 10 Disaster Recovery and Data Integrity

                • 10.1 The Basics

                • 10.2 The Icing

                • 10.3 Conclusion

                • 11 Security Policy

                  • 11.1 The Basics

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan