Telecom Risk and Security Part 4 – Facilities
October 14, 2009 Leave a comment
A 40 year old building with much of the original mechanical and electrical infrastructure. A 40 year old 4000 amp, 480 volt aluminum electrical buss duct, which had been modified and “tapped” often during its life, with much of the work done violating equipment specifications. With the old materials such as buss insulation gradually deteriorating, the duct expanding and contracting over the years, the fact aluminum was used during the initial installation to either save money or test a new technology vision – it all becomes a risk. A risk of buss failure, or at worst a buss failing to the point it results in a massive electrical explosion.
Sound extreme? Now add a couple of additional factors. The building is a mixed use-telecom carrier hotel, with additional space used for commercial collocation and standard commercial office space. This narrows it down to most of the carrier hotel facilities in the US and Europe. Old buildings, converted to mixed-use carrier hotel and collocation facilities, due mainly to an abundance of vacant space during the mid-1990s, and a need for telecom interconnection space following the Telecommunications Act of 1996.
Over the past four years the telecom, Internet, and data center industry has suffered several major electrical events. Some have resulted in complete facility outages, others have been saved by backup systems which operated as designed, preventing significant disruption to tenants and the services operated within the building.
A partial list of recent carrier hotel and data center facility outages or significant events include some of the most important facilities in the telecom and Internet-connected industry:
- 365 Main in San Francisco
- RackSpace hosting facilities in Dallas
- Equinix facilities in Australia and France
- MPT in San Jose
- IBM facility in NZ
- Fisher Plaza in Seattle
- Cincinnati Bell
And the list goes on. Facilities which are managed by good companies, but have many issues in common. Most of those issues are human issues. The resulting outages caused havoc or chaos throughout a wide range of commercial companies, telecom companies, Internet services and content.
The Human Factor in Facility Failures
Building a modern data center or carrier interconnection point follows a fairly simple series of tasks. Following a data center design and construction checklist, with strict compliance to the process and individual steps, can often mean the difference between a well-run facility and one that is at risk of failure during a commercial power outage, or systems failure.
In the design/construction phase, data center operators follow a system of:
- Determining the scope of the project
- Developing a data center design specification based on both company/industry standards
- Designing a specific facility based on business scope and budget, which will comply with the standard design specification
- Publish the design specification and distribute to several candidate construction management companies and engineering companies
- Use a strong project manager to drive the construction, permitting, certification, and vendor management process
- Complete systems integration and commissioning prior to actual operations
Of all the above tasks, a complete commissioning plan and integration test is essential to building confidence the data center or telecom facility will operate as planned. Many outages in the past have resulted from systems that were not fully tested or integrated prior to operations.
An example may be a breaker coordination study. This is the process of ensuring switch gear and panel breakers from the point of electrical presentation by the local power utility down to individual breaker panels are set, tested, and integrated according to vendor specification. Without a complete coordination study, there is no assurance components within an electrical system will either operate correctly during normal conditions, or operate correctly during equipment failures. An essential component of a complete systems integration test. Failure to complete a simple breaker coordination study during commissioning has resulted in major electrical failures in data centers as recently as 2008.
The InterNational Electrical Testing
Association (NETA) provides guidance on electrical commissioning for data centers under “full design load” conditions. This includes testing recommendations to test performance and operations including the sequence of operations for electrical, mechanical, building management systems/BMS, and power monitoring/management. The actual levels of NETA testing are:
- Level 1- Submittal Review and Factory Testing
- Level 2- Site Inspection and Verification to Submittal
- Level 3- Installation Inspections and Verifications to Design Drawings
- Level 4- Component Testing to Design Loads
- Level 5- System Integration Tests at Full Design Loads
No company should consider collocation within a facility that cannot produce complete documentation that integration testing and commissioning was completed prior to facility operations – and that testing should be at NETA Level 5. In some cases, documentation of “retro” testing is acceptable, however potential tenants in a facility should be aware that is still a compromise, as it is almost impossible to complete a retro-commissioning test in a live facility.
Bottom Line – even a multi-million dollar facility has no integrity without a detailed design specification and complete integration/commissioning test.
The Human Factor in Continuing Facility Operations
Assuming the facility adequately completes integration and commissioning at NETA Level 5, the next step is ensuring the facility has a comprehensive continuing operations plan to manage their electrical (and mechanical/air conditioning) systems. There are two main recommendations for ensuring the annual, monthly, and even daily equipment maintenance and inspection plans are being completed.
Computerized Maintenance Management System (CMMS)
Data centers and central offices are complex operations. Thousands of moving parts, thousands of things that can potentially break or go wrong. A CMMS system tries to bring all those components together into an integrated resource that includes (according to Wikipedia)
Work orders: Scheduling jobs, assigning personnel, reserving materials, recording costs, and tracking relevant information such as the cause of the problem (if any), downtime involved (if any), and recommendations for future action
Preventive maintenance (PM): Keeping track of PM inspections and jobs, including step-by-step instructions or check-lists, lists of materials required, and other pertinent details. Typically, the CMMS schedules PM jobs automatically based on schedules and/or meter readings. Different software packages use different techniques for reporting when a job should be performed.
Asset management: Recording data about equipment and property including specifications, warranty information, service contracts, spare parts, purchase date, expected lifetime, and anything else that might be of help to management or maintenance workers. The CMMS may also generate Asset Management metrics such as the Facility Condition Index, or FCI.
Inventory control: Management of spare parts, tools, and other materials including the reservation of materials for particular jobs, recording where materials are stored, determining when more materials should be purchased, tracking shipment receipts, and taking inventory.
And we can also add additional steps such as daily equipment inspections, facility walkthroughs, and staff training.
SAS 70 Audits
The SAS 70 Audit is becoming more popular with companies to force the data center operator to provide audited documentation by a neutral evaluator that they are actually completing the maintenance, security, staffing, and permitting activities as stated in marketing and other sales negotiations.
Wikipedia defines a SAS70 Audit as:
“… the professional standards used by a service auditor to assess the internal controls of a service organization and issue a service auditor’s report. Service organizations are typically entities that provide outsourcing services that impact the control environment of their customers. Examples of service organizations are insurance and medical claims processors, trust companies, hosted data centers, application service providers (ASPs), managed security providers, credit processing organizations and clearinghouses.
There are two types of service auditor reports. A Type I service auditor’s report includes the service auditor’s opinion on the fairness of the presentation of the service organization’s description of controls that had been placed in operation and the suitability of the design of the controls to achieve the specified control objectives. A Type II service auditor’s report includes the information contained in a Type I service auditor’s report and also includes the service auditor’s opinion on whether the specific controls were operating effectively during the period under review.”
Many companies considering outsourcing within the financial services industries are now considering a SAS70 audit essential to considering candidate data center facilities to host their data and applications. Startup companies with savvy investors are demanding SAS70 audits. In fact, any company considering outsourcing their data or applications into a commercial data center should demand to obtain or review SAS70 audits for each facility considered.
Otherwise, you are forced to “believe” the words of a marketer’s spin, a salesman’s desperate pitch, or the words of others to provide confidence your business will be protected in another company’s facility.
One thing to keep in mind about SAS70 audits… The audit only reviews items the data center operator chooses to audit. Thus, a company may have a very nice and polished SAS70 audit documentation, however the contents may not include every item you need to ensure the data center operator has a comprehensive operations plan. You may consider finding an experienced consultant to review the SAS70 document, and provide any additional guidance on whether or not the audit actually includes all facility maintenance and management items needed to ensure continuing protection from mechanical, monitoring/management, electrical, security, or human staffing failures.
Finally, Know Your Facility
Facility operators are traditionally reluctant to show a potential customer or tenant their electrical and mechanical diagrams and “as-built” documentation for the facility. This is the point you would find a 40 year old aluminum buss duct, single points of failure, and other infrastructure designs and realities you should know before putting your business into a data center or carrier hotel.
So, when all other data center and carrier hotel facilities appear equal, in geography and interconnections, look at facilities which will incur the least impact if your interconnections are disrupted, and demand your candidate data center operator and hosting provider are able to provide you complete documentation on the facility, commissioning, CMMS, and SAS70.
Your business, the global marketplace, and network-connected world depend on forcing the highest possible standards of facility design and operation.
John Savageau, Long Beach
Other articles in this series include: