The data center cooling system is the most maintenance intensive equipment within the site environmental support system exceeding that of generators/ATS, UPS/Battery, power distribution, fire protection and others. The long term reliability of the Computer Room A/C (CRAC) system is a function of several items;
1) Best equipment selection that meets the application,
2) System is installed utilizing industry best practices and proven refrigeration, condenser water or chiller water piping methods,
3) System start up and commissioning, and
4) Thorough on-going preventive maintenance. If we look at the definition of “Preventive Maintenance”, it can be broken down into;
Preventive: to stop or keep from happening; make impossible by prior action; hinder.
Maintenance: to keep in a certain condition, as of repair; means of support or sustenance.
One can believe that if you go to five different contractors to get a definition of “preventive maintenance”, you will get five different answers.
With this in mind, the following discussion will be centered on what and how preventive maintenance is done for a data center. Information technology / data center managers have a huge burden on their shoulders to provide a high level of reliability, computer system performance and uptime, so their efforts to retain a credible and data center experienced maintenance service contractor can be a challenging task.
he construction and on-going maintenance of data center support systems is very expensive but is necessary for today’s information centric business structure. The “Cost of Ownership” can be broken down into three parts;
2) Equipment installation, and
3) On-going maintenance.
If we assume an equipment life span of 15 years, then equipment maintenance can be the most costly of the three parts before equipment end of life (EOL). It is also true that a well maintained system has a higher efficiency and reliability, all while reducing the total cost of ownership.
The overall goal is to find a data center contractor that deploys a series of maintenance “best practices”. So the question arises on how and what to do in the selection process. A systematic approach to provide qualified maintenance has to start with the contractor’s field technicians. Data center level contractors must have a higher level of awareness when working inside a critical space data center. Not all contractors have this capability. Contractors have to be vetted and interviewed regarding their technical engineering office support/ operations and their field technician data center experience.
This can be summarized as follows:
- Contractors are to demonstrate their current and past work with data centers.
- Do they have any credentials within the data center industry such as a manufacturer’s factory training or IT technical group associations such as 7X24 Exchange, AFCOM and others?
- How many technicians are data center experienced and trained? How much of the technicians work schedule is data center versus non-data center?
- Perform onsite interview of technicians and have them demonstrate the operation of the CRAC system. Observe if they can navigate the display/operation panel without delay and discuss “Sequences of Operation”.
- What is the availability of equipment spare parts at service truck and office/warehouse levels? Parts availability has a direct correlation to low “Mean Time to Repair (MTTR)”. More on this topic will be discussed later.
- Does the contractor comprehend the risks associated with working in the critical space environment?
- Evaluate the contractor’s level of office support operations for their technical ability with data center mechanical systems. The office support should also have the ability to understand data center power systems as the power level input into the data center has a direct relation to the cooling load. Also, knowledge of any emergency power off (EPO) and clean agent shutdown control wiring needs to be demonstrate.
- Contractor field technicians must have experience with working in a clean agent protected space, where they exist. Data centers with clean agent should have a “bypass switch” that disables the agent discharge assembly so an accidental discharge can be avoided. If your system does not have a bypass switch, then one can be added. At no time is the contractor to perform work in the critical space with an active clean agent system. All contractors should ask prior to commencing any work if the clean agent is disabled. If a contractor performs work with an active system and doesn’t ask if it is disabled, then this contractor is inexperienced within the critical space and should be asked immediately to stop. At this point you should review the contractor’s prowess inside the critical space and set guidelines on the contractor’s actions. Contractor monitoring will be needed and having to schedule an employee to watch over your contractors can be very time consuming and costly, besides the interruption in your staff’s normal work schedules and commitments.
- Ask if the contractor can perform a complimentary audit of one or two CRAC units and provide a report on their findings. This can be very helpful in the contractor evaluation as to their technician’s ability, findings and depth of report.
Preventive maintenance (PM) agreements are site specific as to the number and the style of equipment that exist. Scopes of work (SOW) basically range from;
- FULL SERVICE: This type of contract is inclusive of parts, preventive maintenance labor and corrective action labor. Compressor coverage may be optional if the factory compressor warranty has expired. Note to check about compressor coverage inclusion of associated labor and material. Labor coverage should include all labor at any time, overtime, weekends, holidays, etc.
- INSPECTION ONLY: This type of contract provides only a maintenance review or “inspection” of the equipment. Corrective action labor and material is excluded and invoiced separately.
The following are items to consider that apply to both PM agreement types with reference to the inspection portion of the PM visit. These items are important to evaluate any quotation:
- What individual unit documentation will be provided depicting what PM items were checked with associated recorded component operating values?
- What is the number of techs dispatched to perform maintenance tasks and what is the estimated number of man-hours estimated per PM visit. Thorough maintenance is directly based on how many man-hours it takes to perform an in-depth review.
- What is the CRAC unit schedule for service shutdown and how will it impact critical systems cooling?
- Is there any specialty predictive testing to be performed such as infra-red scanning, vibration analysis or megger testing? If so, what, how often and what reporting will be furnished?
- What level of equipment failure forensics will be provided? Some failures are a result of an installation deficiency that will require the root cause of the failure to be corrected so the same failure doesn’t reoccur. Repetitive component failure such as a compressor can be directly attributed to an installation deficiency and requires identification and correction. This scenario requires engineering application typically provided by office support operations.
Each of the maintenance contract types will have caveats to consider such as the following:
- Most full maintenance contracts have a clause allowing the “right of 1st inspection”. This allows the contractor to inspect the equipment now under coverage as to the state of dis-repair. This is not considered out of line as the contractor must agree that the equipment is in reasonable condition before undertaking a full maintenance agreement. A list of needed repair findings will be presented with corrective action labor and material pricing. This list should be broken down into three categories, (urgent, necessary, and routine) as to their urgency of repair. Sometimes the needed repair list can be a challenge with the level of content and the cost of the repairs. This is a good point to break down the list, ask questions and fully understand the level of risk now present with the data center. This situation with a significant needed repair list is not the fault of the new service contractor but likely the one you are replacing. And now you know why.
- Create a method of tracking the repairs performed on each piece of equipment. A good service contractor should reduce frequency of repair especially repetitive repairs.
- Qualify how many man-hours are included within the maintenance portion of the contract? How many men and for how long are they going over the equipment and perform maintenance task such as tightening electrical connections, cleaning condensers (if they exist), checking superheat/sub cooling and document, check contactor contacts and so on. For example, there’s a big difference if your have 40 CRACs and one contractor provides one man for one week and another contractor provides three men for one week. In this example, the (3) men-week selection is the correct choice.
- All repairs are invoiced so the incentive to perform quality inspections may not be present. Qualify the man-hours to perform the maintenance portion as stated in #3 above.
- Be cautious of low pricing as the contractor will attempt to recover any costs on the repairs.
A large portion of your review of contractors competing for your work is the preventive portion stressed above for either full service or inspection only. If poor maintenance is performed, the frequencies of repairs are higher thus increasing the cost of ownership. For example, if maintenance is poor and a 3 phase compressor contactor fails and single phases the power input, the compressor may fail. If good maintenance is performed, the contactor deficient condition would have been identified and replaced ahead of any single phasing impact on the compressor. You just saved $8K to $10K. Compressor manufacturers have stated that over 80% of the warranty compressors returned failed because of some issue other than a faulty compressor. They reference poor installation and maintenance as the root cause of many returned compressors. This is all the more reason to be cautious and be more inquisitive in your review of the base proposal components of any potential service provider.
The availability of spare parts has a direct impact on cost and time to repair. Spare parts are considered those components that have a high frequency of replacement such as contactors, condensate pumps, condenser fan motors, condenser motor fan speed control, humidifier bulbs, humidifier float and others. Compressors are also to be considered with one of each size to be kept in inventory. Spare parts can be sourced independently or can be a part of the contractor’s responsibility to provide within the contract scope. If an independent parts inventory is purchased then the contractor would replace as the inventory is used for timely repair purposes.
Multi-year contracts typically come in a 3 year term. This can be an option from an annual price after you have determined from the aforementioned discussion that a viable contractor has been selected. Extended contracts have benefits for both parties such as;
IT Manager Benefits
- Lock in pricing for 3 years
- A cancellation clause with 30 days notice can be included.
- Allows opportunity to offer best pricing with applicable discounts for extended commitment.
- Instills confidence that quality service will be rewarded with extended commitment from client.
In summary, all service contracts of whatever type include the base procedures of general preventive maintenance. Preventive maintenance is the foundation of a reliable and efficient cooling system. A well maintained system is also less expensive with the total cost of ownership. Review the content of the root components of what preventive maintenance is suppose to be all about. This will have a big return on overall satisfaction on system reliability with much less stress.