The biggest failure of my career: what happened, what we did about it, and how we won’t do it again.
TimeWednesday, July 313:30pm - 4pm
DescriptionIn 2017 the Research Computing team at University of Colorado Boulder began the process of refreshing the PetaLibrary, a campus-wide research and academic data storage service. Updates were motivated by aging hardware, the impending end of upstream support, and a hope to improve the service with lessons learned.
We selected a supplier with a seemingly well-architected proposal; but problems experienced during deployment and testing increased our workload beyond expectations. We were required to operate the legacy infrastructure longer than expected, leading to stress within the team. Finally, inadequate planning and risk mitigation led to an over-reliance on project success and a reticence to enforce requirements and deadlines.
The project began to see success through more open communication and by making adequate allowances for partial project failure. We found it necessary to take a more active role in the project's success, including the removal of components that were unable to meet acceptability requirements. This, coupled with strong contracting, allowed us to amend the solution and avoid total project failure. The amended solution also matches our institutional priorities of openness and flexibility more closely than the original design.
From this experience we have learned to perform large and complex migrations in discrete stages; to decouple legacy decommissioning timelines from deployment timelines; to insist on and provide adequate time for hands-on evaluation of new components; to maintain architectural authority and responsibility; and at all times to assess risk frankly and communicate openly.