In order to help maintain some metrics over time that can help us understand what types of reliability issues LAVA has we have a small process in place for dealing with failed health jobs.
You can find failed health jobs by looking at the reports page. Each column of the failure graphs includes a hyperlink you can select to view the failed health jobs for that interval. The report you select might be similar to:
If the job has no failure tags or comments, you can fill them in by selecting the job. eg:
And from that page, selecting the "Annotate failure" action.
Once the failure has been documented, you should try and get it back online.
Platform/LAVA/DevOps/LAVAFailedHealthJob (last modified 2013-01-07 18:07:29)