In order to help maintain some metrics over time that can help us understand what types of reliability issues LAVA has we have a small process in place for dealing with failed health jobs.

You can find failed health jobs by looking at the reports page. Each column of the failure graphs includes a hyperlink you can select to view the failed health jobs for that interval. The report you select might be similar to:

If the job has no failure tags or comments, you can fill them in by selecting the job. eg:

And from that page, selecting the "Annotate failure" action.

Once the failure has been documented, you should try and get it back online.


