Rationale

Currently, a Jenkins build instance lingers for some time after a build on it finishes, before it's terminated. This is apparently done based on idea that there's overhead of starting up an instance, so reusing existing may be useful. But that's not we want - we want to use an instance once-off for specific build, to have guaranteedly clean-room environment, and then terminate it immediately, to not incur additional costs on us.

User stories

As a EC2 Administrator I want that cloud-buildd instances were terminated as soon as build finishes (and not reused for other builds).

As a EC2 Administrator I want that there was monitoring of running instance with user notification on soft limit and automatic termination on hard limit.

Assumptions

  1. Jenkins has hard-coded instance lingering time before it is garbage-collected.
  2. We cannot just self-shutdown instance at the end of build - this will lead to build failure.

Design

Slave instance monitoring

Write a monitoring tool which will watch running Jenkins slave instances, if normal time threshold is passed for an instance, notify owner of this build instance, so one can have chance to look what's wrong, then, if instance is still running at threshold #2, notify EC2 admin and/or common devel list.

Additionally, consider if we need to add hard threshold, after which has passed, an instance will be terminated in automatic manner (notifying needed parties).

Instance lifetime control

There were 2 ideas how to implement this:

  1. Use Jenkins post-build task plugin, which allows to run a script after the main part of a build is finished (and have explicit option to not affect build result). The idea is that it would be possible to shutdown an instance there w/o causing build to fail, but that needs to be verified.
  2. Patch Jenkins EC2 plugin to shutdown instance immediately after the build. More general, reusable (and upstreamable) approach would be to add UI settings for: a) time an instance may linger after the build before it is garbage-collected, and b) cap of the number of builds in row an instance can serve before being shutdown explicitly. For our case, we'd have those params respectively at 0 (which means ASAP per Jenkins scheduler, not immediately) and 1 (this setting will ensure that Jenkins doesn't allocate a job on instance, even if there're jobs in queue which would otherwise fit even with 0 linger delay).

Implementation

Slave instance monitoring

Write a standalone script, which can be run both manually and as a cronjob. Later, it can be merged into other EC2 monitoring scripts. Figure out how to map instance to their owners (to mail them in case of problems). There's possibility to associate arbitrary key/value pairs with an instance, so we could store email there, but then all tools which start up instance should set such key, which is not the case and hardly can be supported for 3rd party tools. So instead, will likely need to have cronjob-local mapping from instance properties (like, SSH keypair name) to owner emails.

UI Changes

For new Jenkins EC2 parameters, they should be presented via standard Jenkins configuration page and be editable from there.

For instance monitoring, cronjob will log all its actions, and will communicate with outside world using email.

Test/Demo Plan

Testing will be done gradually, starting with non-destructive monitoring, and moving towards active control (like instance termination) only later. Changes will be first tested/deployed locally, then on sandboxes if applicable, then proposed for production use.


CategorySpec CategoryTemplate

Platform/Specs/11.11/CloudBuilddInstanceManagement (last modified 2011-06-08 11:50:53)