The Cloudfoundry Diego Persistence team recently spent a fair amount of time and effort building and refactoring the CI pipeline for our Ceph filesystem, volume driver, and service broker. The end state from this exercise, while not perfect, is nonetheless pretty darn good: It deploys Cloudfoundry, Diego, and a Cephfs cluster, along with our volume driver and service broker. It runs our code through unit tests, certification tests, and acceptance tests. It keeps our deployment up to date with the latest releases of CloudFoundry and the latest development branch changes to Diego. It does all of this with minimal rework or delay; changes in our driver/broker bosh release typically flow through the pipeline in about 10 minutes.
But our first attempt at creating the pipeline did not work very well or very quickly, so we thought it would be worth documenting our initial assumptions, what was wrong about them, some of what we learned while fixing them.
Our First Stab at It
We started with a set of assumptions about what we could run quickly and what would run slowly, and we tried to organize our pipeline around those assumptions to make sure that the quick stuff didn’t get blocked by the slow stuff.
- Cephfs cluster deployment is slow–it requires us to apt-get a largish list of parts and then provision a cluster. This can take 20-30 minutes.
- Since cluster deployment is slow, and we share a bosh release for the cephfs bosh job and our driver and broker bosh jobs, we should only trigger cephfs deployment nightly when nobody is waiting–we shouldn’t trigger it when our bosh release is updated.
- Redeploying Cephfs is not safe–to make sure that it stays in a clean state, we should undeploy it before deploying it again.
- CloudFoundry deployment is slow–we should not automatically pick up new CF releases because it might paralyze our pipeline during the work day.
- The pipeline should clean up on failure–bad deployments of cephfs should get torn down automatically.
What We Eventually Learned
Our first pass at the pipeline (mostly) worked, but it was slow and inefficient. Because we structured it to deploy some of the critical components nightly or on demand, and we tore down the ceph file system vm before redeploying it, any time we needed an update, we had to wait a long time. In the case of cephfs, we also had to create a shadow pipeline just for manually triggering cephfs redeployment. It turned out that most of of the assumptions above were wrong, so let’s take another look at those:
- Cephfs cluster deployment is slow. This is only partially true. Because we installed cephfs using apt-get, we were doing an end-run around Bosh package management, effectively ensuring that we would re-do work in our install script whether it was necessary or not. We switched from apt-get to Bosh managed debian packages and that sped things up a lot. Bosh caches packages and only fetches things that have actually changed.
- We should only trigger cephfs deployment nightly or we will repeat slow cephfs deployments whenever code changes. This is totally untrue. Bosh is designed to detect changes from one version to the next, so when the broker job or the driver job changes, but cephfs hasn’t changed, deploying the cephfs job will result in a no-op for bosh.
- Redeploying Cephfs is not safe. This might be partially true. In theory our ceph filesystem could get corrupted in ways that would cause the pipeline to keep failing, but treating this operation as unsafe is somewhat antithetical to cloud operations. Bosh jobs should as much as possible be safe to redeploy without removing them.
- CloudFoundry deployment is slow. This is usually not true. When there are new releases of CloudFoundry, they deploy incrementally just like other bosh deployments, so only the changed jobs will result in deployment changes. The real culprit in slow deployment times happens when there is an update to the bosh stemcell, and bosh needs to download the stemcell before it can deploy. In order to keep that from slowing down our pipeline during the workday, we created a “nightly stemcell” task in the pipeline that doesn’t do anything, but can only run at night. Using the latest passed stemcell from that task, and setting the stemcell as a trigger in our deploy tasks ensures that when there is a stemcell change, our pipeline will pick it up at night, and redeploy with it, and that we will never have to wait for a stemcell download during the day:
- name: <b>nightly</b>
start: 01:00 AM -0800
stop: 1:15 AM -0800
- name: aws-stemcell
- name: nightly-stemcell
- get: nightly
- get: bosh-stemcell
- name: teardown-cephfs-cluster
- get: cephfs-bosh-release
- get: aws-stemcell
- get: deployments-runtime
- task: teardown
- The pipeline should clean up on failure—This is generally a bad practice. It means that we have no way of diagnosing failures in the pipeline. Teardown after failure also doesn’t restore the health of the pipeline, unless the deployments in question are re-deployed after, but in the case of a deployment error, that could easily result in a tight loop of deployment and undeployment, so we never did that.
Where We Ended Up
After we corrected all of our wrong assumptions, our pipeline is in much better shape:
- Bosh deployments are incremental and frequent. We pick up new releases as soon as they happen, and we re-test against them, so we get early warning of failures even when we didn’t make the breaking changes.
- Our bosh job install scripts are as much as possible idempotent. The only undeploy jobs we have in the pipeline are manually triggered.
- We trigger slow stemcell downloads at night when nobody is working, and stick to the same stemcells during the day to avoid slow downloads.
- Since we share the same bosh release for 3 different deployments (broker, driver, and file system) we trigger deployment of all 3 things whenever our bosh release changes. Since Bosh is clever about not doing anything for unchanged jobs, this is a much easier approach than trying to manage separate versions of the bosh release for different jobs.
- We use concourse serial groups to force serialization between the tasks that deploy things and the tasks that rely on those deployments. Serial groups are far from perfect–they operate as a simple mutex with no read/write lock semantics–but for our purposes they proved to be good enough, and they are far easier than implementing our own locks.
The yaml for our current pipeline is here for reference.
In addition to our nightly job to download stemcells, we also run a nightly task to clean up bosh releases by invoking bosh cleanup. This is a very good idea–otherwise bosh keeps everything that’s been uploaded to it, which can quickly use up available disk space.
At some point in the future, we will probably want to add additional tasks to the pipeline to clean out our Amazon S3 buckets, but so far we haven’t done that.
A special thanks to Connor Braa who recently joined our team from the Diego team where he did a great deal of Concourse wrangling. Connor is responsible for providing us with most of the insights in this post.