LILYLILYDOCS
Docs/Guides/CI/CD

Wire up CI/CD

Push to main, CI builds the brick, the fleet swaps it in. No SSH involved.

Every piece of LILY infrastructure that runs on a long-lived host deploys the same way: a CI job builds an artifact and uploads it to a Google Cloud Storage bucket, and a small systemd timer on the host pulls it in. No runner ever opens a connection into production. Hosts dial out, never in. This guide is the pattern we use for platform.lilylabs.io, the dataroom, and the docs site you are reading. It works equally well for any app you ship.

The pattern

A CI pipeline on push to main produces two files in a bucket. The first is the artifact itself: a tarball, a single binary, or — once you have adopted voids — a brick. The second is a tiny marker file containing the sha256 of the artifact and a build timestamp. Hosts only watch the marker. When its content changes, a deploy is in flight.

The bucket layout we use looks like this. Pick names that match your service so the layout stays readable.

gs://lly-releases/myapp/myapp-x86_64-linux.tar.gz
gs://lly-releases/myapp/myapp-x86_64-linux.sha256

The marker is what makes this safe. It is uploaded last. The puller reads the marker, then fetches the artifact, then verifies the sha256 locally before doing anything with it. A half-uploaded artifact never gets installed because the marker that points at it does not yet exist.

On the host

Two systemd units live in /etc/systemd/system/. A timer fires every sixty seconds and starts a oneshot service. The service is a tiny shell script that does five things, in order, and stops at the first failure.

  1. Fetch the marker. Compare to the last-installed sha cached at /var/lib/myapp/installed.sha256. Exit cleanly if unchanged.
  2. Fetch the artifact into a staging directory next to the live one.
  3. Verify the local sha256 matches the marker. Fail loudly if not.
  4. Atomic-swap: ln -sfn /var/lib/myapp/stage /var/lib/myapp/current, then signal or restart the running service.
  5. Hit a local health endpoint. If it returns non-200 within ten seconds, swap the symlink back to the previous release and exit non-zero.

The two unit files are short enough to read in one breath. The service runs the puller script. The timer runs the service.

myapp-puller.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/myapp-pull
User=myapp

myapp-puller.timer
[Timer]
OnBootSec=30s
OnUnitActiveSec=60s
Unit=myapp-puller.service
[Install]
WantedBy=timers.target

Enable the timer once with systemctl enable --now myapp-puller.timerand you are done. Every push to main reaches the host within roughly ninety seconds.

journalctl
$ journalctl -u myapp-puller -f -- boot 4f2b · myapp-puller.service -- 12:14:02 marker: 9c64657 (cached: 7d4aec0) · update available 12:14:02 fetching myapp-x86_64-linux.tar.gz (38 MB) 12:14:05 sha256 verified · 9c64657…b1c 12:14:05 staging → /var/lib/myapp/stage 12:14:05 swap → /var/lib/myapp/current 12:14:05 systemctl reload myapp.service 12:14:06 health: GET http://127.0.0.1:8080/_health → 200 (118ms) 12:14:06 rollover ok · installed=9c64657 12:15:06 marker: 9c64657 (cached: 9c64657) · no change 12:16:06 marker: 9c64657 (cached: 9c64657) · no change 12:17:06 marker: a64d552 · update available 12:17:09 health: GET http://127.0.0.1:8080/_health → 503 12:17:09 rolling back to 9c64657 12:17:11 rollback ok · alarm raised

Why pull, not push

Push deploys mean your CI runner holds a credential that can change production. Lose that credential — leaked secret, compromised runner, wrong branch — and an attacker has a shell on every host. Pull deploys invert the trust direction. The runner can write to one bucket and nothing else. The host can read one bucket and nothing else. Neither side can reach the other directly.

There is no inbound SSH to harden, no jump host to babysit, no per-project IAM matrix. Hosts live behind iptables -P INPUT DROPexcept for ports they actually serve traffic on. The CI service account only needs storage.objectAdmin on a single bucket — the narrowest GCP role that still allows overwrites — and that role can be scoped per service.

The pattern is also self-healing. A host that was offline during a deploy catches up the moment its puller next ticks. There is no separate reconcile loop because the puller is the reconcile loop.

Adapting it for your own app

The structure is identical whether the artifact is a Go binary, a Node bundle, or — once you are on LILY — a brick produced by lly compile or lly void deploy. For most LILY apps you skip the puller entirely: lly void deploy ./void.brick already lands on the fleet through Raptor in under a second, with rollback and health-checking handled by the CAP itself. The pull-based pattern in this guide is for hosts you own, outside the CAP fleet.

Three things to set up the first time. After that the loop runs itself.

  • A GCS bucket with a CI service account that has storage.objectAdmin only on that bucket.
  • A CI step at the end of your pipeline that runs gsutil cp twice: artifact first, marker second.
  • The two systemd units and the puller script above, dropped on the host once via your provisioner of choice.

For LILY-internal apps the CI script lives in .gitlab-ci.ymland the puller script lives in the host image. We treat both as ordinary code, reviewed in MRs, with the puller script unit-tested against fake markers in a temp directory. If you want to copy ours wholesale, ask in the team channel — the platform and docs pullers are nearly byte-identical.

Related

See Deploy a Next.js site for the zero-CI path that uses lly void directly, and Publish a plugin for how plugin authors push to rc.lilylabs.io through the same pull-based scheme.