Insight Notes

The Ciphertext Outlives Its Key

2026-04-24T00:00:00+00:00

Nine years ago, on LinkedIn, I published a short piece asking whether quantum computing would eventually kill Bitcoin. The framing was speculative on both ends. The cryptographic break was decades away. The asset at risk was itself a bet on a cryptographic abstraction. It was a thought experiment about a possible future.

The same question, asked today about a national genetic database, has none of those qualities.

The cryptographic break is no longer comfortably decades away. The asset at risk is not a speculative ledger but the genomic identity of a population. And the harvesting of the encrypted material that will eventually be broken is not hypothetical: it is happening now, on the wires, while the data is being generated.

A packet on a wire

Somewhere on a transatlantic fibre link, a packet leaves a national biobank and travels to a research collaborator abroad. It contains thousands of sequenced genomes, with consent forms, identifiers, and metadata. It is encrypted. The key exchange relies on elliptic-curve cryptography, which is the current standard for protecting data in transit. By every reasonable measure today, the packet is safe.

It will also be copied.

Not necessarily by the recipient, or by the carrier, or by anyone the sender would recognise as an adversary. The copy will be made by a passive observer somewhere along the route, sitting on an interception point and writing ciphertext to long-term storage. The copy does not need to be readable now. Storage is cheap, and the observer can afford to wait.

This is the tactic the security community calls Harvest Now, Decrypt Later. The first half is happening today. The second half is waiting for an instrument that does not yet exist.

The arithmetic that does not work

There is a simple inequality that frames the entire problem, first articulated by cryptographer Michele Mosca. It asks three questions. How long does this data need to remain confidential? How long will the migration to quantum-resistant cryptography take? How long until a cryptographically relevant quantum computer exists?

If the first two added together are greater than the third, the data is already compromised. Not metaphorically. The ciphertext will outlive the cryptography that protects it.

For most categories of data, the arithmetic is uncomfortable but manageable. Financial transactions have confidentiality horizons measured in years. Communications metadata in months. Even classified material, though sensitive, usually has a defined downgrade path.

A genomic sequence has none of these properties. It is confidential for as long as its subject is alive. Unlike a password or a credit card number, it cannot be rotated. It cannot be revoked. It also implicates people who never consented to its disclosure: full siblings share roughly half of it, parents and children share roughly half of it, more distant relatives share progressively less but still measurably so. A single sequence, broken thirty years from now, retroactively exposes a family that may not have existed at the time the sequence was collected.

Migration timelines for complex, regulated infrastructure are measured in a decade. The European coordinated roadmap sets a 2030 target for critical systems and a 2035 endpoint for deprecating vulnerable algorithms. Devices and platforms governed by certification cycles move even more slowly.

The third variable is the contested one. Estimates for a cryptographically relevant quantum computer have compressed significantly over the past two years, driven by hardware progress that has repeatedly outperformed conservative projections. Published research on quantum error correction, reductions in the qubit requirements for breaking RSA and elliptic-curve cryptography, and demonstrations of below-threshold logical qubits have all moved faster than the cryptographic community expected. Whether the date is 2030 or 2040 is, for genomic data, immaterial. Either way, the inequality does not balance.

What HNDL actually requires

The most common misunderstanding about HNDL is that it is a future threat. It is not. It is a present operation whose consequences are deferred.

To execute the harvest, an adversary does not need to compromise the database. They do not need to deploy malware on a research endpoint, phish an administrator, or exploit a vulnerability in a sequencing platform. None of the conventional detection signals fire. There is no incident to respond to.

What they need is a position on the network path between two endpoints, and the storage to write ciphertext to disk at scale. Both have been technically and economically accessible to state-level actors for more than a decade. The only thing missing, until recently, was confidence that the future half of the operation, the decryption, would eventually become feasible.

That confidence now exists. Whatever one's view of specific timelines, the probability that the instrument will arrive within the useful lifetime of today's encrypted traffic is no longer negligible.

The consequence is that the present exposure is already sunk. No cryptographic migration undertaken today can protect data that has already crossed a wire in a form that will be readable in fifteen years. That data is, for practical purposes, copied.

This is the most difficult thing to communicate to non-technical stakeholders. When the migration to post-quantum cryptography completes, it will not make the problem go away. It will stop adding to it.

The cryptography is the easy part

The instinct, facing a problem framed this way, is to focus on algorithms. Which post-quantum schemes are ready? Which has NIST standardised? Which lattice constructions are robust against side-channel attacks? These are important questions, and they have largely been answered. ML-KEM for key encapsulation, ML-DSA and SLH-DSA for signatures, with ongoing work on alternatives.

The algorithms are the easy part. The difficult work begins the moment they need to be deployed.

Deploying post-quantum cryptography at the scale of a national or supranational data infrastructure requires, first, knowing where cryptography currently lives. Not just in the obvious places, TLS terminators, application-level encryption, certificate authorities, but in firmware on instruments, in hardcoded libraries inside platforms, in the authentication modules of interoperability layers, in legacy middleware that nobody has touched in five years because it works.

Most organisations do not have this inventory. In a fragmented infrastructure of national systems, institutional procurement, and international vendors, no single actor owns the map. The inventory itself is a governance deliverable before it is a technical one: it requires agreement on scope, on ownership, on what counts as a cryptographic asset, and on who updates the catalogue when something changes.

Once the inventory exists, the second question is who controls the migration path for each item. Some assets are under the direct control of the operator. Others depend on a vendor releasing a firmware update. Others depend on a standards body publishing a new specification. Others depend on a cloud provider rolling out support in a managed service. The operator's ability to migrate is, for any given asset, equal to the slowest dependency in that chain.

This is why crypto-agility, the capacity to swap cryptographic primitives without redesigning the systems that use them, has emerged as the central architectural concern. But crypto-agility is not a property of an algorithm. It is a property of how infrastructure is designed, procured, and operated. It requires abstraction layers that most existing systems do not have, contract terms that most existing vendor relationships do not include, and operational discipline that most existing teams are not organised to deliver.

In this sense, the quantum migration inherits a familiar pattern. The technical decision, choosing ML-KEM, is a small part of the work. The organisational decision, rebuilding how cryptographic dependencies are managed, is where the difficulty lives.

The jurisdictional complication

For European data, there is an additional layer that makes the arithmetic worse, and for genomic data the layer is particularly thick.

Many of the largest aggregations of human genetic data are held under non-European jurisdictions, whether by sovereign biobanks abroad, by consumer genetic services, or by research consortia hosted on infrastructure provided by non-European hyperscalers. The compromise that has allowed European data to coexist with such arrangements has been customer-managed encryption: the cloud provider operates the infrastructure, but the customer controls the keys. When a foreign authority issues an extraterritorial request for data, the provider can comply in the legal sense, by producing the ciphertext, while the customer's keys remain out of reach. The data is technically disclosed, but unreadable.

This compromise depends entirely on the classical cryptography assumption. It assumes that ciphertext produced by today's algorithms, even if handed to a foreign authority, remains computationally useless. That assumption is what allows the arrangement to be described as privacy-preserving.

When the assumption expires, so does the compromise. Ciphertext collected under an extraterritorial production order today becomes readable material tomorrow, in the same way that ciphertext intercepted on a fibre link does. The legal instrument is different, the adversary is different, but the underlying exposure is the same: data obscured rather than protected, waiting for an instrument to arrive.

For genetic data the consequences are categorical. A genomic dataset disclosed in fifteen years is not a historical curiosity. It is operational intelligence about the people in it, and about their living relatives, and about populations that may have been the subject of targeted study without ever being told. Genetic data does not age out of relevance.

The European regulatory architecture has begun to encode this shift, but not yet in a way that forces action on the timeline the arithmetic demands. The debates around cloud certification, supply-chain assurance, and operational resilience are converging on the right questions. They are not yet converging fast enough.

What can actually be done

There is no version of this problem where the right response is to wait and see. The traffic is being collected now, and the migration path is long.

The useful posture, for any organisation holding long-lived genetic or biometric data, involves three things, none of which are primarily about cryptography.

The first is the inventory. Not an audit for compliance, but an operational map of where cryptographic primitives are used, what their replacement paths look like, and who owns each migration. This is slow work, and it does not produce visible deliverables until it is complete. It is also the prerequisite for everything else.

The second is a procurement posture. New contracts for infrastructure, instruments, and cloud services should require crypto-agility as a functional requirement, not an aspirational one. Specifically: support for hybrid classical-plus-post-quantum schemes during the transition, defined algorithm replacement pathways, and clear vendor commitments on migration timelines. The cost of adding these requirements now is trivial compared to the cost of not having them in five years.

The third is a realistic view of what the migration will not fix. The data already in transit, already in backups, already in long-term research archives, is not protected by any future cryptographic upgrade. It requires a separate category of decision: whether to re-encrypt, whether to move to cryptographically different storage, whether to accept the exposure and focus mitigation on the data that can still be saved. These are not questions that have good answers, but they are questions that need to be asked, because the alternative is to discover the answer when the key arrives.

Closing

Nine years ago I asked whether Feynman would kill Bitcoin. The honest answer was probably yes, eventually, but the stakes were containable. A speculative asset can lose its value. A market can be repriced. A protocol can be forked. The harm, if it ever materialised, would be financial and bounded.

The question I am asking now is whether the same instrument, on the same timeline, will expose the genomic identity of populations that have already given their samples to systems built on cryptographic assumptions that have not aged well. The answer is also yes. The stakes are not containable in the same way. A genome cannot be repriced or reissued. The relatives implicated by it cannot be opted out retroactively. The harm, when it materialises, is not financial.

The standard framing of the quantum threat is that a new class of computer will break today's encryption. This is true, but it puts the emphasis in the wrong place. The more accurate framing is that the useful life of our cryptography is shorter than the useful life of our data, and that the gap is being exploited in the present tense. The quantum computer, when it arrives, will be the instrument that converts an existing archive of ciphertext into an existing archive of plaintext. The harvesting is happening now.

The encryption has an expiration date. The data does not.

The Commit Is the Deploy

2026-04-18T00:00:00+00:00

The problem with doing things properly

Episode 2 ended with a two-node Proxmox cluster running several LXC containers, each hosting Docker Compose stacks. The architecture was deliberately segmented: separate containers for network services, media, personal tools, secrets management. Better isolation, clearer boundaries, smaller blast radius.

But segmentation has an operational cost.

Adding a service meant opening an SSH session to the right container, creating the Compose file, writing the environment variables, copying secrets from wherever they were last stored, running the stack, and hoping the configuration matched what was running on the other hosts. Updating a service meant doing this again. Updating across multiple hosts meant doing it several times and keeping track mentally. Rolling back meant remembering what the previous state was.

At some point, I realized the architecture I had built to reduce risk was introducing a different kind of risk: the risk of not changing things because changing them was tedious. Services stopped getting updated. New ideas stayed in a note file instead of being deployed. The infrastructure was sound but operationally stale.

The previous setup, a single server running OpenMediaVault with Watchtower, had the opposite problem. Updates happened automatically with no review, no versioning, and no rollback path. That works until it does not. With multiple hosts, uncoordinated blind updates would have been worse: the same unreviewed change happening independently on different machines, with no record of what changed.

The question was not whether to automate. It was how to automate in a way that remains legible, auditable, and reversible.

The repository as source of truth

The entire service layer is defined in a single private repository. Every Docker Compose file, every environment variable, every encrypted secret, every host-to-stack mapping lives in version control. The repository is not a mirror of what is running. It is the definition of what should be running.

The structure separates what runs where from how it is configured. The repository has three main areas: host/ declares which stacks run on each host via a stacks.yaml file, stacks/ contains the actual Compose definitions and encrypted secrets for each service, and global/ holds shared configuration. The full flow from commit to running containers looks like this:

Each host has a stacks.yaml that is just a list of stack names. Nothing else. The actual stack definitions, Compose files, environment variables, and SOPS-encrypted secrets, live under stacks/, organized by service name.

Adding a service to a host is a one-line change in stacks.yaml. Moving a service between hosts is two one-line changes. The stack definition itself does not know or care which host runs it.

The repository is hosted on GitHub but mirrored to a local Gitea instance. Deploys pull from the local mirror when available, so they do not depend on a single external service being reachable.

How a deploy happens

A systemd timer on each host runs the deploy script every five minutes. The deploy loop is timer-based rather than event-driven. For this setup, the simplicity and failure isolation of polling outweighed the benefits of a push-based model: no webhook endpoint to expose, no CI/CD pipeline to maintain, no message broker to run. If a cycle fails, the next one retries from clean state.

The script does four things in sequence.

Pull. It fetches the latest state from the remote repository. The remote is the source of truth. Local changes are discarded via hard reset. This is deliberate: if something was modified directly on a host, it was either a temporary fix that should have been committed, or a mistake that should not persist.

Decrypt. It checks which secret files have changed since the last pull. If any *.secret.env file was modified, or if a decrypted copy is missing, it decrypts only the affected files. The decryption runs SOPS inside a pinned, digest-verified container image, using an AGE private key that lives on the host and is never committed. The decrypted files are written with mode 600 and excluded from version control.

Teardown. If a stack was removed from a host's stacks.yaml since the last deploy, the script detects the orphaned containers and brings them down. This works in two passes: first by comparing running Compose projects against the declared list, then by scanning Docker labels to catch containers that Compose parsing might miss, such as stacks whose directory was deleted or whose environment files are no longer resolvable.

Converge. It iterates over the stacks declared for that host, loads the relevant environment files and decrypted secrets, pulls the latest images, and runs docker compose up -d. If nothing changed, Docker sees the same configuration and does nothing.

The result is convergent. The system continuously aligns itself to the state defined in the repository. A commit is, within five minutes, a deploy.

This is not a full GitOps controller in the Kubernetes sense. There is no declarative reconciler, no admission webhook, no per-container state comparison. But there is a basic form of drift enforcement: the teardown pass removes any running container that is not declared in the host's stacks.yaml. If something was started manually or left behind from a previous configuration, it gets cleaned up within five minutes. It is a simpler pull-based convergence loop that applies the same core idea, Git as the source of truth, without introducing an additional control plane. For the scale of this infrastructure, a deploy script and a timer are the right tool. The complexity is in the workflow, not in the tooling.

Encrypted secrets in the repository

The first objection to storing infrastructure in Git is usually about secrets. API keys, database passwords, SMTP credentials. They cannot go into a repository in plaintext, even a private one.

SOPS solves this. Every file matching *.secret.env is encrypted with AGE before being committed. The encryption is AES-256-GCM. The AGE public key is in the repository's .sops.yaml configuration. The private key stays on each host, in a protected directory under root, and is provisioned once during bootstrap.

What this means in practice: the repository contains files like authentik.secret.env that are fully encrypted. When the deploy script runs, it decrypts them into .decrypted.authentik.secret.env, which Docker Compose then reads as environment variables. The decrypted files are gitignored and never leave the host.

Day to day, I edit secrets from my laptop, which has an encrypted disk and a VSCode plugin that handles SOPS decryption and re-encryption transparently. Open the file, edit the value, save. The plugin re-encrypts on save. The commit contains only ciphertext. If I ever need to edit secrets from a different machine, the fallback is minimal: any computer with Git, SOPS, and AGE installed is enough. The AGE private key is stored in my password manager, so the recovery path does not depend on a single device.

On the workstation where I edit secrets day to day, the AGE private key does not live on disk. A shell wrapper fetches it from the macOS Keychain at decrypt time, passes it to SOPS via stdin, and discards it. The key exists in memory for the duration of the operation and nowhere else on the filesystem. An encrypted disk that is actively unlocked is not a meaningful barrier against a process reading arbitrary files. The Keychain approach means that even on a running, unlocked machine, the AGE key is not a file that can be copied or stumbled upon. The password manager copy exists for recovery from a different machine, not for routine use.

This workflow has a deliberate trade-off: secrets cannot be edited from a phone or a web browser. For a homelab, this is an acceptable constraint. It is also, arguably, a feature. The inability to modify secrets from an arbitrary device is a security boundary, not a limitation.

The security model is convenience-aware, not convenience-optimized. Compromising a host would expose its AGE key and the decrypted secrets for its local stacks. It would not automatically grant access to all data: services are isolated at the container and network level, and hosts are not directly reachable from outside the network. There is no SSH access enabled by default; the firewall blocks it unless a temporary rule is explicitly created. This is still a significant improvement over the previous setup, where secrets lived in plaintext files on each host, manually copied and occasionally forgotten.

Updates as pull requests

Images in the Compose files are not tagged with latest or alpine. Every image reference includes a static version tag and a SHA256 digest. This means the image that runs is always the exact image that was committed. If an upstream registry is compromised or a tag is reassigned, the digest mismatch prevents the wrong image from being pulled. Mutable tags assume that upstream registries are always trustworthy. Recent supply-chain incidents suggest otherwise.

Keeping those digests current is Renovate's job. Renovate scans the Compose files weekly, detects available updates, and opens a pull request with the new tag and digest already applied. The PR is reviewed, and once merged, the next deploy cycle picks it up.

Not all updates are treated equally. Services that do not handle sensitive data and whose compromise would not expose other parts of the infrastructure allow digest-only updates to auto-merge: same tag, rebuilt image, no breaking change expected. For everything else, including any minor or major version bump, the PR requires manual review. Major PostgreSQL upgrades, for instance, trigger an explicit warning in the PR body because a major version bump is incompatible with the existing data directory without migration.

The result is that updates are deliberate, traceable, and reversible. Every image change is a commit with a clear diff. If an update breaks something, the fix is a revert. If I want to know what version of a service was running three weeks ago, I check the Git history. The PR history also serves as an implicit changelog, which matters for the backup episode: knowing exactly what was running at the time of a backup makes restores meaningful.

Images are reviewed before they run

Renovate ensures that an image update is intentional and traceable. It does not verify what is inside the image.

Trivy runs as a GitHub Actions step on every pull request that Renovate opens. Before a PR can be merged, Trivy scans the updated image against known CVE databases. A high or critical severity finding blocks the merge. The review step that was already required for version bumps now has a second layer: the image content is checked, not just the tag and digest.

The two controls are orthogonal. Renovate guarantees that the image reference is the one you committed and that the commit was deliberate. Trivy guarantees that the image content does not contain known vulnerabilities at the time of review. Neither replaces the other. A clean Trivy report on a mutable tag would mean nothing. A pinned digest with a critical CVE would be a problem regardless of how traceable it is. Together they cover the cases the other misses.

This does not eliminate supply chain risk. An image that passes Trivy today may have a new CVE disclosed tomorrow. The protection is at review time, not continuously. But it means that the update cycle, which Renovate already makes systematic and weekly, includes a security signal at each iteration rather than treating image content as a trusted input.

What this changes operationally

The deploy pipeline removed an entire category of work.

Creating a new service means writing a Compose file and an environment file, encrypting any secrets, adding the stack name to a host's stacks.yaml, and committing. Within five minutes, the service is running. No SSH session needed.

Removing a service means deleting the stack name from stacks.yaml and committing. The teardown pass detects the orphaned containers and removes them.

Moving a service between hosts means editing two stacks.yaml files. The old host tears it down, the new host brings it up.

Rolling back a broken change means reverting the commit. The next deploy cycle restores the previous state.

All of this works from any Git client. A laptop, a tablet, a phone with a Git app. The only operation that requires something beyond a Git client is editing encrypted secrets, which needs any device with the AGE key and SOPS installed.

New LXC containers are provisioned from a common template that already includes Docker, the deploy tooling, and the SSH configuration for the pipeline. Cloning the template, assigning a VLAN, and adding a host/ directory with a stacks.yaml is the entire onboarding process for a new host.

What I would not do again

Relying on automated image updates with no review step. Watchtower on the old OMV server would pull whatever was newest. For a single machine running non-critical services, the blast radius was limited. For a segmented multi-host setup, uncoordinated updates across hosts with no record of what changed would have been a liability.

Using latest or mutable tags in Compose files. A tag that can be reassigned upstream is not a version pin. It is a hope. Digest pinning is more verbose, but it means the deploy is deterministic. What you committed is what runs.

Storing secrets outside version control entirely. Before this setup, secrets lived in files on each host, manually copied, occasionally forgotten. Encrypting them in the repository means they are versioned, backed up, and impossible to lose by accident. The trade-off is the key management overhead, but that overhead is small and one-time.

What versioned infrastructure actually means

The shift from "I configured services on hosts" to "the repository defines what runs" is not primarily about automation. It is about legibility.

At any point, I can read the repository and know the complete state of the infrastructure. Which services run on which host. What version of each image is deployed. What secrets exist and when they were last changed. What was different a month ago.

The setup phase is more complex than a traditional Compose-based approach. Once in place, the day-to-day operation is simpler: changes are centralized, repeatable, and do not require direct access to hosts. The system shifts complexity from routine operations to initial design. The cost is upfront. The benefit is that routine changes become trivial.

This is the same principle that makes infrastructure as code valuable in enterprise contexts, but in a homelab, there is an additional dimension. I am the only operator. There is no team to hand off to, no documentation that someone else will read. But there is a future version of myself who will need to understand a decision made today. The commit history is that documentation.

In professional contexts, particularly in public administration, the absence of this kind of traceability is common. Systems are configured manually, updates happen through undocumented procedures, and the gap between what is supposed to be running and what is actually running grows silently. Having operated a system where every change is a commit with a timestamp, a diff, and an author has changed how I evaluate infrastructure governance. The question is no longer whether a system works. It is whether anyone can tell you why it works, and what changed last.

Two Nodes, One Lesson in Constraint

2026-04-15T00:00:00+00:00

The plan that did not survive contact with reality

The original design had four nodes.

A network node for the firewall, DNS, and routing. A compute node for general workloads. A storage node for data-heavy services and ZFS pools. A backup node running Proxmox Backup Server and a Borg endpoint for host-level datasets.

Each had a clear role. The separation was clean on paper. The first three would run Proxmox Virtual Environment, the fourth Proxmox Backup Server as its primary OS.

It did not last.

Why four became two

The problem was power consumption. With all four nodes online, the homelab drew over 400 watts continuously. For a residential setup running 24/7, that is not a minor detail. It is a recurring cost that forces a conversation about what the infrastructure actually needs versus what it would be nice to have.

The first thing to go was the compute node. Its workloads were not demanding enough to justify a dedicated machine. I migrated everything onto the storage node, which had the physical capacity for it, and upgraded its CPU from an Intel N150 to an AMD Ryzen 7 PRO 8845HS with 96 GB of RAM. The storage node became the general-purpose workhorse: compute, storage, and most services on a single machine.

The backup node followed. The original idea was to power it on only when needed, run the backup jobs, then shut it down. In practice, managing scheduled wake-ups and ensuring consistency across timed operations added complexity that was not worth the savings. The simpler answer was a Proxmox Backup Server VM running on the storage node, with its data sitting in a dedicated ZFS dataset with constrained quotas. The bulk of the backup capacity, following a 3-2-1 strategy, lives offsite on S3-compatible object storage. But backup deserves its own episode.

The result is two nodes running Proxmox VE. Power consumption dropped from over 400 watts to around 120.

Power was not the only physical constraint. The previous server, a single machine running OpenMediaVault with eight mechanical drives, lived in the living room. The noise was constant and impossible to ignore. The current setup sits in an attic room, well away from living and sleeping areas, which solved the proximity problem entirely. But the choice of CPUs, a laptop-class Ryzen 8845HS for the storage node and an efficiency-oriented N150 for the network node, was driven as much by thermal output and fan noise as by power draw. The network node uses a large passive heatsink with a fan that only spins under sustained load, which is rare. The storage node runs Noctua fans throughout, including supplementary ones to keep temperatures manageable during summer. Less heat means less cooling. Less cooling means less noise. The same constraint, approached from three angles.

The two nodes

Both nodes share a few baseline design choices. The Proxmox OS disk on each is a ZFS mirror across two NVMe drives, so that a single drive failure does not take down the host. This was a deliberate decision from the original four-node design, carried forward into the current layout. All devices, both nodes, the Home Assistant Yellow, the inter-node switch, and the active network equipment, sit behind a UPS. The two nodes and the Yellow are connected through the same 10 Gbps switch, which matters less for daily operations than for live migration and backup traffic when they occur.

pve-network

This is the network node. An Intel N150 with 16 GB of RAM, four cores, modest in every respect.

It runs three things: OPNsense as a virtual machine, a UniFi controller LXC for wireless management, and a network-services LXC that hosts the reverse proxy and the identity provider. Episode 1 covered why the firewall is virtualized and what that entails. The point here is different: this node exists because its power draw is low enough to justify keeping it permanently on, and its workload is stable enough that it rarely needs attention.

The N150 is not powerful. It does not need to be. Routing, DNS resolution, VPN termination, and reverse proxying are not computationally expensive. What matters is that the network layer is physically and logically independent from everything else. If the storage node goes down for maintenance, the network stays up. If a container on the storage node misbehaves and consumes all available resources, the firewall and reverse proxy are unaffected.

This is the same separation of concerns described in Episode 1, but at the hardware level.

pve-storage

This is where everything else runs. An AMD Ryzen 7 PRO 8845HS with 96 GB of RAM and eight HDD bays.

The eight bays were the reason this node could absorb the roles of the others. Four of those bays currently hold the RAIDZ2 pool that provides the primary storage for service data and local backups. The remaining four host two MergeFS pools used for temporary and transient files. The disks recovered from the decommissioned backup node will likely expand the RAIDZ2 pool in the future, but there is no immediate need.

The CPU upgrade was necessary once the node started hosting compute workloads that the N150 could never have handled: machine learning inference for the photo library, and an NVR with a Google Coral TPU for object detection.

This node hosts two VMs (Proxmox Backup Server and a Windows machine for a specific use case) and several LXC containers, each dedicated to a functional domain.

The third device

There is a third device in the cluster, though it is not a Proxmox node.

A Home Assistant Yellow sits alongside the two servers. It runs Home Assistant for home automation, but it also serves an infrastructure purpose: it hosts a Corosync instance that provides the quorum vote needed for the Proxmox cluster to function correctly with only two nodes.

A two-node Proxmox cluster has a fundamental problem: if one node goes down, the surviving node cannot establish quorum on its own, which limits its ability to manage cluster resources. Adding a third Corosync voter solves this without adding a third hypervisor.

I could have virtualized Home Assistant on one of the Proxmox nodes. But I already had the Yellow, it consumes very little power, and keeping home automation on a separate physical device means it remains operational during Proxmox maintenance or outages. The Yellow runs on an M.2 SSD rather than a microSD or eMMC, a deliberate choice given that it now carries a cluster-level responsibility. When the servers are down for updates, the lights still work. That felt like the right trade-off.

How the pieces fit together

Choosing the hypervisor

Proxmox was not the result of a comparative evaluation. It was the natural next step from where I was.

Before the multi-node setup, I ran a single server with OpenMediaVault. OMV gave me a Docker Compose plugin for services and a KVM plugin for the occasional virtual machine. It worked, but it was clearly designed as a NAS operating system with virtualization bolted on, not as a hypervisor with storage capabilities built in.

Proxmox offered what OMV could not: native ZFS integration, a proper backup system with Proxmox Backup Server, support for live migration between nodes, and the possibility of experimenting with high availability. It runs on Debian, which means that when something breaks, the debugging tools are standard and the filesystem is accessible. That mattered more than any feature comparison.

The storage node also exposes an NFS share to the network node, which enables live migration of containers between hosts when needed. This is not something I use routinely, but it means maintenance on one node does not require shutting down everything that runs on it.

When a VM, when a container

This is the decision that shapes the entire compute layer.

Proxmox supports both full virtual machines (KVM) and Linux containers (LXC). They solve different problems, and confusing the two leads to designs that are either wasteful or fragile.

A VM gets its own kernel. It is fully isolated from the host. It can run a different operating system. It carries overhead: memory for the guest kernel, CPU cycles for hardware emulation, a virtual disk that adds a layer between the application and the physical storage.

An LXC container shares the host kernel. It is lighter, faster to start, and uses fewer resources. But it cannot run a different OS, and the isolation boundary is thinner. A misconfigured container can, in certain scenarios, affect the host in ways a VM cannot.

My default choice is LXC. The reasoning is straightforward: the CPUs in this setup are adequate but not overpowered, power consumption is a constant constraint, and every layer of overhead that can be removed should be, unless there is a specific reason to keep it. LXC containers start in seconds, consume only the memory their processes actually use, and are simpler to manage and template. For the workloads I run, the reduced isolation compared to a full VM is an acceptable trade-off, assessed against the actual risk profile of each service. Nearly all containers run unprivileged. The one exception requires limited privileges to pass through the AMD integrated GPU and a TUN device, which the kernel will not map correctly to an unprivileged container regardless of how creative you get with device permissions. I spent weeks confirming this.

The exceptions are few and justified. OPNsense runs as a VM because it is FreeBSD and requires direct access to network interfaces via PCI passthrough. Proxmox Backup Server runs as a VM because it manages its own storage independently and benefits from the stronger isolation boundary. A Windows VM exists for a specific use case that requires it.

Everything else is an LXC container running Docker Compose stacks. The container provides the OS environment and the network identity. Docker provides the application layer.

The data problem

Here is a question that does not get discussed enough in homelab content: where do your datasets actually live?

Consider a personal photo library managed by Immich. Or a file repository served by Nextcloud. These are large, growing datasets. They are the reason the infrastructure exists. The services that operate on them are replaceable. The data is not.

The naive approach is to put everything inside the container's virtual disk. This is simple, and it works until it does not. The virtual disk is a file on the host filesystem. If it grows, ZFS cannot manage the individual blocks inside it. If a bit flips inside the virtual disk, ZFS sees the outer file as intact, the corruption is invisible to the host's integrity checks. Preallocating large virtual disks wastes space. Thin provisioning avoids that, but adds complexity and fragmentation.

I chose a different approach. The datasets that matter, photos, cloud storage files, media libraries, live directly on ZFS at the host level. Each dataset has its own ZFS filesystem with appropriate properties: compression, record size, quota. This means ZFS can protect them with its own checksumming and scrubbing, independently of whatever is running inside the container.

These datasets are then exposed where they are needed. For LXC containers, this is a bind mount: the host directory appears inside the container's filesystem directly, with no virtualization layer in between. For the few cases where a VM needs access, NFS provides the bridge. A dedicated storage-services container also runs a Samba server, sharing selected directories to devices on the local network. This is one of the few services that runs outside Docker, directly on the container OS, because the overhead of containerizing a file-sharing daemon that operates on bind-mounted host paths added complexity without benefit.

The result is that the service layer and the data layer are decoupled. I can destroy and recreate a container without touching its data. I can snapshot and replicate the datasets independently. I can move a dataset to a different pool or a different RAID configuration without the service knowing.

This is not a minor architectural detail. It is the decision that makes everything else sustainable.

LXC containers as service hosts

After various experiments, the operating model settled on thematic LXC containers, each hosting one or more related Docker Compose stacks.

The grouping is functional, not arbitrary. Services that share data dependencies or network requirements are colocated. Services with distinct security profiles are separated. The NVR, for instance, runs alone because it needs access to a Google Coral TPU for inference and operates in a network segment isolated from everything else.

Each GitOps-managed container is built from a common template. The template includes Docker, the deploy tooling, and the SSH configuration needed for the GitOps pipeline to operate. This means spinning up a new service container is a matter of cloning the template, assigning it to the correct network segment, and adding a stacks entry. The deploy pipeline handles the rest. But the pipeline itself is the subject of Episode 3.

What I would not do again

A dedicated backup node designed to run only intermittently. The coordination cost of scheduled wake-ups, consistency checks, and state synchronization across power cycles was higher than the cost of running a lightweight VM full-time. Simpler won.

Defaulting to VMs before understanding the actual isolation requirements. The overhead is real, and for most self-hosted services behind a reverse proxy, the risk profile does not justify it. LXC-first, with explicit exceptions, would have been the right starting point.

Over-separating roles across physical hardware without measuring whether the workloads justified the separation. A clean diagram is not the same as a good architecture. The constraint that matters is the one you pay for every month, not the one that looked elegant on day one.

What the consolidation taught me

The evolution from four nodes to two was not a failure of planning. It was planning meeting reality.

The original design optimized for separation. The final design optimizes for sustainability. A homelab that costs too much to run will eventually be turned off. A homelab with too many moving parts will eventually be neglected. Neither outcome serves the purpose of having one.

The interesting lesson is that the consolidation did not reduce capability. It reduced waste. The two-node architecture still maintains the separation that matters: the network layer is independent from the compute layer. What it eliminated was the separation that did not pay for itself: a dedicated compute node with idle capacity, a backup node that was only useful intermittently.

In enterprise contexts, this trade-off is rarely visible. Hardware exists in racks that someone else pays for. Power is a line item in a budget that someone else manages. In a homelab, every watt is yours. That constraint forces a clarity about what actually needs to be separate and what is separate only because a diagram looked cleaner that way.

This is, incidentally, one of the more transferable insights. Not every system that could be distributed should be. Not every service that could have its own host needs one. The right architecture is the one that survives contact with its operating environment, not the one that looks best before deployment.

The accountability satisfies the citizen, not the agent

2026-04-14T00:00:00+00:00

Every administrative act in Italy must carry a motivation. Not a summary, not a label: a traceable chain of reasoning that a citizen can read, understand, and challenge before a judge. This principle is not a bureaucratic formality. It is the mechanism through which public power remains accountable.

Agentic AI, by design, operates differently. It pursues outcomes through statistical inference, adapts through feedback, and can coordinate across systems at a speed no human process can match. What it cannot do, today, is explain itself in a way that satisfies the institutional standards that public administration is built on.

Recently, AgID Academy hosted a webinar that put this tension at the centre: autonomy on one side, responsibility on the other. The session drew on a recent vision paper, The Agentic State: Rethinking Government for the Era of Agentic AI, one of the most comprehensive attempts to date to think through what agentic AI means for government, not just as a technology question but as an institutional one. It is worth reading in full.

What follows is not a critique of the paper. It is an attempt to ground some of its arguments in the specific reality of Italian public administration, where I work, and where some of the questions the paper raises become particularly sharp. The core of my argument is simple: the factors that limit agentic AI in public administration today are not primarily technical. They are institutional, legal, and infrastructural. And addressing them is a prerequisite, not an afterthought.

The levels matter

One of the paper's most useful contributions is a taxonomy of agent capabilities, borrowed from autonomous driving, ranging from Level 0 (fully manual) to Level 5 (fully autonomous, no human involvement). Most real deployments today sit between Level 2 (AI classifies and prioritises, humans decide) and Level 3 (agents plan and adapt within defined domains, humans handle edge cases). Level 4, where agents act independently in bounded environments, exists only in constrained private-sector contexts. Level 5 remains research-only.

This distinction is worth keeping in mind, because the conversation around agentic AI tends to collapse these levels. When we talk about agents that orchestrate complete workflows, anticipate citizen needs, and coordinate across organisational boundaries, we are describing something between Level 3 and Level 4. When we talk about agents replacing bureaucratic processes entirely, we imply Level 4 or above. These are not the same thing, and the gap between them is where most of the difficult questions live.

The explainability problem is not new, but AI makes it structural

Public administration has always operated under an explainability requirement. In Italian law, this takes the form of the obbligo di motivazione: every act affecting a citizen must state the factual and legal reasons behind the decision. But the principle is not uniquely Italian. It is a cornerstone of open government more broadly: the idea that public decisions must be transparent, traceable, and contestable.

The Open Government Partnership, the EU's own transparency frameworks, and the AI Act itself all converge on the same expectation: citizens have the right to understand how decisions that affect them are made. When a human official issues a decision, the reasoning can be reconstructed, examined, and challenged. The decision may be wrong, but the process is legible.

AI systems based on machine learning introduce a tension here. A large language model producing a decision or a recommendation does not reason the way a legal system expects reasoning to work. It does not apply a rule to a set of facts through a traceable logical chain. It generates outputs through statistical inference over learned patterns. The result may be correct. The process that produced it is, in the legal and institutional sense, opaque.

The paper engages with this directly. It calls for structured explanations showing how inputs led to outputs, what alternatives were considered, and why specific approaches were chosen. It proposes layered transparency for different audiences: plain-language summaries for citizens, reasoning traces for policy experts, algorithmic inspection for auditors. These are the right architectural goals.

The open question is timing. No current large language model can produce an explanation that satisfies the standard an administrative judge would require when reviewing a challenged act. Explainability research is advancing, but the gap between what a post-hoc interpretability tool can generate and what legal accountability demands remains significant. This is not necessarily a permanent condition, but it is the current one, and it is worth being honest about when planning deployments.

The mismatch is structural, not incidental. It applies everywhere open government principles are taken seriously, not only in Italy. And it means that the path from Level 3 agents (supporting human decisions) to Level 4 agents (making decisions within bounded autonomy) passes through unsolved problems in the explainability of AI systems, not just unsolved problems in their accuracy.

Accountability requires a name

Beyond explainability, there is the question of responsibility.

The paper frames this well: Attribution becomes complex when decisions emerge from interactions between multiple AI systems, training data, and real-time information flows rather than identifiable human judgement. It proposes delegation frameworks, a kind of digital power of attorney, where officials formally authorise agents to act within defined scope, and remain accountable for the parameters they set.

This maps reasonably to a human-on-the-loop model, where an official supervises an autonomous system rather than approving each action. The paper acknowledges this will be the dominant model at scale, and the logic is sound.

Where it becomes difficult, at least in the Italian context, is that administrative accountability is not abstract. When a citizen appeals an act, the administration must identify who made the decision, on what authority, and through what reasoning. The responsabile del procedimento is not a metaphor. It is a named individual with a specific legal role.

Consider a concrete scenario: an agentic system processes a benefit application, cross-references five databases, applies eligibility criteria, and produces a denial. Who signed the act? The official who set the system's parameters six months ago? The one who approved its deployment? The one who received the output and clicked "confirm" without meaningful review?

The paper raises this question explicitly, and it is the right question to raise. The answer will likely require new legal frameworks that define what it means for a human to be accountable for a decision that was substantively made by a machine. That work has not yet been done in most jurisdictions. In Italy, where the procedural framework is particularly detailed, it will require careful legislative and jurisprudential evolution.

Interoperability comes first

The paper's most compelling use cases for agentic AI involve cross-boundary coordination. A citizen reports house damage; agents orchestrate responses across insurance, emergency housing, building permits, and utilities. A business registration triggers identity verification, budget checks, and registry updates across departments, instantly.

These scenarios are genuinely appealing. They also depend on something that does not yet exist in most public administrations: reliable, real-time data interoperability across organisational boundaries.

I have written before about why a single source of truth is not a technical problem but an organisational one. The same logic applies here, amplified. An agentic system that orchestrates across departments needs not just access to data, but semantically consistent, validated, authoritative data with clear ownership and governance.

In the Italian context, where regional health systems maintain separate patient registries, where municipalities run their own demographic databases, where interoperability between state systems often relies on batch file exchanges rather than real-time APIs, the infrastructure for cross-boundary agent orchestration is not yet in place. The agent would be navigating the same fragmentation that frustrates citizens today, just faster.

This is not a reason to dismiss the agentic vision. It is a reason to sequence the work correctly. The paper itself, in its data and privacy chapter, argues for treating data as critical infrastructure and calls for ecosystem-wide governance and open by default principles. That is exactly the right foundation. The point is that this foundation needs to come first, or at least in parallel, rather than after the agent layer is deployed.

And the gap is not only technical. Principles like once only, where the citizen should never be asked to provide information the administration already holds, have been legislated for years in Italy and across the EU. The legal basis exists. What is largely missing is consistent implementation: shared operational practices, institutional willingness to treat another administration's data as authoritative, and a culture that prioritises the citizen's experience across organisational boundaries rather than within them. Digital systems built to enable this, from interoperability platforms to shared registries, exist in many cases. But when they are adopted without the cultural shift that gives them meaning, they remain underused infrastructure: technically available, operationally irrelevant. The problem agentic AI would be asked to solve on the citizen-facing side is, in large part, the consequence of this gap between what is already normatively possible and what is actually practiced.

There is also a deeper question worth asking. Much of the bureaucratic complexity that agents would navigate is the residue of regulatory accumulation: layers of requirements from different eras, applied to modern contexts through incremental adaptation rather than structural redesign. Interoperability standards, API mandates, and above all legislative simplification would reduce the surface area that any agent needs to navigate. In some cases, the best use of resources is not building an agent that manages complexity, but removing the complexity that makes the agent necessary.

Where agentic capabilities deliver value now

None of this means agentic AI has no place in public administration. It means the most productive deployments, at least in the near term, are likely to be those that work within existing accountability structures rather than against them.

There are domains where agentic capabilities at Level 2-3 can deliver real value today:

Internal workflow optimisation, where agents classify, route, and pre-process documents within a single administration, under direct human supervision. The paper's examples of invoice processing, risk-based inspection scheduling, and HR document analysis fall squarely here. These are high-volume, low-complexity tasks where the agent augments human capacity without displacing human authority.

Citizen-facing information services, where agents provide guidance, answer questions, and help citizens navigate procedures, without making binding decisions. This is the assisted interaction model: useful, lower-risk, and already in limited deployment in several countries. The paper's examples from Ukraine (Diia.AI) and Abu Dhabi (TAMM 3.0) show what this looks like at national scale.

Anomaly detection and quality control, where agents monitor data flows and flag inconsistencies for human review. In digital health, for example, an agent that identifies mismatches between regional and local patient records could accelerate reconciliation processes that currently take days.

Back-office coordination within a single organisation, where agents manage scheduling, resource allocation, and internal routing without crossing the organisational boundaries that make interoperability so difficult. The example from Goiás, Brazil, cited in the paper, where an agent reduced project review time from one year to a single week, falls into this category.

What these use cases share is that the agent operates within clear boundaries, a human retains decision authority, and the output is reviewable. They are not the full Agentic State the paper envisions, but they are real, deployable, and valuable.

The conversation that matters

The webinar's framing put autonomy and responsibility side by side, and that pairing captures the core tension well. The paper does not shy away from it. Its governance chapter is one of the strongest in the document, proposing identity binding, behavioural attestation, preview windows, and citizen override mechanisms that go well beyond what most current discussions of AI governance attempt.

The Italian perspective within the paper is notably grounded: agents fully under human control, challenges around scope of competence, regulations on procedural responsibility, and privacy, transparency, and accountability tools. This is the language of people working through the operational detail of how a regulatory framework actually adapts to new technology.

The paper describes a broader vision: outcome-driven governance, self-composing services, and anticipatory government. That vision is worth engaging with seriously. The path toward it, though, runs through work that is not primarily technological: interoperability governance, legal frameworks for algorithmic accountability, legislative simplification, and a shared understanding of what explainability means when the decision-maker is a machine.

That work is less visible than deploying an agent. It is also what determines whether the agent, once deployed, actually serves the citizen or merely adds a layer of sophistication to a system that still does not function as it should.

The question is not whether agentic AI will become capable enough to operate in public administration. It probably will. The question is whether our institutional frameworks can absorb it without breaking the guarantees they exist to protect: that every decision can be explained, that every act has a responsible author, and that every citizen can challenge what the state does in their name. Those guarantees are not obstacles to innovation. They are the reason public administration exists at all.

Building a Home Network You Actually Control

2026-04-11T00:00:00+00:00

Starting point

When I moved to a new house, I had a choice.

I could plug in a consumer router and be done with it. Or I could treat the move as an opportunity to build a network infrastructure from scratch, properly.

I chose the second. This is what that actually involved, and what it taught me.

The first decision: physical or virtual firewall

The question I faced immediately was where to run the firewall.

The obvious path was a dedicated appliance: a small x86 box running pfSense or OPNsense, separate from everything else. Simple, predictable, cheap.

I chose the other path: virtualizing the firewall inside a hypervisor, on a node dedicated exclusively to the network layer. That node now runs OPNsense, the wireless controller, and the services responsible for authentication and traffic routing. It is physically and logically separate from the compute and storage infrastructure.

This separation was not strictly necessary at the start. It became a deliberate design choice once I understood what it meant: the network layer can be reasoned about, maintained, and failed independently from everything running on top of it. This is not a novel concept in enterprise architecture. It is, however, one thing to read about it and another to operate it.

pfSense or OPNsense

Both are FreeBSD-based. Both are capable. The differences that mattered were around approach.

pfSense has a longer history and a larger community. It is also, in places, visibly anchored to older decisions, in the interface and in how certain features are exposed.

OPNsense felt more deliberately designed. The API is first-class. The plugin system is clean. Updates are frequent and well-documented.

I was starting from zero. If I had to learn one, I preferred to learn the one with the more modern design language. OPNsense it was.

The chicken-and-egg problem

Here is something nobody warns you about when you virtualize a firewall for the first time.

To configure OPNsense, you need network access. But OPNsense is what provides network access. If you misconfigure it, you lose the ability to reach the interface you need to fix it.

My previous router became my lifeline during setup: a parallel network I used as a fallback path to reach the hypervisor when the virtual firewall was not yet routing correctly, or was completely broken, which happened more than once.

It took longer than I expected. There were evenings where I rebuilt the virtual machine configuration from scratch because I had gotten into a state I did not understand well enough to fix. That was, in hindsight, the point. You understand a system when you have broken it and repaired it, not before.

Physical infrastructure

The house was cabled during the move. The backbone between floors runs on single-mode fiber with BiDi transceivers at 10 Gbps. The choice of fiber over copper for the backbone was partly practical: fiber cables are thinner and easier to pull through existing conduits inside walls. Using BiDi modules, which transmit and receive on different wavelengths over a single fiber strand, halved the number of physical cables needed between switches.

End-device connections use copper, which is more than sufficient for anything currently attached.

The WAN connection is fixed wireless, with an antenna mounted on the roof. Living on top of a hill, with no shared lightning protection in the building, exposed me to a risk I had not initially considered. A lightning strike on the antenna could propagate through the Ethernet cable directly into the router, and from there into everything else.

The mitigation was deliberate: a copper-to-fiber media converter sits between the antenna's injector and the router. The signal enters as light, not electricity. A surge on the antenna side cannot travel down an optical path. The injector itself is on a UPS, as are the servers and all active network equipment, providing continuity for a minimum viable setup during power interruptions.

Why segmentation

My previous self-hosting experience had taught me one thing clearly: a flat network is not a network design, it is the absence of one.

The problem I had encountered before was that services exposed on local ports were reachable by anything on the same network. Bypassing authentication was sometimes as simple as hitting the direct address and port instead of going through the reverse proxy. This is not a hypothetical. It happened.

Beyond that, I had three distinct concerns that shaped the segmentation design.

The first was service isolation. I did not want client devices to reach service infrastructure directly. I also did not want services to communicate with each other laterally. Traffic between segments should pass through the reverse proxy, not through direct connections.

The second was IoT devices. Some devices in the house run closed-source firmware with unclear data practices. Smart appliances and home automation hardware from certain manufacturers phone home in ways that are not transparent. I was not primarily concerned about intrusion. I was concerned about what data was leaving, and where. Placing these devices in isolated segments where they cannot observe other network traffic is a meaningful constraint, even if imperfect.

The third was cameras. Surveillance cameras are a specific case. The risk of unauthorized access, including from firmware vendors or cloud providers, is real enough to warrant complete isolation: no internet access, no cross-segment reach, no exposure outside what the NVR software needs to receive the video stream.

Segment design

The result is a set of distinct logical zones, each with a specific scope:

The management segment carries only infrastructure: servers, switches, access points, backup systems, and out-of-band management devices. It is never reachable from client networks by default.

The server segment hosts all service containers and virtual machines. Client devices do not have direct access here. They reach services through a reverse proxy that sits inside this segment. Traffic flows in one direction: clients reach the proxy, the proxy reaches the appropriate backend. The management segment remains separate, and its interfaces are accessible only through the administrative VPN tunnel, not through the proxy.

The trusted segment is for personal devices that need broad access to services. The users segment is for family devices, with internet access and selective reach to specific services.

IoT devices are split across two zones depending on whether they require internet connectivity. Those that do not are placed in a fully isolated zone with no outbound path whatsoever. Those that do are allowed outbound internet access but cannot reach any other internal segment.

Cameras have their own dedicated zone: isolated from everything, with no internet access. The only allowed traffic is the inbound stream pulled by the NVR.

Smart media devices and a guest wireless network complete the picture, each with appropriately limited reach.

The network operates on a default-deny posture: inter-segment traffic is blocked by default, and every communication path requires an explicit firewall rule

Dynamic VLAN assignment

Defining segments is the design. Assigning devices to them reliably at connection time is the operational problem.

For wired connections, segment membership is determined by the switch port. A device plugged into a port gets the VLAN that port belongs to. This works for infrastructure with fixed physical locations.

Wireless is different. Any device can attempt to connect, and the network needs to decide where to put it before it has proven anything about itself.

FreeRADIUS, running inside OPNsense, handles wireless authentication for both networks. The two SSIDs serve different populations and use different mechanisms.

The IoT network uses MAC Authentication Bypass. The access point presents the connecting device's MAC address to FreeRADIUS as the authentication credential. If the MAC is registered, the device is assigned to the corresponding IoT VLAN. If it is not recognized, the default policy applies: the device lands in the most restricted IoT zone, with no internet access and no cross-segment reach. Either way, the device stays within the IoT boundary. The MAC binding only determines the degree of restriction within that boundary, not a level of trust. All IoT VLANs are untrusted from the perspective of access to privileged resources. The default is not rejection. It is automatic containment: a new device connecting for the first time is isolated without any manual intervention and without interrupting anything else.

The main network uses WPA3-Enterprise. FreeRADIUS authenticates clients via PEAP-MSCHAPv2, so the credential is a username and password rather than a shared passphrase. Authentication is tied to an identity, not to a device. My own setup uses two separate accounts: one that maps to the trusted segment, one to the untrusted segment. The choice of which to use at connection time determines the level of access that session will have. Family members authenticate with untrusted credentials and land in the users segment regardless of which device they use.

The result is that segment assignment follows a consistent policy rather than manual configuration. Adding a new device means deciding its trust level, either registering its MAC to the appropriate VLAN or letting it fall to the default. The network posture does not depend on remembering to configure anything after the fact.

Remote access

Two separate VPN tunnels run directly on OPNsense, providing different levels of access.

One is for infrastructure management: it reaches the server and management segments and is used for operating the homelab directly. The other is for general access: it reaches only services exposed through the reverse proxy, and is used when connecting from outside the home network.

The separation matters because administrative access and user access are different threat surfaces. A compromised device on the administrative tunnel would have direct access to the hypervisors. The two-tunnel design keeps those surfaces separate without requiring two physical endpoints.

What operating this actually teaches

The value of building and running this infrastructure is not the infrastructure itself.

It is the operational experience that comes from being the only person responsible for every layer of the stack, from the physical cabling to the application running inside a container. There is no ticket to escalate. There is no network team to call. When something breaks, the gap in understanding becomes immediately visible, and filling it is the only path forward.

This is a different kind of learning from reading documentation or following a tutorial. When you misconfigure a firewall rule and lock yourself out, you understand stateful packet inspection in a way that no explanation produces. When you debug why two services cannot communicate and trace it back to a missing DNS override, you understand the relationship between name resolution and service discovery concretely.

In professional contexts, particularly in public sector digital governance, systems are designed and operated by specialists. The people responsible for policy rarely have direct visibility into how those systems behave at the architectural level. Having operated a network where every decision was mine, including the consequences of bad ones, gives me a different frame for reading architecture documentation, evaluating vendor claims, and understanding where the real constraints in a system lie.

It does not replace specialized expertise. It provides a ground-level reference that makes that expertise easier to engage with.

The process you know but cannot describe

2026-03-30T00:00:00+00:00

In an earlier post, I wrote about two software projects I built before I had a formal framework for what I was doing. The conclusion was simple: building tools to manage deadlines and track compliance obligations was, in retrospect, a primitive attempt at process governance. The code was the easy part. Understanding the underlying process well enough to model it was the real work.

I also mentioned a university course on process modelling that finally gave me a language for what I had been doing instinctively. This is that story.

The process I already knew

For the course assignment, I chose to model a process I had been living inside for four years: the sanctioning procedure under the Italian Highway Code (Codice della Strada), as implemented at the small municipality where I worked as a local police officer.

It was a deliberate choice. I was not interested in inventing a fictional process for the exercise. I wanted to see what happened when I tried to formalise something I already knew intimately, from the inside.

The process, in summary: a traffic violation is detected, documented, and notified. The recipient then has several possible responses, each triggering different sub-flows. They can pay the reduced fine within 60 days. They can submit a formal defence. They can appeal to the Justice of the Peace or the Prefect. If nothing happens, the debt is registered and transferred to a collection agency. If there is a formal error in the verbal, the process can be corrected or annulled.

I knew all of this. I had run every path in that diagram personally. I could have described it in prose in five minutes.

What I could not do, before the modelling exercise, was hold the entire thing in my head at once.

The diagram

The model uses BPMN 2.0, with two swim lanes representing the two functional roles involved: Patrol Unit (handling on-site detection and immediate contestation) and Administrative Office (handling documentation, notification, and all subsequent phases).

Download the source file

A few things worth noting in the model:

The process has two start events. This reflects a structural reality of Italian traffic enforcement: contestation can be immediate (on the spot) or deferred (by mail to the registered owner). These are not the same administrative path, even though they converge at the same subsequent steps.

The event-based gateway after notification handles three mutually exclusive waiting states: payment within 60 days (Payment timer event), a request for defensive writings (Request for Defensive Writings message event), and simple inaction (which triggers the 60-day deadline expiry). This was the hardest part of the model to get right, because in practice the three outcomes feel sequential rather than parallel. The gateway makes explicit that the process is actually suspended, waiting for one of three things to happen.

The loop from the gateway through defensive writings, appeal outcome, and back to the main flow reflects the real complexity of the appeals chain. In practice, this loop rarely completes more than twice. But it exists, and a model that ignores it is wrong.

What modelling changed

When I finished the first draft of the diagram, I found three things I had not consciously known before, despite having executed the process hundreds of times.

The first was the double start event. I had always thought of immediate and deferred contestation as variations of the same thing. Modelling them made clear they are structurally different paths that happen to share most of their downstream steps. That distinction matters if you ever want to instrument the process or collect metrics separately.

The second was the event-based gateway. In practice, the 60-day wait and the possibility of a defensive request feel like sequential steps: you send the notice, you wait to see if they pay, and if they do not, you check for defensive writings. The model showed that these are actually concurrent waiting states, and that the administrative logic for handling them needs to account for both simultaneously. Several municipalities I later learned about had actually built their software to handle them sequentially, creating edge cases where a payment received on day 59 could be ignored because the system had already opened a defensive-writings sub-process.

The third was the annulment path. In practice, annulment in autotutela was rare enough that I barely thought about it as part of the process. In the model, it appears as a distinct end state that requires its own flow. Making it explicit forced me to think about what triggers it, who authorises it, and how it is recorded.

None of these were revelations. They were things I would have recognised immediately if someone had pointed them out. But I had not pointed them out to myself, because I had no tool that required me to be explicit about them.

On governance work

The reason I find this worth writing about is not the BPMN diagram itself. It is what the exercise revealed about the relationship between tacit knowledge and formal structure.

People who work inside complex processes for years accumulate a detailed operational understanding of how things work. That understanding is real and valuable. It is also largely invisible: it lives in their heads, it is transmitted through observation and apprenticeship, and it is extremely difficult to examine critically from the inside.

Formal modelling is one tool for making that tacit knowledge visible. It does not replace operational experience. It surfaces it, names it, and creates a shared object that can be examined, critiqued, and improved.

This is, in a more structured form, the same thing I was trying to do when I built DESU: surface the compliance tracking logic that was living in spreadsheets and in people's heads, and give it a more explicit, queryable form.

The difference is that with DESU I was solving the problem with code before I understood the structure well enough to model it properly. The result worked, but it was brittle, difficult to extend, and dependent on my own understanding of what the firm actually needed.

If I were to rebuild it now, I would model the process first.

Building compliance tools before I knew what governance meant

2026-03-28T00:00:00+00:00

Between 2019 and 2021, I built two software tools for two very different organisations facing the same underlying problem: they were growing faster than their ability to keep track of their own obligations.

Neither project ended the way I hoped. One was eventually abandoned due to accumulated technical debt and no one left to maintain it. The other was never really adopted at all.

Both taught me more about digital transformation than I expected, not because of the code I wrote, but because of what happened when I tried to introduce it.

DESU: when a spreadsheet stops being enough

DESU stands for deadline support. I built it for a compliance consulting firm I was collaborating with as an external contractor.

The firm provided RSPP (workplace safety officer) consultancy services to a portfolio of clients. As the client base grew past a hundred accounts, so did the volume of active engagements: each client had its own contract, its own inspection schedule, its own renewal dates, its own billing cycle. The team was tracking all of it in spreadsheets.

The problem was not incompetence. It was scale. Spreadsheets work fine for ten clients. At a hundred, with multiple overlapping deadlines per client per month, the cognitive overhead becomes unsustainable. Things start slipping through.

I built DESU to address this directly. The architecture was deliberately simple: a Java/JavaFX desktop client with a PostgreSQL backend, a DAO layer for clean data access, and a set of server-side Python scripts for automated email notifications. Every month, the system would query the database and send a digest of upcoming billing deadlines and expired engagements that still lacked an invoice, the two failure modes the firm cared most about.

java subject = "Incarichi da fatturare per il mese di " + calendar.month_name[datetime.now().month]

It was not sophisticated. But it worked, and it addressed a real operational pain point.

I chose to make it open source for a specific reason: I was an external contractor, not a permanent employee. I knew my engagement would end, and I wanted the firm to have the option to hand the project to another developer without being locked into me. FOSS was the cleanest exit strategy I could think of. The code would remain accessible and modifiable by whoever came after.

As far as I know, DESU stayed in use for several years after I left. It eventually stopped being maintained and was replaced, I believe by Notion. Which is entirely reasonable: a general-purpose tool with a managed update cycle is a better fit for a small firm than custom software that requires a developer to evolve it.

The second project: when the problem is not technical

The second project started from a different context. After competing in a business innovation challenge at the local Junior Enterprise, where my team worked on a digital transformation strategy for traditional Piedmontese craft companies, I stayed in contact with one of the firms involved.

The company was a small artisan business. Post-competition, I volunteered to help translate some of the strategy we had presented into something concrete: a lightweight inventory management tool to replace a manual process that was becoming unwieldy as production volumes grew.

I built a small JavaFX application with a CSV-backed data model. CSV as a persistence layer is a pragmatic choice when you do not want to introduce a full database into a context where no one has the skills to maintain one. The tool needed to be operable by non-technical staff without infrastructure overhead.

The project never made it into production.

It was not a technical failure. The company's attention was almost entirely focused on the present: on day-to-day operations, on immediate orders, on this week's production run. Any investment of time and energy that did not directly address an immediate operational need felt like a distraction. Digital transformation, even a very modest one, requires a certain willingness to invest in the future. That willingness was not there.

I do not say this as a criticism. Small craft businesses operate under real constraints, and the instinct to focus on what is in front of you is often rational. But it was the first time I encountered something I would later come to recognise as one of the most consistent patterns in digital transformation work: the gap between what an organisation could do and what it is actually ready to do is almost never a technical gap.

What I took from both projects

Looking back, both projects were fundamentally about the same thing: how do you make obligations, deadlines, and data flows visible and manageable in an organisation that has more complexity than its current systems can handle?

DESU was a compliance tracking system. It modelled regulatory relationships between entities, surfaced obligations before they became failures, and tried to make accountability legible to the people responsible for it.

The second project failed for reasons that are entirely familiar in public sector digital transformation: competing priorities, short time horizons, and the difficulty of convincing an organisation to invest in infrastructure that will not pay off until later.

What neither project gave me at the time was a language for what I was actually doing. I was decomposing processes, identifying data flows, mapping responsibilities. I was doing it instinctively, based on operational observation, without any formal framework.

A few years later, working through a university course on process modelling, I finally had a name for it. BPMN gave me a notation and a way of thinking that made explicit what I had been doing informally. Seeing it formalised did not change how I approached problems, but it clarified why certain approaches worked and others did not. The inventory tool had failed partly because I had never properly modelled the process I was trying to support. I had built a solution before understanding the problem well enough.

That lesson, more than anything else, shapes how I approach governance work now. The code is the easy part. Understanding the process it needs to serve is where the real work is.

A Single Source of Truth Is Not a Technical Problem

2026-03-27T00:00:00+00:00

The promise

A citizen crosses an administrative boundary. The core state registry updates her record: new address, new jurisdiction, new assigned service node. The update propagates. And yet, when she presents herself at the local office the next morning, the system still shows the previous assignment. The desk operator apologises, checks manually, and resolves it. Nobody is surprised. This is how it works.

In highly regulated, multi-layered ecosystems, the idea of a single source of truth is often presented as a technical objective. Centralise the data. Remove duplication. Ensure consistency. The logic is appealing: if there is one authoritative copy, there can be no conflict.

In practice, it rarely works that way. Not because the technology is insufficient, but because the systems that maintain citizen data are embedded in institutional structures where duplication is not an accident. It is a consequence of how responsibility is distributed.

Understanding this changed how I think about data architecture in public systems. The problem is not how to store data. It is how organisations agree on it.

Why local copies exist

Consider the typical architecture of a federated institutional network. At the top, a central authoritative ledger holds the canonical identity of every enrolled entity: name, fiscal identifier, residence, service assignment, jurisdictional membership. This is the reference. Below it, each peripheral administrative unit maintains its own copy of that data, enriched with locally managed fields: active entitlements, specialist authorisations, internal identifiers, operational flags that the central ledger does not track.

From the outside, these local copies look like redundancy. A data architect would see them as a problem to be eliminated. But they exist for reasons that are not immediately visible in a system diagram.

Local systems serve as verification layers. When the core state registry sends a notification that a citizen has changed address, or switched service provider, or moved to a different jurisdiction, the local system does not simply overwrite its record. It validates the incoming data against its own state. Some fields map cleanly. Others do not, because the local system tracks things the central record does not, or represents them differently. In some cases, the update triggers a re-verification workflow that takes hours or days.

Meanwhile, the citizen walks into a local office. The desk operator sees the old record. Not out of negligence, but because the validation cycle has not yet completed. The local system is doing exactly what it was designed to do: verify before accepting. The inconsistency is real, but it is the cost of distributed responsibility, not a failure of the system.

Local systems also serve as operational buffers. If the central authoritative ledger goes down, edge authorities can continue to operate because they hold a working copy of the data they need. This is not elegant, but it is resilient. And in regulated public services, resilience is not optional.

Finally, local systems serve as accountability points. When a peripheral administrative unit manages a citizen's entitlements or authorisations, it does so as a data controller with specific legal responsibilities. Holding a local copy is not just a technical convenience. It is a consequence of the regulatory structure that assigns different institutions different roles over the same data.

The notification model and its limits

The architecture I just described typically operates on a push model. The core state registry detects a change and sends a notification downstream to every local system that might be affected. Each local system receives, validates, and integrates the update on its own schedule.

This model has worked for years. It also has a structural problem: it scales duplication.

Every notification creates a new copy of the data at every receiving end. The more local service nodes exist, the more copies are maintained. The more copies exist, the more opportunities there are for drift: a notification that arrives late, a validation that fails silently, a local enrichment that conflicts with a subsequent update.

Over time, the gap between the central ledger and any given local copy becomes a function of timing and implementation quality. The system converges toward consistency, but never guarantees it at any given moment. For routine operations, this is acceptable. For cross-boundary workflows, where one institution needs to trust another institution's data in real time, it is not.

But the notification model's limitations become most visible not within a single federated domain, but across domain boundaries. Consider a citizen who moves from one autonomous jurisdiction to another. She registers with the new jurisdiction's service system and is assigned a new local provider. The new jurisdiction begins allocating resources for her primary services. The problem is that the old jurisdiction has no mechanism to learn that she has left. Its registry still lists her as an active entity. Her former provider is still on the books. The old jurisdiction continues to allocate resources for a service that is no longer being delivered.

The result is that the system pays twice for the same citizen's primary entitlements. Not because of fraud or negligence, but because two independent registries, each internally correct, have no shared reference to reconcile against. The old jurisdiction's data is not wrong. It is stale. And without a national-level source of truth that both jurisdictions query, there is no event that triggers the correction. The citizen moved. The data did not.

This is not a theoretical risk. It is a structural inefficiency embedded in any system where autonomous registries operate as independent authorities with no common upstream. The push model, which works tolerably within a single domain, breaks down entirely when the boundary to cross is between domains rather than between local service nodes.

The deeper problem is that the push model encodes a specific assumption about how data should flow: the centre distributes, the periphery absorbs. In practice, local systems do not just absorb. They interpret, enrich, and adapt. The notification is a starting point, not a conclusion.

The architectural shift: from copies to queries

When I started working with systems that were moving beyond this model, the direction was clear in principle but difficult in practice. The shift is from replication to access: instead of maintaining synchronised copies, systems query a shared authoritative source directly when they need current data.

The logic applies at every scale. Within a domain, peripheral units stop maintaining shadow copies and query the core state registry instead. Across domains, the registries themselves stop operating as independent authorities and integrate with a higher-order national ledger that provides the shared reference they lacked. The cross-boundary double allocation described above becomes structurally impossible: when a citizen registers in a new jurisdiction, the national ledger reflects the change, and the old jurisdiction's system sees it the next time it queries.

In architectural terms, this is a move from push to pull. Sources stop sending notifications. Downstream systems stop maintaining copies of data they do not own. When a desk operator needs to verify a citizen's identity or service assignment, the system makes a real-time API call to the authoritative source and gets the current state.

This eliminates an entire category of consistency problems. There is no lag between the source and the downstream record, because there is no downstream record for that data. The source is queried, not copied. But the trade-off is direct: the pull model eliminates inconsistency at the cost of systemic dependency. Every downstream system now relies on the authoritative source being available, correct, and fast enough for real-time use.

This is where the concept of trust becomes central, and not in the technical sense of TLS certificates or API authentication. The trust that matters is institutional. The authoritative source must be highly available, because every downstream system depends on it for routine operations. The API must be well-governed, because every query carries an implicit agreement: the consuming system is accepting the response as authoritative without local validation. And the transition itself must be managed carefully, because dismantling local copies means removing the operational buffers and verification layers that those systems provided.

In the push model, trust was distributed. Each local system trusted its own copy, verified on its own terms. In the pull model, trust is centralised. Every actor must agree that the source is authoritative, that its data is correct, and that its availability is guaranteed. This is not a technical configuration. It is an institutional commitment.

This is where the real difficulty lives. The technical implementation of an API is straightforward. The institutional agreement required to trust that API — to accept that the data it returns does not need local verification, to restructure workflows that depended on having a local copy — that is organisational work, not engineering work.

What makes convergence possible

A single source of truth does not emerge from architecture alone. I have seen systems where the technical layer was well-designed but the institutional alignment was missing, and the result was a central registry that existed alongside local copies that no one was willing to dismantle. The architecture was correct. The organisation had not changed.

What actually enables convergence is not a better database or a faster API. It is a prior agreement on who owns what. Consider a simple field: a citizen's entitlement status. The central ledger knows whether the citizen exists. The peripheral unit knows whether that citizen qualifies for a specific entitlement, because it holds the supporting documentation. If both systems maintain the field independently, they will eventually disagree. The only way to prevent that is to decide, before writing any code, that the entitlement field belongs to one system and one system only, and that every other system queries it rather than copying it.

This sounds obvious. In practice, it requires negotiations that have nothing to do with technology: which institution accepts liability for an incorrect value, what happens when a query fails and the data is needed for a time-sensitive decision, who pays for the infrastructure that makes real-time access possible. These are governance questions, not engineering questions. And they must be answered for every field in the data model, not just the easy ones.

The principle known as "once only" — where a citizen should never be asked to provide information the administration already holds — captures the aspiration well. It has been legislated in multiple jurisdictions. The legal basis exists. What is often missing is the trust infrastructure underneath: shared operational practices, institutional willingness to treat another administration's data as authoritative, and a culture that prioritises the citizen's experience across organisational boundaries rather than within them.

The same lesson, in a different system

I have encountered this pattern in a completely different context: my own homelab infrastructure. In The Commit Is the Deploy, I described how I moved from manually configured services on each host to a single Git repository that defines the complete state of the infrastructure.

The parallel is closer than it might seem. Before the repository, each host had its own configuration, its own copy of environment variables, its own version of what was supposed to be running. The configurations would drift. Updates would happen on one host and not another. The gap between intended state and actual state grew silently. It was the same push model at a smaller scale: I would make a change and propagate it manually to each host, hoping for consistency.

The repository solved this by becoming the single source of truth. Hosts no longer maintain their own state. They query the repository and converge to whatever it defines. A commit is a deploy. There are no local copies to reconcile.

But even in that context, the hard part was not writing the deploy script. It was deciding to trust the repository as authoritative, to accept that local modifications would be overwritten, to restructure the operational workflow around a centralised definition rather than distributed improvisation. Trust, again, was the bottleneck. Not trust in the technology, but trust in the decision to give up local control.

The technical layer was the easy part. The shift in how I operated the system was the real change.

Closing

A single source of truth is often described as a data architecture goal. In reality, it is a coordination problem. Technology can support it, but cannot create it.

The systems I have worked with — both in federated institutional networks and in infrastructure — taught me the same thing: duplication exists because responsibility is distributed, and it persists because trust is not. Eliminating duplication without first building the institutional trust that replaces it does not create a single source of truth. It creates a central system that no one fully relies on, surrounded by local copies that everyone still depends on.

The path forward is not centralisation for its own sake. It is the slower work of defining who owns what, who validates what, and what it means to accept someone else's data as authoritative. When that trust is established, the architecture follows naturally. When it is not, no amount of engineering will compensate.

Why I Built a Homelab (and Why It Matters)

2026-03-27T00:00:00+00:00

Context

My homelab journey started around 2018.

At the time, I tried to build a small NAS using a Raspberry Pi 4, two SSDs, and a UPS, mainly to run NextcloudPi. The goal was simple: reduce dependency on large tech platforms.

I had been using a terabyte of Google Drive storage - until that option was removed. The alternatives were either paying (not trivial as a student) or figuring things out myself.

I chose the second.

Early failures

It did not go well.

Things would work for a while - until they did not. An update would break something. A configuration would drift. A service would silently stop responding.

My limited experience, combined with the complexity of managing a bare-metal system, meant that I rarely understood why things broke - only that they had. Even though I was already using Linux on desktop, operating a persistent system was a different problem entirely. A desktop you can reboot. A NAS running your only cloud storage, you cannot.

After a while, I would give up and return to some compromise with commercial services.

This cycle repeated for years:

the desire for independence
the frustration of becoming a sysadmin without being one

The shift: Docker

Things changed between 2023 and 2024, when I discovered Docker.

Decoupling the system from the software introduced a different way of thinking. I quickly moved to Docker Compose to manage my services.

Until then, I had mostly worked with Raspberry devices and had limited experience with virtualization (aside from a previous project running on a VM).

Docker made systems:

easier to reproduce
easier to debug
easier to reason about

For the first time, things started to stick.

From self-hosting to homelab

I never abandoned the idea of running my own services, but recently it evolved into something more structured.

After moving to a new house, I designed the network infrastructure from scratch:

cabling (fiber and Ethernet)
switches
VLAN segmentation (8 VLANs managed via OPNsense running in a VM)

I also moved from a single machine to a small multi-node setup using Proxmox.

Originally, I planned three nodes, but energy costs pushed me toward a two-node architecture.

The result is a more robust system, with better isolation and improved reliability.

Operational maturity

More recently, I introduced a GitOps-style approach.

All services are managed via Docker and deployed through scripts connected to a GitHub repository. This allows:

reproducible deployments
quick rollback
controlled experimentation

Infrastructure changes are now versioned, not improvised.

What a homelab actually teaches

At this point, the homelab is no longer just about independence.

It provides something difficult to replicate in enterprise environments:

a single decision-maker across the entire stack
full visibility, from physical layer to application layer
direct exposure to maintenance costs
constant interaction with failure modes

A homelab forces decisions that are often abstract elsewhere.

You are constantly dealing with:

downtime
backups
redundancy
time investment

Closing

Running even a small system over time makes one thing clear:

Infrastructure is not just about building systems - it is about sustaining them.