I'd love to hear some stories about how you or your organization is using Kubernetes for development! My team is experimenting with using it because our "platform" is getting into the territory of too large to run or manage on a single developer machine. We've previously used Docker Compose to enable starting things up locally, but that started getting complicated.
The approach we're trying now is to have a Helm chart to deploy the entire platform to a k8s namespace unique to each developer and then using Telepresence to connect a developer's laptop to the cluster and allow them to run specific services they're working on locally.
This seems to be working well, but now I'm finding myself concerned with resource utilization in the cluster as devs don't remember to uninstall or scale down their workloads when they're not active any more, leading to inflation of the cluster size.
Yes people do this kind of thing, as far as I know they all do it in fairly different ways... but what you're describing sounds reasonable. Yes, it does tend to be expensive. The whole point, as you note, is that the env has grown to be large so you're hosting a bunch of personal large environments which gets pricey when (not if) people aren't tidy with them.
Some strategies I've seen people employ to limit the cost implications:
Narrow the interface. Don't give devs direct access to the infra, but rather given them build/tooling that saves some very rich observability data from each run. Think not just metrics/logs, but configurable tracing/debugging as well. This does limit certain debugging techniques by not granting full/unfettered access to the environment for your devs, but it now makes clear when an env is "in use". Once the CI/build job is complete, the env can be reused or torn down and only the observability data/artifacts need to be retained, which is much cheaper.
Use pools of envs rather than personal envs. You still have to solve the problem of knowing when an env is "in use", and now also have scheduling/reservation challenges that need to be addressed.
Or automatically tear down "idle" envs. The definition of "idle" is going to get complex, and your definitely going to tear down an env that someone still wants at some point. But if you establish the precedent that envs gets destroyed by default after some max-lifetime unless renewed, you can encourage people to treat them as ephemeral resources rather than a home away from home.
None of these approaches are trivial to implement, and all have serious tradeoffs even when done well. But fundamentally, you can't carry the cavalier attitude of how you treat your laptop as a dev env into the "cloud" (even if it's a private cloud). Rather, the dev envs need to be immutable and ephemeral by default, those properties need to be enforced by frequent refreshes so people acclimate to the constraints they imply, and you need some kind of way to reserve, schedule, and do idle detection on the dev envs so they can be efficiently shared and reaped. Getting a version of these things that work can be a significant culture shock for eng teams used to extended intermittent debugging sessions and installing random tools on their laptop and having them available forever.
Right now our guidance is that each developer is given a namespace and a helm chart to install and the wording is such that developers wouldn't think of it as an ephemeral resource (ie. people have their helm installation up for months, and periodically upgrade it).
It would be nice to have user's do a fresh install each time they "start" working, and have some way to automatically remove helm installations after a time period, but we do have times where it's nice to have a longer-lived env because you'd working within some accumulated state.
Maybe there's something to automatically scaling down workloads on a cadence or after a certain time period, but it would be challenging to figure out the triggers for that.
You can build a workflow for ephemeral environments with ArgoCD using an applicationset resource with the pull request generator and the CreateNamespace=true sync option.
If a developer opens a pull request, create a generated namespace based on the branch name and PR number, then deploy their changes to the cluster, in the new namespace, automatically.
With github, if there is no activity on a PR after X time frame, you can have the PR closed automatically. When it's closed, Argo will not see it as an open PR anymore so it will automatically destroy the environment it created. If the dev wants to keep it active or reopen, just do normal git updates to the PR..
Right now our guidance is that each developer is given a namespace and a helm chart to install and the wording is such that developers wouldn't think of it as an ephemeral resource (ie. people have their helm installation up for months, and periodically upgrade it).
Right, the tradeoff here is that to maintain that state you're paying for envs even when they're not in use. Extended periods of "accumulated state" are definitely a thing, and you want some escape valve to enable them occasionally. But the way to reduce hosting costs is definitely to make them the exception rather than the rule, which involves adapting workflows to rely more on storing and offline analyzing telemetry rather than interactively debugging everything.
Maybe there's something to automatically scaling down workloads on a cadence or after a certain time period, but it would be challenging to figure out the triggers for that.
A other approach here is using something like EBS so that stateful pods can be stopped and then reattached to persistent disk. But unless you do some kind of deep hibernation you lose memory state, and even if you do that you lose socket and other environmental state. IMO telemetry is a stronger long-term strategy as it can capture this state that hibernation destroys.
I have found with individual dev environments that they cause many issues with outdated service versions. If you are going this route, I would use ScheduledScaler to shut down dev resources after hours.
I have found much more success with PR deployments - every PR gets deployed and wired up to a PR environment which has a full copy of dev and a copy of each PR which runs for 8 hours after the PR is built (I switch out the deploy for a job). Not every service maps well to this model, but it's a good 95% solution in my experience.
Yeah I agree, but for us this would mean like 30 containers. We've tried several times to have some kind of flexible setup where devs could choose which parts to run, but that got complicated with all the various permutations of the containers and devs needing different setups in different situations.
If the complete deployment will run on a high end laptop, I would suggest it's cheaper and easier to make the devs us local development on kind and Docker Desktop. The licenses for Docker will pay for itself, and you will be able to control costs. Also, you will incentize devs to optimize the stack to run in local dev. For the Mac users, it means a $4k laptop.
As a developer, I really like the local dev experience vs deploying everything. On my Mac M2 Ultra, it's more than fast enough for my env.
We use monitoring to check the state of our applications to see if they're idle. For example, one of our healthchecks looks at if data was placed into the database within a period of time. We know the database is idle if there is no data there within the last month for example. We use Prometheus/AlertManager to let us know in slack about idle resources.
Also we make use of HPA to scale resources, it takes a lot of time to find optimal settings for HPA in my experience. Each product team has a namespace in the cluster. These clusters have like 200 nodes.
We have hundreds of installations of our application across ~8 data centers, we use both Ansible and Golang operators for managing our many resources.