Wednesday, September 16, 2020

The missing OS

Preface:

When I joined Google in 2011, I quoted a quip of a friend of mine:
"There are roughly one and a half computers in the world, and Google has one of them."
The world has changed quite a bit since 2011, and there may possibly be half a dozen computers in the world now. That said, for the following text to make sense, when I say "the computer", I mean a very large assembly of individual machines that have been connected to make them act like one computer.

Actual blog post:

The tech landscape of modern microservice deployments can be confusing - it is fast-changing, with a proliferation of superficially similar projects claiming to do similar things. Even to me as someone fairly deeply into technology, it isn't always clear what precise purpose the different projects serve.

I've quipped repeatedly about "Datacenter OS" (at least here and here), and mused about it since I first left Google for my sabbatical in 2015. I recently had the chance to chat with a bunch of performance engineers (who sit very much at the crossing between Dev and Ops), and they reminded me to write up my thoughts. This is a first post, but there may be more coming (particularly on the security models for it).

Warning: This post is pure, unadulterated opinion. It is full of unsubstantiated unscientific claims. I am often wrong.

I claim the following:
When we first built computers, it took a few decades until we had the first real "operating systems". Before a 'real' OS emerged, there were a number of proto-OS -- collections of tools that had to be managed separately and cobbled together. There were few computers overall in the world, and if you wanted to work on one, you had to work at a large research institution or organization. These machines ran cobbled-together OSs that were unique to that computer.

Since approximately 2007, we're living through a second such period: The "single computer" model is replaced with "warehouse-sized computers". Initially, few organizations had the financial heft to have one of them, but cloud computing is making "lots of individual small computers" accessible to many companies that don't have a billion of cash for a full datacenter.

The hyperscalers (GOOG, FB, but also Tencent etc.) are building approximations to a "proto-datacenter-OS" internally; Amazon is externalizing some of theirs, and a large zoo of individual components for a Datacenter-OS exist as open-source projects.

What does not exist yet is an actual complete DatacenterOS that "regular" companies can just install.

There is a "missing OS" - a piece of software that you install on a large assembly of computers, and that transform this assembly of computers into "one computer".

What would a "Datacenter OS" consist of? If you look at modern tech stacks, you find that there is a surprising convergence - not in the actual software people are running, but in the "roles" that need to be filled. For each role, there are often many different available implementations.

The things you see in every large-scale distributed infrastructure are:

  1. Some form of cluster-wide file system. Think GFS/Colossus if you are inside Google, GlusterFS or something like it if you are outside. Many companies end up using S3 because the available offerings aren't great.
  2. A horizontally scalable key-value store. Think BigTable if you are inside Google, or Cassandra, or Scylla, or (if you squint enough) even ElasticSearch.
  3. A distributed consistent key-value store. Think Chubby if you are inside Google, or etcd if you are outside. This is not directly used by most applications and mostly exists to manage the cluster.
  4. Some sort of pub/sub message queuing system. Think PubSub, or in some sense Kafka, or SQS on AWS, or perhaps RabbitMQ.
  5. A job scheduler / container orchestrator. A system that takes the available resources, and all the jobs that ought to be running, and a bunch of constraints, and then solves a constrained bin-packing optimization problem to make sure resources are used properly. Think Borg, or to some extent Kubernetes. This may or may not be integrated with some sort of MapReduce-style batch workload infrastructure to make use of off-peak CPU cycles.

I find it very worthwhile to think about "what other pieces do I have on a single-laptop-OS that I really ought to have on the DatacenterOS?".

People are building approximations of a process explorer via Prometheus and a variety of other data collection agents.

One can argue that distributed tracing (which everybody realizes they need) is really the Datacenter-OS-strace (and yes, it is crucially important). The question "what is my Datacenter-OS-syslog" is similarly interesting. 

A lot of the engineering that goes into observability is porting the sort of introspection capabilities we are used to having on a single machine to "the computer".

Is this "service mesh" that people are talking about just the DatacenterOS version of the portmapper?

There are other things for which we really have no idea how to build the equivalent. What does a "debugger" for "the computer" look like? Clearly, single-stepping on a single host isn't the right way to fix problems in modern distributed systems - your service may be interacting with dozens of other hosts that may be crashing at the same time (or grinding to a halt or whatever), and re-starting and single-stepping is extremely difficult.

Aside from the many monitoring, development, and debugging tools that need to be rebuilt for "the computer", there are many other - even more fundamental - questions that really have no satisfactory answer. Security is a particularly uncharted territory:

What is a "privileged process" for this computer? What are the privilege and trust boundaries? How does user management work? How does cross-service authentication and credential delegation work? How do we avoid re-introducing literally every logical single-machine privilege escalation that James Forshaw describes in his slides into our new OS and the various services running there? Is there any way that a single Linux Kernel bug in /mm does not spell doom for our entire security model?

To keep the post short:

In my opinion, the emerging DatacenterOS is the most exciting thing that has happened in computer science in decades. I sometimes wish I was better at convincing billionaires to give me a few hundred million dollars to invest in interesting problems -- because if there is a problem that I think I'd love to work on, it'd be a FOSS DatacenterOS - "install this on N machines, and you have 'a computer'".

A lot of the technological landscape is easier to understand if one asks the question: What function in "the computer" does this particular piece of the puzzle solve? What is the single-machine equivalent of this project?

This post will likely have follow-up posts, because there are many more ill-thought-out ideas I have on the topic:

  • Security models for a DatacenterOS
  • Kubernetes: Do you want to be the scheduler, or do you want to be the OS? Pick one.
  • How do we get the power of bash scripting, but for a cluster of 20k machines?