Tech

Definitely not Windows 95: What operating systems keep things running in space?

Enlarge / ESA's Solar Orbiter mission will face the Sun from within the orbit of Mercury at its closest approach.ESA/ATG medialab

The ESAs recently launched Solar Orbiter will spend years in one of the most unwelcoming places in the Solar System: the Sun. During its mission, Solar Orbiter will get 10 million kilometers closer to the Sun than Mercury. And, mind you, Mercury is close enough to have sustained temperatures reaching 450°C on its Sun-facing surface.

To withstand such temperatures, Solar Orbiter is going to rely on an intricately designed heat shield. This heat shield, however, will protect the spacecraft only when it is pointed directly at the Sun—there is no sufficient protection on the sides or in the back of the probe. So, accordingly, ESA developed a real-time operating system (RTOS) for Solar Orbiter that can act under very strict requirements. The maximum allowed off-pointing from the Sun is only 6.5 degrees. Any off-pointing exceeding 2.3 degrees is acceptable only for a very brief period of time. When something goes wrong and dangerous off-pointing is detected, Solar Orbiter is going to have only 50 seconds to react.

"Weve got extremely demanding requirements for this mission," says Maria Hernek, head of flight software systems section at ESA. "Typically, rebooting the platform such as this takes roughly 40 seconds. Here, weve had 50 seconds total to find the issue, have it isolated, have the system operational again, and take recovery action.”

To reiterate: this operating system, located far away in space, needs to remotely reboot and recover in 50 seconds. Otherwise, the Solar Orbiter is getting fried.

Billiard ball OS

To deal with such unforgiving deadlines, spacecraft like Solar Orbiter are almost always run by real-time operating systems that work in an entirely different way than the ones you and I know from the average laptop. The criteria by which we judge Windows or macOS are fairly simple. They perform a computation, and if the result of this computation is correct, then a task is considered to be done correctly. Operating systems used in space add at least one more central criterion: a computation needs to be done correctly within a strictly specified deadline. When a deadline is not met, the task is considered failed and terminated. And in spaceflight, a missed deadline quite often means your spacecraft has already turned into a fireball or strayed into an incorrect orbit. Theres no point in processing such tasks any further; things must adhere to a very precise clock.

The time, as measured by the clock, is divided into singular ticks. To simplify it, space operating systems are typically designed in such a way that each task is performed within a set number of allocated ticks. It can take three ticks to upload data from sensors; four further ticks are devoted to fire up engines and so on. Each possible task is assigned a specific priority, so a higher-priority task can take precedence over the lower-priority task. And this way, a software designer knows exactly which task is going to be performed in any given scenario and how much time it is going to take to get it done.

To compare this to operating systems we all know, just watch any given speed comparison between modern smartphones. In this one made by EverythingApplePro, the iPhone XS Max and Samsung S10 Plus go head to head opening some popular apps. Before the test, both phones are restarted, and the cache is cleared in them. Samsung opens all the apps in 2 minutes 30 seconds, and the iPhone clocks in at 2 minutes 54 seconds. In the second round, all the apps are closed and opened again without restarting or clearing the RAM. Because the apps are still in RAM, Samsung finishes the opening in 46 seconds, and the iPhone does it in 42 seconds. Thats a whopping two-minute time difference between the first try and the second. But if the phones had to run the kind of real-time operating systems used for spaceflight, opening those apps would take exactly the same amount of time no matter how many times you tried it—down to a millisecond.

Beyond time, space operating systems have more tricks up their sleeves. Real-time operation is one thing, and determinism is another. If you somehow convinced Craig Federighi to take part in one of those speed comparisons, gave him full access to the iPhone about to be tested, and asked him to predict exactly how much time it would take for this iPhone to complete the test, he would likely have no idea. Sure, hed probably say something like "fast," or "fast enough," or even "blazingly fast," but nothing more specific than that. Neither iOS nor Android is a deterministic system. The number of factors that could potentially affect speed results is so huge that making such exact predictions is practically impossible. But if the phone was running a space-grade OS, an engineer with access to the system would know exactly what causes what in a given sequence and could calculate the exact time necessary for any given task. Space-grade software has to be fully predictable and perform within super specific deadlines.

NASA

Shooting at the Moon (and beyond) with VxWorks

Back in the Apollo days, operating systems were custom-built for each mission. Sure, some of the code got reused—parts of the software made for the Apollo program made their way to Skylab and the Shuttle program, for instance. But for the most part, things had to be done from scratch.

One small reboot

During their famous descent, Buzz Aldrin and Neil Armstrong left the rendezvous radar antenna on and pointed at the Apollo Command Module orbiting the Moon. This was a safety measure for the lander to know where the CM was in case it needed to abort the landing. But it turned out the radar was flooding the computer with data, which caused the AGC to quickly run out of memory. The infamous 1201 and 1202 errors simply meant there were no free magnetic or memory cores and no free vector accumulation areas, respectively. The lack of memory made it impossible for the landing programs to complete on time, and this in turn caused repeatable restarts of the computer. Still, due to safety measures built into the OS, no critical navigation data was lost during those reboots—the landing could proceed as planned. The OS simply ran its scheduled tasks, picking up exactly where it had left off.Eventually, NASAs preferred OS solution came from WindRiver, a company based in Alameda, California. WindRiver released a fully operational commercial off-the-shelf, real-time operating system called VxWorks back in 1987. While VxWorks wasnt the first system of this kind, it quickly became the most widely deployed of them all, meaning VxWorks soon caught the eye of NASA mission designers.

The first mission to fly VxWorks was the Clementine Moon probe, otherwise known as the Deep Space Program Science Experiment. Back in the early 1990s, Clementine marked NASAs shift away from behemoth, Apollo-like programs. Everything was supposed to be lean, developed quickly, and on a tight budget. As such, one of the design choices made for the Clementine probe was to use VxWorks, and the system made a good enough impression to get a second date. VxWorks was the choice for the Mars Pathfinder mission.

But not everything was all rosy for this RTOS, though. A bug—the priority inversion problem—caused a lot of trouble for NASAs ground control team. Shortly after landing, Pathfinders system started to reboot for no apparent reason, which delayed transmitting the collected data back to Earth. It took three weeks to find the problem and another 18 hours to fix it; the issue turned out to be buried deep down in the VxWorks mechanics.

Listing image by Lee Hutchinson (original image)

An intro to VxWorks from WindRiver

Anatomy of VxWorks

At the heart of VxWorks lies the wind microkernel. Its job is to manage all the interactions between applications operating in the system and hardware. In VxWorks, the microkernel is responsible for task scheduling with all 256 levels of priority the task can be assigned. Both preemptive and non-preemptive round-robin scheduling is supported along with all communications between tasks.

Tasks in the system can be in one of four states. The "ready" state is the state of a task when it is started. From there, it can either run till its done or can be assigned a specific amount of time for running. A task enters a “blocked” state when it gets preempted by another task with a higher priority or when its allotted number of ticks has run out. The third option is a "delayed" state. A task is delayed while it waits for resources necessary for it to do its job (maybe data samples from a sensor). A delay is always measured by a timer running independently of processing, typically a tick counter at all times maintained by the kernel. When such delays exceed some set values, the system assumes something probably went really wrong and starts rebooting. Finally, there is also the fourth, “suspended” state, where the tasks context registers are saved while it is stopped for debugging.

Inter-task communication in VxWorks can be done either through a messaging service that allows tasks to exchange data or through semaphores, a variable that exists to make sure tasks are interlocked or synchronized when needed. There are two types of semaphores in VxWorks. The first are binary semaphores, which can assume two values: "full" or "empty." Full semaphores are available for tasks, and empty ones are unavailable. When a task starts, it takes an available semaphore, making it "empty" or unavailable for other tasks. When the task is finishing its execution, it relinquishes the semaphore, thus rendering it available for other tasks.

Such binary semaphores are used for synchronizing or interlocking different tasks. The name "semaphore" has railroad connotations, so lets stick with that for an analogy: imagine two trains that need to meet at some point to exchange cargo. In the VxWorks reality, the train that needs to pick the cargo up would create an empty semaphore and hand it over to the train that is carrying this cargo at the moment. Once the cargo-carrying train has unloaded it at the exchange point, this train would release the semaphore, leaving it up for grabs again.The first train (the one that created the semaphore) would then get notified that the semaphore is available, take it, and come in to pick up the cargo

In addition to binary semaphores, VxWorks includes a second type known as mutual exclusion, or "mutex," semaphores. These allow a task to have the exclusive use of a resource. The main difference with this method is how the semaphore is initialized. Binary semaphores are always created empty. Mutex semaphores are always created full. A task simply creates a full semaphore and takes it immediately, thus making it unavailable to all other tasks until its through with whatever it is doing. Such semaphores are often used to access communications hardware. A task needs to use such equipment, say, an information bus, until its data transfer is over. Cutting the transmission before it's done would be pointless, hence the need for mutex semaphores.

If this sounds clever, its because it is. The semaphore system is proprietary, and it became one of VxWorks selling points. But during those first few weeks Mars Pathfinder spent on the Red Planet, the RTOS still went beautifully downhill.

A Martian bug

The "information bus" working onboard the Mars Pathfinder was a shared memory used for passing the data between different components of the lander. Predictably, this area was a resource locked with a mutex semaphore. As it turned out, there were three tasks involved in causing the mysterious reboots. The first was a high priority task whose job was to manage the information bus operations. The second was a low priority task, which once in a while would take the information bus mutex to share meteorological data. The third culprit involved was a medium priority communications task.

Heres how this system was supposed to work: the meteorological data-gathering task was supposed to infrequently seize the information bus mutex. On rare occasions when the information bus management task was scheduled to run while the meteorological data-gathering task was running, the higher-priority task would try to get ahold of the same mutex—and therefore it ended up blocked until the lower-priority meteorological data was written to the bus. So far, so good, as data transfers should go from start to finish. But the third medium-priority communications task entered the scene and caused trouble.

The trouble was that there was an unlikely sequence of events that could schedule the medium-priority task to run when the low-priority meteorology task was running after it caused the high-priority bus-management task to block on the mutex. There was only a split-second window of opportunity for this to happen, but when it did occur, the medium-priority task preempted the low-priority task. One of the many things the halted meteorological data gathering couldnt do on such occasions was release the mutex semaphore to the high-priority bus management task. In consequence, the medium-priority task indirectly blocked the higher-priority task from running, hence the priority inversion. Of course, this caused the bus management task to enter the delayed state. And once the independent timer working in the kernel figured out that the important thread was not running as planned, it assumed something went really wrong and initiated a total reboot.

Such reboots happened roughly half a dozen times in two weeks—but ultimately VxWorks and its design was not to blame. The system could deal with such issues with a trick called “priority inheritance,” which caused the low-priority task to temporarily assume the higher priority of another task it has just blocked on mutex. If priority inheritance was working in the Mars Pathfinder, the meteorological data-gathering task would have simply assumed the high priority of the bus management task for the time the bus management task was waiting on the semaphore. This, in turn, would have prevented the medium-priority communications task from preempting it. All that had to be done was to turn on the priority inheritance option before launch.

Therefore, at the end of the day, Pathfinders issues stemmed from a human error. VxWorks, thus found not guilty, has gone on to fly on pretty much every rover that has landed on Mars since. Just a few decades after becoming the most widely deployed RTOS on Earth, it managed to become the most popular operating system on the Red Planet, too.

From 2015: An artist's rendering of the BepiColombo mission, a joint ESA/JAXA project, which will take two spacecraft to the harsh environment of Mercury.
Enlarge / From 2015: An artist's rendering of the BepiColombo mission, a joint ESA/JAXA project, which will take two spacecraft to the harsh environment of Mercury. ESA

ESA Falls for RTEMS

For the last decade, the space operating systems landscape seemed stable. In the US, NASA was mostly happy with using proprietary VxWorks for its most high-profile missions. But in the EU, the ESA had its own workhorse. The space agency was heavily invested in developing the open source RTEMS—which, according to the ESAs Maria Hernek, is just as capable but comes without expensive licensing fees.

RTEMS was not initially created to fly European spaceships—its original purpose was flying US missiles, actually. This RTOS history began with a study performed at the Research Development and Engineering Center of the US Army Missile Command back in 1988. Army researchers concluded that using proprietary real-time operating systems caused a number of problems. Most notably, the government did not own the code, so it couldnt modify it in any way. Moreover, the study claimed the responsibility for software failures looked a bit unclear, and RTOSes of that era were too slow for missile systems. For all those reasons, the Army decided to build its own RTOS called Real-Time Executive for Missile Systems. The goal was to make an RTOS that was fast enough for guiding missiles, government-owned, easy to run on different processor families, and license-free.

As the RTEMS was taking shape, the US Military started to realize that its possible applications reached far beyond firing rockets. Hence the name of the system quickly evolved into the more general Real-Time Executive for Military Systems. And since May 4, 1995, when RTEMS was released as open source and no longer bound to wear a uniform, it became known as the Real-Time Executive for Multiprocessor Systems.

The European Space Agency has fallen in love with it for two main reasons. The first is that RTEMS was designed from the ground up to be effortlessly ported to new processor families. So, making it work on SPARC LEON radiation-hardened chips developed in Europe for ESAs space missions could be done with relative ease. The second reason was that the system was highly customizable. Based on the same working principles as VxWorks, RTEMS allowed programmers more freedom since virtually everything in the system could be changed. ESA was totally free to fiddle with the code.

Scheduling is one of the customizable areas where RTEMS differs from VxWorks. In VxWorks, a programmer is stuck with a preemptive priority-based scheduler for tasks with differing priorities and a round-robin when multiple tasks have the same priority. It cant be changed. WindRiver built it this way—take it or leave it. RTEMS offers a completely different approach.

Of course, RTEMS has a priority-based scheduler with 256 levels of priority just as in VxWorks. There is also a round-robin scheduling method available. Both are used as default schedulers for single-processor platforms. But in RTEMS, you can dispense with each option and go for one of the numerous other scheduling mechanisms instead. There is the Simple Priority Scheduler, a leaner version of default schedulers that can work under several memory constraints. The same low-memory scheduler is also available in a variant designed for symmetric multiprocessing systems with multiple processors running in parallel. Or another scheduling option entirely is the Earliest Deadline First Scheduler, which, as its name suggests, prioritizes tasks with earliest deadlines. Plus if you are not happy with any of RTEMS options, you are free to throw them all out the window and write your own scheduling algorithm—RTEMS will work with that as well.

Since opting for this RTOS, ESA has invested lots of time and effort into qualifying RTEMS to software criticality Level B, which is the second-highest level of software reliability recognized by the agency. The ESA uses Level B status to denote software whose failure would cause “critical” consequences. To achieve that, ESA testers had to execute every single line and every single decision point in the RTEMS code. The only higher criticality—Level A—is where the consequences of failure are “catastrophic.” (Sadly, ESA documents do not specify what “critical” or “catastrophic” mean exactly, but you can easily imagine the ISS crashing down on Brussels.)

“I recall the last time we used VxWorks was in one of the instruments on Sentinel 1 spacecraft,” says Hernek. All other modern European space missions, including the most recent Solar Orbiter, flew with RTEMS onboard.

RTOS on a mission

At this point, VxWorks and RTEMS have been used for decades and are astonishingly good at what they do. In an email exchange discussing real-time operating systems in 2004, Gregory Menke, NASAs software engineer, wrote that in terms of performance, RTEMS and VxWorks were so close that it was impossible to even tell the difference between the two. So, as you might expect, ESA used VxWorks at times, and NASA went for RTEMS on more than one occasion. The two major flight operating systems have even run in parallel on the same spacecraft managing different instruments.

But that doesnt mean the last decade has been all VxWorks and RTEMS in the world of space operating systems. And sometimes, new challengers came from the most unexpected places—like a bitcoin forum post.

Back in 2013, bitcoin core developer Jeff Garzik posted a humble idea to the Bitcoin Talk Forum: what about building some bitcoin resiliency in space?

"I was researching how to sort of make the bitcoin network even more resilient,” Garzik says. “And I had an amateur space background—my father took me to Space Shuttle launches; he worked at the White Sands Missile Range." Garzik saw two potential paths: the first, according to him, was to rent a bandwidth on an existing satellite and use it to broadcast the blockchain data. "But from tRead More – Source