Is RAM wiped before use in another LXC container?

adamgordonbell · on April 5, 2023

My shortcut for answering these questions is just replace 'container' with process and find the answer for that.

A running container is just a process that is using certain kernel features: cgroup, namespaces, pivotroot.

LXC has a different DX around containers than OCI containers do, but the same rules apply.

commandersaki · on April 5, 2023

One exercise on pwn.college in their kernel security module is they have a program that forks, opens the flag file (/flag owned by root), reads the content, and then child process exits.

The program continues to run to allow you to load in shell code that'll be run by the kernel.

Your task is basically to write some shellcode to scan the memory for the flag. So now I know that at least some Linux-es don't clean up after a process exits, and you can get the contents of memory when you have kernel privileges. This is not so easy if you're scanning memory as root since /dev/mem or whatever won't reveal that process memory.

CodesInChaos · on April 5, 2023

By default memory is wiped before it's handed out again, not when it's freed. This improves performance, but means secrets can remain in RAM for longer than necessary, where they can be accessed by privileged attackers (software running as root, DMA without MMU, or hardware attacks). For unprivileged processes eager and lazy zeroing look the same.

seri4l · on April 5, 2023

Apparently there's a kernel config flag to zero the memory on free (CONFIG_INIT_ON_FREE_DEFAULT_ON) but it has a quite expensive performance cost (3-5% according to the docs). I wonder in what kind of scenario it would make sense to enable it.

bhawks · on April 5, 2023

You want to enable this if your concerned about forensic attacks. A simple example would be someone has physical access to your device. They're able to power it down, and boot it with their own custom kernel. If the memory has not been eagerly zeroed they may be able to extract from RAM sensitive data.

This flag puts an additional obstacle in the attacker's path. If you have private key material protecting valuable property, you definitely want to throw up as many roadblocks as possible.

l33t233372 · on April 5, 2023

I don’t understand why this would help prevent cold boot attacks.

Wouldn’t the memory need to bee free’d first for this to have any effect?

bhawks · on April 5, 2023

Yes it would - either through the free syscall or a process exit. This is a defense in depth strategy and not 100% perfect. If you yanked the power cord and a long lived process had sensitive data in memory you're still vulnerable. But if you had a clean power down or very short lifetimes of sensitive data being active in RAM it would afford you additional security.

WalterBright · on April 5, 2023

?? Cutting the power means the RAM contents vanish.

SllX · on April 5, 2023

They vanish eventually which is usually measured in seconds. This can be extended to minutes or hours if someone performs a cold boot attack: https://security.stackexchange.com/questions/10643/recover-t...

l33t233372 · on April 5, 2023

I find that phrasing weird.

A cold boot attack relies on a cold boot of the system to evade kernel protections(as opposed to a warm boot where the kernel can zero memory.)

The name has nothing to do with reducing the temperature of the ram to extend the time it takes bytes to vanish in ram.

SllX · on April 5, 2023

I think it’s a little bit of column A and a little bit of column B, but admit while I remember reading about using technique a long time ago, I’m not sure of the history of the nomenclature. From the StackExchange:

> For those who think this is only theoretical: They were able to use this technique to create a bootable USB device which could determine someone's Truecrypt hard-drive encryption key automatically, just by plugging it in and restarting the computer. They were also able to recover the memory-contents 30 minutes+ later by freezing the ram (using a simple bottle of canned-air) and removing it. Using liquid nitrogen increased this time to hours.

l33t233372 · on April 5, 2023

Reducing the temperature of the RAM can be done to make a cold boot attack easier, but it’s not the origin of the name.

For more details, see the paper Lest We Remember.

SllX · on April 6, 2023

Thanks, TIL! I'll check it out.

WalterBright · on April 5, 2023

i didn't know that. Thanks!

ender341341 · on April 5, 2023

the idea is that well written software would release memory as soon as possible so with it enabled you'd have the secret in memory for as little time as possible.

Though in my mind well written software should be zeroing the memory out before freeing if it held sensitive data.

jamesfinlayson · on April 6, 2023

Yes I thought that was common practice - I remember reading patch notes for something years ago where the program had been updated to always zero the password buffer after checking that it matches (I think in some cases it was kept around for a bit).

bhawks · on April 6, 2023

From a defense in depth perspective you definitely want the implementation to be robust (zero the memory after reading). However you should also consider:

1) abnormal program termination due to signals, memory pressure/oom killer, aborts in other threads serving different requests, and so on. These events could race with the memory zeroing.

2) bugs in the implementation where memory isn't zeroed in all paths

3) interactions between compiling, standard libraries, language runtimes and optimization passes causing memory zeroing to be skipped.

All these cases have happened time and time again in the wild. Hence having additional safety nets is useful.

These patches were endorsed by folks working on chromeos and Android security. I would suppose that they want them to put additional safeguards behind full disk encryption keys and may also be concerned with quality of implementation issues in 3rd party or vendor blobs.

Wowfunhappy · on April 5, 2023

How is the attacker powering down the device while retaining the contents of its RAM?

l33t233372 · on April 5, 2023

Perhaps by using a can of compressed air[0].

[0] https://www.usenix.org/legacy/event/sec08/tech/full_papers/h...

soulofmischief · on April 5, 2023

If your PC is connected to a power strip, it's my understanding that law enforcement can attach a live male-to-male power cable to the power strip and then remove the power strip from the wall while still powering the computer. That, and yeah freezing ram.

ghostpepper · on April 6, 2023

So technically that's removing power, not "powering it down". I guess you'd then warm-boot with your own kernel and hope that the contents of RAM are mostly untouched?

Gracana · on April 5, 2023

Data fades slowly from DRAM, especially if you freeze it first.

MR4D · on April 5, 2023

So, if it's only 3-5% slower, then for $50-100 I could buy a slightly faster processor and never know the difference?

Just trying to check my understanding of what the 3-5% delta is. Seems like a tiny tradeoff for any workstation (I wouldn't notice the difference at least). The tradeoff for servers might vary depending on what they are doing (shared versus owned, etc)

postalrat · on April 5, 2023

How many thousand tradeoffs like this are you willing to pay for?

soulofmischief · on April 5, 2023

This seems beneficial in systems where security concerns trump performance concerns. The above poster has probably made many such trade-offs already and would likely make more. (Full disk encryption, virtualization, protection rings, spectre mitigations, MMIO, ECC, etc.)

With exponentially increasing processor performance it does make sense for workstations where physical access should be considered in the threat model.

MR4D · on April 5, 2023

Lots.

But then again, I run a few companies that deal with sensitive data. If I were just a gamer, I wouldn't care.

gus_massa · on April 5, 2023

I don't understand why it is slower. It has to be zeroed anyway.

In the normal configuration:

Is it not zeroed if the memory is assigned to the same process???

Is it zeroed when the system is idle???

Is it zeroed in batches that are more memory friendly???

KMag · on April 5, 2023

> I don't understand why it is slower. It has to be zeroed anyway.

Memory pages freed from userspace might be reused in kernelspace.

If, for instance, the memory is re-used in the kernel's page cache, then the kernel doesn't need to zero it out before copying the to-be-cached data into the page.

Edit: I seem to remember back in the 1990s that the kernel at least in some cases wouldn't zero-out pages previously used by the kernel before giving them to userspace, sometimes resulting in kernel secrets being leaked to arbitrary userspace processes. Maybe I'm missremembering, and it was just leakage of secrets between userspace processes. In any case, in the 1990s, Linux was way too lax about leaking data from freed pages.

wongarsu · on April 5, 2023

Two reasons:

- lots of code is written under the assumption that free is fast

- memory is zeroed in the background, unless memory pressure forces the kernel to zero when handing it out

vlovich123 · on April 5, 2023

Yes. It’s zeroed in a low priority background process to avoid interfering with foreground apps.

dathinab · on April 5, 2023

> Is it zeroed when the system is idle???

yes mainly that,

and if the system isn't idle but also doesn't use all phys. memory it might not be zeroed for a very long time

> Is it not zeroed if the memory is assigned to the same process???

idk. what the current state of this is in linux but at least in the past for some systems for some use cases related to mapped memory this was the case

shawabawa3 · on April 5, 2023

My guess is it won't always have to be zeroed

e.g. if your code is doing

    ptr = malloc()
    memcpy(mydata, ptr)

You can presumably optimise out the zeroing of the memory

KMag · on April 5, 2023

As far as I know, the Linux kernel never inspects the userspace thread to adjust behavior based on what the thread is going to do next. This would be a very brittle sort of optimization.

More importantly, it's not safe. Another thread in the same process can see ptr between the malloc and the memcpy!

Edit: also, of course, malloc and memcpy are C runtime functions, not syscalls, so checking what happens after malloc() would require the kernel to have much more sophisticated analysis than just looking a few instructions ahead of the calling thread's %%eip/%%rip. While handling malloc()'s mmap() or brk() allocation, the kernel would need to be able to look one or two call frames up the call stack, past the metadata accounting that malloc is doing to keep track of the newly acquired memory, perhaps look at a few conditional branches, trace through the GOT and PLT entries to see where the memcpy call is actually going, and do so in a way that is robust to changes in the C runtime implementation. (Of course, in practice, most C compilers will inline a memcpy implementation, so in the common case, it wouldn't have to chase the GOT and PLT entries, but even then, it's way too complicated for the kernel to figure out if anything non-trivial is happening between mmap()/brk() and the memory being overwritten.)

Edit 2: To be robust in the completely general case, even if it were trivial to identify the inlined memcpy implementation, and it were clearly defined "something non-trivial happens", determining if "something non-trivial happens" between mmap()/brk() and memcyp() would involve solving the halting problem. (Imposssible in the general case.)

mjevans · on April 5, 2023

I'm NOT an expert here, but offhand.

  malloc() == 'reservation' (but not paged in!) memory
  // If touched / updated THEN the memory's paged in

A copy _might_ not even become a copy if the kernel's smart enough / able to setup a hardware trigger to force a copy on writes to that area, at which point the physical memory backing two distinct logical memory zones would be copied and then different.

KMag · on April 5, 2023

That's a good point that Linux doesn't actually allocate the pages until they're faulted in by a read or write. So, if it were doing some kind of thread inspection optimization, it would presumably just need to check if the faulting thread is currently in a loop that will overwrite at least the full page.

However, that wouldn't solve the problem of other threads in the same process being able to see the page before it's fully overwritten, or debugging processes, or using a signal handler to invisibly jump out of the initialization loop in the middle, etc. There are workarounds to all of these issues, but they all have performance and complexity costs.

sumtechguy · on April 5, 2023

malloc gets memory from the heap which may or may not be paged in/reused. That means you may get reused memory from the heap (which is up to the CRT).

If you want make sure it is zero you will want calloc. If you know you are going to copy something in on the next step like your example you probably can skip calloc and just us malloc. calloc is nice for when you are doing thigs like linked lists/trees/buffers and do not want to have steps to clean out the pointers or data.

matt_heimer · on April 5, 2023

Just a guess but since apps can fail to free memory correctly you probably have to zero it on allocation and deallocation (to be secure) when you enable the feature. So you aren't swapping one for the other, you are now doing both.

Denvercoder9 · on April 5, 2023

> Just a guess but since apps can fail to free memory correctly

That's not relevant here; from the perspective of the kernel pages are either assigned to a process, or they're not. If an application fails to free memory correctly, that only means it'll keep having pages assigned to it that it no longer uses, but eventually those pages will always be released (by the kernel upon termination of the process, in the worst case).

dfox · on April 5, 2023

That is the worst case if the process had leaked that part of the heap, but it is an optimal case on process exit. On OS with any kind of process isolation walking over most of the heap before exiting as to "correctly free it" is pure waste of the CPU cycles and in worst case even IO bandwidth (when it causes parts of the heap to be paged in).

samus · on April 5, 2023

Pages can be completely avoided to be paged in if the intention is to just zero them. The kernel could either just "forget" them, or use copy-on-write with a properly zeroed out page as a base.

dfox · on April 6, 2023

The point is that you do not want to do any kind of heap cleanup before exit. The intention isn't to zero the pages, but to outright discard all of them (which is going to be done by the kernel anyway).

CodesInChaos · on April 5, 2023

I'd like the ability to control this at a process or even allocation (i.e. as a flag on mmap) level. That way a password manager could enable this, while a game could disable it.

Bender · on April 5, 2023

Do you mean init_on_alloc=1 and init_on_free=1? Here [1] is a thread on the options and performance impact. FWIW I use it on all my workstations but these days I am not doing anything that would be greatly impacted by it. I've never tried it on a gaming machine and never tried it on a large memory hypervisor.

I wish there were flags similar to this for the GPU memory. Even something that zero's GPU memory on reboot would be nice. I can always see the previous desktop after a reboot for a brief moment.

[1] - https://patchwork.kernel.org/project/linux-mm/patch/20190617...

bayindirh · on April 5, 2023

Any multi-user system where users don't know each other and handle sensitive data.

usr1106 · on April 6, 2023

So every simple home PC with more than 1 tab open where the user is logged in to more than 1 site.

bayindirh · on April 6, 2023

No, more like companies which run HPC clusters to do defense stuff and similar things, and maybe commercial HPC providers.

dathinab · on April 5, 2023

When running non performance sensitive but security sensitive code. Even adding protections summing up to much higher performance penalties can be very acceptable.

E.g. on a crypto key server. Less if it's a server which encrypts data en mass, but e.g. one which signs longer valid auth tokens or one which hold middle layer certificates which are once every few hours used to create a cert used to encrypt/sign data en mass used on a different server etc.

ape4 · on April 5, 2023

I believe the docs but I would have thought that memset() would be really quick - implemented in hardware?

dataflow · on April 5, 2023

"Real quick" is human speak. For large amounts of memory it's still bound by RAM speed for a machine, which is much lower (a couple orders of magnitude I believe) than, say, cache speed. Things might be different if there was a RAM equivalent of SSD TRIM (making the RAM module zero itself without transferring lots of zeros across the bus), but there isn't.

throwaway894345 · on April 5, 2023

I'm completely unfamiliar with how the CPU communicates with the memory modules, but is there not a way for the CPU to tell the memory modules to zero out a whole range of memory rather than one byte/sector/whatever-the-standard-unit-is at a time?

As I type this, I'm realizing how little I know about the protocol between the CPU and the memory modules--if anyone has an accessible link on the subject, I'd be grateful.

dataflow · on April 5, 2023

That's what I referred to as "TRIM for RAM". I'm not aware of it being a thing. And I don't know the protocol, but I'm also not sure it's just a matter of protocol. It might require additional circuitry per bit of memory that would increase the cost.

mjevans · on April 5, 2023

'trim' for RAM is a virtual to physical page table hack. Memory that isn't backed by a page is just a zero, it doesn't need to be initialized. Offhand it's supposed to be before it's handed to a process, but I don't know if there are E.G. mechanisms to use some spare cycles to proactively zero non-allocated memory that's a candidate for being attached to VM space.

andrewf · on April 5, 2023

Oldie but a goodie: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

MarkSweep · on April 5, 2023

Some processors have “hardware store elimination” that makes writing all zeros a bit faster than writing other values.

https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt...

vlovich123 · on April 5, 2023

No. Memset (and bzero) aren’t HW accelerated. There is a special CPU instruction that can do it but in practice it’s faster to do it in a loop. In user space you can frequently leverage SIMD instructions to speed it up (of course those aren’t available in the kernel because it avoids saving/restoring those and FP registers on every syscall (only when you switch contexts).

What could be interesting if there were a CPU instruction to tell the RAM to do it. Then you would avoid the memory bandwidth impact of freeing the memory. But I don’t think there’s any such instruction for the CPU/memory protocol even today. Not sure why.

Arrath · on April 5, 2023

That seems wild to be honest. I know how easy it is to say "well they can just.."

But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it? Actually slamming a bunch of 0's out over the bus seems so very suboptimal.

mlyle · on April 5, 2023

> But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it?

Having the memory controller or memory module do it is complicated somewhat because it needs to be coherent with the caches, needs to obey translation, etc. If you have the memory controller do it, it doesn't save bandwidth. But, on the other hand, with a write back cache, your zeroing may never need to get stored to memory at all.

Further, if you have the module do it, the module/sdram state machine needs to get more complicated... and if you just have one module on the channel, then you don't benefit in bandwidth, either.

A DMA controller can be set up to do it... but in practice this is usually more expensive on big CPUs than just letting a CPU do it.

It's not really tying up a processor because of superscalar, hyperthreading, etc, either; modern processors have an abundance of resources and what slows things doing is things that must be done serially or resources that are most contended (like the bus to memory).

Arrath · on April 5, 2023

Thanks for the answer!

dathinab · on April 5, 2023

Through modern CPUs are explicitly build to make sure such a loop is fast.

And in some cases on some systems the DRM controller might zero the memory in some situations, in which cases you could say it was done by hardware.

pflanze · on April 5, 2023

> DRM controller

Did you mean DMA controller? Or do you have more information?

dathinab · on April 5, 2023

yes DMA, not the direct rendering manager ;=)

saagarjha · on April 6, 2023

dc zva?

dathinab · on April 5, 2023

really quick still doesn't mean it's free, especially if you always have to zero all the allocated pages even if the process might just have used part of the page.

Also the question is what is this % in relation to?

Probably that freeing get up to 5% slower, which is reasonable given that before you often could use idle time to zero many of the pages or might not have zeroed some of the pages at all (as they where never reused).

dathinab · on April 5, 2023

> don't clean up after a process exits

exactly, the only guarantee is that things are zeroed before handing them out to a different process, but there is some potential time gap between releasing memory back to the kernel and it being cleaned, a gap which can outlive the live of a process

> and you can get the contents of memory when you have kernel privileges. This is not so easy [..] as root

yes, root has much less privileges then the kernel, but often can gain kernel privileges.

But this is where e.g. lockdown mode comes in which denies the root users such privilege escalation (oversimplified, it's complicated). Main problem is that lockdown mode is not yet compatible with suspend to disk (hibernation), even through its documentation implies it is, if your have a encrypted hibernation. (This is misleading as it refers to a not yet existing feature where the kernel creates a encrypted image which is also tamper proof even if root tries to tamper. On the other hand suspend to an encrypted partition is possible in Linux, but not enough for lockdown mode to work.)

zamnos · on April 5, 2023

here, I made it clickable: http://pwn.college

gaudat · on April 5, 2023

Oh it is free too. Gotta try this.

djbusby · on April 5, 2023

The memory wipe is a kernel build-time config option.

CodesInChaos · on April 5, 2023

Do you know what the option is called?

edit: apparently the runtime option is called `init_on_alloc` and the compile-time option (which determines the default of the runtime option) is called `CONFIG_INIT_ON_FREE_DEFAULT_ON`.

jwilk · on April 5, 2023

There are two parameters:

• init_on_alloc (default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON)

• init_on_free (default set by CONFIG_INIT_ON_FREE_DEFAULT_ON)

djbusby · on April 6, 2023

I didn't even know of run time opt. Thanks!

sgt · on April 5, 2023

adamgordonbell · on April 5, 2023

Developer Experience (Maybe there is a better term, but how you use them)

Docker/ OCI containers tend to be a single process and LXC containers have a collection of processes.

But in either case they are just processes running on the host in a different namespace.

So they feel different to use, but use the same building blocks (to my understanding).

sgt · on April 5, 2023

Unrelated, but your name rang a... bell. You're the corecursive guy. Great podcast!

adamgordonbell · on April 5, 2023

Thanks!

Karellen · on April 5, 2023

Shouldn't "DX" refer to the experience of people hacking on the Docker (or whatever) code itself? Like... how easy the build system is to work with if you're adding a new source file to the docker codebase?

The people working with Docker, even if they are developers doing development work, are still users of Docker, aren't they? I mean, the GUI of an IDE is still part of its UX, right? Even though it's for developers doing development work?

adamgordonbell · on April 5, 2023

I was thinking that developer experience is user experience – where the user is a developer. You are suggesting that the user and developer are different roles because even when the user is a developer there is still the developer who builds that tool

It's possible you are right but I'm not an expert. I always think of the developer experience is the experience of developers using the tools and APIs you produce.

tomcam · on April 5, 2023

Thank you! Annoyed at myself for not thinking it through that way.

Aachen · on April 5, 2023

I found the currently accepted answer (by user10489) interesting, tying together a lot of concepts I had heard of but didn't properly understand. For example, looking in htop, chromium has six processes each using a literal terabyte of virtual memory, which is obviously greater than swap + physical RAM on my ordinary laptop. Or, why does an out-of-memory system hang instead of just telling the next process "nope, we don't have 300KB to allocate to you anymore" and either crash the program or let the program handle the failed allocation gracefully? This answer explains how this all works.

The TL;DR answer to the actual question is: processes generally don't get access to each other's memory unless there is some trust relation (like being the parent process, or being allowed to attach a debugger), and being in a container doesn't change that, the same restrictions apply and you always get zeroed-out memory from the kernel. It's when you use a different allocator that you might get nonzeroed memory from elsewhere in your own process (not a random other process).

dale_glass · on April 5, 2023

> Or, why does an out-of-memory system hang instead of just telling the next process "nope, we don't have 300KB to allocate to you anymore"

Blame UNIX for that, and the fork() system call.

It's a design quirk. fork() duplicates the process. So suppose your web browser consumes 10GB RAM out of the 16GB total on the system, and wants to run a process for anything. Like it just wants to exec something tiny, like `uname`.

1. 10GB process does fork().

2. Instantly, you have two 10GB processes

3. A microsecond later, the child calls exec(), completely destroying its own state and replacing it with a 36K binary, freeing 10GB RAM.

So there's two ways to go there:

1. You could require step 2 to be a full copy. Which means either you need more RAM, a huge chunk of which would always sit idle, or you need a lot of swap, for the same purpose.

2. We could overlook the memory usage increase and pretend that we have enough memory, and only really panic if the second process truly needs its own 10GB RAM that we don't have. That's what Linux does.

The problem with #2 is that dealing with this happens completely in the background, at times completely unpredictable to the code. The OS allocates memory when the child changes memory, like does "a=1" somewhere. A program can't handle memory allocations failures there because as far as it knows, it's not allocating anything.

So what you get is this fragile fiction that sometimes breaks and requires the kernel to kill something to maintain the system in some sort of working state.

Windows doesn't have this issue at all because it has no fork(). New processes aren't children and start from scratch, so firefox never gets another 10GB sized clone. It just starts a new, 36K sized process.

reisse · on April 5, 2023

> Blame UNIX for that, and the fork() system call.

At least that design failure of UNIX has been fixed long ago. There are posix_spawn(3) and various clone(2) flavours which allow to spawn new process without copying the old one. And a lot of memory-intensive software actually use them, so modern Linux distros can be used without memory overprovisioning.

I'd rather blame people who are still using fork(2) for anything that can consume more than 100MB of memory.

drougge · on April 5, 2023

I'm someone who likes to use fork() and then actually use both processes as they are, with shared copy-on-write memory. I'm happy to use it on things consuming much more than 100MB of memory. In fact that's where I like it the most. I'm probably a terrible person.

But what would be better? This way I can massage my data in one process, and then fork as many other processes that use this data as I like without having to serialise it to disk and and then load it again. If the data is not modified after fork it consumes much less memory (only the page tables). Usually a little is modified, consuming only a little memory extra. If all of it is modified it doesn't consume more memory than I would have otherwise (hopefully, not sure if the Linux implementation still keeps the pre-fork copy around).

(And no, not threads. They would share modifications, which I don't want. Also since I do this in python they would have terrible performance.)

reisse · on April 5, 2023

So if I got it right, you're using fork(2) as a glorified shared memory interface. If my memory is (also) right, you can allocate shared read-only mapping with shm_open(3) + mmap(2) in parent process, and open it as a private copy-on-write mapping in child processes.

jacquesm · on April 5, 2023

No, he's using fork the way it is intended.

Shared memory came much later than fork did.

Asooka · on April 5, 2023

I have used fork as a stupid simple memory arena implementation. fork(); do work in the child; only malloc, never free; exit. It is much, much heavier than a normal memory arena would be, but also much simpler to use. Plus, if you can split the work in independent batches, you can run multiple children at a time in parallel.

As with all such stupid simple mechanisms, I would not advise its use if your program spans more than one .c file and more than a thousand lines.

saagarjha · on April 6, 2023

This isn't advisable in many more contexts than that: for example, your calls to malloc can block indefinitely if locks were held at the time of fork.

panzi · on April 5, 2023

posix_spawn() is great, but Linux doesn't implement it. glibc does based on fork()+exec(). Other Unix(-like) OSes do implement posix_spawn() as system call. Also while you can use posix_spawn() in the vast majority of cases, if it doesn't cover certain process setup options that you need you still have to use fork()+exec(). But yeah, it would be good if Linux had it as a system call. It would probably help PostgreSQL.

the8472 · on April 5, 2023

glibc uses vfork+exec to implement posix_spawn, which makes it much faster than fork+exec.

panzi · on April 6, 2023

Yes, of course. Did remember its not fork(), but some other *fork() and couldn't remember the name. But just the kind of thing it does. Its also not exec(), but probably execvp() or execvpe() or something like that.

seba_dos1 · on April 7, 2023

It's important to note because vfork is exactly what makes it close to posix_spawn (it does not copy the page tables), as opposed to regular fork.

panzi · on April 7, 2023

When using vfork() what are you allowed to do? Can you even do IO redirection (piping)? The man page says:

> ... the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork() ...

But the time between fork and exec is exactly where you do a lot of setup, like IO redirection, dropping privileges, setuid(), setting a signal mask (nohup) etc. and I don't think you can do that without setting any variables. You certainly write to the stack when calling a function.

If you can't do these things you can't really use it to implement posix_spawn(). I guess it could use vfork() in the case no actions are required, but only then.

AnIdiotOnTheNet · on April 5, 2023

Can a modern distro really be used without over provisioning? Because the last time I tried it either the DE or display server hard locked immediately and I had to reboot the system.

Having this ridiculous setting as the default has basically ensured that we can never turn it off because developers expect things to work this way. They have no idea what to do if malloc errors on them. They like being able to make 1TB allocs without worrying about the consequences and just letting the kernel shoot processes in the head randomly when it all goes south. Hell, the last time this came up many swore that there was literally nothing a programmer could do in the event of OOM. Learned helplessness.

It's a goddamned mess and like many of Linux's goddamned messes not only are we still dealing with it in 2023, but every effort to do anything about it faces angry ranty backlash.

lxgr · on April 5, 2023

Almost everything in life is overprovisioned, if you think about it: Your ISP, the phone network, hospitals, bank reserves (and deposit insurance)...

What makes the approach uniquely unsuitable for memory management? The entire idea of swapping goes out of the window without overprovisioning as well, for better or worse.

AnIdiotOnTheNet · on April 5, 2023

Perhaps there is some confusion because I used "overprovision" when the appropriate term here is "overcommit", but Windows manages to work fine without unix-style overcommit. I suspect most OSs in history do not use unix's style of overcommit.

> What makes the approach uniquely unsuitable for memory management?

The fact that something like OOM killer even needs to exist. Killing random processes to free up memory you blindly promised but couldn't deliver is not a reasonable way to do things.

Edit: https://lwn.net/Articles/627725/

mrguyorama · on April 5, 2023

What an absurdly whataboutism filled response. Meanwhile Windows has been doing it the correct way for 20 years or more and never has to kill a random process just to keep functioning.

lxgr · on April 5, 2023

So you're saying the correct way to support fork() is to... not support it? This seems pretty wasteful in the majority of scenarios.

For example, it's a common pattern in many languages and frameworks to preload and fully initialize one worker process and then just fork that as often as required. The assumption there is that, while most of the memory is theoretically writable, practically, much of it is written exactly once and can then be shared across all workers. This both saves memory and the time needed to uselessly copy it for every worker instance (or alternatively to re-initialize the worker every single time, which can be costly if many of its data structures are dynamically computed and not just read from disk).

How do you do that without fork()/overprovisioning?

I'm also not sure whether "giving other examples" fits the bill of "whataboutism", as I'm not listing other examples of bad things to detract from a bad thing under discussion – I'm claiming that all of these things are (mostly) good and useful :)

dale_glass · on April 5, 2023

> How do you do that without fork()/overprovisioning?

You use threads. The part that fork() would have kept shared is still shared, the part that would have diverged is allocated inside each thread independently.

And if you find dealing with locking undesirable you can use some sort of message system, like Qt signals to minimize that.

lxgr · on April 6, 2023

> the part that would have diverged is allocated inside each thread independently

That’s exactly my criticism of that approach: It’s conceptually trickier (fork is opt-in for sharing; threads are opt-out/require explicit copying) and requires duplicating all that memory, whether threads end up ever writing to it or not.

Threads have their merits, but so do subprocesses and fork(). Why force developers to use one over the other?

dale_glass · on April 6, 2023

> Threads have their merits, but so do subprocesses and fork(). Why force developers to use one over the other?

I used to agree with you, but fork() seems to have definitely been left behind. It has too many issues.

* fork() is slow. This automatically makes it troublesome for small background tasks.

* passing data is inconvenient. You have to futz around with signals, return codes, socketpair or shared memory. It's a pain to set up. Most of what you want to send is messages, but what UNIX gives you is streams.

* Managing it is annoying. You have to deal with signals, reaping, and doing a bunch of state housekeeping to keep track of what's what. A signal handler behaves like an annoying, really horribly designed thread.

* Stuff leaks across easily. A badly designed child can feed junk into your shared filehandles by some accident.

* It's awful for libraries. If a library wants to use fork() internally that'll easily conflict with your own usage.

* It's not portable. Using fork() automatically makes your stuff UNIX only, even if otherwise nothing stops it from working on Windows.

I think the library one is a big one -- we need concurrency more than ever, but under the fork model different parts of the code that are unaware of each other will step over each other's toes.

lxgr · on April 5, 2023

Unless you're sure that you're going to write to the majority of the copy-on-write memory resulting from fork(), this seems like overkill.

Maybe there should be yet another flavor of fork() that does copy-on-write, but treats the memory as already-copied for physical memory accounting purposes? (Not sure if "copy-on-write but budget as distinct" is actually representable in Linux's or other Unixes' memory model, though.)

magicalhippo · on April 5, 2023

> Maybe there should be yet another flavor of fork() that does copy-on-write, but treats the memory as already-copied for physical memory accounting purposes?

How about a version which copies the pages but marks them read-only in the child process, except for a set of ranges passed to fork (which would be copy-on-write as now). The child process then has to change any read-only pages to copy-on-write (or similar) to modify them.

This allows the OS to double-count and hence deny fork if the range of pages passed to fork leads to out of memory situation. It also allows the OS to deny the child process changing any read-only pages if it would lead to an out of memory situation. Both of those scenarios could be gracefully handled by the processes if they wish.

It would also keep the current positive behavior of the forked process having read access to the parent memory for data structures or similar.

saagarjha · on April 6, 2023

What's the benefit here?

KMag · on April 5, 2023

Option 3: vfork() has existed for a long time. The child process temporarily borrows all of the parent's address space. The calling process is frozen until the child exits or calls a flavor of exec. Granted, it's pretty brittle and any modification of non-stack address space other than changing a variable of type pid_t is undefined behavior before exec is called. However, it gets around the disadvantages of fork() while maintaining all of the flexibility of Unix's separation of process creation (fork/vfork) and process initialization (exec*).

vfork followed immediately by exec gives you Windows-like process creation, and last I checked, despite having the overhead of a second syscall, was still faster than process creation on Windows.

jacquesm · on April 5, 2023

> 2. Instantly, you have two 10GB processes

No, that's not how it works. The process table gets duplicated and copy-on-write takes care of the pages. As long as they are identical they will be shared, there is no way that 10GB of RAM will be allocated to the forked process and that all of the data will be copied.

blitzkrieg3 · on April 5, 2023

This is the only right answer. What actually happens is you instantly have two 10G processes which share the same address space, and:

3. A microsecond later, the child calls exec(), decrementing the reference count to the memory shared with the parent[1] and faulting in a 36k binary, bringing our new total memory usage to 1,045,612KB (1,048,576K + 36K)

CoW has existed since at least 1986, when CMU developed the Mach kernel.

What GP is really talking about is overcommit, which is a feature (on by default) in Linux which allows you to ask for more memory than you have. This was famously a departure from other Unixes at the time[2], a departure that fueled confusion and countless flame wars in the early Internet.

[1] https://unix.stackexchange.com/questions/469328/fork-and-cow... [2] https://groups.google.com/g/comp.unix.solaris/c/nLWKWW2ODZo/...

senko · on April 5, 2023

This is more or less what the second 2. explains:

> 2. We could overlook the memory usage increase and pretend that we have enough memory, and only really panic if the second process truly needs its own 10GB RAM that we don't have. That's what Linux does

"pretend" → share the memory and hope most of it will be read-only or unallocated eventually; "truly needs to own" → CoW

jacquesm · on April 5, 2023

It will never happen. To begin with all of the code pages are going to be shared because they are not modified.

Besides that the bulk of the fork calls are just a preamble to starting up another process and exiting the current one. It's mostly a hack to ensure continuity for stdin/stdout/stderr and some other resources.

msm_ · on April 5, 2023

It will most likely not happen? It's absolutely possible to write a program that forks and both forks overwrite 99% of shared memory pages. It almost never happens, which is GP's point, but it's possible and the reason it's a fragile hack.

What usually happens in practice is you're almost OOM, and one of the processes running in the system writes to a page shared with another process, forcing the system to start good ol' OOM killer.

jacquesm · on April 5, 2023

99% isn't 100%.

Sorry, but no, it can't happen, you can not fork a process and end up with twice the memory requirements just because of the fork. What you can do is to simply allocate more memory than you were using before and keep writing.

The OOM killer is a nasty hack, it essentially moves the decision about what stays and what goes to a process that is making calls way above its pay grade, but overcommit and OOM go hand in hand.

blitzkrieg3 · on April 5, 2023

It does not happen using fork()/exec() as described above. For it to happen we would need to fork() and continue using old variables and data buffers in the child that we used in the parent, which is a valid but rarely used pattern.

worthless-trash · on April 5, 2023

OP is referring to what would happen in the naive implementation, not what actually happens.

reisse · on April 5, 2023

The data won't be copied, but kernel has to reserve the memory for both processes.

l33tman · on April 5, 2023

No, it doesn't. See "overcommit"

reisse · on April 5, 2023

Please read the parent comments. Overcommit is necessary exactly because kernel has to reserve memory for both processes, and overcommit allows to reserve more memory than there is physically present.

If kernel could not reserve memory for forked process, overcommit would not be necessary.

blitzkrieg3 · on April 5, 2023

This is a misconception you and parent are perpetuating. fork() existed in this problematic 2x memory implementation _way_ before overcommit, and overcommit was non-existent or disabled on Unix (which has fork()) before Linux made it the default. Today with CoW we don't even have this "reserve memory for forked process" problem, so overcommit does nothing for us with regard to fork()/exec() (to say nothing of the vfork()/clone() point others have brought up). But if you want you can still disable overcommit on linux and observe that your apps can still create new processes.

What overcommit enables is more efficient use of memory for applications that request more memory than they use (which is most of them) and more efficient use of page cache. It also pretty much guarantees an app gets memory when it asks for it, at the cost of getting oom-killed later if the system as a whole runs out.

lxgr · on April 5, 2023

I think you've got it backwards: With overcommit, there is no memory reservation. The forked processes gets an exact copy of the other's page table, but with all writable memory marked as copy-on-write instead. The kernel might well be tallying these up to some number, but nothing important happens with it.

Only without overcommit does the kernel does need to start accounting for hypothetically-writable memory before it actually is written to.

reisse · on April 5, 2023

Yes, you're right here, that was what I wanted to say but probably was not able to formulate.

enedil · on April 5, 2023

In a sense it does. Page tables would need to be copied.

jacquesm · on April 5, 2023

Yes, but they all point to the same pages. The tables take up a small fraction of the memory of the pages themselves.

enedil · on April 5, 2023

But large fraction, if all you do afterwards is an exec call. Given 8 bytes per page table entry and 4k pages, it's 1/512 memory wasted. So if your process uses 8GB, it's 16MB. Still takes noticeable time if you spawn often.

jacquesm · on April 5, 2023

I've never had the page tables be the cause of out of memory issues. Besides the fact that they are usually pre-allocated to avoid recursive page faults, but nothing would stop you from making the page tables themselves also copy-on-write during a fork.

lxgr · on April 5, 2023

Aren't page tables nested? I don't know if any OS or hardware architecture actually supports it, but I could imagine the parent-level page table being virtual and copy-on-write itself.

eru · on April 5, 2023

Keep in mind that copy-on-write makes analysing the situation a bit more complicated.

nnntriplesec · on April 5, 2023

How so?

pdpi · on April 5, 2023

CoW is a strategy where you don't actually copy memory until you write to it. So, when the 10GB process spawns a child process, that child process also has 10GB of virtual memory, but both processes are backed by the same pages. It's only when one of them writes to a page that a copy happens. When you fork+exec you never actually touch most of those pages, so you never actually pay for them.

(Obviously, that's the super-simplified version, and I don't fully understand the subtleties involved, but that's exactly what GP means: it's harder to analyse)

eru · on April 5, 2023

Thanks for writing the explanation.

To make it slightly more complicated: you don't pay for the 10 GB directly, but you still pay for setting up the metadata, and that scales with the amount of virtual memory used.

whartung · on April 5, 2023

We actually ran into this a long time ago with Solaris and Java.

Java has JSP, Java Server Pages. JSP processing translates a JSP file into Java source code, compiles it, then caches, loads, and executes the resulting class file.

Back then, the server would invoke the javac compiler through a standard fork and exec.

That’s all well and good, save when you have a large app server image sucking up the majority of the machine. As far as we could tell, it was a copy on write kind of process, it didn’t actually try to do the actual work when forking the app server. Rather it tried to do the allocation, found it didn’t have the space or swap, and just failed with a system OOM error (which differs from a Java out of memory/heap error).

As I recall adding swap was the short term fix (once we convinced the ops guy that, yes it was possible to “run out of memory”). Long term we made sure all of our JSPs were pre-compiled.

Later, this became a non issue for a variety of reasons, including being able to run the compiler natively within a running JVM.

peheje · on April 5, 2023

I find your writing style really pleasant and understandable! Much more so than the StackExchange answer. I really like the breakdown into steps, then what could happen steps and the follow ups. Where can I read more (from you?) in this style about OS and memory management?

0x0 · on April 5, 2023

You don't HAVE to use fork() + exec() in Linux as far as I know. There is for example posix_swawn() which does the double whammy combo for you.

Also, if you DO use fork() without immediately doing an exec(), and start writing all over your existing allocations... just don't?

the8472 · on April 5, 2023

posix_spawn isn't powerful enough since it only supports a limited set of process setup operations. So if you need to do some specific syscalls before exec that aren't covered by it then the fork/exec dance is still necessary. In principle one can use vfork+exec instead but that very hard to get right.

What's really needed is io_uring_spawn but that's still a WIP. https://lwn.net/Articles/908268/

l33tman · on April 5, 2023

The Windows example is a non-sequitur as in both cases, you end up with the 36K sized process both in Windows and Linux if you want to spawn a sub-process that exec's. The fork() => exec() path is not the issue (if there is an issue at all here), and if you use threading the memory is not forked like this to start with (on either of the OSes).

I guess the case you want to highlight is more if you for example mmap() 10 GB of RAM on that 16 GB machine that only has 5 GB unused swap space left and where all of the physical RAM is filled with dirty pages already. Should the mmap() succeed, and then the process is killed if it eventually tries to use more pages than will fit in RAM or the backing swap? This is the overcommit option which is selectable on Linux. I think the defaults seem pretty good and accept that a process can get killed long after the "explicit" memory mapping call is done.

josefx · on April 5, 2023

> blame UNIX for that, and the fork() system call.

Given that most code I have seen would not be able to handle an allocation failure gracefully I wouldn't call it "blame", if the OS just silently failed memory allocations on whatever program tried to allocate next you would basically end up with a system where random applications crash, which is similar to what the OOM killer does, just with no attempt to be smart about it. Even better, it is outright impossible to gracefully handle allocation failures in some languages, see for example variable length arrays in C.

Arnavion · on April 5, 2023

No code is written to handle allocation failure because it knows that it's running on an OS with overcommit where handling allocation failure is impossible. Overcommit means that you encounter the problem not when you call `malloc()` but when you do `*pointer = value;`, which is impossible to handle.

kevin_thibedeau · on April 5, 2023

Plenty of code runs on systems without that behavior. Graceful handling of malloc failure is still useful.

Arnavion · on April 5, 2023

I know. I myself write code that checks the result of malloc. I was responding with josefx's words.

tjoff · on April 5, 2023

That is a very weak argument.

Also, why would you bother to handle it gracefully when the OS won't allow you to do it?

Also, outright impossible in some languages? Just don't use VLAs if then? "Problem" solved.

josefx · on April 5, 2023

> Also, why would you bother to handle it gracefully when the OS won't allow you to do it?

There are many situations where you can get an allocation failure even with over provisioning enabled.

> Just don't use VLAs if then? "Problem" solved.

Yes, just don't use that language feature that is visually identical to a normal array. Then make sure that your standard library implementation doesn't have random malloc calls hidden in functions that cannot communicate an error and abort instead https://www.thingsquare.com/blog/articles/rand-may-call-mall.... Then ensure that your dependencies follow the same standards of handling allocation failures ... .

I concede that it might be possible, but you are working against an ecosystem that is actively trying to sabotage you.

tjoff · on April 5, 2023

VLAs are barely used and frowned upon by most. It is not relevant enough to discuss.

Yes, mallocs in standard library is a problem. But this is rather the result of a mindset where over provision exist than anything else.

RcouF1uZ4gsC · on April 5, 2023

In retrospect, fork was big mistake. It complicates a bunch of other things and introduces suboptimal patterns.

t0suj4 · on April 5, 2023

A small reminder: In the age of Unix multiuser systems were very common. Fork was the optimal solution to be able to serve as much concurrent users or programs as possible while keeping the implementation simple.

Today's RAM is cheap.

moonchrome · on April 5, 2023

So much of design constraints in our base abstractions are not relevant today but we're still cobbling together solutions built on legacy technical decisions.

progbits · on April 5, 2023

If you would like to go deeper down the rabbit hole, fasterthanlime recently made a few videos that explore this topic:

https://youtu.be/YB6LTaGRQJg https://youtu.be/c_5Jy_AVDaM https://youtu.be/DpnXaNkM9_M

The code is in Rust but that doesn't matter for the explanation.

saagarjha · on April 6, 2023

> For example, looking in htop, chromium has six processes each using a literal terabyte of virtual memory, which is obviously greater than swap + physical RAM on my ordinary laptop.

Generally browsers allocate large swaths of address space as guard pages (they're unmapped, but accesses into them will trap). Or they'll double map pages, so that the same physical page shows up multiple times in the virtual address space.

jeffbee · on April 5, 2023

In Linux, a task that tries to allocate more than the allowed amount will be immediately terminated and not notified about anything. The reason out-of-memory systems thrash is merely because people do not always bother setting the limits, and the default assumes users prefer a process that continues through hours of thrashing instead of being promptly terminated.

Sakos · on April 5, 2023

Who is "people" in this? If you mean your average user, it's unrealistic to expect them to know or care about details like this. This is the kind of thing that needs sane defaults.

skyeto · on April 5, 2023

What I find to be a rather interesting tidbit related to this is that some applications (e.g. certain garbage collectors) map multiple ranges of virtual memory that address the same physical memory.

GrumpySloth · on April 5, 2023

That’s how you can create a gapless ring buffer: <https://learn.microsoft.com/en-us/windows/win32/api/memoryap...> (see scenario 1 in examples).

hansendc · on April 5, 2023

Uh... Did I miss the patches that add a pre-zeroed page pool to Linux? Wouldn't be the first time I missed something like that getting added, but 6.3-rc5 definitely zeroes _some_ pages at allocation time, and I don't see any indiciation of it consulting a prezeroed page pool: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

gregw2 · on April 5, 2023

It’s not clear to me why the top rated reply which seemed extremely knowledgeable in discussing Linux kernel auto zeroing of memory mentioned a bunch of caveats including mentioning “some memory management libraries may reuse memory” but didn’t say “including glibc”.

That’s a pretty big library to gloss over from a security or working programmer perspective!

(Corrections to perspective welcome.)

IcePic · on April 5, 2023

Even if your libc IS one of the offenders, this libc code runs in a single process, so even though libc does or does not use evil tricks for your process, it doesn't mean you will get pages from some other process memory map, only that you may get your own old pages back.

gregw2 · on April 5, 2023

Thanks, makes sense. (Looks like the poster commenting about glibc has corrected themself.)

hn92726819 · on April 5, 2023

I thought the shared memory they were referring to would be read only memory or something. So yes, the memory is shared, but your process doesn't have any access to overwrite any of it. Likewise, when you call library functions, when it mallocs something, it uses memory that your process has write access to, not the shared library memory.

In the context of the question, I assume the asker was mostly interested in reading some kind of sensitive data from a previous process, not reading the same librsry-code-only memory or something.

Note: all of this could be wrong, it was just my understanding

Edit: looks like this answers that: https://stackoverflow.com/questions/20857134/memory-write-pr...

pmontra · on April 5, 2023

I thought that everybody knows how the CPU and the kernel manage memory. Everything else follows. However as a very good developer told me once, "I don't know how networking works, only that I have to type a URL." "How is that possible?" I replied, "They teach it at a school." And he told me "yes, but I studied graphic design, then discovered that I like programming."

Replace and repeat with SQL, machine language, everything below modern programming languages.

coppsilgold · on April 5, 2023

It is if you tell the kernel.

zero memory on free (3-5% average system performance impact, due to touching cold memory)

  init_on_free=1

zero memory on alloc (<1% average system performance impact)

  init_on_alloc=1

CodesInChaos · on April 5, 2023

Is GPU memory wiped before it's handed to a new process, and is a process prevented from accessing GPU memory of a different process? (assuming low level APIs like Vulkan)

johntb86 · on April 5, 2023

By the Vulkan spec it is guaranteed that memory from one process can't be seen by another. https://registry.khronos.org/vulkan/specs/1.3-extensions/htm... says: In particular, any guarantees made by an operating system about whether memory from one process can be visible to another process or not must not be violated by a Vulkan implementation for any memory allocation.

In theory you could have some sort of complex per-process scrambling system to avoid leaking information, but I think implementations actually just zero the memory.

GPU drivers on different operating systems can be more or less buggy; Windows and Linux generally seem to do the right thing, but MacOS is a bit more haphazard.

NotCamelCase · on April 5, 2023

You can request Windows to zero out every GPU memory allocation, but it's more of a driver thing that's not exposed in APIs. Such option is likely to be off by default in drivers as it might induce additional, unintended overhead. In practice, you are likely to see memory cleared more often than not due to other reasons, though.

You can't just peek another process' GPU memory thru UMD app, either. Per-process virtual memory mechanisms similar to CPUs are also present in GPUs, which is the whole reason that resources are explicitly imported/exported across APIs via special API calls.

saagarjha · on April 6, 2023

In theory yes. In practice, there have been bugs here.

tryauuum · on April 5, 2023

this is certainly more interesting question than OPs

bluedino · on April 5, 2023

Is the confusion coming from when you malloc() memory in C you aren't getting zereoed memory?

dale_glass · on April 5, 2023

I tried on Linux, and got all zeroes.

Of course that's for a trivial program. If you freed something, that probably wasn't returned to the OS, and the next malloc might just recycle it.

bluedino · on April 5, 2023

Malloc a bunch of memory, dirty it, free it, then malloc some more. And then check.

Your first request in your program, you'll get clean memory from the OS.

dale_glass · on April 5, 2023

Yeah, that's what I'm saying.

worthless-trash · on April 5, 2023

One day, the general technical population will understand containers are just processes on a system. No magic under the hood.

__turbobrew__ · on April 6, 2023

Hey, if people keep thinking it is magic I keep getting paid to be a wizard

jcarrano · on April 5, 2023

I'm curious, when did OSes first begin zeroing out pages. Did the first operating systems already zero the memory?

saagarjha · on April 6, 2023

OSes that have paging (and generally memory protection) have been zeroing them out from very early on, because part of the reason you have this is to avoid processes being able to mess with or spy on each other.

ritcgab · on April 5, 2023

I think the main insight is that user-level memory allocation does not necessarily involve the kernel. If the application uses `malloc` to get memory allocated, the real "application" that will (possibly) request free pages from the kernel is the memory allocator. If the space requested can be served by the pages already assigned to the allocator, it can just hand them out. So when you `malloc`, it is not zeroed, unless you know it is freshly from kernel (but you don't know the internal state of the allocator). That is why one needs to manually zero out space returned from `malloc`, or use `calloc` instead.

jk_i_am_a_robot · on April 5, 2023

If I could answer it would be a solid "maybe". If you're willing to consider the Translation Lookaside Buffer as part of the contents of RAM then, if the claims are to be believed, SPECTRE and Meltdown could read the contents of other processes' (and containers) RAM.