One exercise on pwn.college in their kernel security module is they have a program that forks, opens the flag file (/flag owned by root), reads the content, and then child process exits.
The program continues to run to allow you to load in shell code that'll be run by the kernel.
Your task is basically to write some shellcode to scan the memory for the flag. So now I know that at least some Linux-es don't clean up after a process exits, and you can get the contents of memory when you have kernel privileges. This is not so easy if you're scanning memory as root since /dev/mem or whatever won't reveal that process memory.
By default memory is wiped before it's handed out again, not when it's freed. This improves performance, but means secrets can remain in RAM for longer than necessary, where they can be accessed by privileged attackers (software running as root, DMA without MMU, or hardware attacks). For unprivileged processes eager and lazy zeroing look the same.
Apparently there's a kernel config flag to zero the memory on free (CONFIG_INIT_ON_FREE_DEFAULT_ON) but it has a quite expensive performance cost (3-5% according to the docs). I wonder in what kind of scenario it would make sense to enable it.
You want to enable this if your concerned about forensic attacks. A simple example would be someone has physical access to your device. They're able to power it down, and boot it with their own custom kernel. If the memory has not been eagerly zeroed they may be able to extract from RAM sensitive data.
This flag puts an additional obstacle in the attacker's path. If you have private key material protecting valuable property, you definitely want to throw up as many roadblocks as possible.
Yes it would - either through the free syscall or a process exit. This is a defense in depth strategy and not 100% perfect. If you yanked the power cord and a long lived process had sensitive data in memory you're still vulnerable. But if you had a clean power down or very short lifetimes of sensitive data being active in RAM it would afford you additional security.
I think it’s a little bit of column A and a little bit of column B, but admit while I remember reading about using technique a long time ago, I’m not sure of the history of the nomenclature. From the StackExchange:
> For those who think this is only theoretical: They were able to use this technique to create a bootable USB device which could determine someone's Truecrypt hard-drive encryption key automatically, just by plugging it in and restarting the computer. They were also able to recover the memory-contents 30 minutes+ later by freezing the ram (using a simple bottle of canned-air) and removing it. Using liquid nitrogen increased this time to hours.
the idea is that well written software would release memory as soon as possible so with it enabled you'd have the secret in memory for as little time as possible.
Though in my mind well written software should be zeroing the memory out before freeing if it held sensitive data.
Yes I thought that was common practice - I remember reading patch notes for something years ago where the program had been updated to always zero the password buffer after checking that it matches (I think in some cases it was kept around for a bit).
From a defense in depth perspective you definitely want the implementation to be robust (zero the memory after reading). However you should also consider:
1) abnormal program termination due to signals, memory pressure/oom killer, aborts in other threads serving different requests, and so on. These events could race with the memory zeroing.
2) bugs in the implementation where memory isn't zeroed in all paths
3) interactions between compiling, standard libraries, language runtimes and optimization passes causing memory zeroing to be skipped.
All these cases have happened time and time again in the wild. Hence having additional safety nets is useful.
These patches were endorsed by folks working on chromeos and Android security. I would suppose that they want them to put additional safeguards behind full disk encryption keys and may also be concerned with quality of implementation issues in 3rd party or vendor blobs.
If your PC is connected to a power strip, it's my understanding that law enforcement can attach a live male-to-male power cable to the power strip and then remove the power strip from the wall while still powering the computer. That, and yeah freezing ram.
So technically that's removing power, not "powering it down". I guess you'd then warm-boot with your own kernel and hope that the contents of RAM are mostly untouched?
So, if it's only 3-5% slower, then for $50-100 I could buy a slightly faster processor and never know the difference?
Just trying to check my understanding of what the 3-5% delta is. Seems like a tiny tradeoff for any workstation (I wouldn't notice the difference at least). The tradeoff for servers might vary depending on what they are doing (shared versus owned, etc)
This seems beneficial in systems where security concerns trump performance concerns. The above poster has probably made many such trade-offs already and would likely make more. (Full disk encryption, virtualization, protection rings, spectre mitigations, MMIO, ECC, etc.)
With exponentially increasing processor performance it does make sense for workstations where physical access should be considered in the threat model.
> I don't understand why it is slower. It has to be zeroed anyway.
Memory pages freed from userspace might be reused in kernelspace.
If, for instance, the memory is re-used in the kernel's page cache, then the kernel doesn't need to zero it out before copying the to-be-cached data into the page.
Edit: I seem to remember back in the 1990s that the kernel at least in some cases wouldn't zero-out pages previously used by the kernel before giving them to userspace, sometimes resulting in kernel secrets being leaked to arbitrary userspace processes. Maybe I'm missremembering, and it was just leakage of secrets between userspace processes. In any case, in the 1990s, Linux was way too lax about leaking data from freed pages.
and if the system isn't idle but also doesn't use all phys. memory it might not be zeroed for a very long time
> Is it not zeroed if the memory is assigned to the same process???
idk. what the current state of this is in linux but at least in the past for some systems for some use cases related to mapped memory this was the case
As far as I know, the Linux kernel never inspects the userspace thread to adjust behavior based on what the thread is going to do next. This would be a very brittle sort of optimization.
More importantly, it's not safe. Another thread in the same process can see ptr between the malloc and the memcpy!
Edit: also, of course, malloc and memcpy are C runtime functions, not syscalls, so checking what happens after malloc() would require the kernel to have much more sophisticated analysis than just looking a few instructions ahead of the calling thread's %%eip/%%rip. While handling malloc()'s mmap() or brk() allocation, the kernel would need to be able to look one or two call frames up the call stack, past the metadata accounting that malloc is doing to keep track of the newly acquired memory, perhaps look at a few conditional branches, trace through the GOT and PLT entries to see where the memcpy call is actually going, and do so in a way that is robust to changes in the C runtime implementation. (Of course, in practice, most C compilers will inline a memcpy implementation, so in the common case, it wouldn't have to chase the GOT and PLT entries, but even then, it's way too complicated for the kernel to figure out if anything non-trivial is happening between mmap()/brk() and the memory being overwritten.)
Edit 2: To be robust in the completely general case, even if it were trivial to identify the inlined memcpy implementation, and it were clearly defined "something non-trivial happens", determining if "something non-trivial happens" between mmap()/brk() and memcyp() would involve solving the halting problem. (Imposssible in the general case.)
malloc() == 'reservation' (but not paged in!) memory
// If touched / updated THEN the memory's paged in
A copy _might_ not even become a copy if the kernel's smart enough / able to setup a hardware trigger to force a copy on writes to that area, at which point the physical memory backing two distinct logical memory zones would be copied and then different.
That's a good point that Linux doesn't actually allocate the pages until they're faulted in by a read or write. So, if it were doing some kind of thread inspection optimization, it would presumably just need to check if the faulting thread is currently in a loop that will overwrite at least the full page.
However, that wouldn't solve the problem of other threads in the same process being able to see the page before it's fully overwritten, or debugging processes, or using a signal handler to invisibly jump out of the initialization loop in the middle, etc. There are workarounds to all of these issues, but they all have performance and complexity costs.
malloc gets memory from the heap which may or may not be paged in/reused. That means you may get reused memory from the heap (which is up to the CRT).
If you want make sure it is zero you will want calloc. If you know you are going to copy something in on the next step like your example you probably can skip calloc and just us malloc. calloc is nice for when you are doing thigs like linked lists/trees/buffers and do not want to have steps to clean out the pointers or data.
Just a guess but since apps can fail to free memory correctly you probably have to zero it on allocation and deallocation (to be secure) when you enable the feature. So you aren't swapping one for the other, you are now doing both.
> Just a guess but since apps can fail to free memory correctly
That's not relevant here; from the perspective of the kernel pages are either assigned to a process, or they're not. If an application fails to free memory correctly, that only means it'll keep having pages assigned to it that it no longer uses, but eventually those pages will always be released (by the kernel upon termination of the process, in the worst case).
That is the worst case if the process had leaked that part of the heap, but it is an optimal case on process exit. On OS with any kind of process isolation walking over most of the heap before exiting as to "correctly free it" is pure waste of the CPU cycles and in worst case even IO bandwidth (when it causes parts of the heap to be paged in).
Pages can be completely avoided to be paged in if the intention is to just zero them. The kernel could either just "forget" them, or use copy-on-write with a properly zeroed out page as a base.
The point is that you do not want to do any kind of heap cleanup before exit. The intention isn't to zero the pages, but to outright discard all of them (which is going to be done by the kernel anyway).
I'd like the ability to control this at a process or even allocation (i.e. as a flag on mmap) level. That way a password manager could enable this, while a game could disable it.
Do you mean init_on_alloc=1 and init_on_free=1? Here [1] is a thread on the options and performance impact. FWIW I use it on all my workstations but these days I am not doing anything that would be greatly impacted by it. I've never tried it on a gaming machine and never tried it on a large memory hypervisor.
I wish there were flags similar to this for the GPU memory. Even something that zero's GPU memory on reboot would be nice. I can always see the previous desktop after a reboot for a brief moment.
When running non performance sensitive but security sensitive code. Even adding protections summing up to much higher performance penalties can be very acceptable.
E.g. on a crypto key server. Less if it's a server which encrypts data en mass, but e.g. one which signs longer valid auth tokens or one which hold middle layer certificates which are once every few hours used to create a cert used to encrypt/sign data en mass used on a different server etc.
"Real quick" is human speak. For large amounts of memory it's still bound by RAM speed for a machine, which is much lower (a couple orders of magnitude I believe) than, say, cache speed. Things might be different if there was a RAM equivalent of SSD TRIM (making the RAM module zero itself without transferring lots of zeros across the bus), but there isn't.
I'm completely unfamiliar with how the CPU communicates with the memory modules, but is there not a way for the CPU to tell the memory modules to zero out a whole range of memory rather than one byte/sector/whatever-the-standard-unit-is at a time?
As I type this, I'm realizing how little I know about the protocol between the CPU and the memory modules--if anyone has an accessible link on the subject, I'd be grateful.
That's what I referred to as "TRIM for RAM". I'm not aware of it being a thing. And I don't know the protocol, but I'm also not sure it's just a matter of protocol. It might require additional circuitry per bit of memory that would increase the cost.
'trim' for RAM is a virtual to physical page table hack. Memory that isn't backed by a page is just a zero, it doesn't need to be initialized. Offhand it's supposed to be before it's handed to a process, but I don't know if there are E.G. mechanisms to use some spare cycles to proactively zero non-allocated memory that's a candidate for being attached to VM space.
No. Memset (and bzero) aren’t HW accelerated. There is a special CPU instruction that can do it but in practice it’s faster to do it in a loop. In user space you can frequently leverage SIMD instructions to speed it up (of course those aren’t available in the kernel because it avoids saving/restoring those and FP registers on every syscall (only when you switch contexts).
What could be interesting if there were a CPU instruction to tell the RAM to do it. Then you would avoid the memory bandwidth impact of freeing the memory. But I don’t think there’s any such instruction for the CPU/memory protocol even today. Not sure why.
That seems wild to be honest. I know how easy it is to say "well they can just.."
But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it? Actually slamming a bunch of 0's out over the bus seems so very suboptimal.
> But...wouldn't it be relatively trivial to have an instruction that tells the memory controller "set range from address y to x to 0" and let it handle it?
Having the memory controller or memory module do it is complicated somewhat because it needs to be coherent with the caches, needs to obey translation, etc. If you have the memory controller do it, it doesn't save bandwidth. But, on the other hand, with a write back cache, your zeroing may never need to get stored to memory at all.
Further, if you have the module do it, the module/sdram state machine needs to get more complicated... and if you just have one module on the channel, then you don't benefit in bandwidth, either.
A DMA controller can be set up to do it... but in practice this is usually more expensive on big CPUs than just letting a CPU do it.
It's not really tying up a processor because of superscalar, hyperthreading, etc, either; modern processors have an abundance of resources and what slows things doing is things that must be done serially or resources that are most contended (like the bus to memory).
really quick still doesn't mean it's free, especially if you always have to zero all the allocated pages even if the process might just have used part of the page.
Also the question is what is this % in relation to?
Probably that freeing get up to 5% slower, which is reasonable given that before you often could use idle time to zero many of the pages or might not have zeroed some of the pages at all (as they where never reused).
exactly, the only guarantee is that things are zeroed before handing them out to a different process, but there
is some potential time gap between releasing memory back to
the kernel and it being cleaned, a gap which can outlive the live of a process
> and you can get the contents of memory when you have kernel privileges. This is not so easy [..] as root
yes, root has much less privileges then the kernel, but often can gain kernel privileges.
But this is where e.g. lockdown mode comes in which denies the root users such privilege escalation (oversimplified, it's complicated). Main problem is that lockdown mode is not yet compatible with suspend to disk (hibernation), even through its documentation implies it is, if your have a encrypted hibernation. (This is misleading as it refers to a not yet existing feature where the kernel creates a encrypted image which is also tamper proof even if root tries to tamper. On the other hand suspend to an encrypted partition is possible in Linux, but not enough for lockdown mode to work.)
edit: apparently the runtime option is called `init_on_alloc` and the compile-time option (which determines the default of the runtime option) is called `CONFIG_INIT_ON_FREE_DEFAULT_ON`.
The program continues to run to allow you to load in shell code that'll be run by the kernel.
Your task is basically to write some shellcode to scan the memory for the flag. So now I know that at least some Linux-es don't clean up after a process exits, and you can get the contents of memory when you have kernel privileges. This is not so easy if you're scanning memory as root since /dev/mem or whatever won't reveal that process memory.