New CPU Feature ERAPS Reduces Software Tax for Hardware Bugs

01 Nov 2024

With the 5th gen AMD EPYC CPU, formerly codenamed “Turin,” AMD introduced a new feature in its processors: ERAPS - Enhanced RAP Security. For Linux, this feature allows software mitigations to be relaxed that were put in place after the SpectreRSB vulnerability was disclosed in hardware back in 2018.

This feature makes the software invulnerable to certain speculative attacks, thereby providing the security and not requiring the software mitigations anymore.

This article goes into which software mitigations are not required anymore when using processors with this feature. But before that, let’s catch up on the CPU vulnerabilities and the software mitigations applied to protect against them. Note that the APM release will contain official documentation. This writeup is just my notes - admittedly with a bunch of mistakes that I’m slowly correcting as the patches go through the upstreaming process. I’ll continue clarifying language and making corrections.

Background

Abbreviations

RAP: Return Address Predictor

RSB: Return Store Buffer

RAS: Return Address Store

The three acronyms – RAP, RSB, RAS – are used to reference the same store in the CPU. AMD manuals use RAS and RAP. Linux uses RSB. For consistency, I use RSB in this post.

The RSB is not directly used by software. It is a CPU buffer store to speed up speculative operations. Whenever a CALL instruction is made, the address following the CALL instruction is pushed on the RSB by microcode. These addresses on the RSB are then used to predict return targets and speculate operations.

The SpectreRSB Vulnerability

The Spectre v2 SpectreRSB vulnerability exploited the addresses in the RSB. One userspace process could issue a bunch of CALL instructions, then yield control of the CPU and allow another process to be scheduled in. Any speculative operations on the entries in the RSB from the other userspace process would then use the addresses stuffed by the first malicious process. With this malicious technique, user->user and user->kernel attack vectors are possible. When running virtual machines, a similar technique makes it possible for guest->guest, guest->user, and guest->hypervisor attacks to be carried out.

Software Mitigations

As a result of the SpectreRSB disclosure, Linux added code to mitigate RSB poisoning scenarios – e.g., a guest placing malicious entries in the RSB, and then hardware using those entries in hypervisor or host or other guest contexts.

Software-based mitigations can take two forms:

RSB flushing: clearing the contents of the RSB
RSB stuffing: populating RSB entries via bogus CALL instructions

Both RSB flushing and RSB stuffing address RSB poisoning scenarios. RSB stuffing addresses RSB underflow vectors (not applicable for AMD CPUs), whereas RSB flushing does not. Since RSB stuffing is a superset of both these methods, it is currently the only method in use in the Linux kernel.

RSB stuffing means the kernel repeats 32 (hard-coded value) CALL instructions. The instruction following the CALL (i.e., the target instruction from a RET) causes a trap. Any speculative operations on any of these RSB entries are hence benign.

The Linux mitigation for these RSB vulnerabilities is to stuff the RSB for these events:

Context switches
VMEXITs

Another form of mitigation - with hardware assistance - is Indirect Branch Prediction Barrier (IBPB): this mitigation also clears the RSB, but it’s not performed on every context switch. This is a more expensive operation compared to stuffing the RSB. The focus of this article and patchset is the RSB stuffing performed on context switches and VMEXITs.

The New ERAPS Feature

The Enhanced Return Address Predictor (ERAPS) feature debuted in the newly-released 5th Gen AMD EPYC processors. This feature addresses the RSB poisoning attack scenario in hardware, making software mitigations unnecessary.

There are several enhancements in the hardware for this feature:

Auto-flush the RSB
- Flush the RSB on INVPCID, all context switches (triggered by writing to the CR3), and even some writes to CR4. Even if those operations cause a TLB flush that does not flush the entire TLB, the RSB is flushed.
- Caveat 1: If context switches do not invoke any of the hardware TLB invalidation instructions, or update the CR3 – as in the case of running in a virtual machine with nested paging (NPT) disabled, the RSB will not be cleared automatically
  - More notes about caveat 1 below.
Tag addresses in the RSB as host or guest, and prevent cross-predictions
- This ensures a malicious guest cannot influence the speculations made in host/hypervisor context
- Caveat 2: the RSB entries are not tagged with the ASID of the guest, and hence cannot differentiate between a guest and its nested guests
  - This caveat only affects L1 guests from malicious L2 guests in nested VM configurations. To address this situation, this feature provides the hypervisor with to a VMCB bit to instruct the hardware to flush the RSB on the next VMRUN when switching from an L2 guest to an L1 guest, ensuring the L1 guest is protected against L2 RSB poisoning attacks.
  - This caveat does not apply in non-nested scenarios, and the hypervisor does not need to set the VMCB bit in that case
  - More notes about caveat 2 below.
Increase the size of the RSB to a new default (64 entries in 5th Gen EPYC CPUS, as opposed to the old 32)
- Increase is automatic for bare metal
- Software can probe the value of the ERAPS CPUID bit to confirm its presence
- Additional CPUID bits indicate the size of the RSB
- Hypervisor sets a new VMCB bit added with this feature to allow the larger RSB size to be usable by the guest
  - Otherwise, virtual machines default to older 32 entries
  - This is to preserve backwards compatibility with existing hypervisors
  - To address caveat 2 above, the hypervisor also needs to set another VMCB bit to clear the RSB before a qualifying VMRUN if this feature has been advertised to the guest.
This new default size of the RSB can change in future CPU generations without needing any software changes

For bare metal kernel and bare metal hypervisor hosts, the increase in the RSB size is automatic – no software support is necessary. The hardware always will use the larger RSB stack, and auto flush the RSB, irrespective of software configuration, or mitigations in place.

On the other hand, software enlightenment for this feature helps remove the software mitigations in favor of the hardware mitigations – effectively removing the tax of double RSB flushing or stuffing in case old software is run on newer hardware.

When executing in a guest context, the increased RSB size is only used by the hardware if the hypervisor sets the new ALLOW_LARGER_RAP VMCB bit. The hypervisor setting this bit implies the hypervisor also ensures to set the new FLUSH_RAP_ON_VMRUN VMCB bit when necessary, as identified in Caveat 2.

Compatibility Matrix and Operation Modes

This feature has been designed with backwards compatibility and security in mind. No software changes are required for this feature to work in host, hypervisor, or guest mode. The following matrix shows the effect of the feature on “old”/unenlightened hosts, guests, and “new”/enlightened hosts and guests.

	Old guest software	New guest software
Old host/hypervisor software	Host uses larger RSB automatically Host software mitigations still in place – net effect is that the first 32 entries of the RSB cleared twice, remaining entries cleared by microcode Guest context: only 32 entries of RSB available Guest software mitigations in place. Net effect for NPT-enabled guests is that all 32 entries of the RSB are cleared twice on context switches
New host software	Host uses larger RSB automatically Host drops software mitigations on VMEXIT and context switches RSB preserves host and guest RSB entries on reentry to the same guest after a VMEXIT Host extends guest’s RSB size Guest context uses the full default RSB buffer Host needs to protect L1 guest from malicious L2 guests by setting FLUSH_RAP_ON_VMRUN
New host software	Guest software mitigations are still in place. The first 32 entries of the RSB cleared twice for in-guest context switches, while the remaining entries cleared by microcode	Guest software mitigations dropped

Attack Vectors and Mitigation Modes

	User		Kernel/Hypervisor		Guest
	Before	After	Before	After	Before	After
User	RSB Stuffing	ERAPS (RSB cleared on context switch)	SMEP		RSB Stuffing	ERAPS (host/guest tags)
Guest	RSB Stuffing	ERAPS (host/guest tags)	RSB Stuffing after VMEXIT	ERAPS (host/guest tags)	RSB Stuffing	ERAPS (RSB cleared on context switch)

Notes on Caveats, and Confidential Computing

Caveat 1

With NPT disabled, the KVM hypervisor uses the “shadow paging” technique. This method does not cause a change in CR3 when the guest scheduler switches guest processes within the virtual machine. In this case, the auto-RSB flush does not happen on guest context switch despite ERAPS being present. To enable ERAPS for such guests, both these will need to happen:

Hypervisor needs to set FLUSH_RAP_ON_VMRUN for every guest event that causes a context siwtch or TLB flush on real hardware
Guest software will have to change the hard-coded “32” value that is used for RSB stuffing to increase it to the default RSB size if ERAPS is exposed to the guest. A guess will have to do this to preserve its operational integrity in case the hypervisor has a bug that does not clear the RSB on one of the qualifying conditions.

It is important to note that having NPT disabled is a theoretical case – almost no production host that has the NPT capability disables it.

Due to these shortcomings, the Linux implementation will not expose the larger RSB or the ERAPS CPUID feature bit to virtual machines when NPT is disabled.

Caveat 2

When running nested virtual machines, the RSB entries marked “guest” may correspond to either of the L1 or L2 guests. To protect the L1 guest from a malicious L2 guest wanting to poison the guest RSB entries, the hypervisor, on a VMEXIT from an L2 guest, will set the FLUSH_RAP_ON_VMRUN VMCB bit. This ensures that when the control goes from an L2 guest to an L1 guest, the RSB entries are cleared in the L1 guest’s context. This is a case where the L1 guest relies on the hypervisor to do the right thing.

Considerations for Confidential Computing

For Confidential Computing, software running in the guest does not trust the hypervisor software to be bug-free or be not malicious. For both caveats 1 and 2, the guest software must rely on the hypervisor for secure and correct operation. This can go against the confidential computing trust boundaries.

However, both those cases for Caveats 1 and 2 – NPT disabled and nested virtualization – are disallowed in SEV-SNP Confidential Computing.

All other modes of operation for this feature do not rely on any hypervisor implementation details, and hence are compatible with Confidential Computing trust boundaries.

A final note on confidential computing: the SEV-SNP hardware checks whether the CPUID bits being presented to the guest are allowed on a particular host. That prevents a malicious or buggy hypervisor to expose the ERAPS feature when the hardware does not have it. This ensures the guest software to safely drop the software mitigations if the ERAPS CPUID bit is discovered.

Changelog

v2: 5 Nov 2024:
- The BTC/BTC_NO discussion was only related to the comments in the kernel’s arch/x86/kernel/cpu/bugs.c, and not related to the SpectreRSB issue on AMD or ERAPS, so drop that discussion from this doc, as it caused confusion - and also split out that patch for upstream v2 posting
  - remove mentions of BHB, AMD doesn’t have such a thing; and also the BTB, since that was only used in the BTC context.
- Add a note on IBPB also doing RSB flushes
- Update language to clarify context switches via CR3 writes being the main vector for RSB flushes in hardware, rather than just TLB flushes
- Thanks to Andrew Cooper, Dave Hansen for comments on patch upstream to help clarify language

Think. Debate. Innovate.