New CPU Feature ERAPS Reduces Software Tax for Hardware Bugs

With the 5th gen AMD EPYC CPU, formerly codenamed “Turin”, AMD introduced a new feature in its processors: ERAPS - Enhanced RAP Security. For Linux, this feature allows software mitigations to be relaxed that were put in place after the SpectreRSB vulnerability was disclosed in hardware back in 2017.

This feature makes the software invulnerable to certain speculative attacks, thereby providing the security and not requiring the software mitigations anymore.

This article goes into which software mitigations are not required anymore when using processors with this feature. But before that, let’s catch up on the CPU vulnerabilities and the software mitigations applied to protect against them. Note that the APM release will contain official documentation. This writeup is just my notes - admittedly with a bunch of mistakes that I’m slowly correcting as the patches go through the upstreaming process. I’ll continue clarifying language and making corrections.

Background

Abbreviations

RAP: Return Address Predictor

RSB: Return Store Buffer

RAS: Return Address Store

The three acronyms – RAP, RSB, RAS – are used to reference the same store in the CPU. AMD manuals use RAS and RAP. Linux uses RSB. For consistency, I use RSB in this post.

The RSB is not directly used by software. It is a CPU buffer store to speed up speculative operations. Whenever a CALL instruction is made, the address following the CALL instruction is pushed on the RSB by microcode. These addresses on the RSB are then used to predict return targets and speculate operations.

The SpectreRSB Vulnerability

The Spectre v2 SpectreRSB vulnerability exploited the addresses in the RSB. One userspace process could issue a bunch of CALL instructions, then yield control of the CPU and allow another process to be scheduled in. Any speculative operations on the entries in the RSB from the other userspace process would then use the addresses stuffed by the first malicious process. With this malicious technique, user->user and user->kernel attack vectors are possible. When running virtual machines, a similar technique makes it possible for guest->guest, guest->user, and guest->hypervisor attacks to be carried out.

Software Mitigations

As a result of the SpectreRSB disclosure, Linux added code to mitigate RSB poisoning scenarios – e.g., a guest placing malicious entries in the RSB, and then hardware using those entries in hypervisor or host or other guest contexts.

Software-based mitigations can take two forms:

Both RSB flushing and RSB stuffing address RSB poisoning scenarios. RSB stuffing addresses RSB underflow vectors (not applicable for AMD CPUs), whereas RSB flushing does not. Since RSB stuffing is a superset of both these methods, it is currently the only method in use in the Linux kernel.

RSB stuffing means the kernel repeats 32 (hard-coded value) CALL instructions. The instruction following the CALL (i.e., the target instruction from a RET) causes a trap. Any speculative operations on any of these RSB entries are hence benign.

The Linux mitigation for these RSB vulnerabilities is to stuff the RSB for these events:

Another form of mitigation - with hardware assistance - is Indirect Branch Prediction Barrier (IBPB): this mitigation also clears the RSB, but it’s not performed on every context switch. This is a more expensive operation compared to stuffing the RSB. The focus of this article and patchset is the RSB stuffing performed on context switches and VMEXITs.

The New ERAPS Feature

The Enhanced Return Address Predictor (ERAPS) feature debuted in the newly-released 5th Gen AMD EPYC processors. This feature addresses the RSB poisoning attack scenario in hardware, making software mitigations unnecessary.

There are several enhancements in the hardware for this feature:

For bare metal kernel and bare metal hypervisor hosts, the increase in the RSB size is automatic – no software support is necessary. The hardware always will use the larger RSB stack, and auto flush the RSB, irrespective of software configuration, or mitigations in place.

On the other hand, software enlightenment for this feature helps remove the software mitigations in favor of the hardware mitigations – effectively removing the tax of double RSB flushing or stuffing in case old software is run on newer hardware.

When executing in a guest context, the increased RSB size is only used by the hardware if the hypervisor sets the new ALLOW_LARGER_RAP VMCB bit. The hypervisor setting this bit implies the hypervisor also ensures to set the new FLUSH_RAP_ON_VMRUN VMCB bit when necessary, as identified in Caveat 2.

Compatibility Matrix and Operation Modes

This feature has been designed with backwards compatibility and security in mind. No software changes are required for this feature to work in host, hypervisor, or guest mode. The following matrix shows the effect of the feature on “old” / unenlightened hosts, guests, and “new” / enlightened hosts and guests.

Old guest software New guest software
Old host / hypervisor software
  • Host uses larger RSB automatically
  • Host software mitigations still in place – net effect is that the first 32 entries of the RSB cleared twice, remaining entries cleared by microcode
  • Guest context: only 32 entries of RSB available
  • Guest software mitigations in place. Net effect for NPT-enabled guests is that all 32 entries of the RSB are cleared twice on context switches
New host software
  • Host uses larger RSB automatically
  • Host drops software mitigations on VMEXIT and context switches
  • RSB preserves host and guest RSB entries on reentry to the same guest after a VMEXIT
  • Host extends guest’s RSB size
  • Guest context uses the full default RSB buffer
  • Host needs to protect L1 guest from malicious L2 guests by setting FLUSH_RAP_ON_VMRUN
  • Guest software mitigations are still in place. The first 32 entries of the RSB cleared twice for in-guest context switches, while the remaining entries cleared by microcode
  • Guest software mitigations dropped

Attack Vectors and Mitigation Modes

User Kernel / Hypervisor Guest
Before After Before After Before After
User RSB Stuffing ERAPS (RSB cleared on context switch) SMEP RSB Stuffing ERAPS (host / guest tags)
Guest RSB Stuffing ERAPS (host / guest tags) RSB Stuffing after VMEXIT ERAPS (host / guest tags) RSB Stuffing ERAPS (RSB cleared on context switch)

Notes on Caveats, and Confidential Computing

Caveat 1

With NPT disabled, the KVM hypervisor uses the “shadow paging” technique. This method does not cause a change in CR3 when the guest scheduler switches guest processes within the virtual machine. In this case, the auto-RSB flush does not happen on guest context switch despite ERAPS being present. To enable ERAPS for such guests, both these will need to happen:

  1. Hypervisor needs to set FLUSH_RAP_ON_VMRUN for every guest event that causes a context siwtch or TLB flush on real hardware

  2. Guest software will have to change the hard-coded “32” value that is used for RSB stuffing to increase it to the default RSB size if ERAPS is exposed to the guest. A guess will have to do this to preserve its operational integrity in case the hypervisor has a bug that does not clear the RSB on one of the qualifying conditions.

It is important to note that having NPT disabled is a theoretical case – almost no production host that has the NPT capability disables it.

Due to these shortcomings, the Linux implementation will not expose the larger RSB or the ERAPS CPUID feature bit to virtual machines when NPT is disabled.

Caveat 2

When running nested virtual machines, the RSB entries marked “guest” may correspond to either of the L1 or L2 guests. To protect the L1 guest from a malicious L2 guest wanting to poison the guest RSB entries, the hypervisor, on a VMEXIT from an L2 guest, will set the FLUSH_RAP_ON_VMRUN VMCB bit. This ensures that when the control goes from an L2 guest to an L1 guest, the RSB entries are cleared in the L1 guest’s context. This is a case where the L1 guest relies on the hypervisor to do the right thing.

Considerations for Confidential Computing

For Confidential Computing, software running in the guest does not trust the hypervisor software to be bug-free or be not malicious. For both caveats 1 and 2, the guest software must rely on the hypervisor for secure and correct operation. This can go against the confidential computing trust boundaries.

However, both those cases for Caveats 1 and 2 – NPT disabled and nested virtualization – are disallowed in SEV-SNP Confidential Computing.

All other modes of operation for this feature do not rely on any hypervisor implementation details, and hence are compatible with Confidential Computing trust boundaries.

A final note on confidential computing: the SEV-SNP hardware checks whether the CPUID bits being presented to the guest are allowed on a particular host. That prevents a malicious or buggy hypervisor to expose the ERAPS feature when the hardware does not have it. This ensures the guest software to safely drop the software mitigations if the ERAPS CPUID bit is discovered.

Changelog



CONVERSATION

Continue this conversation via your Mastodon or any ActivityPub account on the Fediverse by replying to this post.