## Persistent Memory Objects on the Cheap International Conference on Supercomputing 2025. Salt Lake City, Utah, United States. June 11. **Derrick Greenspan**, Naveed Ul Mustafa, Jongouk Choi, Mark Heinrich, Yan Solihin CompArch & ARPERS research groups Cyber Security and Privacy Research Cluster UNIVERSITY OF CENTRAL FLORIDA - 1 Introduction and Background - 2 Design - Light PMO (LPMO) Design - 3 Evaluation - LPMO Performance - CXL Performance 4 Conclusion ICS '25 Example: Intel Optane, CXL Memory-Semantic SSDs #### Characteristics Example: Intel Optane, CXL Memory-Semantic SSDs #### Characteristics • Slow write performance/Decent read performance ## Persistent Memory (PM) Example: Intel Optane, CXL Memory-Semantic SSDs #### Characteristics - Slow write performance/Decent read performance - Poor multithreaded performance Example: Intel Optane, CXL Memory-Semantic SSDs #### Characteristics - Slow write performance/Decent read performance - Poor multithreaded performance (more on this later) ## Persistent Memory (PM) Example: Intel Optane, CXL Memory-Semantic SSDs #### Characteristics - Slow write performance/Decent read performance - Poor multithreaded performance (more on this later) Note: Devices with PM usually have DRAM as well Primitives: pcreate() Primitives: pcreate() attach() Primitives: pcreate() attach() detach() Primitives: pcreate() attach() detach() psync() Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel #### **Features** Fast with minimal metadata Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel - Fast with minimal metadata - Crash-consistency Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel - Fast with minimal metadata - Crash-consistency - Security at rest Primitives: pcreate() attach() detach() psync() #### **Properties** - File-less - Potentially pointer-rich - Accessed via load-store instructions - Metadata managed by kernel - Fast with minimal metadata - Crash-consistency - Security at rest - Integrity verification at rest Prior work exhibited poor thread scaling # Thread Scaling Prior work exhibited poor thread scaling Prior work exhibited poor thread scaling #### Reasons • PMOs are hosted entirely in PM Prior work exhibited poor thread scaling #### Reasons - PMOs are hosted entirely in PM - Encryption and integrity verification on the critical path #### Compute Express Link - Utilizes PCIe interface - Direct access from CPU to memory - Heterogeneous memory pools - Can use PM or Volatile Memory #### Compute Express Link - Utilizes PCIe interface - Direct access from CPU to memory - Heterogeneous memory pools - Can use PM or Volatile Memory #### Additional latency - From controller - Comparable to NUMA #### Compute Express Link - Utilizes PCIe interface - Direct access from CPU to memory - Heterogeneous memory pools - Can use PM or Volatile Memory #### Additional latency - From controller - Comparable to NUMA Goal: High-Performance PMOs #### CXL 3.1 Specification • Trusted Execution Environments (TEE) Security Protocol (TSP) #### CXL 3.1 Specification - Trusted Execution Environments (TEE) Security Protocol (TSP) - Range-based memory encryption #### CXL 3.1 Specification - Trusted Execution Environments (TEE) Security Protocol (TSP) - Range-based memory encryption #### I.e., Transparent hardware encryption #### CXL 3.1 Specification - Trusted Execution Environments (TEE) Security Protocol (TSP) - Range-based memory encryption #### I.e., Transparent hardware encryption ...more on this later. - 1 Introduction and Background - 2 Design - Light PMO (LPMO) Design - 3 Evaluation - LPMO Performance - CXL Performance 4 Conclusion ICS '25 $\rightarrow$ Design 7 / 23 Goal: Protect at-rest data from disclosure/corruption Goal: Protect at-rest data from disclosure/corruption ### Out of Scope - Side-channel attacks - Data-remanence attacks (DRAM) Prior work: PMO entirely in PM #### Prior work: PMO entirely in PM Crash consistency simple #### Prior work: PMO entirely in PM - Crash consistency simple - High latency, low write bandwidth #### Prior work: PMO entirely in PM - Crash consistency simple - High latency, low write bandwidth LPMO can exploit DRAM as cache without hardware support #### Prior work: PMO entirely in PM - Crash consistency simple - High latency, low write bandwidth LPMO can exploit DRAM as cache **without** hardware support DRAM as cache = Reconfigurable Memory Challenges #### Challenges Which data should be placed in DRAM? #### Challenges Which data should be placed in DRAM? All data in DRAM #### Challenges ## Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary #### Challenges ## Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary #### Challenges Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary What data should be encrypted? #### Challenges ## Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary ## What data should be encrypted? • Primary and shadow page encrypted #### Challenges ## Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary ## What data should be encrypted? - Primary and shadow page encrypted - Primary and shadow in plaintext #### Challenges ## Which data should be placed in DRAM? - All data in DRAM - Shadow in DRAM, primary in PM Psync: Temporary Shadow Page (TSC) in PM, copy to Primary ## What data should be encrypted? - Primary and shadow page encrypted - Primary and shadow in plaintext - Shadow in plaintext, primary in ciphertext Optane PM uses local DRAM # Reconfigurable Memory Hierarchy (Part 2) #### Optane PM uses local DRAM CXL can place system on either side of memory expander Prior work: Demand Faulting Why not **predict** when pages are needed? Prior work: Demand Faulting Why not **predict** when pages are needed? ## **Example Solution: Stream Buffer** - On fault, predict next X sequential pages (**depth**) - Works well for access patterns amenable to prediction - 1 Introduction and Background - 2 Design - Light PMO (LPMO) Design - 3 Evaluation - LPMO Performance - CXL Performance 4 Conclusion ICS '25 $\rightarrow$ Evaluation 13 / 23 #### **Evaluated Benchmarks** - Microbenchmarks - 2d Convolution (2dConv) - Gaussian Elimination (Gauss) - LU Decomposition (LU) - Tiled Matrix Matrix Multiplication (TMM) - Filebench (Fileserver, VarMail, WebProxy, WebServer) LMDB ICS '25 $\rightarrow$ Evaluation 14 / 23 #### **Evaluated Benchmarks** - Microbenchmarks - 2d Convolution (2dConv) - Gaussian Elimination (Gauss) - LU Decomposition (LU) - Tiled Matrix Matrix Multiplication (TMM) - Filebench (Fileserver, VarMail, WebProxy, WebServer) - LMDB | Component | Specifications | |-----------|-------------------------------------------------| | MB | Supermicro X11DPi-NT | | CPU | 2×Intel Xeon Gold 6230 (20 cores) | | DRAM | $4 \times 32 \text{GiB DDR4} @ 2666 \text{MHz}$ | | PM | $4 \times 128 \text{GiB Intel Optane DIMM}$ | | OS | AlmaLinux 9.0; Linux 5.15.157 | ICS '25 $\rightarrow$ Evaluation 14 / 23 #### LPMO Performance All benchmarks have better thread scaling! ## LPMO Performance • DRAM reduces execution time by $\approx 21\%$ ## LPMO Performance - DRAM reduces execution time by $\approx 21\%$ - IV + Prediction faster than original GPMO design w/o IV ## LPMO Performance - Filebench $\bullet$ Only 1.19× faster with DRAM ## LPMO Performance - Filebench - $\bullet$ Only $1.19 \times$ faster with DRAM - 1.81× faster with page prediction ## LPMO Performance - Filebench - $\bullet$ Only $1.19 \times$ faster with DRAM - 1.81× faster with page prediction - $1.37 \times$ faster with page prediction & **IV** #### **CXL** Performance #### Perform same tests, but emulate CXL latency - Use opposite-node Optane - Near configuration: cache allocated from local node - Far configuration: cache allocated from opposite node - With CXL alone: 50% slower than original - With DRAM: 20% faster (despite CXL latency) ## CXL Performance - Filebench - 1 Introduction and Background - 2 Design - 3 Evaluation - LPMO Performance - CXL Performance 4 Conclusion ICS '25 $\rightarrow$ Conclusion 21 / 23 #### **LPMO** - Software-based DRAM caching - Up to $1.25 \times$ faster ICS '25 $\rightarrow$ Conclusion 22 / 23 #### **LPMO** - Software-based DRAM caching - Up to $1.25 \times$ faster - Predictive Decryption - $\bullet$ Up to $1.81 \times$ faster ICS $^{\circ}25\rightarrow$ Conclusion 22 / 23 #### **LPMO** - Software-based DRAM caching - Up to $1.25 \times$ faster - Predictive Decryption - Up to 1.81× faster #### CXL - Introduced Reconfigurable Memory Hierarchy - CXL latency can be masked by LPMO optimizations ICS '25→Conclusion 22 / 23 Thank You! Any questions? ICS '25 $\rightarrow$ Conclusion 23 / 23