Publications

M5: Mastering page migration and memory management for CXL-based tiered memory systems

Published in ASPLOS, 2025

This paper proposes a HW-SW co-design for hot data detection and migration in CXL-based tiered memory systems. paper slides

Demystifying a CXL Type2 Device: A Heterogeneous Cooperative Computing Perspective

Published in MICRO, 2024

This paper is the first-ever characterization study of real commodity CXL Type-2 devices. We also introduce a real-world use case of Type-2 device as cache-coherent accelerator for Linux kernel function offloading. paper slides

Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration

Published in OSDI, 2024

This paper presents a novel tiered memory management software system based on emerging CXL memory. paper and slides

Intel Accelerators Ecosystem: An SoC-Oriented Perspective

Published in ISCA (Industry Track), 2024

This is the authentic retrospect of Intel journey of building SoC-level features and software ecosystem for accelerators, and integrating various data accelerators into the modern Xeon CPU chips. paper slides

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in ModernIntel Xeon Scalable Processors

Published in ASPLOS, 2024

This paper is the authentic characterization study of Intel Data Streaming Accelerator (DSA) in modern Intel Xeon Scalable Processors. It provides introduction, performance analysis, optimization guide, ecosystem, and real use cases of DSA. paper arXiv version slides

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

Published in MICRO, 2023

This paper is the first-ever characterization study of real commodity CXL memory devices. We develop and open-source MEMO benchmark for CXL memory testing, and develop a auto-tuning algorithm to make the most out of the CXL devices as not only capacity expander but also bandwidth expander. paper arXiv version slides

STYX: Exploiting SmartNIC Capability to Reduce Datacenter Memory Tax

Published in ATC, 2023

This paper proposes to take advantage of the capability of SmartNICs in the datacenter to offload a set of expensive kernel memory management tasks, so that the CPU cycles and cache pollution can be significantly reduced. paper (including video and slides)

RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications

Published in HPCA, 2023

Responding to the “datacenter tax” and “killer microseconds” problems, this paper proposes a holistic solution for us-scale datacenter applications accelerations, leveraging emerging RDMA and cache-coherent accelerator techniques. paper arXiv version slides

IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors

Published in MICRO, 2022

This paper proposes a intelligent and flexible mechanism for inbound traffic destination steering to reduce data movement in the cache/memory hierarchy and thus improve the inbound I/O performance. paper slides

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Published in NSDI, 2022

This paper proposes a versatile and flexible approach for floating-point number representation, storage, and operations in modern RMT-based programmable switches, which can benefit a wide range of distributed applications, including distributed training and database query. paper (including video) arXiv (extended) version slides

Yifan Yuan

Publications

M5: Mastering page migration and memory management for CXL-based tiered memory systems

Demystifying a CXL Type2 Device: A Heterogeneous Cooperative Computing Perspective

Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration

Intel Accelerators Ecosystem: An SoC-Oriented Perspective

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in ModernIntel Xeon Scalable Processors

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

STYX: Exploiting SmartNIC Capability to Reduce Datacenter Memory Tax

RAMBDA: RDMA-driven Acceleration Framework for Memory-intensive us-scale Datacenter Applications

IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Don’t Forget the I/O When Allocating Your LLC

QEI: Query Acceleration Can be Generic and Efficient in the Cloud

Data Direct I/O Characterization for Future I/O System Exploration

Accelerating Distributed Reinforcement Learning with In-Switch Computing

HALO: Accelerating Flow Classification for Scalable Packet Processing in NFV

A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks