M5: Mastering page migration and memory management for CXL-based tiered memory systems
Published in MICRO, 2024
This paper proposes a HW-SW co-design for hot data detection and migration in CXL-based tiered memory systems. paper slides
Published in MICRO, 2024
This paper proposes a HW-SW co-design for hot data detection and migration in CXL-based tiered memory systems. paper slides
Published in MICRO, 2024
This paper is the first-ever characterization study of real commodity CXL Type-2 devices. We also introduce a real-world use case of Type-2 device as cache-coherent accelerator for Linux kernel function offloading. paper slides
Published in OSDI, 2024
This paper presents a novel tiered memory management software system based on emerging CXL memory. paper and slides
Published in ISCA (Industry Track), 2024
This is the authentic retrospect of Intel journey of building SoC-level features and software ecosystem for accelerators, and integrating various data accelerators into the modern Xeon CPU chips. paper slides
Published in ASPLOS, 2024
This paper is the authentic characterization study of Intel Data Streaming Accelerator (DSA) in modern Intel Xeon Scalable Processors. It provides introduction, performance analysis, optimization guide, ecosystem, and real use cases of DSA. paper arXiv version slides
Published in MICRO, 2023
This paper is the first-ever characterization study of real commodity CXL memory devices. We develop and open-source MEMO benchmark for CXL memory testing, and develop a auto-tuning algorithm to make the most out of the CXL devices as not only capacity expander but also bandwidth expander. paper arXiv version slides
Published in ATC, 2023
This paper proposes to take advantage of the capability of SmartNICs in the datacenter to offload a set of expensive kernel memory management tasks, so that the CPU cycles and cache pollution can be significantly reduced. paper (including video and slides)
Published in HPCA, 2023
Responding to the “datacenter tax” and “killer microseconds” problems, this paper proposes a holistic solution for us-scale datacenter applications accelerations, leveraging emerging RDMA and cache-coherent accelerator techniques. paper arXiv version slides
Published in MICRO, 2022
This paper proposes a intelligent and flexible mechanism for inbound traffic destination steering to reduce data movement in the cache/memory hierarchy and thus improve the inbound I/O performance. paper slides
Published in NSDI, 2022
This paper proposes a versatile and flexible approach for floating-point number representation, storage, and operations in modern RMT-based programmable switches, which can benefit a wide range of distributed applications, including distributed training and database query. paper (including video) arXiv (extended) version slides
Published in ISCA, 2021
This paper proposes the first I/O-aware LLC management mechanism for performance isolation in DDIO-enable platform. paper arXiv version slides
Published in HPCA, 2021
This paper proposes a generic accelerator architecture for fine-gained latency-sensitive queries in various data structures. It also proposes a hybrid scheme for the accelerator to be efficiently integrated into modern server CPU. paper slides
Published in ISPASS, 2020
This paper provides a performance impact of Intel Data Direct I/O technology (DDIO) on applications, and models DDIO in gem5 simulator. paper slides video
Published in NSDI (poseter), ISCA (full paper), 2019
This paper presents an algorithm/hardware co-design to accelerate gradient aggregation in distributed reinforcement learning training. paper slides
Published in ISCA, 2019
This paper provides a on-CPU near-cache acceleration solution for cuckoo hash lookup, the core operation of modern virtual switch. paper slides
Published in MICRO, 2018
This paper provides an algorithm/hardware co-design to accelerate gradient aggregation in distributed deep learning training. paper slides