Yifei Shen

a researcher and an engineer

About

I received the B.S. degree in computer science from ShanghaiTech University and Ph.D degree in the electronic and computer engineering from the Hong Kong University of Science and Technology.

During my Ph.D. studies, I focused on the intersection of signal processing and machine learning. I aimed to demystify deep learning by applying signal processing tools such as sparse coding. Additionally, I was the first to introduce graph neural networks (GNNs) to communication and networking (V1 (opens in new tab) V2 (opens in new tab)), providing comprehensive theoretical analysis and practical guidelines. GNN-based resource allocation and signal processing have since been implemented in numerous base stations.

After graduation, I joined Microsoft. With large language models (LLMs) becoming a key focus for productivity, I shifted my research toward LLMs and large multimodal models (LMMs) to align with industry interests.

First, I concentrated on exploring the inner workings of these models using signal processing tools, with the goal of enhancing both their trustworthiness and performance. We were among the first to:

1. Analyze the emergence of reasoning and planning capabilities within LLMs and the gap between supervised fine-tuning (SFT) and reinforcement learning (RL), applying these insights to real-world agents.

NeurIPS’24 | ALPINE: Unveiling The Planning Capability of Autoregressive Learning in Language Models (opens in new tab)

NeurIPS’24 | Can Graph Learning Improve Planning in LLM-based Agents? (opens in new tab)

2. Investigate LMMs using sparse coding tools, applying them to reduce hallucinations.

ICCV’25 | Large Multi-modal Models Can Interpret Features in Large Multi-modal Models (opens in new tab)

3. Provide theoretical guidelines for Mixture of Experts (MoE) structures, applying them to vision foundation models.

ICLR’23 (Oral) | Sparse Mixture-of-Experts are Domain Generalizable Learners (opens in new tab)

Second, I focused on the training systems for LLMs and LMMs. We developed the BlockOptimizers, capable of fine-tuning 8 billion-parameter models on an RTX 3090 and 70 billion-parameter models on four A100 GPUs.

BlockOptimizers | Full parameter finetuning 8B models on RTX3090 and 70B models on 4 A100s (opens in new tab)

Together with my amazing colleagues, we applied these techniques to fields such as embodied AI (Habi (opens in new tab) Diffusion Veteran (opens in new tab)) and AI for Science (Omni-DNA (opens in new tab) MIMSID (opens in new tab) MuDM (opens in new tab) GraphormerV2 (opens in new tab)).

During my part-time, I write blogs on AI, mathematics, and physics, with more than 10 thousand followers and favorites: