Diffusion | Notion

RoHM: Robust Human Motion Reconstruction via Diffusion

Motion reconstruction
webpage https://sanweiliti.github.io/ROHM/ROHM.html
Thought this closely related
Pipeline
1. use off-the-shelf, per-frame regressors and/or per-frame optimization to obtain initial SMPL-X estimates for each frame and masks obtained by randomly masking joints at training time and computing joint visibility at test time
  
  $$ \text{Get motion Sequence} \tilde{X} \in R^{N\times d}\\ X = (R,P)\\ \text{Root-Traj: } \bold{R} \in R^{N\times d_R}\\ \text{Local Body-Features: } \bold{P} \in R^{N\times d_P}\\ \text{Root Joint visibility masks } M_R \in \{0,1\}^{N ×d_R}\\ \text{local joint visibility masks } M_P \in\{0,1\}^{N ×d_P}\\
  
  $$
2. TrajNet
3. PoseNet
4. TrajControl, an auxiliary module for fine-tuning TrajNet with additional control signal from local body pose
  
  Haven’t looked thoroughly but think just fine-tuning technique

PriorMDM : Human Motion Diffusionas a Generative Prior

Screenshot 2025-04-06 at 1.15.14 PM.png

webpage https://priormdm.github.io/priorMDM-page/
Prior
Text2Motion
MDM Architecture

Long Sequences Generation
Two-Person Generation
Fine-Tuned Motion Control
1. Thought this is closely related
- generate full-body motion controlled by a user-defined set of input features.
b. 感觉改进主要是只对部分加噪

DiffPose: Toward More Reliable 3D Pose Estimation

webpage https://gongjia0208.github.io/Diffpose/
Pipeline
1. 2D pose is often estimated from the image with an off-the-shelf 2D pose detector
  
  Initializing the indeterminate 3D pose distribution $H_k$ based on extracted heatmaps, which capture the underlying uncertainty of the input 2D pose in 3D space
  
  和别的work的不同点我觉得可能是利用2Dpose等直接得到相当于“加噪”后的motion作为k‘s step，别的会在initial上加gaussian noise再去denoise
  - Training — DiffPose Forward Diffusion 因为不是gaussian noise所以需要考虑不同的forward过程 — See Section 4.2
2. Performing the reverse diffusion process, where we use a diffusion model g to progressively denoise the initial distribution $H_k$ to a desired high-quality determinate distribution $H_k$, and then we can sample $h_0 ∈ R^{3×J}$ from the pose distribution $H_0$ to synthesize the final 3D pose $h_s$
  - use the distributions got from forward pass to optimize our diffusion model
  - Context Encoder
    - spatial-temporal context - extracted from the 2D pose sequence derived from Vt (or just a single 2D pose derived from It if Vt is not available)