Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Abstract
We study zero-shot 3D alignment of two given meshes from a short text prompt describing their spatial relation---an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time---updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer---without training a new model.Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage controlled surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, camera control concentrates views on the interaction region, and randomized restarts improve robustness.To enable evaluation, we curate a benchmark of 50 {mesh pair, prompt} cases spanning diverse categories and relations, and compare against baselines. Across the benchmark, our method yields semantically faithful and physically plausible alignments, improving CLIP similarity while reducing intersection volume.