Off-the-shelf Computer Vision models CLIP(ViT) DINO(ViT) VGG-16 Swin-T(MoBY) Swin-T(Object Detection) Swin-T(Segmentation) Face Parsing Face Normals Models are automatically downloaded when vision-aided discriminator is initialized.