Skip to content

Latest commit

 

History

History
67 lines (41 loc) · 4.06 KB

README.md

File metadata and controls

67 lines (41 loc) · 4.06 KB

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling.

With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets.

Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data.

We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.

We hope our efforts enable a broader range of the research community to advance the field in a more efficient, affordable and equitable manner.

Summary of EVA-02 performance

summary_tab

summary_tab

Get Started

Best Practice

  • If you would like to use / fine-tune EVA-02 in your project, please start with a shorter schedule & smaller learning rate (compared with the baseline setting) first.
  • Using EVA-02 as a feature extractor: #56.

BibTeX & Citation

@article{eva02,
  title={Eva-02: A visual representation for neon genesis},
  author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={Image and Vision Computing},
  pages={105171},
  year={2024},
  publisher={Elsevier}
}

Acknowledgement

EVA-01, BEiT, BEiTv2, CLIP, MAE, timm, DeepSpeed, Apex, xFormer, detectron2, mmcv, mmdet, mmseg, ViT-Adapter, detrex, and rotary-embedding-torch.

Contact

  • For help and issues associated with EVA-02, or reporting a bug, please open a GitHub Issue with label EVA-02. Let's build a better & stronger EVA-02 together :)

  • We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao ([email protected]) and Xinlong Wang ([email protected]).