KingsmanVince
joined 1 year ago
2
Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models
(openaccess.thecvf.com)
IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. "Think before you locate" could be a new path to explore.
The idea is similar to BLIP-2. Both papers use learnable tokens as queries for a transformer decoder. This decoder query from vision space base on the trainable queries and prompt.
I also want to share some resources.
For Pytorch,
- https://pytorch.org/tutorials/ their basic tutorials are fundamental but some more advanced tutorials might be outdated.
- https://www.learnpytorch.io/ the author guides mostly in computer vision but he gives the overview from research to production.
For TPU,
- https://github.com/ayaka14732/tpu-starter full guideline using TPUs with Jax
view more: next ›
indeed it would be great if the authors did so. I personally found some non-official implementations: