(논문 요약) Perception Encoder; The best visual embeddings are not at the output of the network

(논문 요약) LePerception Encoder: The best visual embeddings are not at the output of the network (Paper)

핵심 내용

image 로 CLIP-style pretraining 이후, downstream task 학습시, intermediate layer 를 사용하는 경우 최종 output 을 사용할때보다 성능이 높은 것이 관찰됨.