(논문 요약) CodeRetriever: Large-scale Contrastive Pre-training for Code Search (Paper) (Code)
핵심 내용
function-level code semantic representations through large-scale code-text contrastive pre-training
- Architecture
- Code-Bert 와 같은 architecture
- GraphCodeBERT weight 로 initialize
- text encoder 와 code encoder 의 weight share
- 학습
- contrastive loss
- Positive pairs
- Code-Document: code and its corresponding document
- Code-Comment: pure code and the corresponding in-line comments (use code parser (tree-sitter) and collect 1.9 million code-comment pairs)
- Code-Code
- SimCSE 모델의 NameMatcher, DocMatcher 로 코드 grouping.
- DocMatcher 로 binary classifier 학습 후, grouping 된 코드들을 inference 한 뒤, threshold 이상만 사용.