(개념 요약) Tokenization (tutorial)
Corpus: (hug, 10), (pug, 5), (pun, 12), (bun, 4), (hugs, 5)
BPE (Byte-Pair Encoding)
Tokenization
- Merge rules (예시: hu+g->hug) 를 순차적으로 적용.
Merge rules 생성: frequent 높은 pair 를 vocab 에 추가시켜나감.
- 1st Iter
- Vocabulary: [b, g, h, n, p, s, u, ug]
- Corpus: (h+ug, 10), (p+ug, 5), (p+u+n, 12), (b+u+n, 4), (h+ug+s, 5)
- 2nd Iter
- Vocabulary: [b, g, h, n, p, s, u, ug, un]
- Corpus: (h+ug, 10), (p+ug, 5), (p+un, 12), (b+un, 4), (h+ug+s, 5)
- 3rd Iter
- Vocabulary: [b, g, h, n, p, s, u, ug, un, hug]
- Corpus: (hug, 10), (p+ug, 5), (p+un, 12), (b+un, 4), (hug+s, 5)
Unigram
Tokenization: 확률값이 높은 tokenization 조합을 찾음.
- pug 는 p+ug 혹은 pu+g 로 tokenize (값이 같은 경우, 일관된 룰로 고름)
- p+u+g : 0.000389
- p+ug : 0.0022676
- pu+g : 0.0022676
vocab 생성: 모든 substring 들을 vocab 으로 설정해두고, loss 를 적게 올리는 substring 을 지워나감.
substring 의 frequency: (h, 15) (u, 36) (g, 20) (hu, 15) (ug, 20) (p, 17) (pu, 17) (n, 16) (un, 16) (b, 4) (bu, 4) (s, 5) (hug, 15) (gs, 5) (ugs, 5)
- 각 단어의 tokenization
- hug (score 0.071428)
- pu+g 혹은 p+ug (score 0.007710)
- pu+n 혹은 p+un (score 0.006168)
- bu+n 혹은 b+un(score 0.001451)
- hug+s (score 0.001701)
- loss: 10 * (-log(0.071428)) + 5 * (-log(0.007710)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8
- 1st Iter
- [h, u, g, hu, ug, p, pu, n, un, b, bu, s, hug, gs, ugs]
- 2nd Iter
- [h, u, g, hu, ug, p,
pu, n, un, b, bu, s, hug, gs, ugs]
- [h, u, g, hu, ug, p,
WordPiece
Tokenization
- Merge rules (예시: hu+g->hug) 를 순차적으로 적용.
Merge rules 생성: (pair 의 frequency) / (첫번째 토큰 frequency * 두번째 토큰 frequency) 가 최대가 되는 pair 추가해 나감.
- 1st Iter
- Vocabulary: Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs]
- Corpus: (h ##u ##g, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (h ##u ##gs, 5)
- 2nd Iter
- Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs, hu]
- Corpus: (hu ##g, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (hu ##gs, 5)
- 3rd Iter
- Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs, hu, hug]
- Corpus: (hug, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (hu ##gs, 5)