(개념 요약) Tokenization (tutorial)

Corpus: (hug, 10), (pug, 5), (pun, 12), (bun, 4), (hugs, 5)

BPE (Byte-Pair Encoding)

1st Iter
- Vocabulary: [b, g, h, n, p, s, u, ug]
- Corpus: (h+ug, 10), (p+ug, 5), (p+u+n, 12), (b+u+n, 4), (h+ug+s, 5)
2nd Iter
- Vocabulary: [b, g, h, n, p, s, u, ug, un]
- Corpus: (h+ug, 10), (p+ug, 5), (p+un, 12), (b+un, 4), (h+ug+s, 5)
3rd Iter
- Vocabulary: [b, g, h, n, p, s, u, ug, un, hug]
- Corpus: (hug, 10), (p+ug, 5), (p+un, 12), (b+un, 4), (hug+s, 5)

pug 는 p+ug 혹은 pu+g 로 tokenize (값이 같은 경우, 일관된 룰로 고름)
- p+u+g : 0.000389
- p+ug : 0.0022676
- pu+g : 0.0022676

substring 의 frequency: (h, 15) (u, 36) (g, 20) (hu, 15) (ug, 20) (p, 17) (pu, 17) (n, 16) (un, 16) (b, 4) (bu, 4) (s, 5) (hug, 15) (gs, 5) (ugs, 5)
각 단어의 tokenization
- hug (score 0.071428)
- pu+g 혹은 p+ug (score 0.007710)
- pu+n 혹은 p+un (score 0.006168)
- bu+n 혹은 b+un(score 0.001451)
- hug+s (score 0.001701)
loss: 10 * (-log(0.071428)) + 5 * (-log(0.007710)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8
1st Iter
- [h, u, g, hu, ug, p, pu, n, un, b, bu, s, hug, gs, ugs]
2nd Iter
- [h, u, g, hu, ug, p, pu, n, un, b, bu, s, hug, gs, ugs]

1st Iter
- Vocabulary: Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs]
- Corpus: (h ##u ##g, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (h ##u ##gs, 5)
2nd Iter
- Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs, hu]
- Corpus: (hu ##g, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (hu ##gs, 5)
3rd Iter
- Vocabulary: [b, h, p, ##g, ##n, ##s, ##u, ##gs, hu, hug]
- Corpus: (hug, 10), (p ##u ##g, 5), (p ##u ##n, 12), (b ##u ##n, 4), (hu ##gs, 5)