Making Misal — India's First Competitive Marathi LLM

← back
1 min read· 13 Apr 2024

Originally published on smallstep.ai on April 13, 2024. Read the full post here: smallstep.ai/making-misal

Overview#

Misal is India's first competitive Marathi LLM — 7B and 1B parameter models pretrained and finetuned on ~2B Marathi tokens, with a custom SentencePiece tokenizer that fixes Llama's 3–5x token inefficiency on Devanagari script.

Highlights:

  • Custom tokenizer — 15K Marathi tokens added to Llama's vocabulary, cutting tokens-per-word 3–5x
  • Pretraining — 2B Marathi tokens on A100, LoRA-based continued pretraining of Llama2 7B/1B
  • Instruction tuning — 200K Marathi instructions curated from Alpaca translations + IndicQuestionGeneration
  • Eval — beat GPT-3.5 on Marathi reading comprehension benchmarks
  • Open-sourcedmodels on Hugging Face, tokenizer, pretraining configs, and eval framework

Read the full write-up#

The complete technical breakdown — data curation, tokenizer training, pretraining recipe, finetuning, and evals — lives on the smallstep.ai site:

smallstep.ai/making-misal

Coverage#