ChatGPT has revolutionized how we work and study, but there’s a growing performance gap between SOTA models in English and Italian.

Data Preparation

Data quality matters more than quantity for fine-tuning. Reference datasets:

  • ORCA — reasoning traces from GPT-4
  • WizardLM — automatic evolution of instructions
  • Alpaca — instruction-following dataset

The goal: create high-quality instruction-response pairs in Italian.

Fine-Tuning

Key choices:

  1. Base model — LLaMA2, Mixtral
  2. Tokenizer — crucial for non-English languages
  3. Hyperparameters — learning rate, batch size, epochs
  4. Efficiency techniques — LoRA, QLoRA, Flash Attention
  5. Evaluation — automatic and human metrics

The challenge for Italian is twofold: less training data available and less efficient tokenization compared to English.