A Pipeline to Improve LLMs in Italian
ChatGPT has revolutionized how we work and study, but there’s a growing performance gap between SOTA models in English and Italian.
Data Preparation
Data quality matters more than quantity for fine-tuning. Reference datasets:
- ORCA — reasoning traces from GPT-4
- WizardLM — automatic evolution of instructions
- Alpaca — instruction-following dataset
The goal: create high-quality instruction-response pairs in Italian.
Fine-Tuning
Key choices:
- Base model — LLaMA2, Mixtral
- Tokenizer — crucial for non-English languages
- Hyperparameters — learning rate, batch size, epochs
- Efficiency techniques — LoRA, QLoRA, Flash Attention
- Evaluation — automatic and human metrics
The challenge for Italian is twofold: less training data available and less efficient tokenization compared to English.