A Pipeline to Improve LLMs in Italian

ChatGPT has revolutionized how we work and study, but there’s a growing performance gap between SOTA models in English and Italian.

Data Preparation

Data quality matters more than quantity for fine-tuning. Reference datasets:

The goal: create high-quality instruction-response pairs in Italian.

Key choices:

The challenge for Italian is twofold: less training data available and less efficient tokenization compared to English.