Benchmark POC: Benchmarking LLM models on AO
This proof of concept (POC) aims to benchmark large language models (LLMs) on AO. Funders can create a funding pool with a set of questions, and models compete to answer them. Participants can train and submit their models, which are evaluated and ranked on a daily-updated leaderboard. At the end of the funding period, winners are determined based on the leaderboard, and rewards are distributed.
πPrerequisites
AOS and Ardrive installed on your system.
π½οΈProcesses
πDetailed Process
1. Pool Creation
Upload Dataset
Upload your chosen benchmarking dataset (e.g., SIQA) to the Arweave blockchain using the ArDrive application or CLI.
Create pool
Create a pool by sending the following message through AOS:
For details on creating a dataset process ID, refer to the tutorial on our GitHub.
2. Prepare Model
Upload fine-tuned models
Upload two fine-tuned models (e.g., llama3-8B dataset alpaca and samsum) to the Arweave blockchain via the ArDrive application or CLI. For example:
After uploading the model, you get the data tx ID :
Such as:ISrbGzQot05rs_HKC08O_SmkipYQnqgB1yC3mjZZeEo
3. Model Evaluation
Register models
Join a pool by sending a message to the pool process with the following payload:
Once you join a pool, the model will start evaluating the dataset and sending back the score to the pool process.
4. Scoring and Leaderboard
Retrieve Leaderboard Results
Retrieve the leaderboard results by sending a message to the pool process with the following payload:
Leaderboard updated every 24 hours
This message will execute and display the model leaderboard within AOS. Such as: