Benchmark POC
Benchmark POC: Benchmarking LLM models on AO
This proof of concept (POC) aims to benchmark large language models (LLMs) on AO. Funders can create a funding pool with a set of questions, and models compete to answer them. Participants can train and submit their models, which are evaluated and ranked on a daily-updated leaderboard. At the end of the funding period, winners are determined based on the leaderboard, and rewards are distributed.
πPrerequisites
AOS and Ardrive installed on your system.
π½οΈProcesses
Pool Creation
Model Upload
Model Evaluation
Scoring and Leaderboard
πDetailed Process
1. Pool Creation
Upload Dataset
Upload your chosen benchmarking dataset (e.g., SIQA) to the Arweave blockchain using the ArDrive application or CLI.
Create pool
Create a pool by sending the following message through AOS:
For details on creating a dataset process ID, refer to the tutorial on our GitHub.
2. Prepare Model
Upload fine-tuned models
Upload two fine-tuned models (e.g., llama3-8B dataset alpaca and samsum) to the Arweave blockchain via the ArDrive application or CLI. For example:

After uploading the model, you get the data tx ID : Such as:ISrbGzQot05rs_HKC08O_SmkipYQnqgB1yC3mjZZeEo
3. Model Evaluation
Register models
Join a pool by sending a message to the pool process with the following payload:
Once you join a pool, the model will start evaluating the dataset and sending back the score to the pool process.
4. Scoring and Leaderboard
Retrieve Leaderboard Results
Retrieve the leaderboard results by sending a message to the pool process with the following payload:
Leaderboard updated every 24 hours
This message will execute and display the model leaderboard within AOS. Such as:


Last updated