Benchmark POC

Benchmark POC: Benchmarking LLM models on AO

This proof of concept (POC) aims to benchmark large language models (LLMs) on AO. Funders can create a funding pool with a set of questions, and models compete to answer them. Participants can train and submit their models, which are evaluated and ranked on a daily-updated leaderboard. At the end of the funding period, winners are determined based on the leaderboard, and rewards are distributed.

πŸ”‘Prerequisites

  1. Familiarity with AO, AOS, and ArDrive.

  2. AOS and Ardrive installed on your system.

πŸ“½οΈProcesses

  1. Pool Creation

  2. Model Upload

  3. Model Evaluation

  4. Scoring and Leaderboard

πŸ“œDetailed Process

1. Pool Creation

Upload Dataset

Upload your chosen benchmarking dataset (e.g., SIQA) to the Arweave blockchain using the ArDrive application or CLI.

Create pool

Create a pool by sending the following message through AOS:

For details on creating a dataset process ID, refer to the tutorial on our GitHub.

2. Prepare Model

Upload fine-tuned models

Upload two fine-tuned models (e.g., llama3-8B dataset alpaca and samsum) to the Arweave blockchain via the ArDrive application or CLI. For example:

After uploading the model, you get the data tx ID : Such as:ISrbGzQot05rs_HKC08O_SmkipYQnqgB1yC3mjZZeEo

3. Model Evaluation

Register models

Join a pool by sending a message to the pool process with the following payload:

Once you join a pool, the model will start evaluating the dataset and sending back the score to the pool process.

4. Scoring and Leaderboard

Retrieve Leaderboard Results

Retrieve the leaderboard results by sending a message to the pool process with the following payload:

Leaderboard updated every 24 hours

This message will execute and display the model leaderboard within AOS. Such as:

Untitled

Last updated