Zoonk.AI.Evals.EvalFiles (Zoonk v0.1.0-dev)

View Source

Utility functions to store and retrieve evaluation results for AI models and prompts.

This module helps persist outputs and scores from LLM evaluations, avoiding duplicate processing and enabling comparison between models and prompts.

It organizes the results in a structured directory format under priv/evals.

Summary

Functions

Loads all output files for a model and prompt.

Loads all score files for a prompt.

Updates the leaderboard JSON file with model scores.

Updates the markdown file with the model leaderboard.

Updates the markdown file with the calculated scores for a prompt.

Types

eval_type()

@type eval_type() :: :model | :prompt

Functions

file_exists?(eval_type, model, prompt, results_dir, filename)

@spec file_exists?(eval_type(), String.t(), atom(), String.t(), String.t()) ::
  boolean()

Checks if a model file exists.

This is useful to avoid sending duplicated requests to LLMs when we've already stored the results.

Examples

iex> file_exists?(:model, "openai/gpt-4.1-mini", :recommend_courses, "outputs", "test_1.json")
true

iex> file_exists?(:prompt, "openai/gpt-4.1-mini", :recommend_courses, "scores", "test_1.json")
false

load_model_outputs(prompt, model)

@spec load_model_outputs(atom() | String.t(), String.t()) :: [map()]

Loads all output files for a model and prompt.

This function reads all JSON files from the outputs directory for a given model and prompt and returns their parsed content.

Examples

iex> load_model_outputs(:recommend_courses, "deepseek-chat-v3-0324")
[%{"usage" => %{...}, "steps" => [...]}, ...]

load_prompt_outputs(prompt)

@spec load_prompt_outputs(atom() | String.t()) :: [map()]

Loads all score files for a prompt.

This function reads all JSON files from the scores directory for a given prompt and returns their parsed content.

Examples

iex> load_prompt_outputs(:recommend_courses)
[%{"usage" => %{...}, "steps" => [...]}, ...]

store_results(eval_type, model, prompt, results_dir, filename, data)

@spec store_results(eval_type(), String.t(), atom(), String.t(), String.t(), map()) ::
  :ok

Stores results generated by an AI model.

We use these results to evaluate the model's performance and to compare it against other models.

Examples

iex> store_results(:model, "openai/gpt-4.1-mini", :recommend_courses, "outputs", "test_1.json", %{})
:ok

iex> store_results(:prompt, "openai/gpt-4.1-mini", :recommend_courses, "scores", "test_1.json", %{})
:ok

update_leaderboard_json(model_scores, prompt, model)

@spec update_leaderboard_json(map(), atom() | String.t(), String.t()) :: map()

Updates the leaderboard JSON file with model scores.

This function creates or updates a JSON file in priv/evals/{prompt_name}_leaderboard.json with the model scores.

Examples

iex> update_leaderboard_json(%{average: 7.76, median: 9.0}, :recommend_courses, "deepseek-chat-v3-0324")
%{"deepseek-chat-v3-0324" => %{average: 7.76, median: 9.0}}

update_leaderboard_markdown(leaderboard, prompt_name)

@spec update_leaderboard_markdown(map(), atom() | String.t()) :: :ok

Updates the markdown file with the model leaderboard.

This function creates or updates the leaderboard section in the markdown file for a given prompt with sorted model scores.

Examples

iex> update_leaderboard_markdown(%{"model1" => %{average: 8.0, median: 9.0}}, :recommend_courses)
:ok

update_scores_markdown(map, prompt_name)

@spec update_scores_markdown(
  %{average: float(), median: float()},
  atom() | String.t()
) :: :ok

Updates the markdown file with the calculated scores for a prompt.

This function creates or updates a markdown file in priv/evals/{prompt_name}.md with the average and median scores.

Examples

iex> update_scores_markdown(:recommend_courses, %{average: 7.76, median: 9.0})
:ok