Zoonk.AI.Evals.EvalFiles (Zoonk v0.1.0-dev)
View SourceUtility functions to store and retrieve evaluation results for AI models and prompts.
This module helps persist outputs and scores from LLM evaluations, avoiding duplicate processing and enabling comparison between models and prompts.
It organizes the results in a structured directory
format under priv/evals
.
Summary
Functions
Checks if a model file exists.
Loads all output files for a model and prompt.
Loads all score files for a prompt.
Stores results generated by an AI model.
Updates the leaderboard JSON file with model scores.
Updates the markdown file with the model leaderboard.
Updates the markdown file with the calculated scores for a prompt.
Types
Functions
Checks if a model file exists.
This is useful to avoid sending duplicated requests to LLMs when we've already stored the results.
Examples
iex> file_exists?(:model, "openai/gpt-4.1-mini", :recommend_courses, "outputs", "test_1.json")
true
iex> file_exists?(:prompt, "openai/gpt-4.1-mini", :recommend_courses, "scores", "test_1.json")
false
Loads all output files for a model and prompt.
This function reads all JSON files from the outputs directory for a given model and prompt and returns their parsed content.
Examples
iex> load_model_outputs(:recommend_courses, "deepseek-chat-v3-0324")
[%{"usage" => %{...}, "steps" => [...]}, ...]
Loads all score files for a prompt.
This function reads all JSON files from the scores directory for a given prompt and returns their parsed content.
Examples
iex> load_prompt_outputs(:recommend_courses)
[%{"usage" => %{...}, "steps" => [...]}, ...]
Stores results generated by an AI model.
We use these results to evaluate the model's performance and to compare it against other models.
Examples
iex> store_results(:model, "openai/gpt-4.1-mini", :recommend_courses, "outputs", "test_1.json", %{})
:ok
iex> store_results(:prompt, "openai/gpt-4.1-mini", :recommend_courses, "scores", "test_1.json", %{})
:ok
Updates the leaderboard JSON file with model scores.
This function creates or updates a JSON file in priv/evals/{prompt_name}_leaderboard.json
with the model scores.
Examples
iex> update_leaderboard_json(%{average: 7.76, median: 9.0}, :recommend_courses, "deepseek-chat-v3-0324")
%{"deepseek-chat-v3-0324" => %{average: 7.76, median: 9.0}}
Updates the markdown file with the model leaderboard.
This function creates or updates the leaderboard section in the markdown file for a given prompt with sorted model scores.
Examples
iex> update_leaderboard_markdown(%{"model1" => %{average: 8.0, median: 9.0}}, :recommend_courses)
:ok
Updates the markdown file with the calculated scores for a prompt.
This function creates or updates a markdown file in priv/evals/{prompt_name}.md
with the average and median scores.
Examples
iex> update_scores_markdown(:recommend_courses, %{average: 7.76, median: 9.0})
:ok