Skip to main content

LLM Evals

LLM Evals is a custom, LLM evaluated metric. This means its score is calculated using an LLM. An LLMEvalMetric is the most verstile type of metric deepeval has to offer, and is capable of evaluating almost any use cases.

Required Parameters

To use the LLMEvalMetric, you'll have to provide the following parameters when creating an LLMTestCase:

  • input
  • actual_output

You'll also need to supply any additional arguments such as expected_output and context if your evaluation criteria depends on these parameters.

Example

To create a custom metric that uses LLMs for evaluation, simply instantiate an LLMEvalMetric class and define an evaluation criteria in everyday language:

from deepeval.metrics import LLMEvalMetric
from deepeval.test_case import LLMTestCaseParams

summarization_metric = LLMEvalMetric(
name="Summarization",
criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=["Check whether the 'actual output' has omitted any detail from 'input'"],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

There are three mandatory and two optional parameters required when instantiating an LLMEvalMetric class:

  • name: name of metric
  • criteria: a description outlining the specific evaluation aspects for each test case.
  • evaluation_params: a list of type LLMTestCaseParams. Include only the parameters that are relevant for evaluation.
  • [Optional] evaluation_steps: a list of strings outlining the exact steps the LLM should take for evaluation. You can only provide either evaluation_steps or criteria, and not both.
  • [Optional] threshold: the passing threshold, defaulted to 0.5.
  • [Optional] model: the model name. This is defaulted to 'gpt-4-1106-preview' and we currently only support models from (Azure) OpenAI.
danger

For accurate and valid results, only the parameters that are mentioned in criteria should be included as a member of evaluation_params.

As mentioned in the metrics introduction section, all of deepeval's metrics return a score ranging from 0 - 1, and a metric is only successful if the evaluation score is equal to or greater than threshold. An LLMEvalMetric is no exception. You can access the score and reason for each individual LLMEvalMetric:

from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
input="The sun is shining bright today",
actual_output="The weather's getting really hot."
)

summarization_metric.measure(test_case)
print(summarization_metric.score)
print(summarization_metric.reason)
note

Remember, you can configure deepeval to use Azure OpenAI for all LLM-based metrics.