Skip to main content

Metrics

Quick Summary

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

  • Hallucination
  • Summarization
  • Faithfulness
  • Answer Relevancy
  • Contextual Relevancy
  • Contextual Precision
  • Contextual Recall
  • Ragas
  • Toxicity
  • Bias

deepeval also offers you a straightforward way to develop your own custom evaluation metrics. All metrics are measured on a test case. Visit the test cases section to learn how to apply any metric on test cases for evaluation.

Types of Metrics

A custom metric is a type of metric you can easily create by implementing abstract methods and properties of base classes provided by deepeval. They are extremely versitle and seamlessly integrate with Confident AI without requiring any additional setup. As you'll see later, a custom metric can either be an LLM-Eval (LLM evaluated) or classic metric. A classic metric is a type of metric whose criteria isn't evaluated using an LLM.

deepeval also offer default metrics. Most default metrics offered by deepeval are LLM-Evals, which means they are evaluated using LLMs. This is delibrate because LLM-Evals are versitle in nature and better aligns with human expectations when compared to traditional model based approaches.

deepeval's LLM-Evals are a step up to other implementations because they:

  • are extra reliable as LLMs are only used for extremely specific tasks during evaluation to greatly reduce stochasticity and flakiness in scores.
  • provide a comprehensive reason for the scores computed.

All of deepeval's default metrics output a score between 0-1, and require a threshold argument to instantiate. A default metric is only successful if the evaluation score is equal to or greater than threshold.

info

All GPT models from OpenAI are available for LLM-Evals (metrics that use LLMs for evaluation). You can switch between models by providing a string corresponding to OpenAI's model names via the optional model argument when instantiating an LLM-Eval.

Using OpenAI

To use OpenAI for deepeval's LLM-Evals (metrics evaluated using an LLM), supply your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<your-openai-api-key>

Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY in a cell:

 %env OPENAI_API_KEY=<your-openai-api-key>
note

Please do not include quotation marks when setting your OPENAI_API_KEY if you're working in a notebook enviornment.

Using Azure OpenAI

deepeval also allows you to use Azure OpenAI for metrics that are evaluated using an LLM. Run the following command in the CLI to configure your deepeval enviornment to use Azure OpenAI for all LLM-based metrics.

deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>

Note that the model-version is optional. If you ever wish to stop using Azure OpenAI and move back to regular OpenAI, simply run:

deepeval unset-azure-openai

Measuring a Metric

All metrics in deepeval, including custom metrics that you create:

  • can be executed via the metric.measure() method
  • can have its score accessed via metric.score
  • can have its status accessed via metric.is_successful()
  • can be used to evaluate test cases or entire datasets, with or without Pytest.
  • has a threshold that acts as the threshold for success. metric.is_successful() is only true if metric.score >= threshold.

In additional, most LLM-Evals in deepeval offers a reason for its score, which can be accessed via metric.reason.

Here's a quick example.

export OPENAI_API_KEY=<your-openai-api-key>
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize a test case
test_case = LLMTestCase(input="...", actual_output="...")

# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)

Using this metric, you can either evaluate a test case using deepeval test run:

test_file.py
from deepeval import evaluate
...

def test_answer_relevancy():
assert_test(test_case, metric)
deepeval test run test_file.py

Or using the evaluate function:

from deepeval import assert_test
...

evaluate([test_case], [metric])

Or execute the metric directly and get its score:

...

metric.measure(test_case)
print(metric.score)
print(metric.reason)

For more details on how a metric evaluates a test case, refer to the test cases section.