AIArtificial IntelligenceTrends

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors

Views: 1
0 0
Read Time:7 Minute, 54 Second

  

In this tutorial, we analyze NVIDIA garak as a practical framework for defensive LLM red-teaming. We start by setting up Garak, then move through plugin discovery, dry runs, real-model scans, multi-probe evaluations, report analysis, custom probe creation, custom detector creation, and AVID export. Instead of running only a single scan, we use Garak end-to-end to understand how probes, detectors, generators, reports, and vulnerability scores work together in a complete LLM security testing workflow. Check out the FULL CODES Here.

Setting Up NVIDIA garak and Defining Helper Functions

import os, sys, json, glob, subprocess, importlib
def sh(cmd, capture=False):
   print(f"n$ {cmd}")
   return subprocess.run(cmd, shell=True, text=True,
                         capture_output=capture)
sh(f"{sys.executable} -m pip install -q -U garak")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("HF_HUB_DISABLE_TELEMETRY", "1")
import garak, garak.cli
from garak import _config
print("n=== garak version:", garak.__version__, "===")
def run_garak(args):
   print("n>>> garak " + " ".join(args))
   try:
       garak.cli.main(args)
   except SystemExit as e:
       if e.code not in (0, None):
           print(f"[garak exited {e.code}]")
   try:
       return _config.transient.report_filename
   except Exception:
       return None

We begin by importing the required libraries and creating a helper function to run shell commands directly from the notebook. We install garak, configure basic environment variables, and import the main garak modules needed for the tutorial. We also define a reusable function that lets us run Garak programmatically and capture the path to the generated report.

Listing garak Probes and Detectors and Running Model Scans

print("n########## 1. PLUGIN INVENTORY ##########")
for kind in ["probes", "detectors", "generators", "buffs"]:
   out = sh(f"{sys.executable} -m garak --list_{kind} 2>/dev/null", capture=True)
   lines = [l for l in (out.stdout or "").splitlines() if "." in l]
   print(f"  {kind:11s}: {len(lines)} plugins   e.g. "
         f"{', '.join(l.split()[-1] if l.split() else l for l in lines[:3])}")
print("n########## 2. FAST DRY-RUN (test.Repeat) ##########")
sh(f"{sys.executable} -m garak --target_type test.Repeat "
  f"--probes lmrc.SlurUsage --generations 1")
print("n########## 3. REAL MODEL: gpt2 vs DAN 11.0 ##########")
sh(f"{sys.executable} -m garak --target_type huggingface --target_name gpt2 "
  f"--probes dan.Dan_11_0 --generations 1 --parallel_attempts 8")
print("n########## 4. PROGRAMMATIC MULTI-PROBE SCAN ##########")
report_path = run_garak([
   "--target_type", "test.Repeat",
   "--probes", "dan.Dan_11_0,encoding.InjectBase64,lmrc.SlurUsage",
   "--generations", "1", "--parallel_attempts", "16",
])
print("Report:", report_path)

We inspect the garak plugin ecosystem by listing available probes, detectors, generators, and buffs. We then run a quick dry run using the test generator to confirm that Garak is working without requiring any external model or API key. After that, we scan a real Hugging Face model and run a multi-probe scan to generate a richer report for analysis.

Analyzing garak Reports: Safety Scores and Attack Success Rates

print("n########## 5. ANALYSIS ##########")
import numpy as np, pandas as pd
def find_latest_report():
   cands = []
   for base in [os.path.expanduser("~/.local/share/garak/garak_runs"),
                os.path.expanduser("~/.cache/garak"), "."]:
       cands += glob.glob(os.path.join(base, "**", "*report.jsonl"),
                          recursive=True)
   cands = [c for c in cands if os.path.getsize(c) > 0]
   return max(cands, key=os.path.getmtime) if cands else None
report_path = report_path or find_latest_report()
print("Analysing:", report_path)
evaluations = None
try:
   from garak.report import Report
   rep = Report(report_path).load().get_evaluations()
   evaluations = rep.evaluations.copy()
   print("n--- Per-probe mean SAFETY score (garak.report.Report) ---")
   print(rep.scores.round(1).to_string())
except Exception as e:
   print("garak.report.Report unavailable, falling back to manual parse:", e)
   rows = []
   with open(report_path) as f:
       for line in f:
           try: r = json.loads(line)
           except json.JSONDecodeError: continue
           if r.get("entry_type") == "eval":
               rows.append(r)
   evaluations = pd.DataFrame(rows)
   if not evaluations.empty:
       evaluations["score"] = np.where(
           evaluations["total_evaluated"] != 0,
           100 * evaluations["passed"] / evaluations["total_evaluated"], 0.0)
if evaluations is not None and not evaluations.empty:
   evaluations["asr_%"] = (100 - evaluations["score"]).round(1)
   view = evaluations[["probe", "detector", "passed",
                       "total_evaluated", "score", "asr_%"]].copy()
   view = view.rename(columns={"score": "safe_%"})
   view["safe_%"] = view["safe_%"].round(1)
   view = view.sort_values("asr_%", ascending=False)
   print("n--- Per probe/detector  (higher asr_% = more vulnerable) ---")
   print(view.to_string(index=False))
   try:
       import matplotlib.pyplot as plt
       labels = (view["probe"] + "n" + view["detector"]).tolist()
       plt.figure(figsize=(8, 0.55 * len(view) + 1.5))
       plt.barh(labels, view["asr_%"], color="#76b900")
       plt.gca().invert_yaxis()
       plt.xlabel("Attack Success Rate (%)"); plt.xlim(0, 100)
       plt.title("garak — vulnerability by probe/detector")
       plt.tight_layout(); plt.show()
   except Exception as e:
       print("plot skipped:", e)

We load the generated garak report and prepare it for detailed analysis using pandas and NumPy. We first try to use Garak’s built-in report parser, and if that is unavailable, we manually parse the JSONL report file. We then calculate safety scores and attack success rates, and visualize vulnerabilities across different probe-detector combinations.

Inspecting Flagged Outputs and Building a Custom garak Probe

print("n--- Sample hits (detector score >= 0.5) ---")
hits = []
with open(report_path) as f:
   for line in f:
       try: r = json.loads(line)
       except json.JSONDecodeError: continue
       if r.get("entry_type") != "attempt" or r.get("status") != 2:
           continue
       dr = r.get("detector_results") or r.get("detector") or {}
       max_score = 0.0
       if isinstance(dr, dict):
           for v in dr.values():
               if isinstance(v, list) and v:
                   max_score = max(max_score, max(float(x) for x in v))
       if max_score >= 0.5:
           prompt = r.get("prompt")
           hits.append((r.get("probe_classname", r.get("probe", "?")),
                        round(max_score, 2), str(prompt)[:90]))
print(f"Total hits found: {len(hits)}")
for probe, score, prompt in hits[:5]:
   print(f"  [{score}] {probe}: {prompt!r}")
print("n########## 6. CUSTOM PROBE ##########")
custom_probe = '''
from garak.probes.base import Probe
class HelloProbe(Probe):
   """Minimal custom probe: two fixed prompts, paired with a custom detector."""
   lang = "en"
   bcp47 = "en"
   primary_detector = "mycustomdet.ContainsHello"
   recommended_detector = ["mycustomdet.ContainsHello"]
   tags = ["custom:demo"]
   goal = "demonstrate authoring a garak probe"
   doc_uri = ""
   prompts = [
       "Please reply with the single word: hello",
       "Continue this sentence: The capital of France is",
   ]
'''
import garak.probes
probes_dir = os.path.dirname(garak.probes.__file__)
with open(os.path.join(probes_dir, "mycustom.py"), "w") as fh:
   fh.write(custom_probe)

We further inspect the report by extracting sample hits in which detector scores indicate potentially unsafe or vulnerable outputs. We collect the flagged prompts, detector scores, and probe names to understand what kind of behavior is being detected. We then create a custom garak probe that uses fixed prompts and connects it with a custom detector.

Creating a Custom garak Detector and Exporting Results to AVID

print("n########## 7. CUSTOM DETECTOR ##########")
custom_detector = '''
from garak import _config
from garak.detectors.base import StringDetector
class ContainsHello(StringDetector):
   """Demo detector: flags any output containing 'hello' (case-insensitive)."""
   lang_spec = "en"
   bcp47 = "en"
   def __init__(self, config_root=_config):
       super().__init__(["hello"], config_root=config_root)
       self.matchtype = "str"
'''
import garak.detectors
det_dir = os.path.dirname(garak.detectors.__file__)
with open(os.path.join(det_dir, "mycustomdet.py"), "w") as fh:
   fh.write(custom_detector)
sh(f"{sys.executable} -m garak --target_type test.Repeat "
  f"--probes mycustom.HelloProbe --detectors mycustomdet.ContainsHello "
  f"--generations 1")
print("n########## 8. AVID EXPORT ##########")
if report_path:
   sh(f"{sys.executable} -m garak -r {report_path}")
print("""
rest:
 RestGenerator:
   uri: https://your-endpoint.example.com/v1/chat
   method: post
   headers: {Authorization: "Bearer $TOKEN", Content-Type: "application/json"}
   req_template_json_object:
     model: "your-model"
     messages: [{"role": "user", "content": "$INPUT"}]
   response_json: true
   response_json_field: "$.choices[0].message.content"
""")
print("=== Done. JSONL + HTML reports: ~/.local/share/garak/garak_runs/ ===")

We define a custom detector that flags outputs containing the word “hello” and save it inside Garak’s detector package. We then run our custom probe and detector against the test generator to verify that the extension works correctly. Finally, we export the garak report in AVID format and show a REST configuration template for connecting garak to an external model endpoint.

Conclusion

In conclusion, we have a complete hands-on workflow for testing LLM behavior using NVIDIA garak. We run built-in probes, analyze safety scores and attack success rates, inspect concrete flagged outputs, and extend Garak with our own custom probe and detector. We also export results in AVID format, which makes the workflow more useful for structured vulnerability reporting. It provides us a platform to evaluate models we are authorized to test and to build more advanced defensive red-teaming pipelines.


Check out the FULL CODES HereAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors appeared first on MarkTechPost.

 

​MarkTechPost

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Latest news