openai/ language-models · ai-safety

OpenAI releases open-weight models tuned for policy-based labeling

OpenAI unveiled two open-weight models, gpt-oss-safeguard-120b and -20b, designed to apply a given policy when tagging content.

  • OpenAI announced two new open-weight reasoning models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, that are fine‑tuned to follow a supplied policy for content labeling.

The models are built on the earlier gpt-oss series and undergo an extra post‑training stage that teaches them to reason from a policy description. The technical report compares these safeguards against the base gpt-oss models using OpenAI’s standard safety benchmarks. It notes modest gains in policy adherence but also highlights remaining gaps in edge cases.

For researchers, the release offers a rare glimpse into how large language models can be steered by explicit rules without a closed‑source black box. The open weights mean anyone can test, tweak, or benchmark the approach, potentially accelerating work on controllable AI.

The announcement signals OpenAI’s move toward more transparent safety tooling, though practical impact will depend on how quickly the community can build on the baseline results.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →