OpenAI lets the model "open its mouth". You should pay attention: It is very expensive to insult AI

金色财经_Kala79 day ago

Author: Su Yang, Tencent Technology

On May 8, OpenAI added three new generation speech models to the API: GPT‑Realtime‑2 focusing on speech reasoning and dialogue, Realtime‑Translate highlighting real-time multilingual translation, and Realtime‑Whisper focusing on speech-to-text.

GPT‑Realtime‑2 is OpenAI’s first speech model with GPT‑5 level reasoning capabilities.它在基准测试中展现出显著进步：在Big Bench AudioThe accuracy rate in speech intelligence evaluation reaches 96.6%, inAudio MultiChallenge指令遵循评估中的平均通过率为48.5%，分别较上一代GPT‑Realtime‑1.5提升15.2个和13.8个百分点。

Based on GPT‑Realtime‑2, voice AI has evolved from simple turn-based question and answer to a form that can continuously listen, reason, call tools and complete tasks as the conversation unfolds.

A voice assistant that “thinks”

The design goal of GPT‑Realtime‑2 is to enable speech models to maintain conversational fluency while having the reasoning and action capabilities required to handle complex transactions.

In terms of naturalness of dialogue, The model introducesPreamble mechanism.

Developers can enable short prompts such as "Let me check" or "Wait a minute, I'm checking" to inform the user that the request has been received and is being processed before a formal response is generated.

Complementing this is并行工具调用与Tool TransparencyCapability, the model can call multiple external tools at the same time, and inform the user of the current progress through voice, such as saying "Checking your calendar" or "Looking for", so that the agent remains responsive in the process of completing the task rather than falling silent.

当遇到处理困难时，模型会主动给出诸如“我现在遇到点麻烦”的提示并尝试恢复，而非静默失败或直接中断会话。

In addition, The context window of the model has been expanded from 32K to 128K, which means that it can remain coherent in longer and more complex multi-round conversations, supporting a more complete agent workflow.

在专业场景的适用性方面，模型增强了对特定领域术语的理解能力，能更准确地保留专业词汇、专有名词和医学术语，这对生产环境的部署价值突出。 At the expression level, the model has a more controllable tone and expressiveness, and can switch styles according to the situation.

Another key upgrade isAdjustable reasoning strength. Developers can choose from five levels: minimal, low, medium, high, and xhigh (default is low) to strike a balance between latency and inference depth.

Don’t chat and have fun

GPT‑Realtime‑2 beats previous generation models in benchmark tests

In the Big Bench Audio evaluation, which measures challenging inference capabilities in speech models, GPT‑Realtime‑2 (high inference level) achieved an accuracy of 96.6%, while GPT‑Realtime‑1.5 achieved an accuracy of 81.4%, an improvement of 15.2 percentage points.

In the Audio MultiChallenge evaluation, which evaluates the multi-round interactive intelligence of the spoken dialogue system—the evaluation covers dimensions such as instruction following, context integration, self-consistency, and processing of natural speech correction—the average pass rate of GPT‑Realtime‑2 (xhigh reasoning level) jumped from 34.7% in GPT‑Realtime‑1.5 to 48.5%, a relative increase of 13.8 percentage points.

事实上，衡量一个语音模型是否真正“聪明”，最有说服力的场景不是闲聊，而是处理一个需要层层推演的复杂问题。

Note: OpenAI gave a specific test in the demonstration document: the user described his own business to the model, the speech reasoning of the two generations of Real time models and the corresponding transcript

The above case is a complex task that requires high reasoning capabilities: the model needs to understand the relationship between multiple variables at the same time, the uneven time distribution of passenger flow, expensive fixed rental costs, and the positioning of a business with a low turnover rate such as slow-brew coffee, and perform logical deductions under these constraints.

GPT‑Realtime‑2 gave a methodical and hierarchical answer in 1 minute and 04 seconds. It not only dismantled the contradiction between the flow of people and the rent structure, pointed out that too much concentration during peak hours may result in the overall square footage being insufficient to cover the rent, but also proposed specific lightweight test paths.

The same question was given to the previous generation model GPT‑Realtime‑1.5, and the response time was 51 seconds, but the depth was obviously insufficient. This comparative demonstration directly demonstrates the intergenerational gap in the strategic reasoning dimension between the two generations of models.

03 Real-time translation and transcription

In addition to GPT‑Realtime‑2, the two dedicated models released by OpenAI at the same time each cater to clear scenario requirements.

GPT‑Realtime‑Translate focuses on real-time multilingual translation, supporting more than 70 input languages. It can output to 13 target languages in real time and provide transcripts simultaneously. Its target application scenarios include customer support, cross-border sales, education, events, and creator platforms for global audiences.

Alberto Parravicini, the head of AI at the video platform Vimeo, shared their application scenario: embedding GPT‑Realtime‑Translate during video playback, allowing creators to communicate cross-language with global audiences the moment they go online.

Vimeo demonstrates GPT‑Realtime‑Translate’s real-time translation capabilities

GPT‑Realtime‑Whisper is a streaming speech-to-text model, Built specifically for low-latency transcription scenarios.

它能够在说话者开口的瞬间开始生成文字记录，适用于会议实时字幕、课堂笔记、广播字幕以及需要即时生成后续工作流的语音交互场景。 Its core value lies in converting voice content into structured text that can be used immediately by downstream business systems during the conversation.

Security and Pricing

在安全层面，Realtime API部署了多层护栏——系统内置的主动分类器能够对会话进行实时监控，一旦识别出违反有害内容指南的交互，即可终止会话。 Developers can also use the Agents SDK to easily overlay custom security guardrails.

OpenAI's usage policy expressly prohibits the use of output content for spam, fraud, or other harmful purposes.

According to official guidelines, unless the interaction context clearly indicates that the conversation object is AI, developers must clearly disclose to end users that they are interacting with artificial intelligence (remind users: the person speaking now is AI). In addition, the API fully supports EU data residency for EU customers and is protected by corporate privacy commitments.

Three models are now available to developers through the Realtime API.

In terms of pricing, GPT‑Realtime‑2 is billed by voice token, with a price of $32 per 1 million input tokens (cached input is $0.40 per 1 million tokens) and $64 per 1 million output tokens. GPT‑Realtime‑Translate is billed based on usage time at $0.034 per minute. GPT‑Realtime‑Whisper is also billed by time, at $0.017 per minute.

In order to endorse the new "voice family bucket", OpenAI CEO Sam Altman said on X: People are indeed starting to use voice to interact with AI, especially when a large amount of background information needs to be poured in at once.

He also mentioned that young people seem to prefer communicating with AI through voice, while middle-aged and elderly users tend to type, and raised the open question of whether this habit will change in the future.

The question is: Now that OpenAI’s voice reasoning capabilities are new, who will take over next?

Disclaimer: The copyright of this article belongs to the original work. It only represents the author's own views and does not represent the views or positions of YouToCoin. The content of the article is for reference only and does not constitute investment advice. Investors operate accordingly, at their own risk; if you have any questions about content, copyright, etc., please contact us.

Popular Exchanges

$624,743,467.42

$584,196,149.76

$458,201,095.95

$4,288,828,681.1

$536,700,624.11

$710,402,530.05