Gemini 2.5 Flash Native Audio: This is how Google's AI voice changes

Last update: 15/12/2025

  • Gemini 2.5 Flash Native Audio improves the naturalness, accuracy, and fluidity of voice conversations with Google's AI.
  • The model refines calls to external functions, follows complex instructions better, and maintains context better in long dialogues.
  • It incorporates real-time voice-to-voice translation, with support for more than 70 languages ​​and 2.000 translation pairs, preserving intonation and rhythm.
  • It is already integrated into Google AI Studio, Vertex AI, Gemini Live and Search Live, and is being deployed in Google and third-party products.

Gemini 2.5 Flash Native Audio

Google has taken another step in the evolution of its artificial intelligence ecosystem with a major update to Gemini 2.5 Flash Native AudioThe model designed to understand and generate audio in real time. This technology is geared towards making voice interactions more effective. closer to a human conversationboth in everyday life and in professional environments.

Far from simply "putting a voice" to an assistant's responses, and compared to other options in voice AI comparisonsThis model is designed for to sustain natural, functional and contextual dialogues, making decisions about when to seek additional information and managing complex instructions without breaking the flow of the conversationWith this, Google reinforces its commitment to voice as the primary means of interaction with its AI services.

What is Gemini 2.5 Flash Native Audio and where is it being used?

Gemini 2.5 Flash Native Audio is the latest version of Google's native audio model, capable of listen, understand, and respond by voice in real time. Unlike previous systems focused solely on speech synthesis, this engine is designed to work with audio as both input and output simultaneously, making it especially suitable for conversational assistants.

The company has already integrated this version into several of its key platforms: Google AI Studio, Vertex AI, Gemini Live and Search LiveThis means that both developers and companies can start building advanced voice agents on the same technology that powers Google's latest conversational AI experiences.

In practice, users will notice these changes in experiences such as Gemini Live (the voice conversation mode with the assistant) or in Search Live within the AI ​​mode of the Google app, where the spoken responses sound more expressive, clearer, and better contextualizedFurthermore, you can even ask the assistant to speak more slowly, adjusting the pace of the conversation naturally.

Beyond Google itself, these capabilities have been made available to third parties through Vertex AI and the Gemini APIso that other companies can create autonomous agents voice, virtual receptionists or assistance tools with the same level of voice sophistication.

More accurate external functions and better-rated models

Google's voice AI

One of the areas where Gemini 2.5 Flash Native Audio has made the most progress is in its ability to call external functionsIn simple terms, the model is now more reliable when it comes to making decisions. when you need to consult real-time services or dataFor example, to retrieve updated information, check the status of an order, or launch an automated process.

Exclusive content - Click Here  How to accept all edits in Google Docs

Google points out that this added precision translates into fewer errors when triggering actions, reducing awkward situations where the assistant falls short or acts prematurely. The system is capable of insert the retrieved data into the audio response without the user perceiving any abrupt cuts in the conversation.

To measure these advances, the company has subjected the model to tests such as ComplexFuncBench Audio, an evaluation bench focused on multi-stage tasks with constraints. In this scenario, Gemini 2.5 Flash Native Audio has achieved around a 71,5% success rate in executing complex functions, placing it above previous iterations and other competing models in this type of use.

This performance is especially relevant in contexts where sophisticated automated workflows are needed, such as call centers, technical support or transaction processing (for example, financial or administrative tasks) where each step depends on the previous one and there is little room for error.

Better instruction tracking and more coherent conversation threads

Another focus of the update is on how the model interpret and respect the instructions which it receives from both end users and developers. According to data released by Google, the instruction compliance rate has dropped from 84% to 90% adherenceThis means responses that are more in line with what has actually been asked for.

This leap is key in tasks where it is required complex instructions, multiple steps, or multiple conditionsFor example, when requesting an explanation in a specific style, asking for a summary with certain time constraints, or setting up a workflow that depends on several linked decisions.

Related to this, Gemini 2.5 Flash Native Audio has gained the ability to Retrieve the context of previous messagesIn multi-turn conversations, the model better remembers what has been said, the nuances introduced by the user, and the corrections made throughout the dialogue.

This improvement in conversational memory reduces the need to repeat the same information over and over and helps make interactions more effective. smoother and less frustratingThe experience is closer to talking to a person who picks up a topic where they left off, rather than starting from scratch with each answer.

Real-world use cases: from e-commerce to financial services

Beyond internal metrics, Google is relying on customer examples to illustrate the practical impact of Gemini 2.5 Flash Native Audio. In the e-commerce sector, Shopify has incorporated these capabilities into its assistant. Sidekick", which helps retailers manage their stores and resolve doubts about the business.

Exclusive content - Click Here  How to edit row height in Google Sheets

According to the company, many users They even forget that they are talking to an AI After a few minutes of conversation, the user even thanked the bot after a lengthy inquiry. This type of reaction suggests that advances in naturalness and tone are causing technology to subtly take a backseat.

In the financial sector, the provider United Wholesale Mortgage (UWM) It has integrated the model into its "Mia" assistant to manage mortgage-related processes. With the combination of Gemini 2.5 and other internal systems, the company claims to have processed more than 14.000 loans for its partners, relying on automated interactions that require accuracy and regulatory compliance.

For its part, the startup Newo.ai It uses Gemini 2.5 Flash Native Audio via Vertex AI to power its virtual receptionistsThese voice assistants are capable of identifying the main speaker even in noisy environments, switching languages ​​mid-conversation, and maintaining a natural voice register with emotional nuanceswhich is crucial in customer service.

Real-time voice-to-voice translation: more languages ​​and more nuances

One of the most striking additions in this version is the live voice-to-voice translationInitially integrated into the Google Translate app, Gemini 2.5 Flash Native Audio goes beyond simply converting audio to text or offering fragmented translations, enabling a more immersive experience. simultaneous translation closer to human interpretation.

The system can operate in mode of continuous listeningThis allows the user to put on headphones and hear what's happening around them translated into their language, without needing to pause or press buttons for each phrase. This option can be useful when traveling, attending international meetings, or at events where multiple languages ​​are involved.

Consideration has also been given to situations of two-way conversationFor example, if one person speaks in English and the other in Hindi, the headphones play the English translation in real time, while the phone plays the Hindi translation once the first person finishes speaking. The system automatically switches the output language depending on who is speaking, without the user having to change settings between turns.

One of the most relevant details of this function is its ability to preserve the original intonation, rhythm, and tone from the speaker. This results in translations that sound less robotic and closer to the speaker's voice style, making them easier to understand and the experience more natural.

Language support, automatic detection and noise filtering

In terms of linguistic scope, Gemini 2.5-based voice translation offers support for over 70 languages ​​and some 2.000 translation pairsCombining the model's world knowledge with its multilingual and native audio capabilities, it can cover a wide range of language combinations, including many that aren't always prioritized by other tools.

Exclusive content - Click Here  How to add Roman numerals in Google Docs

The system can manage multilingual entry Within a single session, it understands more than one language simultaneously without requiring the user to manually adjust the settings each time someone switches languages. This feature is especially useful in conversations where several languages ​​are mixed naturally.

Thanks to automatic detection of spoken languageThe user does not need to know in advance what language their interlocutor is communicating in: the model identifies the language and begins to translate on the fly, reducing friction and intermediate steps.

Gemini 2.5 Flash Native Audio also incorporates mechanisms for robustness against noiseIt is able to filter out some of the ambient sound to prioritize the main voice, allowing for more comfortable conversations in busy streets, open spaces, or places with background music.

Availability, deployment and prospects for Europe

Live voice translation based on this model is currently available in beta phase in the Google Translate app for Android devices in markets such as the United States, Mexico, and India. Google has confirmed that the service will be progressively rolled out to more regions and platforms, including other mobile systems.

In parallel, the integration of Gemini 2.5 Flash Native Audio in Gemini Live and Search Live It is being rolled out to users of the Google app on Android and iOS, starting in the United States. As these features mature and pass the initial testing and adaptation phases, they are expected to arrive in other regions as well. more countries, presumably including European markets, where the demand for translation and voice assistants is especially high.

Google has also announced its intention to incorporate this voice and translation experience into other products, including the Gemini APIOver the coming months and years, this would open the door for European companies in sectors such as tourism, logistics, education, and public administration to directly integrate these capabilities into their own services.

The company is presenting these new features as part of a broader strategy to enable developers to build conversational agents with natural voice From now on, taking advantage of both Gemini 2.5 Flash Native Audio and other models in the 2.5 Flash and Pro family geared towards more controlled voice generation (adjusting tone, intention, speed, etc.) and frames such as Agentic AI Foundation.

With this set of improvements, Google reinforces the idea that voice will be one of the main channels of interaction with artificial intelligence: from assistants that handle customer calls and process complex operations, to simultaneous translation systems that facilitate communication between people who do not share a language. Gemini 2.5 Flash Native Audio is at the heart of this endeavor, fine-tuning both voice comprehension and expression. to make the technology more useful and less intrusive in everyday life, while awaiting its full deployment in Europe and other markets.

Voice.ai vs ElevenLabs vs Udio: Which sounds better?
Related article:
Voice.ai vs ElevenLabs vs Udio: A complete comparison of AI voices