ChatGPT Announces its New Image and Voice Capabilities – Seeing, Hearing and Speaking

Great news! OpenAI is introducing new voice and image features for its AI chatbot, ChatGPT.

Our Staff

Image and Voice Capabilities


  • The latest update to ChatGPT incorporates both image and voice capabilities.
  • This integration of novel features broadens the path for more intuitive interactions with artificial intelligence systems.
  • OpenAI is taking a considerate approach to introducing these advanced capabilities, with an emphasis on ensuring user safety.

Now, interacting with ChatGPT can be done using voice or images. they’re introducing exciting voice and image features to ChatGPT, making your interaction more intuitive than ever. Now, you can converse using your voice or by showing images directly to ChatGPT

How does this enhance your ChatGPT experience, you ask? 

Imagine being on a travel spree. You come across an intriguing landmark. All you have to do is snap a picture and engage ChatGPT in a lively chat about it. Or let’s suppose you’re home. Just take pictures of your fridge and pantry, discuss meal possibilities with ChatGPT, and even request a step-by-step cooking recipe. And guess what? After dinner, you can get help with your child’s math problems. Simply take a photo of the problems, let ChatGPT give you guiding hints. It’s as simple and fun as that!

Engage in Two-Way conversation with ChatGPT

Converse with your assistant using the new voice function – you can ask it for a story, or help in solving a debate. 

Activate this feature via the Settings → New Features on the mobile app, then select your preferred voice from five options. 

The voice technology uses a text-to-speech model that produces lifelike audio. It was created in collaboration with professional voice actors, and uses Whisper, our open-source system, for speech recognition

Images Contribute to Visual Context in Discussion

Imagine this: You’re not just telling ChatGPT about your world, but you’re ALSO showing it! That’s right, you can now enlighten your favourite AI companion with one or several images to supplement the context of your conversation. Exciting, isn’t it? 

Let’s paint a little picture, shall we? Perhaps you’ve got a malfunctioning appliance at hand. Instead of twisting your words trying to explain the problem, why not just snap a photo and share it directly with ChatGPT? It’s like your very own, pocket-sized technician! And for those using mobile devices, you’ve got yourself an added advantage; a nifty drawing tool that allows you to circle or pinpoint specific parts of the image. Talk about precision! 

So, what’s the secret sauce behind these image-perception capabilities, you ask? Well, it boils down to a multimodal rendition of the mighty GPT-3.5 and GPT-4 models! These cutting-edge models were meticulously fine-tuned to comprehend visual input, lending their cognitive powers to ChatGPT. Of course, safety is paramount here at OpenAI, which is why every pixel of this image-reading feature was put through rigorous testing to minimize any potential risks. Quite the picture-perfect enhancement, don’t you think?

Gradual Deployment of Image and Voice Capabilities for Safety

While the idea of a bot that can see, hear, and speak might excite you, it’s important to understand that this progress isn’t just about adding cool features. Instead, the goal of gradual deployment of image and voice capabilities in ChatGPT has always been about safety and improving user experience. 


With voice integration, the AI doesn’t just read text, but also listens to and interprets spoken language. This offers an entirely different level of engagement, making interactions smoother and more natural. Just imagine, with this feature, you can have a real-time conversation with ChatGPT, just like you would with a human. However, it’s essential to point out that this advancement comes with a layer of caution; the potential for misuse is also real. To address such concerns, strict measures are in place to ensure the ethical application of the voice feature. 

video source:

Image Input 

In addition to understanding text and voice, ChatGPT now has the ability to perceive and interpret images. This gives it a sense of ‘vision’, allowing it to provide more context-rich responses. Think of all the times when you wished you could just show something to get your point across. With this function, it’s now possible. Yet, even these enhancements aim at safety above all. Understandably, image inputs may pose privacy concerns for users. Rest assured, as steps have been taken to make this image processing system both useful and safe, ensuring the user’s privacy is always respected. 

video source:

Transparency about Model Limitations 

An important part of this gradual deployment also revolves around transparency. It’s crucial to address that, like any system, ChatGPT has its limits. While the AI strives to understand context and interpret information efficiently, there may be instances where it doesn’t get everything right. This could be a misunderstood word in a voice command or a misinterpreted image. It’s all part of the learning process, which is inherently iterative and gradual. Acknowledgement of these limitations is the first step in our endeavour towards making ChatGPT a dynamic, responsive, and safe tool for everyone.

Learn more about our safety approach and our collaboration with Be My Eyes in the system card for image input.

Enhancing Visual Perception Safety

ChatGPT vision helps interpret your visual environment for everyday use. 

Its design is based on the free app, Be My Eyes, used for aiding visually impaired individuals. Users enjoy discussing incidental images such as people in a TV scene. 

Measures have been implemented to limit ChatGPT’s ability to analyze people directly, ensuring privacy and accuracy. 

Your feedback is crucial for improving safety protocols and tool usefulness.

How can users make the most out of ChatGPT’s new capabilities?

ChatGPT offers new ways to interact with better visual, auditory, and vocal capabilities. Use images or visual descriptions for more tailored responses, especially in visually focused fields like fashion or design. Convey audio inputs or sound descriptions for assistance in areas like music or sound troubleshooting. Engage with ChatGPT vocally for a more natural, conversation-like interaction suitable for voice-controlled apps or language learning platforms. 

The addition of these capabilities allows ChatGPT to process and generate text-based responses based on visual and auditory inputs.

For optimal results, provide clear, concise instructions or context when giving visual, audio, or voice inputs. Experiment and iterate with various types of input to better understand and utilize ChatGPT’s capabilities to fit your specific needs.

Expansion of Access in ChatGPT

Guess what? If you’re a Plus Enterprise user, in just two weeks you’ll be marveling at the addition of voice and image functionalities. We’re barely containing our excitement as we gear up to introduce these incredible features to other groups, including our much-valued developer community, in the not-too-distant future!

On a Final Note

The introduction of voice and image capabilities to ChatGPT provides users with a more immersive means of engagement with the AI system. 

In keeping with a cautious strategy, OpenAI is initiating the release of these advancements gradually, restricting early access and certain functions due to potential risks yet to be fully assessed. 

While an exciting evolution, it is critical to remember the boundaries within which ChatGPT operates; hazardous applications should not be pursued without thorough verification of its competence.


Leave a Comment below

Join Our Newsletter.

Get your daily dose of search know-how.