Tutorial - Extract visual information from long videos with Gemini

In today's fast-paced professional environment, meetings are a cornerstone of collaboration, but sifting through long recordings to find specific visual details can be time-consuming. Traditional AI meeting assistants often rely on audio transcripts, missing critical visual information like slides, diagrams, or whiteboard notes. Google's Gemini 1.5 Pro addresses this gap with its ability to process prompts with up to 2 million tokens of context, which allows us to upload videos up to 1 hour, and then extract both visual and audio content seamlessly.

This tutorial guides you through using Gemini 1.5 Pro to extract visual information from meeting recordings. You'll learn to upload a video to Google AI Studio, apply a comprehensive template prompt to capture all visual and audio details, and interpret the results to find specific information. This is ideal for professionals who need to review meetings efficiently without missing key visuals.

While Gemini 2.5 Pro is available as of May 2025, we focus on Gemini 1.5 Pro for its 2 million token context window, perfect for long videos. Future versions of Gemini 2.5 Pro may offer similar capabilities, so stay updated on Google's announcements.

Key objectives:

Access and use Google AI Studio for Gemini 1.5 Pro.
Upload and process a long video with Gemini 1.5 Pro.
Understand video token counts and stay within limits.
Use a template prompt to extract comprehensive visual and audio information.

Step 1 - Accessing Google AI Studio

Start by accessing Google AI Studio, the platform for interacting with Gemini 1.5 Pro.

Navigate to Google AI Studio and sign in with your Google account. If you don’t have an account, create one by following the prompts.

Google AI Studio is preferred over other platforms like the simpler Gemini because it displays token counts for uploaded files, crucial for ensuring your video stays within the 2 million token limit. Also, only AI Studio allows selecting older models like Gemini 1.5 Pro. Vertex AI is an alternative, but AI Studio’s user-friendly interface makes it ideal for this task.

Step 2 - Model selection

After logging into Google AI Studio, the next step is to choose the right model for your video processing task. Follow these steps to select Gemini 1.5 Pro, which is optimized for handling long videos due to its large context window.

Choose the Chat Interface: In the left panel of Google AI Studio, click on "Chat" to access the interactive interface for model usage.
Access the Model Selector: Within "Run settings," click on the model selector dropdown to view available model options.
Pick the Model Family: From the dropdown, select the "Gemini 1.5" family, which includes models tailored for various advanced tasks.
Specify the Model: Choose "Gemini 1.5 Pro" from the list, ensuring you have the version designed for long video analysis.

With Gemini 1.5 Pro selected, you’re now set to proceed with uploading your video and extracting the visual information you need. Double-check your selection to ensure optimal performance.

Step 3 - Uploading a long video

With access to AI Studio, and the right model selected, upload your meeting recording, which can be up to 1 hour long (maybe even 2) and ideally in MP4 format, the standard format supported by Gemini 1.5 Pro.

Direct Upload to AI Studio

In AI Studio, start a new chat or conversation with Gemini 1.5 Pro. Locate the "Upload File" option, a plus sign to the right of the chat field, or drag and drop your MP4 video right into the chat box.

In case the direct file upload fails, try the below method of using Google Drive.

Right after the upload AI Studio will display the token count, confirming it’s within the 2 million token limit.

Token Count for Videos

Gemini 1.5 Pro’s 2 million token context window can process up to 1 hour of video. Token count depends on video length and visual complexity:

A 1-minute video uses approximately 18,000 tokens.
A 30-minute video uses around 540,000 tokens.
A 60-minute video uses around 1,080,000 tokens.

Both durations are well within the limits of the 2 million token count. Compressing videos may reduce token count by lowering frame numbers, but for accuracy, use the original file unless file size is a concern. Large files may take longer to upload, so ensure a stable internet connection.

Alternative Method - Using Google Drive

If you have issues with the direct file upload, or your video is stored in Google Drive already, link AI Studio to your Drive for convenience. Browse and select your video file with "My Drive" in the upload menu (plus icon).

Step 4 - Using the template prompt

With your video uploaded, we use a single template prompt to extract all visual and audio information, eliminating the need for multiple prompts. The following prompt is refined for clarity and comprehensiveness.

Create a detailed transcript of the video:
Include timestamps for each segment
Identify who is speaking at each timestamp
Describe diagrams, flowcharts, or whiteboard notes in detail during presentations
Extract the full content of any clearly presented slides
Note who shared their screen and when
Mention any physical objects shown or demonstrated via screen share
Capture any other visual cues or information not audible

In AI Studio’s chat box, paste this prompt. Ensure the uploaded video is attached, which AI Studio typically links automatically. Click 'Run' to process the prompt with Gemini 1.5 Pro.

Processing a 30-60 minute video may take several minutes, depending on length and complexity. Monitor any progress indicators in AI Studio. If the process fails, check the token count or try re-uploading the video.

As of today, the maximum output length in the run settings is 8192 tokens for the 1.5 pro model. So, if a longer output is required, we recommend checking with newer models.

Step 5 - Interpreting the results

Once processed, Gemini 1.5 Pro will generate a detailed transcript including timestamps, speaker identification, slide content, and descriptions of visual elements like diagrams or objects. To find specific information, such as a gross sales number on a slide, search the transcript for keywords like ‘gross sales’ or ‘slide’.

This allows you to locate details quickly without watching the entire video. For extensive transcripts, save the output as a text file for easier searching. You can also ask follow-up questions in AI Studio, such as ‘Provide more details about Slide: FY25 budget targets,’ to refine the results.

Verify critical information, as AI may occasionally misinterpret complex visuals. If the output is incomplete, try rephrasing the prompt or breaking the video into smaller segments, though this is rarely necessary with Gemini 1.5 Pro’s large context window.

Conclusion

This tutorial has equipped you to use Gemini 1.5 Pro to extract visual information from long meeting recordings, streamlining your review process. By accessing Google AI Studio, uploading a video, applying a template prompt, and interpreting the results, you can efficiently find critical details like slide content or visual cues.

Recap of steps:

Access Google AI Studio, confirm availability and select Gemini 1.5 Pro.
Upload your MP4 video directly or via Google Drive.
Use the template prompt to extract all visual and audio information.
Search the transcript for specific details like slide content.

Troubleshooting tips:

If upload fails, check file format (MP4) or size (<2GB).
If upload fails, upload first to Google Drive

Further exploration:

Experiment with prompts to summarize topics or identify frequent speakers.
Apply this technique to webinars or training videos.
Monitor updates for Gemini 2.5 Pro, which may soon support 2 million tokens.

By leveraging Gemini 1.5 Pro, you can enhance productivity and gain deeper insights from video content, making your workflow more efficient.

Unlock 220+ AI courses (ChatGPT, Claude, Gemini...) and more

Monthly

Yearly