Build and refine your audio generation end-to-end with Gemini 1.5 Pro

Generative AI is giving people new ways to experience audio content, from podcasts to audio summaries. For example, users are embracing NotebookLM’s recent Audio Overview feature, which turns documents into audio conversations. With one click, two AI hosts start up a lively “deep dive” discussion based on the sources you provide. They summarize your material, make connections between topics, and discuss back and forth. 

While NotebookLM offers incredible benefits for making sense of complex information, some users want more control over generating  unique audio experiences – for example, creating their own podcasts. Podcasts are an increasingly popular medium for creators, business leaders, and users to listen to what interests them. Today, we’ll share how Gemini 1.5 Pro and the Text-to-Speech API on Google Cloud can help you create conversations with diverse voices and generate podcast scripts with custom prompts.

$300 in free credit to try Google Cloud AI and ML

Build and test your proof of concept with $300 in free credit for new customers. Plus, all customers get free monthly usage of 20+ products, including AI APIs.Start building for free

The approach: Expand your reach with diverse audio formats

A great podcast starts with accessible audio content. Gemini’s multimodal capabilities, combined with our high-fidelity Text-to-Speech API, offers 380+ voices across 50+ languages and custom voice creation. This unlocks new ways for users to experience content and expand their reach through diverse audio formats. 

This approach also helps content creators reach a wider audience and streamline the content creation process, including:

Let’s take a look at how. 

The architecture: Gemini 1.5 Pro and Text-to-Speech 

Our audio overview creation architecture uses two powerful services from Google Cloud:

How to create an engaging podcast yourself, step-by-step 

A python function that powers our podcast creation process can look as simple as below:

def extract_sections_and_subsections(document1: Part, project="<your-project-id>", location = "us-central1") -> str:
   """
   Extracts hierarchical sections and subsections from a Google Cloud blog post
   provided as a PDF document.
   This function uses the Gemini 1.5 Pro language model to analyze the structure
   of a blog post and identify its key sections and subsections. The extracted
   information is returned in JSON format for easy parsing and use in
   various applications.
   This is particularly useful for:
   * **Large documents:**  Breaking down content into manageable chunks for
     efficient processing and analysis.
   * **Podcast creation:** Generating multi-episode series where each episode
     focuses on a specific section of the blog post.
   Args:
       document1 (Part): A Part object representing the PDF document,
                         typically obtained using `Part.from_uri()`.
                         For example:
                         ```python
                         document1 = Part.from_uri(
                             mime_type="application/pdf",
                             uri="gs://your-bucket/your-pdf.pdf"
                         )
                         ```
       location: The region of your Google Cloud project. Defaults to "us-central1".
       project: The ID of your Google Cloud project. Defaults to "<your-project-id>".
   Returns:
       str: A JSON string representing the extracted sections and subsections.
            Returns an empty string if there are issues with processing or
            the model output.
   """
   vertexai.init(project=project, location=location)  # Initialize Vertex AI
   model = GenerativeModel("gemini-1.5-pro-002")
   prompt = """Analyze the following blog post and extract its sections and subsections. Represent this information in JSON format using the following structure:
   [
     {
       "section": "Section Title",
       "subsections": [
         "Subsection 1",
         "Subsection 2",
         // ...
       ]
     },
     // ... more sections
   ]"""
   try:
       responses = model.generate_content(
           ["""The pdf file contains a Google Cloud blog post required for podcast-style analysis:""", document1, prompt],
           generation_config=generation_config,
           safety_settings=safety_settings,
           stream=True,  # Stream results for better performance with large documents
       )
       response_text = ""
       for response in responses:
           response_text += response.text
       return response_text
   except Exception as e:
       print(f"Error during section extraction: {e}")
       return ""

Then, use Gemini 1.5 Pro to generate the podcast script for each section. Again, provide clear instructions in your prompts, specifying target audience, desired tone, and approximate episode length.

For each section and subsection you can use a function like below to generate a script:

def generate_podcast_content(section, subsection, document1:Part, targetaudience, guestname, hostname, project="<your-project-id>", location="us-central1") -> str:
 """Generates a podcast dialogue in JSON format from a blog post subsection.
 This function uses the Gemini model in Vertex AI to create a conversation
 between a host and a guest, covering the specified subsection content. It uses
 a provided PDF as source material and outputs the dialogue in JSON.
 Args:
   section: The blog post's main section (e.g., "Introduction").
   subsection: The specific subsection (e.g., "Benefits of Gemini 1.5").
   document1: A `Part` object representing the source PDF (created using
              `Part.from_uri(mime_type="application/pdf", uri="gs://your-bucket/your-pdf.pdf")`).
   targetaudience: The intended audience for the podcast.
   guestname: The name of the podcast guest.
   project: Your Google Cloud project ID.
   location: Your Google Cloud project location.
 Returns:
   A JSON string representing the generated podcast dialogue.
 """
 print(f"Processing section: {section} and subsection: {subsection}")
 prompt = f"""Create a podcast dialogue in JSON format based on a provided subsection of a Google Cloud blog post (found in the attached PDF).
 The dialogue should be a lively back-and-forth between a host (R) and a guest (S), presented as a series of turns.
 The host should guide the conversation by asking questions, while the guest provides informative and accessible answers.
 The script must fully cover all points within the given subsection.
 Use clear explanations and relatable analogies.
 Maintain a consistently positive and enthusiastic tone (e.g., "Movies, I love them. They're like time machines...").
 Include only one introductory host greeting (e.g., "Welcome to our next episode...").  No music, sound effects, or production directions.
 JSON structure:
 {{
   "multiSpeakerMarkup": {{
     "turns": [
       {{"text": "Podcast script content here...", "speaker": "R"}}, // R for host, S for guest
       // ... more turns
     ]
   }}
 }}
 Input Data:
 Section: "{section}"
 Subsections to cover in the podcast: "{subsection}"
 Target Audience: "{targetaudience}"
 Guest name: "{guestname}"
 Host name: "{hostname}"
 """
 vertexai.init(project=project, location=location)
 model = GenerativeModel("gemini-1.5-pro-002")
 responses = model.generate_content(
     ["""The pdf file contains a Google Cloud blog post required for podcast-style analysis:""", document1, prompt],
     generation_config=generation_config, # Assuming these are defined already
     safety_settings=safety_settings,      # Assuming these are defined already
     stream=True,
 )
 response_text = ""
 for response in responses:
   response_text += response.text
 return response_text

Next, feed the generated  script by Gemini to the Text-to-Speech API. Choose a voice and language appropriate for your target audience and content.

A function as below can generate human quality audio based on text. For this we can use the advanced text-to-speech API in Google Cloud.

def generate_audio_from_text(input_json):
   """Generates audio using Google Text-to-Speech API.
   Args:
       input_json: A dictionary containing the 'multiSpeakerMarkup' for the TTS API. This is generated by the Gemini 1.5 Pro model in the buildPodCastContent() function. 
   Returns:
       The audio data in bytes (MP3 format) if successful, None otherwise.
   """
   try:
       # Build the Text-to-Speech service
       service = build('texttospeech', 'v1beta1')
       # Prepare synthesis input
       synthesis_input = {
           'multiSpeakerMarkup': input_json['multiSpeakerMarkup']
       }
       # Configure voice and audio settings
       voice = {
           'languageCode': 'en-US',
           'name': 'en-US-Studio-MultiSpeaker'
       }
       audio_config = {
           'audioEncoding': 'MP3',
           'pitch': 0,
           'speakingRate': 0,
           'effectsProfileId': ['small-bluetooth-speaker-class-device']
       }
       # Make the API request
       response = service.text().synthesize(
           body={
               'input': synthesis_input,
               'voice': voice,
               'audioConfig': audio_config
           }
       ).execute()
       # Extract and return audio content
       audio_content = response['audioContent']
       return audio_content
   except Exception as e:
       print(f"Error: {e}")  # More informative error message
       return None

Finally, to store audio content already encoded as base64 MP3 data in Google Cloud Storage, you can use the google-cloud-storage Python library. This allows you to decode the base64 string and upload the resulting bytes directly to a designated bucket, specifying the content type as ‘audio/mp3’.

Hear it for yourself

While the Text-to-Speech API produces high-quality audio, you can further enhance your audio conversation with background music, sound effects, and professional editing using tools. Hear it for yourself – download the audio conversation I created from this blog using Gemini 1.5 Pro and Text-to-Speech API.

To start creating for yourself, explore our full suite of audio generation features using Google Cloud services, such as Text-to-Speech API  and Gemini models using the free tier. We recommend experimenting with different modalities like text and image prompts to experience Gemini’s potential for content creation.Posted in

Related posts

Staying in sync: Effective collaboration strategies for distributed workforces

by Cloud Ace Indonesia
2 weeks ago

Break down data silos with the new cross-cloud transfer feature of BigQuery Omni

by Cloud Ace Indonesia
2 years ago

Cloud Data Loss Prevention’s sensitive data intelligence service is now available in Security Command Center

by Cloud Ace Indonesia
2 years ago