Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in combining 26c-gemini-multimodal-live-video.py and 21-tavus-layer.py #979

Open
sandeepkumarsuresh opened this issue Jan 13, 2025 · 0 comments

Comments

@sandeepkumarsuresh
Copy link

Description

Bug

Environment

  • pipecat-ai version: 0.0.52
  • python version:3.11.5
  • OS: Ubuntu 22.04.3 LTS

Issue description

I am trying to combine the gemini-multimodal-video backend to tavus avatar frontend. When running the attached code, i am able to see 5 people who have joined the call. Further when asked about "How many video feed do you see ?" , it replied with two video feed. One of the human participant and other of the tavus ai avatar

Repro steps

Run the attached pipeline code

Expected behavior

  1. Idealy gemini backend should see the single video feed of the human participant
  2. The number of people in the call should be 2 . Gemini backend with tavus avatar frontend as the first participant and human participant as the second.

Actual behavior

  1. Two video feed is input to the gemini backend
  2. Number of persons in call is 5

Code

import asyncio
import os
import sys
from typing import Any, Mapping

import aiohttp
from dotenv import load_dotenv
from loguru import logger

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask

from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.tavus import TavusVideoService
from pipecat.transports.services.daily import DailyParams, DailyTransport

from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService

load_dotenv(override=True)

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


async def main():
    async with aiohttp.ClientSession() as session:
        tavus = TavusVideoService(
            api_key=os.getenv("TAVUS_API_KEY"),
            replica_id=os.getenv("TAVUS_REPLICA_ID"),
            session=session,
        )

        # get persona, look up persona_name, set this as the bot name to ignore
        persona_name = await tavus.get_persona_name()
        room_url = await tavus.initialize()

        transport = DailyTransport(
            room_url=room_url,
            token=None,
            bot_name="pipecat0",
            params=DailyParams(
                audio_in_sample_rate=16000,
                audio_out_sample_rate=24000,
                audio_out_enabled=True,
                vad_enabled=True,
                vad_audio_passthrough=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
                start_audio_paused=True,
                start_video_paused=True,
            ),
        )

        llm = GeminiMultimodalLiveLLMService(
            api_key=os.getenv("GOOGLE_API_KEY"),
            voice_id="Charon",  # Puck, Charon, Kore, Fenrir, Aoede
            # system_instruction="Talk like a pirate."
            transcribe_user_audio=True,
            transcribe_model_audio=True,
            # inference_on_context_initialization=False,
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
            },
        ]

        context = OpenAILLMContext(messages)
        context_aggregator = llm.create_context_aggregator(context)

        pipeline = Pipeline(
            [
                transport.input(),  # Transport user input
                context_aggregator.user(),  # User responses
                llm,  # LLM
                tavus,  # Tavus output layer
                transport.output(),  # Transport bot output
                context_aggregator.assistant(),  # Assistant spoken responses
            ]
        )

        task = PipelineTask(
            pipeline,
            PipelineParams(
                allow_interruptions=True,
                enable_metrics=True,
                enable_usage_metrics=True,
                report_only_initial_ttfb=True,
            ),
        )

        @transport.event_handler("on_participant_joined")
        async def on_first_participant_joined(transport, participant):
      
        # async def on_participant_joined(
        #     transport: DailyTransport, participant: Mapping[str, Any]
        # ) -> None:

            # Ignore the Tavus replica's microphone
            if participant.get("info", {}).get("userName", "") == persona_name:
                logger.debug(f"Ignoring {participant['id']}'s microphone")
                await transport.update_subscriptions(
                    participant_settings={
                        participant["id"]: {
                            "media": {"microphone": "unsubscribed"},
                        }
                    }
                )

            await transport.capture_participant_video(
                participant["id"], framerate=1, video_source="camera"
            )

            # await transport.capture_participant_video(
            #     participant["id"], framerate=1, video_source="screenVideo"
            # )

            await task.queue_frames([context_aggregator.user().get_context_frame()])
            await asyncio.sleep(3)
            logger.debug("Unpausing audio and video")
            llm.set_audio_input_paused(False)
            llm.set_video_input_paused(False)

            # if participant.get("info", {}).get("userName", "") != persona_name:
            #     # Kick off the conversation.
            #     messages.append(
            #         {"role": "system", "content": "Please introduce yourself to the user."}
            #     )
            #     await task.queue_frames([context_aggregator.user().get_context_frame()])

        @transport.event_handler("on_participant_left")
        async def on_participant_left(transport, participant, reason):
            await task.queue_frame(EndFrame())

        runner = PipelineRunner()

        await runner.run(task)


if __name__ == "__main__":
    asyncio.run(main())

Logs

2025-01-13 15:28:52.068 | DEBUG    | pipecat.services.tavus:get_persona_name:75 - TavusVideoService persona grabbed {'persona_id': 'pipecat0', 'persona_name': 'Ari', 'pipeline_mode': 'echo', 'system_prompt': ' ', 'context': None, 'layers': {'transport': {'room_settings': {'enable_chat': True, 'start_audio_off': False, 'start_video_off': False, 'enable_people_ui': True, 'enable_network_ui': True, 'enable_noise_cancellation_ui': True}, 'input_settings': {'microphone': 'disabled'}}}, 'default_replica_id': None, 'created_at': '2024-10-18T23:20:28.943Z', 'updated_at': '2024-10-18T23:20:29.005Z'}
2025-01-13 15:28:55.755 | DEBUG    | pipecat.services.tavus:initialize:61 - TavusVideoService joined https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:55.755 | INFO     | pipecat.audio.vad.vad_analyzer:set_params:69 - Setting VAD params to: confidence=0.7 start_secs=0.2 stop_secs=0.5 min_volume=0.6
2025-01-13 15:28:55.755 | DEBUG    | pipecat.audio.vad.silero:__init__:113 - Loading Silero VAD model...
2025-01-13 15:28:55.842 | DEBUG    | pipecat.audio.vad.silero:__init__:135 - Loaded Silero VAD
2025-01-13 15:28:55.845 | DEBUG    | pipecat.services.gemini_multimodal_live.gemini:upgrade:62 - Upgrading to Gemini Multimodal Live Context: <pipecat.processors.aggregators.openai_llm_context.OpenAILLMContext object at 0x76d1bacf5b50>
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking PipelineSource#0 -> DailyInputTransport#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking DailyInputTransport#0 -> GeminiMultimodalLiveUserContextAggregator#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveUserContextAggregator#0 -> GeminiMultimodalLiveLLMService#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveLLMService#0 -> TavusVideoService#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking TavusVideoService#0 -> DailyOutputTransport#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking DailyOutputTransport#0 -> GeminiMultimodalLiveAssistantContextAggregator#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveAssistantContextAggregator#0 -> PipelineSink#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking Source#0 -> Pipeline#0
2025-01-13 15:28:55.846 | DEBUG    | pipecat.processors.frame_processor:link:150 - Linking Pipeline#0 -> Sink#0
2025-01-13 15:28:55.847 | DEBUG    | pipecat.pipeline.runner:run:27 - Runner PipelineRunner#0 started running PipelineTask#0
2025-01-13 15:28:55.847 | INFO     | pipecat.transports.services.daily:join:322 - Joining https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:58.773 | INFO     | pipecat.transports.services.daily:join:340 - Joined https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:59.462 | INFO     | pipecat.transports.services.daily:on_participant_joined:620 - Participant joined 68679e05-7806-4050-bcaa-7362b565ce81
2025-01-13 15:28:59.463 | DEBUG    | __main__:on_first_participant_joined:112 - Ignoring 68679e05-7806-4050-bcaa-7362b565ce81's microphone
2025-01-13 15:29:02.465 | DEBUG    | __main__:on_first_participant_joined:131 - Unpausing audio and video
2025-01-13 15:29:02.488 | INFO     | pipecat.services.gemini_multimodal_live.gemini:_connect:364 - Connecting to Gemini service
2025-01-13 15:29:02.488 | INFO     | pipecat.services.gemini_multimodal_live.gemini:_connect:372 - Connecting to wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=AIzaSyA9hVJRnToRUhSBgM6rTKsTTewLcdQcbns
2025-01-13 15:29:04.039 | DEBUG    | pipecat.services.gemini_multimodal_live.gemini:_connect:401 - Setting system instruction: 
You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.
2025-01-13 15:29:04.797 | INFO     | pipecat.transports.services.daily:on_participant_joined:620 - Participant joined f9d51654-5192-40b0-9aba-ff6506ba4ab4
2025-01-13 15:29:07.799 | DEBUG    | __main__:on_first_participant_joined:131 - Unpausing audio and video
2025-01-13 15:29:18.848 | DEBUG    | pipecat.transports.base_input:_handle_interruptions:124 - User started speaking
2025-01-13 15:29:21.166 | DEBUG    | pipecat.transports.base_input:_handle_interruptions:131 - User stopped speaking
2025-01-13 15:29:24.156 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_llm_usage_metrics:73 - GeminiMultimodalLiveLLMService#0 prompt tokens: 406, completion tokens: 9
2025-01-13 15:29:24.156 | DEBUG    | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_user_audio:270 - [Transcription:user] How many video feed can you see?

2025-01-13 15:29:28.105 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:50 - TavusVideoService#0 TTFB: 6.0512449741363525
2025-01-13 15:29:28.106 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:65 - TavusVideoService#0 processing time: 6.05156135559082
2025-01-13 15:29:30.243 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_llm_usage_metrics:73 - GeminiMultimodalLiveLLMService#0 prompt tokens: 620, completion tokens: 22
2025-01-13 15:29:30.243 | DEBUG    | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_model_audio:278 - [Transcription:model] I can see two video feeds. One shows a person with blond hair and the other shows two people.

Image

Screenshot from 2025-01-13 15-29-17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants