You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to combine the gemini-multimodal-video backend to tavus avatar frontend. When running the attached code, i am able to see 5 people who have joined the call. Further when asked about "How many video feed do you see ?" , it replied with two video feed. One of the human participant and other of the tavus ai avatar
Repro steps
Run the attached pipeline code
Expected behavior
Idealy gemini backend should see the single video feed of the human participant
The number of people in the call should be 2 . Gemini backend with tavus avatar frontend as the first participant and human participant as the second.
Actual behavior
Two video feed is input to the gemini backend
Number of persons in call is 5
Code
import asyncio
import os
import sys
from typing import Any, Mapping
import aiohttp
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.tavus import TavusVideoService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def main():
async with aiohttp.ClientSession() as session:
tavus = TavusVideoService(
api_key=os.getenv("TAVUS_API_KEY"),
replica_id=os.getenv("TAVUS_REPLICA_ID"),
session=session,
)
# get persona, look up persona_name, set this as the bot name to ignore
persona_name = await tavus.get_persona_name()
room_url = await tavus.initialize()
transport = DailyTransport(
room_url=room_url,
token=None,
bot_name="pipecat0",
params=DailyParams(
audio_in_sample_rate=16000,
audio_out_sample_rate=24000,
audio_out_enabled=True,
vad_enabled=True,
vad_audio_passthrough=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
start_audio_paused=True,
start_video_paused=True,
),
)
llm = GeminiMultimodalLiveLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
voice_id="Charon", # Puck, Charon, Kore, Fenrir, Aoede
# system_instruction="Talk like a pirate."
transcribe_user_audio=True,
transcribe_model_audio=True,
# inference_on_context_initialization=False,
)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
context_aggregator.user(), # User responses
llm, # LLM
tavus, # Tavus output layer
transport.output(), # Transport bot output
context_aggregator.assistant(), # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
PipelineParams(
allow_interruptions=True,
enable_metrics=True,
enable_usage_metrics=True,
report_only_initial_ttfb=True,
),
)
@transport.event_handler("on_participant_joined")
async def on_first_participant_joined(transport, participant):
# async def on_participant_joined(
# transport: DailyTransport, participant: Mapping[str, Any]
# ) -> None:
# Ignore the Tavus replica's microphone
if participant.get("info", {}).get("userName", "") == persona_name:
logger.debug(f"Ignoring {participant['id']}'s microphone")
await transport.update_subscriptions(
participant_settings={
participant["id"]: {
"media": {"microphone": "unsubscribed"},
}
}
)
await transport.capture_participant_video(
participant["id"], framerate=1, video_source="camera"
)
# await transport.capture_participant_video(
# participant["id"], framerate=1, video_source="screenVideo"
# )
await task.queue_frames([context_aggregator.user().get_context_frame()])
await asyncio.sleep(3)
logger.debug("Unpausing audio and video")
llm.set_audio_input_paused(False)
llm.set_video_input_paused(False)
# if participant.get("info", {}).get("userName", "") != persona_name:
# # Kick off the conversation.
# messages.append(
# {"role": "system", "content": "Please introduce yourself to the user."}
# )
# await task.queue_frames([context_aggregator.user().get_context_frame()])
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frame(EndFrame())
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())
Logs
2025-01-13 15:28:52.068 | DEBUG | pipecat.services.tavus:get_persona_name:75 - TavusVideoService persona grabbed {'persona_id': 'pipecat0', 'persona_name': 'Ari', 'pipeline_mode': 'echo', 'system_prompt': ' ', 'context': None, 'layers': {'transport': {'room_settings': {'enable_chat': True, 'start_audio_off': False, 'start_video_off': False, 'enable_people_ui': True, 'enable_network_ui': True, 'enable_noise_cancellation_ui': True}, 'input_settings': {'microphone': 'disabled'}}}, 'default_replica_id': None, 'created_at': '2024-10-18T23:20:28.943Z', 'updated_at': '2024-10-18T23:20:29.005Z'}
2025-01-13 15:28:55.755 | DEBUG | pipecat.services.tavus:initialize:61 - TavusVideoService joined https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:55.755 | INFO | pipecat.audio.vad.vad_analyzer:set_params:69 - Setting VAD params to: confidence=0.7 start_secs=0.2 stop_secs=0.5 min_volume=0.6
2025-01-13 15:28:55.755 | DEBUG | pipecat.audio.vad.silero:__init__:113 - Loading Silero VAD model...
2025-01-13 15:28:55.842 | DEBUG | pipecat.audio.vad.silero:__init__:135 - Loaded Silero VAD
2025-01-13 15:28:55.845 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:upgrade:62 - Upgrading to Gemini Multimodal Live Context: <pipecat.processors.aggregators.openai_llm_context.OpenAILLMContext object at 0x76d1bacf5b50>
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking PipelineSource#0 -> DailyInputTransport#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking DailyInputTransport#0 -> GeminiMultimodalLiveUserContextAggregator#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveUserContextAggregator#0 -> GeminiMultimodalLiveLLMService#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveLLMService#0 -> TavusVideoService#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking TavusVideoService#0 -> DailyOutputTransport#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking DailyOutputTransport#0 -> GeminiMultimodalLiveAssistantContextAggregator#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking GeminiMultimodalLiveAssistantContextAggregator#0 -> PipelineSink#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking Source#0 -> Pipeline#0
2025-01-13 15:28:55.846 | DEBUG | pipecat.processors.frame_processor:link:150 - Linking Pipeline#0 -> Sink#0
2025-01-13 15:28:55.847 | DEBUG | pipecat.pipeline.runner:run:27 - Runner PipelineRunner#0 started running PipelineTask#0
2025-01-13 15:28:55.847 | INFO | pipecat.transports.services.daily:join:322 - Joining https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:58.773 | INFO | pipecat.transports.services.daily:join:340 - Joined https://tavus.daily.co/cb4ba4eb7ea7
2025-01-13 15:28:59.462 | INFO | pipecat.transports.services.daily:on_participant_joined:620 - Participant joined 68679e05-7806-4050-bcaa-7362b565ce81
2025-01-13 15:28:59.463 | DEBUG | __main__:on_first_participant_joined:112 - Ignoring 68679e05-7806-4050-bcaa-7362b565ce81's microphone
2025-01-13 15:29:02.465 | DEBUG | __main__:on_first_participant_joined:131 - Unpausing audio and video
2025-01-13 15:29:02.488 | INFO | pipecat.services.gemini_multimodal_live.gemini:_connect:364 - Connecting to Gemini service
2025-01-13 15:29:02.488 | INFO | pipecat.services.gemini_multimodal_live.gemini:_connect:372 - Connecting to wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=AIzaSyA9hVJRnToRUhSBgM6rTKsTTewLcdQcbns
2025-01-13 15:29:04.039 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_connect:401 - Setting system instruction:
You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.
2025-01-13 15:29:04.797 | INFO | pipecat.transports.services.daily:on_participant_joined:620 - Participant joined f9d51654-5192-40b0-9aba-ff6506ba4ab4
2025-01-13 15:29:07.799 | DEBUG | __main__:on_first_participant_joined:131 - Unpausing audio and video
2025-01-13 15:29:18.848 | DEBUG | pipecat.transports.base_input:_handle_interruptions:124 - User started speaking
2025-01-13 15:29:21.166 | DEBUG | pipecat.transports.base_input:_handle_interruptions:131 - User stopped speaking
2025-01-13 15:29:24.156 | DEBUG | pipecat.processors.metrics.frame_processor_metrics:start_llm_usage_metrics:73 - GeminiMultimodalLiveLLMService#0 prompt tokens: 406, completion tokens: 9
2025-01-13 15:29:24.156 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_user_audio:270 - [Transcription:user] How many video feed can you see?
2025-01-13 15:29:28.105 | DEBUG | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:50 - TavusVideoService#0 TTFB: 6.0512449741363525
2025-01-13 15:29:28.106 | DEBUG | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:65 - TavusVideoService#0 processing time: 6.05156135559082
2025-01-13 15:29:30.243 | DEBUG | pipecat.processors.metrics.frame_processor_metrics:start_llm_usage_metrics:73 - GeminiMultimodalLiveLLMService#0 prompt tokens: 620, completion tokens: 22
2025-01-13 15:29:30.243 | DEBUG | pipecat.services.gemini_multimodal_live.gemini:_handle_transcribe_model_audio:278 - [Transcription:model] I can see two video feeds. One shows a person with blond hair and the other shows two people.
Image
The text was updated successfully, but these errors were encountered:
Description
Bug
Environment
Issue description
I am trying to combine the gemini-multimodal-video backend to tavus avatar frontend. When running the attached code, i am able to see 5 people who have joined the call. Further when asked about "How many video feed do you see ?" , it replied with two video feed. One of the human participant and other of the tavus ai avatar
Repro steps
Run the attached pipeline code
Expected behavior
Actual behavior
Code
Logs
Image
The text was updated successfully, but these errors were encountered: