Remote fetch latency from LFS

  1. What happens:
    Two identical agent runs on the same event page using the same model (llama-3.1-8b-instant-groq) returned the same output, but with a massive performance difference.

Run ID: d00e8da7 took ~35 seconds to return the first token
Run ID: a0386e03 completed in ~1.2 seconds.

  1. Intended behavior:
    Consistent response latency across equivalent inputs. Remote file fetch or model queueing shouldn’t introduce ~30 sec variability.

  2. Agent ID:
    d0f9126b-d4bb-4356-8283-50ae866c9ee6

  3. Attachment:
    :paperclip: See screenshot attached.

Thanks for getting in touch. Please may you also send this through to support@mindstudio.ai so we can take a deeper look into this.

Thank you!

Sent, thanks!

Hi there, I took a deeper look and it appears this was a service interruption caused by Groq—it looks like they had a network issue from which it took a moment to recover. You can see in the attached logs screenshot (these are logs I pulled directly from Groq, not logs from MindStudio) that the first request fails entirely, then is automatically retried and takes a while to complete.

Unfortunately, while we try our best to deliver a consistent experience, the model providers are dealing with a lot of demand and we tend to see these sorts of issues from time to time (e.g., take a look at the “API” section on Anthropic’s status page to see how frequently they have outages: https://status.anthropic.com/).

Please let me know if that answers your query or if there is anything else I can do to help! Thanks!

Thanks Sean, totally understand. To clarify, when I see “Received first response token” in debugger, is this always referencing the model provider?

Correct! This is also seen in the Groq screenshot above as “TTFT” (time to first token). The latency column is then TTFT + how long it takes to generate the full result. Sometimes if TTFT is high it means there are availability or other network issues affecting the provider, where the model provider is struggling to get the request started. As opposed to low TTFT but high latency, which just means the model is taking a lot of time to do its work.

“Running model” = “MindStudio has sent the request to the model provider and the model provider has acknowledged that they have received the request”
“Received first response token” = “We’ve started getting result data back from the model provider”
“Received full result…” = “Model provider has finished responding and MindStudio is now processing the result”

1 Like