Voice Tasks

Siri for Sites using MCP

TL;DR

Let people talk to your web app. The SDK records, transcribes in under a second, and dispatches your MCP tools. You only define what each tool does on the server.

Over the past two weeks I built a proof of concept that shows how MCP slots into any website, especially if you have complex workflows, or e-commerce flows where a single call to action limits cross-sell opportunities.

Voice intent has mostly lived on phones with Siri, or desktop apps that transcribe voice into prompts (my daily workflow is Superwhisper to Cursor). I wanted to know if the same back-and-forth could happen inside the browser: not just recording audio, but parsing intent and executing it Siri-style.

MCP ended up being the missing piece. On the web you can blend user context (current page), recent history (last action), and chained tasks ("do this, then that"). That lets someone say "buy this", "take me back", or "add credit, then send me a receipt as PDF", without learning your UI first.

Here I share two demos: the maze below shows raw voice control moving a ball to the goal (try to win, see what happens), and the video above walks through how I navigate the Memoreco prototype, add credit, and send recording video links just by speaking. Both use the same SDK.

On the backend I analyze every attempted command, cluster the misses, and show you which MCP tools to add next. The goal is to keep improving the voice catalog based on what people say.

I am adding more use cases soon. If you want updates, hop on the waitlist or say hi @andupoto.

Hands-on Demo

Move the ball with your voice

Hold P and say "move the ball two boxes down and one left", and watch the MCP move the ball.

Loading maze…

Suggested prompts

  • "Move the ball three boxes up."
  • "Move the ball two boxes right."
  • "Move one box down."

Status

Loading the maze... the Voice Tasks button is ready whenever you are.

Commands used: 0 / 10

Recent commands

Your transcript will appear here once the maze starts moving.

Under the hood the maze is powered by the same stack you see on Memoreco's dashboard: the SDK opens an audio-only session, streams it to Speechmatics for sub-second transcription, hands the text to Groq for intent parsing, and calls an MCP server you configure.

Replay the pipeline timeline

The simulator below replays the five pipeline stages. Click any step to inspect the SDK's state.

  • Go to recordings
  • Create a recording request and send it to john@example.com
  • Take me to billing and add $10 credit
  • Take me back

Press Enter to process the command.

TAP TO SIMULATE

Pipeline

  1. Record

    The SDK opens an audio-only session and records a short clip.

    Next
  2. Transcribe

    Streaming speech-to-text returns words within a few hundred milliseconds.

    Next
  3. Parse

    AI Assist extracts parameters and the LLM selects a tool.

    Next
  4. Execute

    Your MCP endpoint receives the tool name and parameters.

    Next
  5. Result

    A structured payload comes back to your UI for the next step.

    Next

Results

Status

Current

idle

Command

Take me to billing and add $10 credit

Transcript

Parsed intent

// run a command to see the structured intent

Tool result

// the MCP reply appears here

Voice commands feel natural

Voice commands understand context, not just words. It knows where the user is, which resource they've selected, and any metadata you decide to pass in. So when someone says "share this recording", the system already knows what "this" means.

Every interaction is stored in a lightweight history layer. That's how "undo that" or "go back" work, by using simple lookups paired with your undo logic, provided in the MCP.

As shown in the dashboard walkthrough, the LLM turns speech into a sequence of actions, the SDK pauses when confirmation's needed, and each MCP response can trigger the next step.

Quick start in code

The SDK is currently published as @memoreco/memoreco-js@0.2.3. Drop it into your app, point it at your MCP endpoints, and listen for the structured replies. Reach out for API access.

20-line integration
// using @memoreco/memoreco-js v0.2.3
import { MemorecoProvider, VoiceTasks } from "@memoreco/memoreco-js";
 
export function VoiceTasksQuickStart({ apiKey, apiBaseUrl, mcpServerUrl, onNavigate }) {
return (
<MemorecoProvider
// Scoped key with voice + transcription permissions only (create it server-side)
apiKey={apiKey}
config={{
voiceTasks: {
// Push-to-talk mirrors the UX in the demo and keeps the mic cold by default
activationMode: "push-to-talk",
// Streaming keeps transcripts under a second while still falling back to bulk automatically
transcriptionMode: "streaming",
// Public HTTPS endpoint on YOUR infrastructure that speaks the MCP protocol
mcpServerUrl,
},
}}
>
<VoiceTasks
eventHandlers={{
// React to structured replies from your MCP tools
onExecutionComplete: (payload) => {
if (payload.result?.nextAction === "navigate") {
// Forward the recommended path to your router (Next.js, Remix, etc.)
console.info("VoiceTasks navigation", payload.result.data.path);
onNavigate?.(payload.result.data.path);
}
},
}}
/>
</MemorecoProvider>
);
}

Multi-language support

If you want multi-language support, flip transcriptionMode to "streaming" and specify the language when you initialise the provider. Everything else comes built-in.

Let's add voice commands to your product

The SDK is ready, the MCP templates are reusable, and the Speechmatics transcription is live. DM me and we'll set up Voice Tasks for your site, or share the explainer with your team if you want to spark ideas internally.