Skip to content

feat(insights): add OpenAI provider for transcription and translation#848

Open
xynstr wants to merge 2 commits intomynaparrot:mainfrom
xynstr:feat/openai-provider
Open

feat(insights): add OpenAI provider for transcription and translation#848
xynstr wants to merge 2 commits intomynaparrot:mainfrom
xynstr:feat/openai-provider

Conversation

@xynstr
Copy link
Copy Markdown

@xynstr xynstr commented Apr 28, 2026

Replaces #844 with the implementation direction you suggested there:

I saw the PR but it's depends on your own implementation on another language which I'm not familiar with. Also I don't want to introduce another language. We've plan to develop openAI API which should solve this type problem. If you're interested to implement it using go SDK e.g. https://github.com/openai/openai-go I'll be happy to work with.

This is that rework. The provider is pure Go, built on github.com/openai/openai-go/v3. No Python, no helper services, no extra build stages — everything lives inside pkg/insights/providers/openai/.

What it does

  • TranscriptionPOST /v1/audio/transcriptions, chunked from PCM16 supplied via TranscriptionStream.WriteSample. Each chunk is encoded as an in-memory WAV and uploaded; one final_result event is emitted per chunk. Realtime partials would require the websocket Realtime API, which openai-go does not yet cover and most self-hosted backends do not support — so this PR ships chunked-only on purpose, and Realtime streaming can land as a follow-up once the SDK adds it.
  • TranslationPOST /v1/chat/completions with a response_format JSON Schema whose properties are the requested target language codes. One round-trip yields all target translations; Strict mode forces the schema. I deliberately picked Chat Completions over the newer Responses API so the same code path works against any OpenAI-compatible self-hosted backend (LocalAI, vLLM, llama.cpp-server, Ollama, …) — Responses is OpenAI-only.
  • TTS / AI chat / batch summarization are stubbed for now; happy to add them in follow-up PRs if there is interest.

Configuration

insights:
  enabled: true
  providers:
    openai:
      - id: "openai-cloud"
        credentials:
          api_key: "sk-..."
        # options.base_url is optional — set it to point at any
        # OpenAI-compatible endpoint (LocalAI, vLLM, llama.cpp-server, ...).
        options:
          chunk_seconds: 5
  services:
    transcription:
      provider: "openai"
      id: "openai-cloud"
      options:
        model: "gpt-4o-mini-transcribe"   # or whisper-1, gpt-4o-transcribe
    translation:
      provider: "openai"
      id: "openai-cloud"
      options:
        model: "gpt-4o-mini"

Knobs

  • providers.openai[].credentials.api_key — required.
  • providers.openai[].options.base_url — optional override; lets the same provider talk to a self-hosted OpenAI-API endpoint.
  • providers.openai[].options.chunk_seconds — transcription chunk duration (default 5).
  • services.<svc>.options.model — per-service model override.

Files

pkg/insights/providers/openai/client.go      168 +++
pkg/insights/providers/openai/transcribe.go  265 +++
pkg/insights/providers/openai/translate.go   145 +++
pkg/insights/providers/openai/languages.go    79 +++
pkg/services/insights/insights_service.go      3 +++  (case "openai")
go.mod / go.sum                                  +    (openai-go v3)

677 lines added, no other files touched. The factory dispatch in insights_service.go mirrors the existing azure / google cases.

Testing

Built and run end-to-end against a self-hosted setup:

  • Transcription via whisper-local running faster-whisper behind FastAPI (exposing OpenAI-compatible /v1/audio/transcriptions). Multipart upload, language hint, JSON response — all working through the new provider.
  • Translation via llama-local running ggml-org/llama.cpp:server with Qwen 2.5 7B Instruct Q4_K_M. JSON Schema strict mode is honoured by llama.cpp: dynamic properties produced by buildTranslationSchema(targetLangs) are populated correctly for multi-target requests (verified de → en + ar).

I have not yet pointed it at OpenAI cloud directly in this branch — happy to do so if useful for your CI smoke or for a maintainer review.

Notes

  • The previous draft's Dockerfile.local is no longer needed; the provider is pure Go and builds with the existing upstream docker-build/Dockerfile.
  • Closing feat(insights): add local (self-hosted) transcription & translation provider #844 in favor of this one to keep the review surface clean — the design pivoted, the diff is unrecognisable from the old branch, and a fresh PR felt more honest than a force-push.

Introduces a new "openai" insights provider implemented entirely in Go
on top of github.com/openai/openai-go/v3. Replaces the third-party-
language helper service experimented with in the previous draft PR;
nothing outside this Go module is required to use it.

Capabilities
- Transcription via POST /v1/audio/transcriptions, chunked from PCM16
  audio supplied through TranscriptionStream.WriteSample. Each chunk is
  encoded as an in-memory WAV and uploaded; one final_result event is
  emitted per chunk. Real-time partials require the Realtime websocket
  API which openai-go does not yet cover and most self-hosted backends
  do not support, so we deliberately ship chunked-only.
- Translation via POST /v1/chat/completions with a JSON Schema response
  format whose properties are the requested target language codes. One
  round-trip yields all target translations; Strict mode forces the
  schema. Chat Completions (rather than the newer Responses API) keeps
  the provider compatible with self-hosted OpenAI-API endpoints
  (LocalAI, vLLM, llama.cpp-server, ...).
- Speech synthesis, AI text chat, and batch summarisation are stubbed
  out for now and can be added in follow-ups.

Configuration
- providers.openai[].credentials.api_key - required.
- providers.openai[].options.base_url - optional, point at any
  OpenAI-compatible endpoint to use a self-hosted backend.
- providers.openai[].options.chunk_seconds - transcription chunk
  duration in seconds (default 5).
- services.transcription.options.model / services.translation.options.model
  override the defaults (whisper-1 and gpt-4o-mini respectively).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new OpenAI provider for the insights service, enabling transcription and translation capabilities. The implementation includes a chunked audio streaming mechanism for transcription and a JSON-schema-constrained chat completion approach for translation. The reviewer suggested refactoring the language list definition to avoid redundant allocations and replacing the use of 'goto' in the JSON parsing logic with more idiomatic Go control flow.

Comment on lines +13 to +79
var supportedLanguages = map[insights.ServiceType][]plugnmeet.InsightsSupportedLangInfo{
insights.ServiceTypeTranscription: whisperLanguages(),
insights.ServiceTypeTranslation: whisperLanguages(),
}

func whisperLanguages() []plugnmeet.InsightsSupportedLangInfo {
return []plugnmeet.InsightsSupportedLangInfo{
{Code: "af", Name: "Afrikaans", Locale: "af"},
{Code: "ar", Name: "Arabic", Locale: "ar"},
{Code: "az", Name: "Azerbaijani", Locale: "az"},
{Code: "be", Name: "Belarusian", Locale: "be"},
{Code: "bg", Name: "Bulgarian", Locale: "bg"},
{Code: "bn", Name: "Bengali", Locale: "bn"},
{Code: "bs", Name: "Bosnian", Locale: "bs"},
{Code: "ca", Name: "Catalan", Locale: "ca"},
{Code: "cs", Name: "Czech", Locale: "cs"},
{Code: "cy", Name: "Welsh", Locale: "cy"},
{Code: "da", Name: "Danish", Locale: "da"},
{Code: "de", Name: "German", Locale: "de"},
{Code: "el", Name: "Greek", Locale: "el"},
{Code: "en", Name: "English", Locale: "en"},
{Code: "es", Name: "Spanish", Locale: "es"},
{Code: "et", Name: "Estonian", Locale: "et"},
{Code: "fa", Name: "Persian", Locale: "fa"},
{Code: "fi", Name: "Finnish", Locale: "fi"},
{Code: "fr", Name: "French", Locale: "fr"},
{Code: "gl", Name: "Galician", Locale: "gl"},
{Code: "he", Name: "Hebrew", Locale: "he"},
{Code: "hi", Name: "Hindi", Locale: "hi"},
{Code: "hr", Name: "Croatian", Locale: "hr"},
{Code: "hu", Name: "Hungarian", Locale: "hu"},
{Code: "hy", Name: "Armenian", Locale: "hy"},
{Code: "id", Name: "Indonesian", Locale: "id"},
{Code: "is", Name: "Icelandic", Locale: "is"},
{Code: "it", Name: "Italian", Locale: "it"},
{Code: "ja", Name: "Japanese", Locale: "ja"},
{Code: "kk", Name: "Kazakh", Locale: "kk"},
{Code: "kn", Name: "Kannada", Locale: "kn"},
{Code: "ko", Name: "Korean", Locale: "ko"},
{Code: "lt", Name: "Lithuanian", Locale: "lt"},
{Code: "lv", Name: "Latvian", Locale: "lv"},
{Code: "mi", Name: "Maori", Locale: "mi"},
{Code: "mk", Name: "Macedonian", Locale: "mk"},
{Code: "mr", Name: "Marathi", Locale: "mr"},
{Code: "ms", Name: "Malay", Locale: "ms"},
{Code: "ne", Name: "Nepali", Locale: "ne"},
{Code: "nl", Name: "Dutch", Locale: "nl"},
{Code: "no", Name: "Norwegian", Locale: "no"},
{Code: "pl", Name: "Polish", Locale: "pl"},
{Code: "pt", Name: "Portuguese", Locale: "pt"},
{Code: "ro", Name: "Romanian", Locale: "ro"},
{Code: "ru", Name: "Russian", Locale: "ru"},
{Code: "sk", Name: "Slovak", Locale: "sk"},
{Code: "sl", Name: "Slovenian", Locale: "sl"},
{Code: "sr", Name: "Serbian", Locale: "sr"},
{Code: "sv", Name: "Swedish", Locale: "sv"},
{Code: "sw", Name: "Swahili", Locale: "sw"},
{Code: "ta", Name: "Tamil", Locale: "ta"},
{Code: "th", Name: "Thai", Locale: "th"},
{Code: "tl", Name: "Tagalog", Locale: "tl"},
{Code: "tr", Name: "Turkish", Locale: "tr"},
{Code: "uk", Name: "Ukrainian", Locale: "uk"},
{Code: "ur", Name: "Urdu", Locale: "ur"},
{Code: "vi", Name: "Vietnamese", Locale: "vi"},
{Code: "zh", Name: "Chinese", Locale: "zh"},
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve efficiency and avoid allocating two identical slices, you can define the list of languages as a single package-level variable and reuse it in the supportedLanguages map. This also makes the code slightly more readable by removing the whisperLanguages() function.

var (
	// whisperLangs is the list of languages supported by Whisper for transcription.
	// We reuse this for translation as modern models support this set.
	whisperLangs = []plugnmeet.InsightsSupportedLangInfo{
		{Code: "af", Name: "Afrikaans", Locale: "af"},
		{Code: "ar", Name: "Arabic", Locale: "ar"},
		{Code: "az", Name: "Azerbaijani", Locale: "az"},
		{Code: "be", Name: "Belarusian", Locale: "be"},
		{Code: "bg", Name: "Bulgarian", Locale: "bg"},
		{Code: "bn", Name: "Bengali", Locale: "bn"},
		{Code: "bs", Name: "Bosnian", Locale: "bs"},
		{Code: "ca", Name: "Catalan", Locale: "ca"},
		{Code: "cs", Name: "Czech", Locale: "cs"},
		{Code: "cy", Name: "Welsh", Locale: "cy"},
		{Code: "da", Name: "Danish", Locale: "da"},
		{Code: "de", Name: "German", Locale: "de"},
		{Code: "el", Name: "Greek", Locale: "el"},
		{Code: "en", Name: "English", Locale: "en"},
		{Code: "es", Name: "Spanish", Locale: "es"},
		{Code: "et", Name: "Estonian", Locale: "et"},
		{Code: "fa", Name: "Persian", Locale: "fa"},
		{Code: "fi", Name: "Finnish", Locale: "fi"},
		{Code: "fr", Name: "French", Locale: "fr"},
		{Code: "gl", Name: "Galician", Locale: "gl"},
		{Code: "he", Name: "Hebrew", Locale: "he"},
		{Code: "hi", Name: "Hindi", Locale: "hi"},
		{Code: "hr", Name: "Croatian", Locale: "hr"},
		{Code: "hu", Name: "Hungarian", Locale: "hu"},
		{Code: "hy", Name: "Armenian", Locale: "hy"},
		{Code: "id", Name: "Indonesian", Locale: "id"},
		{Code: "is", Name: "Icelandic", Locale: "is"},
		{Code: "it", Name: "Italian", Locale: "it"},
		{Code: "ja", Name: "Japanese", Locale: "ja"},
		{Code: "kk", Name: "Kazakh", Locale: "kk"},
		{Code: "kn", Name: "Kannada", Locale: "kn"},
		{Code: "ko", Name: "Korean", Locale: "ko"},
		{Code: "lt", Name: "Lithuanian", Locale: "lt"},
		{Code: "lv", Name: "Latvian", Locale: "lv"},
		{Code: "mi", Name: "Maori", Locale: "mi"},
		{Code: "mk", Name: "Macedonian", Locale: "mk"},
		{Code: "mr", Name: "Marathi", Locale: "mr"},
		{Code: "ms", Name: "Malay", Locale: "ms"},
		{Code: "ne", Name: "Nepali", Locale: "ne"},
		{Code: "nl", Name: "Dutch", Locale: "nl"},
		{Code: "no", Name: "Norwegian", Locale: "no"},
		{Code: "pl", Name: "Polish", Locale: "pl"},
		{Code: "pt", Name: "Portuguese", Locale: "pt"},
		{Code: "ro", Name: "Romanian", Locale: "ro"},
		{Code: "ru", Name: "Russian", Locale: "ru"},
		{Code: "sk", Name: "Slovak", Locale: "sk"},
		{Code: "sl", Name: "Slovenian", Locale: "sl"},
		{Code: "sr", Name: "Serbian", Locale: "sr"},
		{Code: "sv", Name: "Swedish", Locale: "sv"},
		{Code: "sw", Name: "Swahili", Locale: "sw"},
		{Code: "ta", Name: "Tamil", Locale: "ta"},
		{Code: "th", Name: "Thai", Locale: "th"},
		{Code: "tl", Name: "Tagalog", Locale: "tl"},
		{Code: "tr", Name: "Turkish", Locale: "tr"},
		{Code: "uk", Name: "Ukrainian", Locale: "uk"},
		{Code: "ur", Name: "Urdu", Locale: "ur"},
		{Code: "vi", Name: "Vietnamese", Locale: "vi"},
		{Code: "zh", Name: "Chinese", Locale: "zh"},
	}

	// supportedLanguages enumerates the languages we surface for transcription
	// and translation.
	supportedLanguages = map[insights.ServiceType][]plugnmeet.InsightsSupportedLangInfo{
		insights.ServiceTypeTranscription: whisperLangs,
		insights.ServiceTypeTranslation:   whisperLangs,
	}
)

Comment on lines +85 to +97
parsed := map[string]string{}
if err := json.Unmarshal([]byte(raw), &parsed); err != nil {
// Some local backends ignore Strict mode and wrap output in markdown
// fences or prose. Try one tolerant unwrap before giving up.
if cleaned := stripJSONNoise(raw); cleaned != raw {
if err2 := json.Unmarshal([]byte(cleaned), &parsed); err2 == nil {
goto OK
}
}
log.WithError(err).WithField("raw", truncate(raw, 256)).Warn("openai translation: failed to parse JSON response")
return nil, fmt.Errorf("failed to parse translation JSON: %w", err)
}
OK:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While goto is functional here, it's often considered a code smell in modern Go as it can make control flow harder to follow. This block can be refactored to avoid goto by restructuring the error handling, which improves readability and aligns with common Go idioms.

	parsed := map[string]string{}
	err := json.Unmarshal([]byte(raw), &parsed)
	if err != nil {
		// Some local backends ignore Strict mode and wrap output in markdown
		// fences or prose. Try one tolerant unwrap before giving up.
		cleaned := stripJSONNoise(raw)
		if cleaned != raw {
			err = json.Unmarshal([]byte(cleaned), &parsed)
		}
	}

	if err != nil {
		log.WithError(err).WithField("raw", truncate(raw, 256)).Warn("openai translation: failed to parse JSON response")
		return nil, fmt.Errorf("failed to parse translation JSON: %w", err)
	}

@jibon57
Copy link
Copy Markdown
Member

jibon57 commented Apr 28, 2026

Thanks for it. I checked it very quickly and found that you've not considered to use realtime websocket API e.g.

https://developers.openai.com/api/docs/guides/realtime-websocket
https://github.com/openai/openai-go/blob/main/realtime/realtime.go

So, we'll lose partial (delta) results.

Adds a realtime transcription path alongside the existing chunked one,
selectable via the provider account's mode option (default: chunked).

- chunked (existing): WAV chunks POSTed to /v1/audio/transcriptions.
  Works against any OpenAI-compatible HTTP backend (faster-whisper-server,
  whisper.cpp, vLLM, LocalAI).
- realtime (new): WebSocket to /v1/realtime?intent=transcription with
  PCM16 streaming, server VAD, partial_result deltas, and final_result on
  segment completion. Works against OpenAI cloud and Azure OpenAI Realtime.

Also drops the goto fallback in translate.go in favour of a small
parseTranslationJSON helper.
@xynstr
Copy link
Copy Markdown
Author

xynstr commented Apr 29, 2026

Good catch. Pushed 8e7c952 adding both modes, switchable via a mode option on the provider:

  • mode: "chunked" (default): the existing /v1/audio/transcriptions path. Kept as default so self-hosted OpenAI-compatible servers (faster-whisper-server, whisper.cpp, vLLM, LocalAI) keep working, since none of them speak the Realtime protocol.
  • mode: "realtime": WebSocket to /v1/realtime?intent=transcription with PCM16 streaming and server VAD, emitting partial_result on deltas and final_result on segment completion. Works against OpenAI cloud and Azure OpenAI Realtime.

Used gorilla/websocket for the transport (already an indirect dep). The openai-go/v3/realtime package is types-only with no transport, so wiring our own felt simpler than mapping the Stainless unions for the few events we actually consume.

Also dropped the goto OK in translate.go in the same commit (Gemini review flagged it).

@jibon57 jibon57 self-assigned this Apr 29, 2026
@jibon57 jibon57 self-requested a review April 29, 2026 07:35
@jibon57
Copy link
Copy Markdown
Member

jibon57 commented Apr 29, 2026

Also please consider other features like: TTS, AI chat, meeting summarization as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants