feat(insights): add OpenAI provider for transcription and translation#848
feat(insights): add OpenAI provider for transcription and translation#848xynstr wants to merge 2 commits intomynaparrot:mainfrom
Conversation
Introduces a new "openai" insights provider implemented entirely in Go on top of github.com/openai/openai-go/v3. Replaces the third-party- language helper service experimented with in the previous draft PR; nothing outside this Go module is required to use it. Capabilities - Transcription via POST /v1/audio/transcriptions, chunked from PCM16 audio supplied through TranscriptionStream.WriteSample. Each chunk is encoded as an in-memory WAV and uploaded; one final_result event is emitted per chunk. Real-time partials require the Realtime websocket API which openai-go does not yet cover and most self-hosted backends do not support, so we deliberately ship chunked-only. - Translation via POST /v1/chat/completions with a JSON Schema response format whose properties are the requested target language codes. One round-trip yields all target translations; Strict mode forces the schema. Chat Completions (rather than the newer Responses API) keeps the provider compatible with self-hosted OpenAI-API endpoints (LocalAI, vLLM, llama.cpp-server, ...). - Speech synthesis, AI text chat, and batch summarisation are stubbed out for now and can be added in follow-ups. Configuration - providers.openai[].credentials.api_key - required. - providers.openai[].options.base_url - optional, point at any OpenAI-compatible endpoint to use a self-hosted backend. - providers.openai[].options.chunk_seconds - transcription chunk duration in seconds (default 5). - services.transcription.options.model / services.translation.options.model override the defaults (whisper-1 and gpt-4o-mini respectively).
There was a problem hiding this comment.
Code Review
This pull request introduces a new OpenAI provider for the insights service, enabling transcription and translation capabilities. The implementation includes a chunked audio streaming mechanism for transcription and a JSON-schema-constrained chat completion approach for translation. The reviewer suggested refactoring the language list definition to avoid redundant allocations and replacing the use of 'goto' in the JSON parsing logic with more idiomatic Go control flow.
| var supportedLanguages = map[insights.ServiceType][]plugnmeet.InsightsSupportedLangInfo{ | ||
| insights.ServiceTypeTranscription: whisperLanguages(), | ||
| insights.ServiceTypeTranslation: whisperLanguages(), | ||
| } | ||
|
|
||
| func whisperLanguages() []plugnmeet.InsightsSupportedLangInfo { | ||
| return []plugnmeet.InsightsSupportedLangInfo{ | ||
| {Code: "af", Name: "Afrikaans", Locale: "af"}, | ||
| {Code: "ar", Name: "Arabic", Locale: "ar"}, | ||
| {Code: "az", Name: "Azerbaijani", Locale: "az"}, | ||
| {Code: "be", Name: "Belarusian", Locale: "be"}, | ||
| {Code: "bg", Name: "Bulgarian", Locale: "bg"}, | ||
| {Code: "bn", Name: "Bengali", Locale: "bn"}, | ||
| {Code: "bs", Name: "Bosnian", Locale: "bs"}, | ||
| {Code: "ca", Name: "Catalan", Locale: "ca"}, | ||
| {Code: "cs", Name: "Czech", Locale: "cs"}, | ||
| {Code: "cy", Name: "Welsh", Locale: "cy"}, | ||
| {Code: "da", Name: "Danish", Locale: "da"}, | ||
| {Code: "de", Name: "German", Locale: "de"}, | ||
| {Code: "el", Name: "Greek", Locale: "el"}, | ||
| {Code: "en", Name: "English", Locale: "en"}, | ||
| {Code: "es", Name: "Spanish", Locale: "es"}, | ||
| {Code: "et", Name: "Estonian", Locale: "et"}, | ||
| {Code: "fa", Name: "Persian", Locale: "fa"}, | ||
| {Code: "fi", Name: "Finnish", Locale: "fi"}, | ||
| {Code: "fr", Name: "French", Locale: "fr"}, | ||
| {Code: "gl", Name: "Galician", Locale: "gl"}, | ||
| {Code: "he", Name: "Hebrew", Locale: "he"}, | ||
| {Code: "hi", Name: "Hindi", Locale: "hi"}, | ||
| {Code: "hr", Name: "Croatian", Locale: "hr"}, | ||
| {Code: "hu", Name: "Hungarian", Locale: "hu"}, | ||
| {Code: "hy", Name: "Armenian", Locale: "hy"}, | ||
| {Code: "id", Name: "Indonesian", Locale: "id"}, | ||
| {Code: "is", Name: "Icelandic", Locale: "is"}, | ||
| {Code: "it", Name: "Italian", Locale: "it"}, | ||
| {Code: "ja", Name: "Japanese", Locale: "ja"}, | ||
| {Code: "kk", Name: "Kazakh", Locale: "kk"}, | ||
| {Code: "kn", Name: "Kannada", Locale: "kn"}, | ||
| {Code: "ko", Name: "Korean", Locale: "ko"}, | ||
| {Code: "lt", Name: "Lithuanian", Locale: "lt"}, | ||
| {Code: "lv", Name: "Latvian", Locale: "lv"}, | ||
| {Code: "mi", Name: "Maori", Locale: "mi"}, | ||
| {Code: "mk", Name: "Macedonian", Locale: "mk"}, | ||
| {Code: "mr", Name: "Marathi", Locale: "mr"}, | ||
| {Code: "ms", Name: "Malay", Locale: "ms"}, | ||
| {Code: "ne", Name: "Nepali", Locale: "ne"}, | ||
| {Code: "nl", Name: "Dutch", Locale: "nl"}, | ||
| {Code: "no", Name: "Norwegian", Locale: "no"}, | ||
| {Code: "pl", Name: "Polish", Locale: "pl"}, | ||
| {Code: "pt", Name: "Portuguese", Locale: "pt"}, | ||
| {Code: "ro", Name: "Romanian", Locale: "ro"}, | ||
| {Code: "ru", Name: "Russian", Locale: "ru"}, | ||
| {Code: "sk", Name: "Slovak", Locale: "sk"}, | ||
| {Code: "sl", Name: "Slovenian", Locale: "sl"}, | ||
| {Code: "sr", Name: "Serbian", Locale: "sr"}, | ||
| {Code: "sv", Name: "Swedish", Locale: "sv"}, | ||
| {Code: "sw", Name: "Swahili", Locale: "sw"}, | ||
| {Code: "ta", Name: "Tamil", Locale: "ta"}, | ||
| {Code: "th", Name: "Thai", Locale: "th"}, | ||
| {Code: "tl", Name: "Tagalog", Locale: "tl"}, | ||
| {Code: "tr", Name: "Turkish", Locale: "tr"}, | ||
| {Code: "uk", Name: "Ukrainian", Locale: "uk"}, | ||
| {Code: "ur", Name: "Urdu", Locale: "ur"}, | ||
| {Code: "vi", Name: "Vietnamese", Locale: "vi"}, | ||
| {Code: "zh", Name: "Chinese", Locale: "zh"}, | ||
| } | ||
| } |
There was a problem hiding this comment.
To improve efficiency and avoid allocating two identical slices, you can define the list of languages as a single package-level variable and reuse it in the supportedLanguages map. This also makes the code slightly more readable by removing the whisperLanguages() function.
var (
// whisperLangs is the list of languages supported by Whisper for transcription.
// We reuse this for translation as modern models support this set.
whisperLangs = []plugnmeet.InsightsSupportedLangInfo{
{Code: "af", Name: "Afrikaans", Locale: "af"},
{Code: "ar", Name: "Arabic", Locale: "ar"},
{Code: "az", Name: "Azerbaijani", Locale: "az"},
{Code: "be", Name: "Belarusian", Locale: "be"},
{Code: "bg", Name: "Bulgarian", Locale: "bg"},
{Code: "bn", Name: "Bengali", Locale: "bn"},
{Code: "bs", Name: "Bosnian", Locale: "bs"},
{Code: "ca", Name: "Catalan", Locale: "ca"},
{Code: "cs", Name: "Czech", Locale: "cs"},
{Code: "cy", Name: "Welsh", Locale: "cy"},
{Code: "da", Name: "Danish", Locale: "da"},
{Code: "de", Name: "German", Locale: "de"},
{Code: "el", Name: "Greek", Locale: "el"},
{Code: "en", Name: "English", Locale: "en"},
{Code: "es", Name: "Spanish", Locale: "es"},
{Code: "et", Name: "Estonian", Locale: "et"},
{Code: "fa", Name: "Persian", Locale: "fa"},
{Code: "fi", Name: "Finnish", Locale: "fi"},
{Code: "fr", Name: "French", Locale: "fr"},
{Code: "gl", Name: "Galician", Locale: "gl"},
{Code: "he", Name: "Hebrew", Locale: "he"},
{Code: "hi", Name: "Hindi", Locale: "hi"},
{Code: "hr", Name: "Croatian", Locale: "hr"},
{Code: "hu", Name: "Hungarian", Locale: "hu"},
{Code: "hy", Name: "Armenian", Locale: "hy"},
{Code: "id", Name: "Indonesian", Locale: "id"},
{Code: "is", Name: "Icelandic", Locale: "is"},
{Code: "it", Name: "Italian", Locale: "it"},
{Code: "ja", Name: "Japanese", Locale: "ja"},
{Code: "kk", Name: "Kazakh", Locale: "kk"},
{Code: "kn", Name: "Kannada", Locale: "kn"},
{Code: "ko", Name: "Korean", Locale: "ko"},
{Code: "lt", Name: "Lithuanian", Locale: "lt"},
{Code: "lv", Name: "Latvian", Locale: "lv"},
{Code: "mi", Name: "Maori", Locale: "mi"},
{Code: "mk", Name: "Macedonian", Locale: "mk"},
{Code: "mr", Name: "Marathi", Locale: "mr"},
{Code: "ms", Name: "Malay", Locale: "ms"},
{Code: "ne", Name: "Nepali", Locale: "ne"},
{Code: "nl", Name: "Dutch", Locale: "nl"},
{Code: "no", Name: "Norwegian", Locale: "no"},
{Code: "pl", Name: "Polish", Locale: "pl"},
{Code: "pt", Name: "Portuguese", Locale: "pt"},
{Code: "ro", Name: "Romanian", Locale: "ro"},
{Code: "ru", Name: "Russian", Locale: "ru"},
{Code: "sk", Name: "Slovak", Locale: "sk"},
{Code: "sl", Name: "Slovenian", Locale: "sl"},
{Code: "sr", Name: "Serbian", Locale: "sr"},
{Code: "sv", Name: "Swedish", Locale: "sv"},
{Code: "sw", Name: "Swahili", Locale: "sw"},
{Code: "ta", Name: "Tamil", Locale: "ta"},
{Code: "th", Name: "Thai", Locale: "th"},
{Code: "tl", Name: "Tagalog", Locale: "tl"},
{Code: "tr", Name: "Turkish", Locale: "tr"},
{Code: "uk", Name: "Ukrainian", Locale: "uk"},
{Code: "ur", Name: "Urdu", Locale: "ur"},
{Code: "vi", Name: "Vietnamese", Locale: "vi"},
{Code: "zh", Name: "Chinese", Locale: "zh"},
}
// supportedLanguages enumerates the languages we surface for transcription
// and translation.
supportedLanguages = map[insights.ServiceType][]plugnmeet.InsightsSupportedLangInfo{
insights.ServiceTypeTranscription: whisperLangs,
insights.ServiceTypeTranslation: whisperLangs,
}
)| parsed := map[string]string{} | ||
| if err := json.Unmarshal([]byte(raw), &parsed); err != nil { | ||
| // Some local backends ignore Strict mode and wrap output in markdown | ||
| // fences or prose. Try one tolerant unwrap before giving up. | ||
| if cleaned := stripJSONNoise(raw); cleaned != raw { | ||
| if err2 := json.Unmarshal([]byte(cleaned), &parsed); err2 == nil { | ||
| goto OK | ||
| } | ||
| } | ||
| log.WithError(err).WithField("raw", truncate(raw, 256)).Warn("openai translation: failed to parse JSON response") | ||
| return nil, fmt.Errorf("failed to parse translation JSON: %w", err) | ||
| } | ||
| OK: |
There was a problem hiding this comment.
While goto is functional here, it's often considered a code smell in modern Go as it can make control flow harder to follow. This block can be refactored to avoid goto by restructuring the error handling, which improves readability and aligns with common Go idioms.
parsed := map[string]string{}
err := json.Unmarshal([]byte(raw), &parsed)
if err != nil {
// Some local backends ignore Strict mode and wrap output in markdown
// fences or prose. Try one tolerant unwrap before giving up.
cleaned := stripJSONNoise(raw)
if cleaned != raw {
err = json.Unmarshal([]byte(cleaned), &parsed)
}
}
if err != nil {
log.WithError(err).WithField("raw", truncate(raw, 256)).Warn("openai translation: failed to parse JSON response")
return nil, fmt.Errorf("failed to parse translation JSON: %w", err)
}|
Thanks for it. I checked it very quickly and found that you've not considered to use realtime websocket API e.g. https://developers.openai.com/api/docs/guides/realtime-websocket So, we'll lose partial (delta) results. |
Adds a realtime transcription path alongside the existing chunked one, selectable via the provider account's mode option (default: chunked). - chunked (existing): WAV chunks POSTed to /v1/audio/transcriptions. Works against any OpenAI-compatible HTTP backend (faster-whisper-server, whisper.cpp, vLLM, LocalAI). - realtime (new): WebSocket to /v1/realtime?intent=transcription with PCM16 streaming, server VAD, partial_result deltas, and final_result on segment completion. Works against OpenAI cloud and Azure OpenAI Realtime. Also drops the goto fallback in translate.go in favour of a small parseTranslationJSON helper.
|
Good catch. Pushed 8e7c952 adding both modes, switchable via a
Used Also dropped the |
|
Also please consider other features like: |
Replaces #844 with the implementation direction you suggested there:
This is that rework. The provider is pure Go, built on
github.com/openai/openai-go/v3. No Python, no helper services, no extra build stages — everything lives insidepkg/insights/providers/openai/.What it does
POST /v1/audio/transcriptions, chunked from PCM16 supplied viaTranscriptionStream.WriteSample. Each chunk is encoded as an in-memory WAV and uploaded; onefinal_resultevent is emitted per chunk. Realtime partials would require the websocket Realtime API, whichopenai-godoes not yet cover and most self-hosted backends do not support — so this PR ships chunked-only on purpose, and Realtime streaming can land as a follow-up once the SDK adds it.POST /v1/chat/completionswith aresponse_formatJSON Schema whose properties are the requested target language codes. One round-trip yields all target translations; Strict mode forces the schema. I deliberately picked Chat Completions over the newer Responses API so the same code path works against any OpenAI-compatible self-hosted backend (LocalAI, vLLM, llama.cpp-server, Ollama, …) — Responses is OpenAI-only.Configuration
Knobs
providers.openai[].credentials.api_key— required.providers.openai[].options.base_url— optional override; lets the same provider talk to a self-hosted OpenAI-API endpoint.providers.openai[].options.chunk_seconds— transcription chunk duration (default5).services.<svc>.options.model— per-service model override.Files
677 lines added, no other files touched. The factory dispatch in
insights_service.gomirrors the existingazure/googlecases.Testing
Built and run end-to-end against a self-hosted setup:
whisper-localrunningfaster-whisperbehind FastAPI (exposing OpenAI-compatible/v1/audio/transcriptions). Multipart upload, language hint, JSON response — all working through the new provider.llama-localrunningggml-org/llama.cpp:serverwith Qwen 2.5 7B Instruct Q4_K_M. JSON Schema strict mode is honoured by llama.cpp: dynamicpropertiesproduced bybuildTranslationSchema(targetLangs)are populated correctly for multi-target requests (verified de → en + ar).I have not yet pointed it at OpenAI cloud directly in this branch — happy to do so if useful for your CI smoke or for a maintainer review.
Notes
Dockerfile.localis no longer needed; the provider is pure Go and builds with the existing upstreamdocker-build/Dockerfile.