feat(search): add Google Gemini embedding provider#27974
feat(search): add Google Gemini embedding provider#27974
Conversation
Adds a fourth embedding provider (google) alongside openai/bedrock/djl, using the Generative Language API with a single API key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks covering schema change + regen, client implementation, validation tests, error path tests, request shape tests, switch wiring, and final verification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ient The string "models/" appeared in both DEFAULT_BASE_URL and the buildRequestBody method. Extract it as a named constant per project standards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ound Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…shape Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t comment Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new google embedding provider (Google Gemini / Generative Language API) to OpenMetadata’s vector search embedding client framework, alongside the existing bedrock, openai, and djl providers.
Changes:
- Extended the ElasticSearch configuration schema with a new
naturalLanguageSearch.googleblock and updated provider description text. - Implemented
GoogleEmbeddingClient(HTTP call, request/response JSON handling, error extraction, endpoint override support). - Wired the new provider into
SearchRepository.createEmbeddingClientand added a dedicated unit test suite.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json | Adds google provider config block under naturalLanguageSearch and updates provider description. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java | New embedding client implementation for Gemini (Generative Language API). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Adds google case to embedding client provider switch. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClientTest.java | Adds unit tests for Google embedding client behavior and request construction. |
| docs/superpowers/specs/2026-05-07-google-gemini-embedding-client-design.md | Design spec documenting the provider, config shape, and behavior. |
| docs/superpowers/plans/2026-05-07-google-gemini-embedding-client.md | Implementation plan and step-by-step checklist for the change. |
| private HttpRequest buildRequest(String body) { | ||
| String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8); | ||
| String url = endpoint + "?key=" + encodedKey; | ||
| return HttpRequest.newBuilder() | ||
| .uri(URI.create(url)) | ||
| .header("Content-Type", "application/json") | ||
| .timeout(Duration.ofSeconds(30)) | ||
| .POST(HttpRequest.BodyPublishers.ofString(body)) | ||
| .build(); |
| "default": 768 | ||
| }, | ||
| "endpoint": { | ||
| "description": "Custom endpoint URL. Leave empty for the default Generative Language API.", |
| java.util.concurrent.atomic.AtomicReference<String> captured = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| java.util.concurrent.atomic.AtomicReference<Throwable> failure = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| request | ||
| .bodyPublisher() | ||
| .ifPresent( | ||
| publisher -> { | ||
| java.util.concurrent.Flow.Subscriber<java.nio.ByteBuffer> subscriber = | ||
| new java.util.concurrent.Flow.Subscriber<>() { | ||
| private final java.io.ByteArrayOutputStream out = | ||
| new java.io.ByteArrayOutputStream(); | ||
|
|
||
| @Override | ||
| public void onSubscribe(java.util.concurrent.Flow.Subscription subscription) { | ||
| subscription.request(Long.MAX_VALUE); | ||
| } | ||
|
|
||
| @Override | ||
| public void onNext(java.nio.ByteBuffer item) { | ||
| byte[] arr = new byte[item.remaining()]; | ||
| item.get(arr); | ||
| out.write(arr, 0, arr.length); | ||
| } | ||
|
|
||
| @Override | ||
| public void onError(Throwable throwable) { | ||
| failure.set(throwable); | ||
| } | ||
|
|
||
| @Override | ||
| public void onComplete() { | ||
| captured.set(out.toString(java.nio.charset.StandardCharsets.UTF_8)); | ||
| } | ||
| }; | ||
| publisher.subscribe(subscriber); | ||
| }); | ||
| if (failure.get() != null) { | ||
| throw new RuntimeException("Body publisher failed", failure.get()); | ||
| } | ||
| String body = captured.get(); | ||
| if (body == null) { | ||
| throw new IllegalStateException("Request had no body publisher"); | ||
| } | ||
| return body; |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
🔴 Playwright Results — 2 failure(s), 17 flaky✅ 4009 passed · ❌ 2 failed · 🟡 17 flaky · ⏭️ 86 skipped
Genuine Failures (failed on all attempts)❌
|
These were workflow scaffolding (design spec + implementation plan) generated by the superpowers brainstorming/planning flow; they belong in the local development trail, not the PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- GoogleEmbeddingClient.buildRequest: handle endpoint with existing query string by switching the key separator from '?' to '&' as needed; document why the API key travels in the URL (Google Generative Language API requirement, not Bearer-header). - GoogleEmbeddingClient.extractErrorMessage: replace empty catch block with a trace-level log to comply with the 'no empty catch' standard. - elasticSearchConfiguration.json: clarify google.endpoint description so operators know it must be the full ':embedContent' URL, not a base URL. - GoogleEmbeddingClientTest.extractBody: await onComplete via CompletableFuture.get(5s) instead of relying on synchronous publisher delivery; surface onError properly. - New test: testEndpointWithExistingQueryStringUsesAmpersand verifies the '?' / '&' separator logic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A custom endpoint missing the `:embedContent` action used to silently produce 404s at runtime. Fail fast at startup with a clear message showing the expected URL form, so misconfiguration surfaces in logs instead of in vector-search failures. - Update testCustomEndpointConstruction to use a valid full URL. - Add testCustomEndpointWithoutEmbedContentThrows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a `modelId` property to the natural-language-search `google` block, parallel to how the `openai` block exposes both `modelId` (chat) and `embeddingModelId` (embedding). This enables Gemini-based NLQ filter extraction (chat completions via :generateContent) on top of the existing embedding support. Default: gemini-2.5-flash. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
Code Review ✅ Approved 3 resolved / 3 findingsImplements Google Gemini embedding provider with robust validation and schema updates, addressing previously noted null pointer and error handling concerns. No open issues remain. ✅ 3 resolved✅ Quality: Empty catch block in extractErrorMessage violates guidelines
✅ Edge Case: NPE if google config is null in SystemRepository switch
✅ Security: API key in URL query string may leak via server access logs
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|



Summary
Adds a fourth embedding provider — Google Gemini via the Generative Language API — alongside the existing
openai,bedrock, anddjlproviders. Operators can now point natural-language / semantic search at Gemini models using a single API key from Google AI Studio (no GCP project, service account, or OAuth setup required).googleblock undernaturalLanguageSearchinelasticSearchConfiguration.json(apiKey, embeddingModelId, embeddingDimension, endpoint).GoogleEmbeddingClientmirroringOpenAIEmbeddingClient: API key in URL query string,content.parts[].textbody,embedding.valuesresponse parsing, error-message extraction from Google's standard error envelope.SearchRepository.createEmbeddingClientswitch.HttpClientstubs (no Mockito) — covering construction, validation, success path, HTTP errors, malformed responses, request-shape verification, and URL encoding.Defaults to
text-embedding-004/ 768 dim; supportsgemini-embedding-001via configuration.Test plan
mvn test -pl openmetadata-service -Dtest=GoogleEmbeddingClientTest— 23 tests passmvn test -pl openmetadata-service -Dtest='*EmbeddingClientTest'— 53 tests pass, no regressions in OpenAI/EmbeddingClient sibling testsmvn spotless:apply -pl openmetadata-service— cleanOut of scope
googleVertexAiprovider):batchEmbedContents) — uses base-class default serial loop, matching siblingsSummary by Gitar
GoogleEmbeddingClientto support vector embeddings via Google Generative Language API.googleprovider intoopenmetadata.yamlwith configurableapiKey,embeddingModelId, andendpoint.googleconfiguration block toelasticSearchConfiguration.jsonincludingmodelIdfor query transformation support.google.endpointformat andembeddingDimensionconstraints.SystemRepositorydiagnostic to includegoogleprovider details and error handling.This will update automatically on new commits.