cld2labs/llama-3.1-8b-instruct by arpannookala-12 · Pull Request #98 · opea-project/Enterprise-Inference

arpannookala-12 · 2026-04-21T19:57:25Z

Summary

Adds model card for llama-3.1-8b-instruct (Meta) under third_party/Dell/model-deployment/llama-3.1-8b-instruct/
Adds Helm-based deployment guide for deploying llama-3.1-8b-instruct via vLLM on Gaudi and CPU (Xeon) with Keycloak OIDC and APISIX ingress

…ell EI Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

alexsin368 · 2026-05-18T23:39:22Z

Model deployment works. Testing inference is showing a Gateway Timeout error.

vLLM pod is fine, but ingress-nginx-controller is giving an upstream timeout:

2026/05/18 23:31:52 [error] 265195#265195: *10094624 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.17.23.1, server: api.example.com, request: "POST /Llama-3.1-8B-Instruct-vllmcpu/v1/completions HTTP/2.0", upstream: "http://10.233.104.80:9080/Llama-3.1-8B-Instruct-vllmcpu/v1/completions", host: "api.example.com"
172.17.23.1 - - [18/May/2026:23:31:52 +0000] "POST /Llama-3.1-8B-Instruct-vllmcpu/v1/completions HTTP/2.0" 504 160 "-" "curl/7.81.0" 1233 60.001 [auth-apisix-auth-apisix-gateway-80] [] 10.233.104.80:9080 0 60.000 504 b90b75191948d3bf2aff518dc7b72510

alexsin368 · 2026-05-22T23:51:30Z

inference is functional after increasing ingress and APISIX timeout to 300s

alexsin368 · 2026-05-26T15:40:57Z

+kubectl get ingress -A | grep <model-name>
+```
+
+Then annotate each ingress:


Suggested change

Then annotate each ingress:

Then annotate **EACH** ingress:

Let's emphasize EACH since 2 is created.

alexsin368 · 2026-05-26T15:43:02Z

+
+**Cause:**
+
+CPU-based model inference (`vllm-cpu`) generates tokens at ~0.3-0.4 tokens/s. Responses requiring more than ~24 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX.


The performance is different for every Xeon SKU and will change over time. Let's just keep the note generic by only mentioning the root cause is the upstream timeout exceeds 60 seconds.

alexsin368 · 2026-05-26T15:43:40Z

+**Notes:**
+
+- The nginx ingress annotation takes effect immediately; no pod restart required.
+- For GPU-based deployments this timeout is rarely needed as throughput is significantly higher (30-50 tokens/s vs 0.3-0.4 tokens/s on CPU).


Same here, let's remove mentions of performance numbers as it will vary from SKU, config, and over time

feat: add llama-3.1-8b-instruct model card and deployment guide for D…

d68922a

…ell EI Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

alexsin368 self-requested a review April 29, 2026 04:18

Harika added 3 commits May 5, 2026 11:58

updated llama 3.1 8b instruct deployment.md

9e3c97d

updating llama 3.1 8b instruct deployment.md

806461e

update llama 3.1 8b instruct deployment.md

7bcf2bc

alexsin368 reviewed May 22, 2026

View reviewed changes

Comment thread third_party/Dell/model-deployment/llama-3.1-8b-instruct/xeon-deployment.md Outdated

Harika added 3 commits May 26, 2026 09:21

Add model deployment troubleshooting guide for 504 gateway timeout

8eaa078

Remove em dashes from troubleshooting guide

3410c1d

update troubleshooting.md

1d52bc5

alexsin368 reviewed May 26, 2026

View reviewed changes

alexsin368 requested review from AhmedSeemalK, mdfaheem-intel and psurabh May 26, 2026 17:31

Remove README.md from model-deployment folder

971441a

AhmedSeemalK approved these changes Jun 19, 2026

View reviewed changes

mdfaheem-intel approved these changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cld2labs/llama-3.1-8b-instruct#98

cld2labs/llama-3.1-8b-instruct#98
arpannookala-12 wants to merge 8 commits into
opea-project:mainfrom
cld2labs:cld2labs/llama-3.1-8b-instruct

arpannookala-12 commented Apr 21, 2026

Uh oh!

alexsin368 commented May 18, 2026

Uh oh!

Uh oh!

alexsin368 commented May 22, 2026

Uh oh!

alexsin368 May 26, 2026

Uh oh!

alexsin368 May 26, 2026

Uh oh!

alexsin368 May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		Cause:

		CPU-based model inference (`vllm-cpu`) generates tokens at ~0.3-0.4 tokens/s. Responses requiring more than ~24 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX.

Conversation

arpannookala-12 commented Apr 21, 2026

Summary

Uh oh!

alexsin368 commented May 18, 2026

Uh oh!

Uh oh!

alexsin368 commented May 22, 2026

Uh oh!

alexsin368 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexsin368 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexsin368 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants