Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/en/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
+ Advanced
- [Accelerate Data Access by MEM or SSD](samples/accelerate_data_by_mem_or_ssd.md)
- [Alluxio Tieredstore Configuration](samples/tieredstore_config.md)
- [Alluxio S3 High-Concurrency Read Tuning](samples/alluxio_s3_high_concurrency.md)
- [Pod Scheduling Optimization](operation/pod_schedule_optimization.md)
- [Pod Scheduling Base on Runtime Tiered Locality](operation/tiered_locality_schedule.md)
- [Set FUSE clean policy](samples/fuse_clean_policy.md)
Expand Down
193 changes: 193 additions & 0 deletions docs/en/samples/alluxio_s3_high_concurrency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Alluxio S3 High-Concurrency Read Tuning

This document provides a tuning profile for high-concurrency read workloads that use AlluxioRuntime with an S3-compatible backend.

This profile was validated while investigating [issue #5802](https://github.com/fluid-cloudnative/fluid/issues/5802), where fio reads over an S3-backed AlluxioRuntime could hang at high concurrency. It does not change Alluxio internals. Users can apply the configuration through `spec.properties` and FUSE args.

## Scenario

The issue was reproduced with an environment close to:

- Kubernetes v1.26.7
- Fluid v1.0.8 and Fluid master at the time of investigation
- Alluxio 2.9.5
- SeaweedFS 3.80 as an S3-compatible backend
- One Alluxio master, one worker, and FUSE
- 64 files in S3, each about 5GiB

The fio command was:

```bash
FILES=$(seq -f "/data/file%g" 0 63 | paste -sd:)
fio -iodepth=1 -rw=read -ioengine=libaio -bs=256k \
-numjobs=<numjobs> -group_reporting -size=5G \
--filename="$FILES" -name=read_test --readonly -direct=1 --runtime=60
```

Observed behavior without this tuning profile:

- `numjobs=8` and `numjobs=16` completed.
- Higher concurrency, such as `numjobs=32` or `numjobs=64`, could hang.
- The test Pod could fail to delete normally after the hang.
- Force deletion could leave fio or FUSE state stuck on the node.

The validation suggests this tuning mainly mitigates Alluxio 2.9.5 FUSE/client read-path pressure under high-concurrency S3 reads. In the reproduced environment, JNI-FUSE could hit path-lock timeout symptoms. When using JNR/libfuse2, S3 thread/client-pool tuning and disabling direct memory IO were also required to make repeated `numjobs=64` stable.

## Recommended Runtime Configuration

Use this profile only for S3 or S3-compatible high-concurrency read workloads. Keep the default behavior for other workloads unless you have validated the same tuning in your own environment.

```yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: my-s3
spec:
replicas: 1
master:
resources:
requests:
cpu: 8
memory: 32Gi
limits:
cpu: 8
memory: 32Gi
worker:
resources:
requests:
cpu: 8
memory: 32Gi
limits:
cpu: 8
memory: 64Gi
fuse:
jvmOptions:
- "-Xmx16G"
- "-Xms16G"
- "-XX:+UseG1GC"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=16"
resources:
requests:
cpu: 16
memory: 32Gi
limits:
cpu: 16
memory: 64Gi
args:
- fuse
- --fuse-opts=kernel_cache,rw,allow_other,entry_timeout=60,attr_timeout=60,max_background=256,congestion_threshold=256
properties:
alluxio.fuse.jnifuse.enabled: "false"
alluxio.fuse.jnifuse.libfuse.version: "2"
alluxio.underfs.s3.threads.max: "2048"
alluxio.user.block.worker.client.pool.max: "8192"
alluxio.user.block.size.bytes.default: "64MB"
alluxio.user.streaming.reader.chunk.size.bytes: "64MB"
alluxio.user.local.reader.chunk.size.bytes: "64MB"
alluxio.worker.network.reader.buffer.size: "64MB"
alluxio.user.direct.memory.io.enabled: "false"
tieredstore:
levels:
- mediumtype: SSD
path: /home/work/fluid_test
Comment thread
CAICAIIs marked this conversation as resolved.
Outdated
quota: 100G
high: "0.95"
low: "0.6"
```

Important details:

- Set `alluxio.fuse.jnifuse.enabled=false` and `alluxio.fuse.jnifuse.libfuse.version=2` to use JNR/libfuse2.
- Remove `max_idle_threads=*` from FUSE args when using libfuse2. `max_idle_threads` is a libfuse3 option.
- Increase S3 threads and worker client pool size for high-concurrency reads.
- Use larger read chunks and buffers to reduce request fragmentation.
- Set `alluxio.user.direct.memory.io.enabled=false`. In the reproduced environment, this was required for repeated `numjobs=64` stability.

## Dataset Example

Store access keys in a Kubernetes Secret instead of hardcoding them in YAML.

```yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: my-s3
spec:
mounts:
- mountPoint: s3://<bucket-name>/<path-to-data>/
name: s3
options:
alluxio.underfs.s3.endpoint: <s3-endpoint>
alluxio.underfs.s3.endpoint.region: <s3-endpoint-region>
encryptOptions:
- name: aws.accessKeyId
valueFrom:
secretKeyRef:
name: mysecret
key: aws.accessKeyId
- name: aws.secretKey
valueFrom:
secretKeyRef:
name: mysecret
key: aws.secretKey
```

## Test Pod Example

Mount the dataset and run fio from `/data`.

```yaml
apiVersion: v1
kind: Pod
metadata:
name: fio-reader
spec:
restartPolicy: Never
containers:
- name: client
image: alluxio/alluxio:2.9.5
Comment thread
CAICAIIs marked this conversation as resolved.
Outdated
securityContext:
runAsUser: 0
command: ["/bin/bash", "-lc", "sleep infinity"]
volumeMounts:
- mountPath: /data
name: data
readOnly: true
subPath: s3
volumes:
- name: data
persistentVolumeClaim:
claimName: my-s3
readOnly: true
```

## Validation Result

In the validation environment, after applying the above profile through Fluid-generated AlluxioRuntime configuration:

```text
numjobs=8: passed
numjobs=16: passed
numjobs=32: passed
numjobs=64: passed
repeat numjobs=64: passed
test Pod deletion: passed
Alluxio master/worker/fuse restart count: 0
```

The following error symptoms were not observed after applying the profile:

- `DeadlineExceededRuntimeException`
- `Timer expired`
- `OutOfDirectMemoryError`

`TempBlockMeta not found` warnings could still appear in Alluxio logs, but fio completed successfully, test Pods deleted normally, and Runtime components stayed healthy in the validation environment.

## Risks and Scope

- This is a tuning/configuration profile, not an upstream Alluxio internal fix.
- The values were validated for the reproduced S3-compatible workload in issue #5802. Different S3 backends, object sizes, network latency, and concurrency levels may still require tuning.
- Disabling direct memory IO improves stability for this workload, but it may affect performance.
- If the same symptoms continue after applying this profile, collect FUSE logs, worker logs, node process states, mount information, and kubelet logs before force-deleting Pods.
2 changes: 1 addition & 1 deletion docs/zh/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
+ 进阶使用
- [使用内存加速和SSD加速配置](samples/accelerate_data_by_mem_or_ssd.md)
- [AlluxioRuntime分层存储配置](samples/tieredstore_config.md)
- [Alluxio S3 高并发读调优](samples/alluxio_s3_high_concurrency.md)
- [通过Webhook机制优化Pod调度](operation/pod_schedule_optimization.md)
- [基于Runtime分层位置信息的应用Pod调度](operation/tiered_locality_schedule.md)
- [如何开启 FUSE 自动恢复能力](samples/fuse_recover.md)
Expand Down Expand Up @@ -84,4 +85,3 @@
- [如何使用Go客户端创建、删除fluid资源](dev/use_go_create_resource.md)
- [如何使用其他客户端(非GO语言)](dev/multiple-client-support.md)
- [通过REST API访问](samples/api_proxy.md)

Loading
Loading