Session

How to Build a Cheap and Scalable Feature Store on S3 with 1000× Acceleration

Using Parquet on S3 as a lightweight feature store is becoming common. But querying petabyte-scale data lakes directly from cloud object storage remains painfully slow—often with latencies in the hundreds of milliseconds, and inconsistent performance at scale.

In this talk, we’ll walk through how to turn your S3-based Parquet data lake into a high-performance feature store—without rearchitecting your stack, rewriting data, or buying expensive hardware.

We present a system architecture co-designed with Alluxio, acting as a high-throughput, low-latency S3 proxy. This layer delivers sub-millisecond Time-to-First-Byte (TTFB)—comparable to Amazon S3 Express—while remaining fully compatible with existing S3 APIs. In production benchmarks, a 50-node Alluxio cluster achieves over 1 million S3 ops/sec—50× the throughput of S3 Express—at predictable latency and low cost.

To further optimize feature lookups and point queries, we introduce pluggable Parquet pre-processing inside the Alluxio proxy. This offloads index scans and row filtering from the query engine, enabling record-level lookups at 0.3 microseconds latency and 3,000 QPS per core—100× faster than traditional approaches.

This talk is ideal for teams building ML platforms or feature stores on top of cloud-native storage who want speed without the spend.

Bin Fan

Founding Engineer, Alluxio

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top