Unleashing GenAI: How a Next-Gen Data Format is Revolutionizing AI Data Storage

As excitement about GenAI's potential grows, it’s clear that while many focus on applications, models, and implementation details, data is the driving force behind success. For most organizations today, that data is unstructured and stored as files. GenAI and classic AI use cases depend on a mix of internal proprietary unstructured data and external open data to train sophisticated models—long-form text, video, or image files.
Technology leaders increasingly acknowledge the critical need for a strategic approach to identifying, classifying, and securely unifying unstructured data. Such an approach maximizes model performance and minimizes the risk of overspending infrastructure costs.
Apache Parquet has emerged as an industry standard for column-oriented data storage on object storage. Yet, its design introduces challenges for modern AI/ML workloads and unstructured data use cases. Many systems spend up to 30% of processing time on reading data, partly because Parquet’s fixed encoding schemes and rigid metadata handling hinder performance—especially when handling point lookups in semantic search or managing wide tables with thousands of columns.
Nimble, an open-source project led by Meta, addresses these issues by reducing I/O overheads, improving read speeds, and lowering memory usage. Its innovative design and extensible architecture make it ideally suited for the evolving demands of GenAI, feature engineering, and unstructured data processing.
Key Use Cases
AI/ML & Semantic Search
GenAI and semantic search often rely on point lookups that access only a few rows from massive, unstructured datasets. Parquet’s encodings require an entire page to be loaded even for a single row, hurting performance. Nimble’s sliceable, efficient encodings are designed to overcome this.
Feature Engineering for Wide Datasets
Workloads with thousands of columns, standard in training models and feature engineering, benefit from Nimble’s lightweight metadata and flexible, cascading encoding mechanisms, which reduce the overhead of loading unnecessary schema information.
High-Performance, Parallel Environments
Nimble is architected to exploit modern hardware such as SIMDs and GPUs, ensuring rapid decoding in parallel processing environments and paving the way for scalable, high-performance data processing.
Challenges with existing formats
1. I/O Overheads and Metadata Layout
Multiple I/O Operations: Traditional formats like Parquet require several I/O calls—one for the file footer, additional ones for stripe or page footers, and further calls for each data stream. This approach is particularly inefficient for point lookups, where an entire page must be loaded to access a single row.
Inefficient Metadata Management for Wide Datasets: Unstructured data use cases often involve thousands of columns. Parquet forces readers to load complete schema metadata for all columns, even when only a small subset is needed, creating unnecessary overhead.
Decoding Entire Blocks: Since Parquet’s encodings are not sliceable, even a single value retrieval can require decompressing and decoding a whole block, significantly increasing latency.
2. Encoding and Compression Limitations
Fixed Encoding Options: Parquet supports only a fixed set of data encoders, limiting flexibility in optimizing for unstructured data scenarios where new encoding styles might offer significant performance and storage benefits.
Metadata Storage Constraints: Parquet restricts metadata storage to within pages. However, unstructured data workloads—such as semantic search with embeddings or images—could benefit from more flexible metadata storage at the row or file level.
Coupled Encoding and Compression: The tight integration of encoding and compression in existing formats limits the ability to optimize each independently, thereby hindering both efficient compression and selective decoding.
3. Memory Constraints During Writes
Handling Wide Data Efficiently: Writing data purely in a columnar fashion requires large in-memory buffers. To mitigate this, data is split into stripes or row groups, which increases complexity and metadata overhead—particularly challenging for wide, unstructured datasets.
How Nimble Addresses These Issues
1. Optimized Data Layout and Lighter Metadata
Integrated Metadata in Streams: Nimble embeds encoding information directly within each data stream, eliminating bulky stripe or page footers and reducing the file footer to only essential location data.
Efficient Metadata Access with Flatbuffers: By utilizing Flatbuffers instead of older methods like Thrift or Protobuf, Nimble efficiently manages large metadata sections—a critical advantage for datasets with thousands of columns.
Block Encoding Strategy: Nimble uses block encoding over traditional stream encoding to ensure predictable memory usage during decoding, making it especially beneficial for wide datasets containing large unstructured columns.
2. Advanced, Extensible, and Cascading Encodings
Nested (Cascading) Encodings: Nimble supports recursive or composite encodings that allow developers to build encoding trees tailored to specific data types. For example, dictionary encoding can be optimized by separately managing the alphabet and indices—overcoming the limitations of fixed encoders.
Pluggable Encoding Selection Policies: With support for a variety of selection strategies—from brute-force to heuristic or machine learning–based methods—Nimble dynamically selects the optimal encoding for the workload.
Decoupled Metadata Storage: Nimble’s design decouples metadata storage from the physical layout, allowing metadata to be stored flexibly (e.g., at the row or file level) to support more efficient data access for unstructured workloads.
3. Core Design Principles Enhancing Usability and Performance
Wide: Nimble is optimized for workloads with thousands of columns, making it particularly effective for feature engineering and training tables in AI/ML scenarios.
Extensible: Recognizing that data encoding techniques evolve faster than file layouts, Nimble decouples stream encoding from the underlying physical layout. This approach enables users to extend or customize encodings recursively without breaking the file format.
Parallel: Designed with modern hardware in mind, Nimble is built to be SIMD and GPU friendly. Future releases will expose metadata to help optimize decoding trees and schedule parallel processing kernels.
Unified: Rather than being just a specification, Nimble is delivered as a unified library. This discourages fragmented implementations and promotes high-quality language bindings across platforms.
4. Enhanced Write and Read Strategies
Row-Based Data Accumulation: Nimble accumulates row data in memory and applies encoding and compression only after reaching a threshold, allowing for more informed encoding decisions and avoiding the overhead of decompressing large, irrelevant blocks.
Shared Dictionaries: By introducing shared dictionaries across and within stripes, Nimble ensures that common encoding alphabets are stored only once—improving storage efficiency and CPU cache locality.
Extensibility APIs: Nimble offers robust APIs that allow users to extend and integrate new encodings and file format features as needed, ensuring that the format can evolve alongside emerging data storage techniques.
Conclusion
While Apache Parquet has long been the industry standard for columnar data storage, its limitations—such as inefficient handling of point lookups, challenges with wide and unstructured data, fixed encoding options, and rigid metadata storage—pose significant hurdles for modern AI/ML and semantic search applications. Nimble addresses these issues with a forward-thinking design that reduces I/O overheads employs advanced nested and cascading encoding strategies and leverages lightweight metadata management with Flatbuffers. Designed to be wide, extensible, parallel, and unified, Nimble provides a robust and scalable foundation that meets the evolving demands of GenAI and other data-intensive applications.
That said, while Nimble shows great promise, it is currently challenging to find comprehensive implementation documentation. Integration must be simplified for enterprises outside of Meta to adopt Nimble as part of their modern data stack and cloud architecture. An abstraction layer that eases interoperability with existing technologies—such as Object Stores, Apache Iceberg, and others—is essential, much like the current ecosystem surrounding ORC, Parquet, and Avro. Such an integration layer would be critical in driving broader adoption and ensuring that Nimble can be seamlessly incorporated into existing data infrastructures. Although still in its early days and not yet fully supported across broader use cases, this next-gen format is one to watch closely.