JuiceFS, a fast growing file storage for massive volume of data ~ File Storage Technologies (FST)

At the 68th IT Press Tour, Joe Zhou, Developer Relations Engineer at JuiceFS developer Juicedata, presented an update on the company's vision for cloud-native storage and the rapidly evolving role of object storage in modern data infrastructure. Rather than focusing solely on JuiceFS itself, Zhou framed the discussion around a broader industry transition in which object storage has become the foundational storage layer for AI, analytics, databases, and cloud-native applications. His central argument was that while object storage has become the dominant persistence layer because of its economics and scalability, it remains too primitive for most enterprise workloads. As a result, an entire generation of software—including distributed databases, vector stores, streaming platforms, and file systems—is emerging to provide richer interfaces while continuing to use immutable object storage as the underlying medium.

According to Zhou, object storage has evolved from being a simple archival technology into the de facto backend of modern cloud infrastructure. He pointed to a growing list of platforms that are now fundamentally built on object storage rather than block or file storage. Examples include AI databases such as LanceDB, Chroma, and Milvus; cloud databases including Neon, which was recently acquired by Databricks; distributed SQL platforms such as TiDB; streaming systems like WarpStream; analytics engines; and newer cloud-native storage systems such as turbopuffer, used by companies including Anthropic and Notion. JuiceFS itself belongs to this category of technologies that leverage object storage while presenting applications with more familiar interfaces. The breadth of these examples illustrated that object storage is no longer confined to backup or archival use cases but has become the persistence layer underpinning modern data platforms.

The appeal of object storage, Zhou explained, is based on several structural advantages that are difficult for traditional storage architectures to match. Object stores expose an extremely simple API built around operations such as PUT, GET, and Compare-and-Swap. This simplicity enables hyperscale cloud providers to deliver extraordinary scalability while maintaining operational efficiency. Unlike conventional file systems, object stores employ a flat namespace rather than hierarchical directories, allowing virtually unlimited scaling without the metadata bottlenecks associated with traditional storage architectures. Public cloud object storage services also routinely advertise eleven nines (99.999999999%) of data durability and support multi-region availability for high resilience. Combined with features such as immutability and exceptionally low storage costs—typically around two cents per gigabyte per month in major cloud regions—object storage has become the most economical and reliable long-term storage platform available.

Despite these strengths, Zhou argued that object storage is fundamentally unsuitable for many application workloads when used directly. Most enterprise software expects richer file system semantics than object stores provide. Objects cannot be modified in place, meaning even minor updates often require entire files to be rewritten. Directory hierarchies do not actually exist, instead being simulated through indexed prefixes that become increasingly expensive to manage as environments scale. Batch metadata operations such as renaming large directory trees are slow and costly because they involve manipulating enormous numbers of object keys. Object stores also exhibit higher latency than conventional file systems, cannot execute applications directly, and perform poorly when managing structured datasets composed of many related files. These limitations create friction for AI pipelines, software development environments, analytics platforms, and enterprise applications originally designed around POSIX file systems.

According to Zhou, these shortcomings explain why many modern cloud-native systems are effectively rebuilding traditional interfaces on top of object storage. Instead of abandoning POSIX or relational database interfaces, companies increasingly expose familiar APIs while using object storage solely as the persistence layer. JuiceFS provides POSIX compatibility, Neon delivers PostgreSQL semantics over object storage, and numerous modern databases perform similar abstraction for their respective workloads. The trend reflects an industry consensus that object storage offers compelling economics but requires additional software layers before it becomes practical for mainstream computing.

A significant portion of the presentation examined Amazon Web Services' recently introduced S3 Files service, released in April 2026. Zhou described the product as "a decent approach" that validates JuiceFS's overall architectural direction but argued that AWS's implementation remains constrained by important design decisions. S3 Files enables customers to mount an S3 bucket as a POSIX-compatible NFS file system by placing Amazon Elastic File System (EFS) in front of S3. Within this architecture, EFS acts as both the metadata layer and high-performance cache while S3 remains the authoritative copy of all data.

The system employs a strict one-to-one relationship between files and objects. Small files, particularly those below the default threshold of 128 kilobytes, remain optimized for low-latency access through the EFS layer. Write operations are initially committed to EFS before being synchronized back to S3 after approximately sixty seconds. Although this approach improves responsiveness compared with accessing S3 directly, Zhou argued that it introduces additional complexity and several significant limitations.

One of the most important concerns involves write amplification. Because each file corresponds directly to a single object, modifying even a tiny portion of a large file requires substantial data movement. For example, appending only a few bytes to a two-gigabyte video file requires retrieving the complete object, merging the new data, and rewriting the entire object through Amazon's multi-stage append workflow. This process increases latency while consuming additional network bandwidth and storage operations.

Metadata operations also become increasingly expensive at scale. Zhou illustrated this using a simple rename command. Renaming a directory containing one million files in a conventional POSIX file system typically requires only metadata updates. Under S3 Files, however, such an operation eventually triggers background rewrites of every corresponding object because each object's key incorporates the file path. Consequently, operations that appear trivial to applications may generate substantial backend activity, increasing operational costs and execution time.

Additional trade-offs include batching delays introduced by asynchronous synchronization between EFS and S3, conflict resolution policies that always treat S3 as the authoritative source, and an architecture limited exclusively to Amazon Web Services. Customers cannot extend the solution across multiple public clouds or integrate alternative object storage platforms. Pricing also becomes more complicated because organizations pay separately for EFS capacity, EFS read and write operations, synchronization processes, and underlying S3 storage and requests. According to Zhou's comparison, S3 Files successfully delivers POSIX compatibility and directory hierarchies but continues to struggle with in-place updates, efficient metadata operations, application execution, and workloads involving structured datasets.

JuiceFS approaches these challenges through a fundamentally different architecture centered on strict separation of data and metadata. Rather than mapping one file to one object, JuiceFS divides every file into immutable four-megabyte chunks. Metadata—including directory structure, permissions, and file mappings—is maintained independently within a dedicated metadata engine. Because only modified chunks require rewriting, appending data to a large file typically updates only the final chunk rather than recreating the entire object. Similarly, renaming a directory becomes a lightweight metadata transaction instead of triggering extensive object rewrites.

The Community Edition of JuiceFS is released under the Apache 2.0 open-source license and supports multiple external metadata databases, including Redis, TiKV, MySQL, PostgreSQL, and FoundationDB. Users can combine these metadata engines with virtually any mainstream object storage platform, including Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, Alibaba Cloud OSS, Tencent Cloud COS, Ceph, and MinIO. Applications access the storage through several interfaces, including POSIX via FUSE, Java and Python software development kits, a Kubernetes CSI driver for container environments, and an S3 Gateway providing compatibility with existing object-based applications.

Recent Community Edition enhancements include configurable storage-class tiering. Administrators can now automatically place directories or individual files into different cloud storage classes such as Amazon S3 Standard-Infrequent Access, Intelligent-Tiering, or Glacier Instant Retrieval. This feature allows organizations to optimize storage costs according to workload requirements without altering application behavior.

The Enterprise Edition expands considerably on the open-source platform. Instead of relying on external databases, it introduces a proprietary distributed metadata engine based on the Raft consensus protocol. This metadata layer scales horizontally while maintaining three-copy redundancy for fault tolerance. Enterprise deployments also gain access to a distributed cache architecture shared across thousands of clients, enabling improved performance for large distributed AI clusters and analytics environments.

Additional Enterprise Edition capabilities include native cross-region replication and multi-cloud mirroring. Organizations may choose cache-only mirrors, where data remains stored centrally while caches are positioned closer to compute resources, or full mirrors that replicate complete datasets across multiple cloud regions or providers. These capabilities support hybrid cloud architectures while reducing latency for globally distributed workloads.

Scalability represents another major Enterprise Edition enhancement. JuiceFS recently increased the supported limit from one hundred billion to five hundred billion files within a single volume. Zhou noted that one existing customer deployment already stores approximately 1.47 pebibytes of data and more than 404 billion inodes, demonstrating that the architecture has moved beyond theoretical scalability into real-world production environments.

Several customer examples illustrated how these capabilities are being used in practice. MiniMax, one of China's leading artificial intelligence laboratories, operates JuiceFS in a hybrid cloud architecture where GPU clusters remain inside the company's own data center while object storage resides elsewhere. Cache-only mirrors position frequently accessed data closer to the GPUs to minimize latency, and the organization is evaluating full replication across multiple locations. JuiceFS argues that this architecture provides the flexibility to balance infrastructure costs against AI training performance without requiring duplicate storage management.

Beyond MiniMax, Zhou highlighted a growing list of customers spanning AI, cloud infrastructure, robotics, internet services, and software development. Named adopters include HeyGen, GMI, PixVerse, Momenta, Horizon Robotics, Xiaomi, Lovart, NAVER, Trip.com, fal, D-Robotics, Cerebrium, Fly.io, and Jerry. These deployments demonstrate the platform's applicability across generative AI, autonomous driving, cloud-native applications, and enterprise software development.

Finally, Zhou discussed JuiceFS's commercial model. Enterprise Edition pricing is based solely on the amount of source-region storage capacity managed by the platform rather than the number of connected clients or compute nodes. This allows organizations to expand large GPU clusters without incurring additional JuiceFS licensing costs. Equally important, JuiceFS does not impose its own data transfer charges because clients communicate directly with the underlying object storage rather than routing traffic through JuiceFS-managed infrastructure. Customers therefore pay only the standard data transfer fees charged by their chosen cloud provider.

Overall, the presentation positioned JuiceFS as part of a broader shift in enterprise storage architecture. Rather than replacing object storage, the company argues that the future lies in enhancing it with higher-level interfaces that preserve cloud economics while eliminating operational limitations. By separating metadata from data, chunking files into immutable blocks, and supporting multiple clouds and storage providers, JuiceFS aims to provide the performance and flexibility of a distributed POSIX file system while retaining the scalability, durability, and low cost that have made object storage the foundation of modern AI, analytics, and cloud-native infrastructure.

Tuesday, June 23, 2026

JuiceFS, a fast growing file storage for massive volume of data

0 commentaires: