Revolutionizing Big Data Processing: Pinterest's Moka Unveiled
In the rapidly evolving world of data management, Pinterest has made a bold move that stands to transform the landscape of large-scale data processing. In a recent article, the digital pinboard platform laid out its vision for the future with a new system called Moka. This innovative platform shifts Pinterest’s fundamental workloads from an aging Hadoop framework to a modern Kubernetes-based architecture hosted on Amazon EKS. Central to this transition is Apache Spark, which serves as the primary processing engine, with plans to integrate additional frameworks in the near future.
In a compelling two-part blog series, engineers Soam Acharya, Rainie Li, William Tom, and Ang Zhang detail the thoughtful process undertaken by the Pinterest Big Data Platform team. As the limitations of their existing Hadoop-based system—internally referred to as Monarch—became increasingly evident, the team sought alternatives for a next-generation data processing solution. Their exploration culminated in the development of Moka, a cloud-native data processing platform designed to handle Pinterest's vast production workloads at scale. The first installment of the series concentrated on the overarching design and application layer, while the second part dives deeper into what the authors refer to as the "infrastructure-focused aspects of Moka," including insights gained and future aspirations.
This transition to Kubernetes is framed in very practical terms, highlighting a significant industry trend where major tech companies are beginning to view Kubernetes not merely as a stateless service platform but as a crucial control plane for data management. As the popularity of Kubernetes surges within the Big Data community, Pinterest's team identified Kubernetes-based systems as the most promising successor to Hadoop 2.x. Any new platform they considered had to fulfill stringent criteria regarding scalability, security, cost-effectiveness, and the capacity to support multiple processing engines. Moka exemplifies how organizations can upgrade their data platforms from the Hadoop era while still leveraging existing investments in Spark.
A key focus of the second article is the operationalization of Spark on a massive scale within Kubernetes. The authors elaborate on their efforts to enhance Moka with logging, metrics, and job history services, enabling engineers to troubleshoot and optimize jobs without needing intricate knowledge of the underlying cluster topology. They standardized log collection using Fluent Bit and implemented uniform metrics through OpenTelemetry along with Prometheus-compatible endpoints, thus providing both infrastructure and application teams with a cohesive understanding of system health.
Additionally, Pinterest has prioritized reproducibility in their platform through a robust infrastructure-as-code approach. The blog explains how tools like Terraform and Helm are utilized to establish EKS clusters, manage networking and security configurations, and deploy essential components such as the Spark History Server.
The engineering team also faced the challenge of accommodating different hardware architectures. They created multi-architecture images to ensure that their data workloads could efficiently run on both Intel and ARM-based instances, including AWS Graviton. This initiative is tied to broader goals related to cost savings and operational efficiency at scale. A summary of the project by InfoQ editor Eran Stiller emphasizes that Moka "offers container-level isolation, supports ARM architecture, incorporates YuniKorn scheduling, and achieves substantial cost reductions through workload consolidation and auto-scaling across various instance types." Such developments align with a growing trend among cloud users aiming to reduce infrastructure expenses without compromising performance.
Beyond Pinterest’s internal developments, the discourse surrounding processing engines further enriches their narrative. In a separate LinkedIn post, Acharya remarks that while Spark serves as their main processing workhorse, the success of Moka has led to the adoption of other technologies within Pinterest, such as Flink Batch, which is already in production, with Apache Ray following closely behind and Flink Streaming planned for introduction later this year. Technical analyses of Spark and Flink highlight the significance of these choices, noting that while Spark excels in handling large batch and interactive analytics workloads, Flink is tailored specifically for real-time, stateful stream processing, requiring meticulous event-by-event management. The Moka platform is being positioned as a versatile foundation capable of accommodating various processing engines based on specific workload demands, rather than being limited solely to Spark.
External observers have recognized valuable lessons from Pinterest's experience. The ML Engineer newsletter identifies the Moka article as a prime example of implementing EKS clusters, Fluent Bit for logging, OTEL metrics pipelines, image management, and a custom Moka UI for Spark on Kubernetes. This places it among other leading case studies in modern data infrastructure, suggesting that Moka may serve as a reference architecture for a new generation of cloud-native data systems.
However, the Pinterest team views their migration journey as ongoing rather than complete. Throughout their blog and in additional LinkedIn communications, they reflect on their "learnings and future direction" while recounting how initial proof-of-concepts paved the way for a gradual transition away from Hadoop as confidence in the new stack grew. Acharya underscores that "the best problems emerge at scale" and acknowledges the hurdles encountered as they shifted real workloads onto the new platform. For many organizations, this experience may represent the most crucial takeaway. While replicating the technical decisions surrounding Kubernetes, EKS, and Spark might be relatively straightforward, the true challenge lies in separating from legacy systems and committing to enhancements in observability, automation, and multi-engine support.
Join the Discussion!
What are your thoughts on Pinterest's shift to Kubernetes and the implications it has for the future of big data processing? Do you agree that the transition from legacy systems is the hardest part, or do you see it differently? Share your insights in the comments!