Hudi Newsletter#1: March-2024

Apr 09, 2024

Hello, Apache Hudi enthusiasts!

Welcome to the first edition of Hudi Newsletter, your monthly digest for all things Apache Hudi. This new newsletter is brought to you by Onehouse.ai with the aim of bringing the vibrant Hudi community even closer together.

Why “Hudi Newsletter” now?

The Apache Hudi project has grown exponentially, becoming a cornerstone for engineers and architects looking to build universal data lakehouse platforms. With this rapid growth, there’s always something new that our diverse community is working on. In each edition of “Hudi Newsletter”, we will reflect on everything that has happened in the Hudi world in the past month & look at some of the interesting future updates to keep users and enthusiasts informed.

Here's what you can expect from the Hudi Newsletter:

Project Updates
Community Highlights
Ecosystem Updates
Upcoming Events
Curated Hudi Resources

Let’s get started!

Project Updates

GitHub ❤️⭐️ https://github.com/apache/hudi

A new payload OverwriteWithGreaterRecordPayload is added to preserve the earlier version when ordering fields are equal between the incoming and existing records.
Three new configs were added to support parallel listing when doing meta-sync to Glue catalog.
- hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
- hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
- hoodie.datasource.meta.sync.glue.partition_change_parallelism

Added a new MetricsReporter implementation org.apache.hudi.metrics.m3.M3MetricsReporter which emits metrics to M3. Users using https://m3db.io can enable reporting metrics to M3 by setting hoodie.metrics.reporter.type as M3 and their corresponding host address/port in hoodie.metrics.m3.host / hoodie.metrics.m3.port
When writing Apache Flink TIMESTAMP data, we can now use local timezone. Thanks to this PR.

Community Blogs/Socials

Blogs

Navigating the Future: The Evolutionary Journey of Upstox’s Data Platform - Manish Gaurav | Upstox

Core Components of Upstox’s architecture

In this blog, Manish details the evolution of Upstox’s Data Platform (UDP) and how they tackle the challenges around scalability and cost while addressing other fundamental problems such as ACID support, Data Lake management, data processing, data access patterns, etc., on the data lake using open source technologies like Apache Hudi, Airflow, and DBT. The blog focuses on Hudi’s cleaner service, auto file sizing, and indexing capabilities that allowed them to deal with storage costs, small file problems, and faster updates.

Cost Optimization Strategies for scalable Data Lakehouse - Halodoc Engineering

With Hudi’s metadata table, Spark runs less tasks to perform reads which in turn reduces S3 API calls

Halodoc discusses some of the critical strategies for optimizing costs in a scalable Data Lakehouse architecture. The blog highlights the importance of managing S3 non-current versions, cleaning up incomplete multipart uploads, and keeping a check on S3 API costs by using Apache Hudi features like partition pruning, metadata table indices, and cleaner/archival services. Implementing these strategies led to significant savings for them: a 66.3% decrease in S3 API requests, a 41.5% cut in S3 costs, and a 34.2% reduction in EMR costs.

Enabling near real-time data analytics on the data lake - Shi Kai Ng, Shuguang Xiang | Grab Engineering

Architecture for Hudi Merge-on-Read table

Grab’s Engineering team elaborates on their approach to enabling near-real-time data analytics on their data lake. They detail the transition from traditional data lakes to using Apache Hudi table format for minimal data latency, thereby enabling faster data access to analysts. The improvements have enabled fresher business metrics and quicker fraud detection, showcasing the impact on operational efficiency and response times.

HoodieStreamer - a "Swiss Army knife" for ingestion - Shiyan Xu | Onehouse

This ninth piece in a ten-part series offers an in-depth look at HoodieStreamer, a data ingestion tool closely coupled with Hudi’s lakehouse platform that allows users to configure source data, define schemas, schedule table services, keep data catalogs in sync, amongst others. The blog breaks down the various technical components of HoodieStreamer and is a must read for users who want to understand the tool’s capabilities in depth.

Open Table Format : Apache Hudi (Hadoop Upserts Deletes and Incrementals) - Vivek Alex | Samta.ai

This introductory blog dives into Apache Hudi's lakehouse platform and its comprehensive components beyond mere table formats. Vivek notes, "While the functionality provided by a table format is just one layer in the Hudi software stack, Hudi encompasses a complete system with tools that facilitate smarter data management." It discusses the array of components in the Hudi stack and illustrates how features like record-level updates and incremental processing not only significantly cut down on compute resources but also offer a more streamlined method for handling data within data lakes.

Socials

Data Engineer Interview Experience - Apache Hudi with Spark

Great set of Data Engineering interview questions with Apache Hudi & Spark by Anurag.

Apache Hudi for Data Engineers

Shashank talks about some of the critical challenges (data freshness, managing pipelines, data reliability) addressed by Apache Hudi with capabilities like faster updates/deletes, time travel, etc. in this post.

Problems in a Data Lakehouse & how Hudi addresses them

What are some of the problems you start to see when you move forth from POC to Production with a Lakehouse architecture? The post touches upon things like small file issues, S3 object store throttling, and data co-locality and how Hudi tackles these.

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

In this blog post, Soumil delves into the realm of Incremental ETL using DeltaStreamer and SQL-Based Transformer, focusing on joining Hudi tables with other Glue tables to build denormalized Hudi tables.

Hudi in the Ecosystem

Modern Datalakes with Hudi, MinIO, and HMS - MinIO

In this hands-on tutorial, Brenna Buuck from MinIO demonstrates how to integrate a Hive Metastore (HMS) Catalog with Apache Hudi on top of a MinIO data lake. The repo can be cloned from here.

Combine Transactional Integrity and Data Lake Operations with YugabyteDB and Apache Hudi - YugaByteDB

The YugaByteDB team discusses the combined benefits of using it with Hudi for transactional integrity and efficient data lake operations. The integration benefits from Hudi’s ability to efficiently manage large-scale data in data lakes, along with YugaByteDB’s robust transactional capabilities.

How to use StarRocks and Apache Hudi to build a robust open data lakehouse architecture - StarRocks

StarRocks is an OLAP data warehouse (MPP) that supports real-time and highly concurrent data analysis. It also enables analysis on open table formats such as Apache Hudi in data lakes. This detailed tutorial serves as a go-to guide for users looking to try their hand at using Hudi with StarRocks' compute.

Latest Community Sync & Hudi Live Sessions

Tip of the Month

Have specific questions on Hudi? Book a 1-1 consultation with Hudi PMC for dedicated discussion and tips

Upcoming Events in April

Building an Open lakehouse on AWS S3 with Apache Hudi & PrestoDB - Led by Hudi and Presto Contributors, this workshop happening on 24th April (9 AM - 12 PM PST) will help attendees understand how to build an open lakehouse architecture using Hudi as the table format on S3 data lake and leverage Presto as the compute engine for ad hoc, interactive analytics at scale.

Hudi Resources

Getting started 🏁
If you are just getting started with Apache Hudi, here are some quick guides to delve into the practical aspects.
- Apache Spark: https://hudi.apache.org/docs/next/quick-start-guide
- Apache Flink: https://hudi.apache.org/docs/next/flink-quick-start-guide
- Docker Demo: https://hudi.apache.org/docs/next/docker_demo
Official docs 📗
- https://hudi.apache.org/docs/next/overview
Join Slack 🤝
Discuss issues, help others & learn from the community. Our Slack channel is a home to 4000+ Hudi users.
- https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g
Socials 📱
Join our social channels to be in the know about deep technical concepts to tips & tricks and interesting things happening the community.
- LinkedIn: https://www.linkedin.com/company/apache-hudi/
- Twitter/X: https://twitter.com/apachehudi

Weekly Office Hours 💼
Hudi PMC members/committers will hold office hours to help answer questions interactively, on a first-come first-serve basis. This is a great opportunity to bring any doubts.
- https://hudi.apache.org/community/office_hours
Interested in Contributing to Hudi?👨🏻‍💻
Apache Hudi community welcomes contributions from anyone! Here are few ways, you can get involved.
- https://hudi.apache.org/contribute/how-to-contribute

Rest of the Data Ecosystem

OneTable is now “Apache XTable™ (Incubating)” - Dipankar Mazumdar, JB Onofré
E-commerce Funnel Analysis with StarRocks: 87 Million Records, Apache Hudi, Apache Iceberg, Delta Lake - Albert Wong
Databases Are Commodities. Now What? - Chris Riccomini
Making Most Recent Value Queries Hundreds of Times Faster - Nga Tran, Andrew Lamb
[Paper: VLDB 2024] - POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance - Justin et al.

Have any feedback on documentation, content ideas or the project? Drop us a message!

Hudi Newsletter