Uber Develops HiveSync for Cross-Region Data Synchronization and Disaster Recovery
Uber has introduced HiveSync, a sharded batch replication system designed for synchronizing Hive and HDFS data across regions, processing millions of events daily. This system enhances data consistency and supports disaster recovery while minimizing idle hardware costs, featuring components like the HiveSync Replication Service and Data Reparo Service for real-time change capture. Future developments aim to extend HiveSync for cloud replication as analytics and machine learning transition to Google Cloud.

Uber developed HiveSync, a sharded batch replication system for synchronizing Hive and HDFS data across regions, processing millions of events daily. It enhances data consistency, supports disaster recovery, and reduces idle regional hardware costs.
Initially based on Airbnb's ReAir project, HiveSync features sharding, DAG-based orchestration, and control/data plane separation, allowing ETL jobs to run in the primary data center while maintaining near real-time replication. The system includes the HiveSync Replication Service and Data Reparo Service, utilizing a Hive Metastore Event Listener for real-time change capture and asynchronous replication jobs. Future plans involve extending HiveSync for cloud replication as analytics and ML migrate to Google Cloud.




Comments