Don’t miss leaders from OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One, only at VentureBeat Transform 2024. Gain key insights about GenAI and grow your network at this exclusive three-day event. Learn more
Data lakehouse vendor Onehouse aims to use new funding to expand both its commercial and open source efforts to enable interoperable data lake technology.
Today, the company announced a $35 million Series B funding round led by Craft Ventures, with participation from Addition and Greylock Partners. The goal of the funding is to accelerate product development and market penetration. This latest round brings the company’s total funding to $68 million. It follows an initial seed round of $8 million and a $25 million Series A announced in February 2023. Onehouse has its roots in the open source Apache Hudi technology, an open data lake table format originally developed at ride-sharing company Uber.
While Apache Hudi is a competitive alternative to the open source Apache Iceberg and Delta Lake table formats, Onehouse’s focus is on interoperability, not competition. In November 2023, Microsoft and Google joined Onehouse to back the OneTable open source data lake table format interoperability technology. The effort was then transferred to the Apache Software Foundation (ASF) and rebranded as Apache XTable.
With the new funding, Onehouse will continue to contribute to the development of XTable as well as evolve its Universal Data Lakehouse platform, which provides an interoperable platform that enables organizations to use different table formats, data catalogs, query engines, and cloud providers.
Countdown to VB Transform 2024
Join enterprise leaders at our flagship AI event in San Francisco July 9-11. Network with your peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications in your industry. Register now
“We’re query-engine and cloud-neutral,” Onehouse CEO and founder Vinoth Chandar told VentureBeat. “Our job is to take the data, optimize it, transform it, and present it to any engine, any catalog that the user chooses.”
Apache XTable Extends Open Source Interoperable Data Lake
Having multiple different data lake table formats poses a challenge for organizations, but XTable (formerly the OneTable project) helps solve this.
Chandar noted that interoperability and usage has grown since the effort received backing from Microsoft and Google in 2023. He noted that even at this relatively early stage, XTable offers 360-degree interoperability between data lake table metadata.
Microsoft’s use of XTables, in particular, has grown significantly recently: Chandar said at the Microsoft Build 2024 conference that the company revealed that Microsoft Fabric has integration capabilities that use XTables as a key component for translating between Apache Iceberg Snowflake writes and Delta Lake reads, and vice versa.
Apache XTable is also a core element of Onehouse’s commercial Universal Data Lakehouse platform. Universal Data Lakehouse is a product managed by Onehouse that aims to provide a neutral, efficient, and interoperable solution for data management. Chandar explained that data is ingested and transformed using Apache Hudi, then stored in a vendor-neutral format such as Apache Parquet that can be accessed by any query engine. This supports interoperability, as customers can query data stored in different table formats without performance degradation.
Next-generation Apache Hudi brings vector support to data lakes
While interoperability between different data lake table formats is important to Onehouse, the evolution of the Apache Hudi technology that underpins its proprietary platform is equally important.
Work is currently underway on a new Apache Hudi 1.0 release that will introduce a new concurrency model and support both unstructured and structured data. Chandar said that the upcoming beta release of Apache Hudi 1.0 will include a new secondary index system that will enable indexing of non-primary keys and filtering queries using those indexes.
What’s even more interesting is that we are working on adding support for vector search indexes to our extensible indexing subsystem, which will enable vector and text searches over data in the data lake. The goal is to bring Hudi closer to the database layer by improving indexing, query planning, and providing a database-like experience on top of the data lake.
Chandar said he expects Apache Hudi 1.0 to be generally available within the next few months.
Source link
