Hike service

Onehouse wins $8 million in funding for Apache Hudi managed service • The Register

Tiny California startup Onehouse has won $8 million in seed funding with which it hopes to grow a business worthy of taking on the data engineering giants.

In a move parallel to that driven by Snowflake, a multibillion-dollar cloud data warehouse, the company founded out of an Uber engineering project offers data lake technologies as a managed service in the cloud. , which is a first. The goal is to make data lake projects faster, cheaper and easier than before.

Analysts agree that the minnow has a good chance, but it will face stiff competition and the challenge of cementing the concept of “lakehouse” in the minds of users.

The Onehouse service is based on Apache Hudi, which stands for Hadoop Upserts Delete Incrementals. It is an open source framework developed by Uber in 2016 to manage the storage of large datasets on distributed file systems. Onehouse Founder and CEO Vinoth Chandar worked on this project.

Talk to The registerChandar explained, “In 2015 and 2016, Uber was growing really fast and launching into new cities almost every day. We were running a data warehouse [based on HPE Vertica, now part of Micro Focus]but we were starting to reach the limits of the amount of data we can store in the warehouse and distribute profitably.

“We also had a lot of data-driven products or machine-learning-driven products: we wanted to credit ETAs, we wanted to tag rides for security, new use cases were coming all the time.”

The team was familiar with data lake technologies to store and manage data for machine learning, but they also wanted to work on transactional data.

“We wanted to be able to replicate all the transactional data, which is actually very structured, and bring it into the data lake very quickly. And when we looked at the warehouse, they had scaling issues. , but it had all the transactional capabilities you needed: you could take trip records from an upstream database and simply replicate the changes back to the data warehouse. like this one to do this on data lakes,” Chandar said.

Hudi (pronounced hoodie) is now used by major corporations and internet companies including Walmart, GE Aviation and HSBC. The goal of the open-source project is to be able to replicate large amounts of transactional data or event streams into a data lake without hassle, Chandar said.

“We’re trying to do data warehousing and data science on the same system. By bringing in transactions, we just made traditional business intelligence and warehousing workloads work much better. and more cost-effectively, even on the data lake.”

The approach allows users to store data in open cloud formats and choose their own query engines. Onehouse offers Hudi-as-a-service, dramatically reducing engineering time for getting projects up and running, Chandar said. “We regularly see that it still takes companies six months to a year to hire a team of engineers, train them in all these new technologies, build a lot of these pipelines and bring the lake to life. long, right? now.”

Mainstays in the data warehousing space may want to point out that they support machine learning on their systems that are already ready for transactional data. Teradata, Oracle, and Snowflake all have their own stories to tell here.

But Chandar argued that it’s much more efficient in terms of the underlying storage to route transactional data to the data lake. While warehousing companies supported programming APIs and some data frameworks used in data science, the requirements for more ambitious projects went far beyond that.

“It might be good for a segment of users, but Apache Spark or Flink access cloud storage directly,” he said. “They don’t need intermediate servers and the architecture itself is much more scalable. If you look at warehouses, you typically deploy dozens of servers. If you look at a deep learning data science pipeline , you’ll see that they routinely run hundreds of similar nodes.There’s an intrinsic cost/performance/scale barrier here that prevents people from running these more serious data science workloads on data warehouses.

Meanwhile, query issues were also fundamentally different. Optimizing data warehousing is all about reducing the amount of data the system examines – with 30 years of work from companies such as Oracle and Teradata to fix the problem. Data science is more about looking for patterns in all data, he argued.

The concept of Lakehouse is not new, however. The idea of ​​bringing data warehousing and BI-like problems to data lakes has been promoted by Databricks, which was first built around Apache Spark, for over two years. It now supports BI lingua franca SQL to boot.

Chandar said, “I think it’s similar, but there are a few key differences. First, we’re not trying to make a free version of Hudi for businesses. The problem we’re trying to solve is to introduce a new model of managed data lakes. We’re trying to introduce this category of products. Second, we want to keep Hudi more open, not just the format, but all of the services in it.

Analyst reaction to Onehouse’s approach has been largely positive, with some caveats.

Andy Thurai, vice president and principal analyst at Constellation Research, said the company has a chance to replicate Snowflake’s success in building a managed services business. “If Apache Hudi gains momentum. There is precedent with many unicorns establishing this model, from Redhat to Snowflake to Confluent. However, if adoption picks up, so will the competition. prevents other companies from trying to offer an Apache Hudi Managed Service, especially the big boys like AWS and Google.”

Hyoun Park, CEO and chief analyst at Amalgam Insights, said there was some truth to the claim that the lake house could handle storage more efficiently than a warehouse, but that would depend on what the user was trying to do.

“Data warehouses typically have processed data that is cleaned, while lakehouses consist of larger, less refined pools of data. From a practical perspective, data lakes are more beneficial because they can use more easily access data from the original source without as much effort cleaning, formatting, and prioritizing data.

“The flip side is that the warehouse will have better performing data and will likely be smaller. Whether one is ultimately more efficient than the other from a storage perspective depends on the type of data involved, the retention data, the nature of machine learning efforts, and other data management, analytics, and model considerations that could tip pure TCO one way or the other.”

Park added that there were parallels with Snowflake’s approach to creating a new category of managed services, but Onehouse would face greater hurdles.

“Snowflake has benefited from both support for an existing business paradigm and exceptional sales execution. Onehouse faces a slightly tougher challenge in that the data lake is not as prevalent as the data warehouse was in the enterprise, so there is still some general training needed and some agreement on what a “standard” data lakehouse should look like.

“However, Onehouse benefits from the fact that enterprises generally see the value in having a data lakehouse for a long-term data repository, but generally lack the expertise to set up the lakehouse. By providing a managed service, Onehouse will quickly be able to tap into analytics and data resellers who can help businesses with a data lake approach.

“Being first to market with a managed data lake approach and with deep Apache Hudi experience, Onehouse delivers an offering that could quickly reach billion-dollar unicorn status by growing businesses to take on load large data lakes and take over as Hudi project deployments become unmanageable.”

Onehouse is a small tech company with $8 million in new funding and 15 employees. With Databricks’ IPO expected this quarter, this could help the company cement the Lakehouse concept on which it hopes to build its managed services business. ®