SHARE

Kubernetes may be known as a key component for managing thousands of small container pods, but some see an even broader future in enterprise big data for the open source orchestrator.

Sean Suchter, CTO at Pepperdata, thinks Kubernetes is an ideal candidate for shepherding Apache’s big data processing platform Spark. This could, in turn, allow enterprises to more seamlessly integrate big data processes into their operations.

Spark is an open source cluster-computing framework designed to deal with large-scale data processing. It interfaces with Hadoop as a big data processing engine. Suchter worked with Hadoop when he was at Yahoo. He was part of the team that developed Hadoop to handle the company’s efforts around search. He took those efforts to Microsoft before helping to found Pepperdata.

Suchter explained that Hadoop was created in a way that separated it from traditional IT departments. “[Hadoop] was not really developed with traditional IT in mind,” he said.

Bringing Kubernetes and Spark into the equation is a way to create a bridge between the two.

Organizations typically run Spark on clusters of thousands of nodes. This is similar to how container nodes are used, and thus Suchter thinks Kubernetes is an ideal candidate to provide orchestration for Spark.

Suchter explained that with organizations becoming more familiar with Kubernetes-managed container deployments, it should be an easy move to add big data management into the mix.

“When Kubernetes first came onto the scene, we looked at it and noted that while it was really designed for services and core IT, it was also possible to write big data to run natively on it,” Suchter said. This provides an advantage because Kubernetes can process data that comes out of big data clusters and also bridge the gap between big data and traditional IT.

“With Kubernetes, these big data projects can be viewed by enterprises as just another application that goes on a cluster,” Suchter said. “It’s just one more thing to add and is no big deal.”

Strong Interest

Suchter and Pepperdata are not the only ones to see the connection. Suchter said when Pepperdata began working on a native implementation of Spark on Kubernetes, it found a community of companies also looking at the move.

“It was really neat to see that others had similar thoughts around this,” Suchter said. “It showed we were not the only ones seeing a possible connection.”

The Kubernetes community recently released version 1.8 of the orchestrator, which included specific workload support for native Kubernetes in Apache Spark as a way to run big data sets.

Suchter noted that while Kubernetes is evolving, its ability to support Spark has been within the system since at least version 1.4.

“We didn’t need a new Kubernetes version to make it work,” Suchter said. “We did find some deployments that would work better if there were some Kubernetes improvements, so it’s cool to see that the community has made it easier to work with.”

Despite the interest, Suchter said one problem so far is that no organization has tried to run Spark on Kubernetes in a production environment. Suchter said all of their testing has shown it’s possible, but that he can’t say for sure until it’s actually done.

“I think the technology is there,” Suchter said. “But we can’t prove it’s there because there are not a lot of companies doing it.”

Suchter said the next milestone is to get the combination into the upstream Spark community. He noted that the Spark program management committee has said they are committed to the move.

“There is a real big ecosystem around Kubernetes and around big data, so there is a lot of market opportunity,” Suchter said. “I am not sure how many zeros we are talking about, but I think it’s a few.”

Pepperdata earlier this year launched its Application Profiler – a platform that developers can use to understand how to optimize the capacity of applications running within their networks. The product is essentially a big data production platform that is accompanied by a cluster analyzer, a capacity optimizer, and a policy enforcer.

More recently, the company scored a deal with Hewlett Packard Enterprise (HPE) to supply its product suite to automatically optimize HPE’s Hadoop clusters and improve monitoring, tuning, and troubleshooting.

<<< This article was originally published on SDxCentral’s website here. >>>