Introducing One Ring — an open-source pipeline for all your Spark applications

by

If you utilize Apache Spark, you probably have a few applications that consume some data from external sources and produce some intermediate result, that is about to be consumed by some applications further down the processing chain, and so on until you get a final result.

We suspect that because we have a similar pipeline with lots of processes like this one:

Click here for a bit larger version

Each rectangle is a Spark application with a set of their own execution parameters, and each arrow is an equally parametrized dataset (externally stored highlighted with a color; note the number of intermediate ones). This example is not the most complex of our processes, it’s fairly a simple one. And we don’t assemble such workflows manually, we generate them from Process Templates (outlined as groups on this flowchart).

So here comes the One Ring, a Spark pipelining framework with very robust configuration abilities, which makes it easier to compose and execute a most complex Process as a single large Spark job.

And we just made it open source. Perhaps, you’re interested in the details.

Let we speak about Spark applications. This document explains how to use...

...in parametrized chains.

Each Application in terms of One Ring is an Operation — an entity with its own configuration space and input / output data streams, described with a minimal set of metadata. That abstraction allows a developer to flexibly compose complex process pipelines from a set of standard subroutines without writing more glue code, just a simple configuration in DSL. Or even generate a very complex ones from shorter Template fragments.

Let us start with a description how to build and extend One Ring, then proceed to configuration and composition stuff.

One Ring to unify them all

To build One Ring you need Apache Maven, version 3.5 or higher.

Make sure you've cloned this repo with all submodules:

git clone --recursive https://github.com/PastorGL/OneRing.git

One Ring CLI

If you're planning to compute on an EMR cluster, just cd to OneRing directory and execute Maven in the default profile:

mvn clean package

The ./TaskWrapper/target/one-ring-cli.jar is a fat executable JAR targeted at the Spark environment provided by EMR version 5.23.0.

If you're planning to build a local artifact with full Spark built-in, execute build in 'local' profile:

mvn clean package -Plocal

Make sure the resulting JAR is about ~100 MB in size.

It isn't recommended to skip tests in the build process, but if you're running Spark locally on your build machine, you could add -DskipTests to Maven command line, because they will interfere.

After you've built your CLI artifact, look into './RESTWrapper/docs' for the automatically generated documentation of available Packages and Operations (in Markdown format).

One Ring Dist

The ./DistWrapper/target/one-ring-dist.jar is a fat executable JAR that generates Hadoop's dist-cp or EMR's s3-dist-cp script to copy the source data from the external storage (namely, S3) to cluster's internal HDFS, and the computation's result back.

It is an optional component, documented a bit later.

One Ring REST

The ./RESTWrapper/target/one-ring-rest.jar is a fat executable JAR that serves a REST-ish back-end for the not-yet-implemented but much wanted Visual Template Editor. It also serves the docs via dedicated endpoint.

One Ring is designed with extensibility in mind.

Extend Operations

To extend One Ring built-in set of Operations, you have to implement an Operation class according to a set of conventions described in this doc.

First off, you should create a Maven module. Place it at the same level as root pom.xml, and include your module in root project's <modules> section. You can freely choose the group and artifact IDs you like.

To make One Ring know your module, include its artifact reference in TaskWrapper's pom.xml <dependencies>. To make your module know One Ring, include a reference of artifact 'ash.nazg:Commons' in your module's <dependencies> (and its test-jar scope too). For an example, look into Math's pom.xml.

Now you can proceed to create an Operation package and describe it.

By convention, Operation package must be named your.package.name.operations and have package-info.java annotated with @ash.nazg.config.tdl.Description. That annotation is required by One Ring to recognize the contents of your.package.name. There's an example.

Place all your Operations there.

If your module contains a number of Operation that share same Parameter Definitions, names of these parameters must be placed in a public final class your.package.name.config.ConfigurationParameters class as public static final String constants with same @Description annotation each. An Operation can define its own parameters inside its class following the same convention.

Parameter Definitions and their default value constants have names, depending on their purpose:

References to columns of input DataStreams must end with _COLUMN suffix, and to column lists with _COLUMNS.

An Operation in essence is

public abstract class Operation implements Serializable {
public abstract Map<String, JavaRDDLike> getResult(Map<String, JavaRDDLike> input) throws Exception;
}

...but enlightened with a surplus metadata that allows One Ring to flexibly configure it, and to ensure the correctness and consistency of all Operation configurations in the Process chain.

It is up to you, the author, to provide all that metadata. In the lack of any required metadata the build will be prematurely failed by One Ring Guardian, so incomplete class won't fail your extended copy of One Ring CLI on your Process execution time.

At least, you must implement the following methods:

Also you must override public void setConfig(OperationConfig config) throws InvalidConfigValueException, call its super() at the beginning, and then read all parameters from the configuration space to your Operation class' fields. If any of the parameters have invalid value, you're obliged to throw an InvalidConfigValueException with a descriptive message about the configuration mistake.

You absolutely should create a test case for your Operation. See existing tests for a reference.

There is a plenty of examples to learn by, just look into the source code for Operation's descendants. For your convenience, there's a list of most notable ones:

Extend Storage Adapters

To extend One Ring with a custom Storage Adapter, you have to implement a pair of InputAdapter and OutputAdapter interfaces. They're fairly straightforward, just see existing Adapter sources for the reference.

For example, you could create an Adapter for Spark 'parquet' files instead of CSV if you have your source data stored that way.

A single restriction exists: you can't set your Adapter as a fallback one, as that is reserved to One Ring Hadoop Adapter.

Hopefully this bit of information is enough for the beginning.

One Ring to bind them

Now entering the configuration space.

There is a domain specific language named TDL3 that stands for One Ring Task Definition Language. (There also are DSLs named TDL1 and TDL2, but they're fairly subtle.)

For the language's object model, see TaskDefinitionLanguage.java. Note that the main form intended for human use isn't JSON but a simple non-sectioned .ini (or Java's .properties) file. We refer to this file as tasks.ini, or just a config.