Presto is a great query engine for a variety of SQL workloads. We’ve been offering Presto-as-a-Service for many months now and a frequent question that comes up is:
“How can I plug in custom user-defined functions in Presto?”
In this blog post, we will answer this very question. We’ve created a Presto UDF Project on Github that simplifies this process considerably. Here’s a layout of the code for ease of exposition.
│ ├── ArrayAggregation.java
│ └── state
│ ├── ArrayAggregationStateFactory.java
│ ├── ArrayAggregationState.java
│ └── ArrayAggregationStateSerializer.java
│ ├── ExtendedMathFunctions.java
│ └── NumberSystemFunctions.java
When the Presto server launches, it requests PluginManager to find and load plugins under the $PRESTO_HOME/plugins/ directory. Plugins can contain ConnectorFactory, FunctionFactory, and Types among other things. For this blog post, we’re interested in FunctionFactory. In this setup, UDFPlugin will supply a FunctionFactory that will contain all the UDFs contained in the project.
User-defined scalar functions go under the scalar subdirectory and user-defined aggregates go under the aggregation directory. You can create a jar file using mvn package and place the jar file under the $PRESTO_HOME/plugin/udfs/ directory. Restart the Presto coordinator and you should now find the UDFs available for use. In a cluster setup, these steps must be repeated on all worker nodes. You should be able to query using the radians UDF that is part of the project.
|presto:default> select radians(180);|
Now, a little bit of explanation of the internals. The UdfPlugin class provides the UdfFactory to PrestoServer. UdfFactory peeks into its own jar and iterates over all classes to find UDFs. It looks for scalar functions in classes in the com.facebook.presto.udfs.scalar namespace (and similarly, UDAFs). An alternate (and simpler) implementation of UdfFactory could iterate over a static list of classes. But that’s no fun now, is it 🙂
In the Qubole world, Presto clusters are ephemeral. They are brought up when required, auto-scale, and shut down when not in use. Therefore, you’ll need to install the UDF jars every time the cluster is launched. The Node bootstrap functionality allows you to run arbitrary commands when the Presto cluster is launched. Your script can download jars from an accessible location (e.g. an s3 bucket) and copy it to the /usr/lib/presto/plugin/udfs/ directory (be sure to create this directory first). Your script can restart the Presto worker using this command:
|/usr/lib/presto/bin/presto server restart|
For details on how to write Presto UDFs, you can take a look at this documentation and refer to a number of examples in the codebase.
We hope you find this little project useful and we welcome ideas and pull requests! Please send us a note at [email protected] if you’d like to talk to us about it.