A few months back I bought the excellent Type Astronaut’s Guide to Shapeless which is, as its name implies, a comprehensive guide to Shapeless: a Scala library making it easier to do generic programming.
A big part of the book is a practical tutorial on building a JSON encoder. As I read through this example, I was eager to find a similar usecase where I could apply all this Shapeless knowledge.
By the end of the book, I came up with the idea of deriving instances of Spark’s StrucType.
StructType is the way Spark internally represents DataFrame schemas:
The problem is interacting with them.
When reading a DataFrame/Dataset from an external data source the schema, unless specified, has to be inferred.
This is especially the case when reading:
As shown in the code linked above, inferring the schema for your DataFrame translates into looking at every record of all the files we need to read and coming up with a schema that can satisfy every one of these rows (basically the disjunction of all the possible schemas).
As the benchmarks at the end of this post will show, this is a time consuming task which can be avoided by specifying the schema in advance:
However, writing schemas again and again in order to avoid having Spark infer them is a lot of tedious boilerplate, especially when dealing with complex domain-specific data models.
That’s where struct-type-encoder comes into play.
struct-type-encoder provides a boilerplate-free way of specifying a
StructType when reading a
In this section, I’ll detail the few lines of code behind the project.
The first step is to define a
DataTypeEncoder type class which will be responsible of the mapping
between Scala types and Spark DataTypes.
We can now define a couple of primitive instances, e.g. for
and a couple of combinator instances:
Credit to circe for the collection encoder.
The second step is to derive
StructTypes from case classes leveraging Shapeless’
We can now define our instances for
And that’s it! We can now map case classes to
I wanted to measure how much time was saved specifying the schema using struct-type-encoder compared to letting Spark infer it.
In order to do so, I wrote a couple of benchmarks using JMH: one for JSON and one for CSV. Each of these
benchmarks writes a thousand file containing a hundred lines each and measure the time spent reading
The following table sums up the results I gathered:
|CSV||5.936 ± 0.035 s||6.494 ± 0.209 s|
|JSON||5.092 ± 0.048 s||6.019 ± 0.049 s|
As we can see, bypassing the schema inference done by Spark using struct-type-encoder takes 16.7% less time when reading JSON and 8.98% less when reading CSV.
The code for the benchmarks can be found here.
They can be run using the dedicated sbt plugin sbt-jmh with the
jmh:run .*Benchmark command.
If you’d like to use struct-type-encoder it’s available on maven-central as
"com.github.benfradet" %% "struct-type-encoder" % "0.1.0".
The repo can be found on github: https://github.com/BenFradet/struct-type-encoder/.
Before reading Dave Gurnell’s book, I had heard about Shapeless but never really felt the need to dig deeper. This small project really made me understand the value that Shapeless brings to the table.
About half-way through the project, I discovered frameless, a library which aims to bring back type safety to Spark SQL. It turns out that they use a similar mechanism in TypedEncoder. The goal now is to expose the functionality proposed by struct-type-encoder in frameless.
Additionally, for 0.2.0, I want to start experimenting with coproducts and see what’s possible on Spark’s side of things.