Domain-specific languages (DSLs) are everywhere; you just might not recognize it. Regex is a DSL, as is SQL and HTML. C# has LINQ for queries. Even React HTML templates are actually from a DSL called JSX.
DSLs are excellent at what they do - picking the ideal language is like picking the right tool from the toolbox; not every problem needs the Python hammer. That said, there is little support for DSLs within existing languages, and we often end up throwing them into string literals hoping for the best at runtime.
Instead, I'd like to introduce a new model for implementing embedded DSLs called syntax macros, which allow extending the syntax of the language to support the DSL. This results in syntax like the following, where the DSL actually looks like it's own language:
db.execute(#sql { SELECT * FROM users WHERE name = ${name} });
This article is a more approachable summary of my honors thesis that highlights key ideas and elaborates on points that presumed a background in programming languages. The full thesis is available through the UF Library Catalog, and I believe it's pretty straightforward as far as academic papers go.
DSLs are often embedded within larger applications written in another language, called the host language. The most common method for embedding DSLs is to simply use strings, such as the following SQL query for looking up a user by their name in a database.
db.execute( "SELECT * FROM users " + "WHERE name = '" + name + "'" );
However, by using string concatenation for variables this query becomes
vulnerable to SQL injection - if name
contained malicious SQL provided by the user, it would be executed as part of
the query by the database. A common solution to this is prepared statements,
which use a template and arguments to set values separate from the query itself.
stmt = db.execute( "SELECT * FROM users " + "WHERE name = ?" ); stmt.set(0, name); stmt.execute();
Though this does resolve the SQL injection issue, there are still some major limitations to this approach:
- The query is still a string, and is thus restricted by the syntax of the host language (for example, using quotes in the DSL would require escaping).
- The separation of the query, arguments, and execution hurts readability, especially in more complex examples.
- Static analysis becomes harder, especially with ensuring the type-safety of arguments.
Ideally, we would be able to write the embedded SQL in the same way we would SQL itself, plus support for static analysis and type-safe arguments. Effectively, an embedded DSL should be just that - an embedded language, with all the benefits that come with it.
The most common approach for DSLs is to use strings, as in SQL above as well as regex. Libraries are another approach, like database ORMs or Kotlin's HTML builder.
However, one of the most promising solutions that is becoming popular is custom format strings, as seen in JavaScript's tagged templates and Scala's interpolators. These transform a string with interpolated values into a function call that can process the string as the DSL and handle the values appropriately. An example of this in JavaScript is shown below using our previous SQL query.
db.execute(sql` SELECT * FROM users WHERE name = ${name} `);
This method is extremely versatile and is currently the best option for DSLs (after all, they're designed for it). Even so, there are limitations - syntax is still restricted (ending quote) and static analysis isn't possible without external tooling (which might need to be created).
There are two key requirements needed for a solution that meets our goals.
- The syntax of the DSL should not be unnecessarily restricted by the host language, such as with string literals. This means that DSLs have to be handled at the level of the parser.
- The DSL should integrate with the host language for accessing values, which is commonly used for templating or passing arguments. This works similar to string interpolation in existing languages.
To allow a DSL to have unrestricted syntax, we must be able to change the grammar used during the parsing process. This means that the parser must use lazy lexing and be cautious of how much lookahead it uses to avoid accidentally lexing the DSL using the grammar of the host language.
That's a pretty term-heavy explanation, so an example here is easiest. Consider two languages with different (and incompatible) styles for variable names:
snake_case = 0; //only snake_case allowed #dsl { kebab-case = 1; //only kebab-case allowed } snake_case = 2; //only snake_case allowed (again)
If the parser isn't careful, it could end up lexing the kebab-case
variable using the grammar of the snake_case
language, which would be incorrect. Likewise, the parser needs to know when the
DSL ends to avoid lexing later lines using the kebab-case
grammar.
The second problem is ensuring that the DSL can access values from the host
language, which can be done using interpolation, like with `name = ${name}`
in JavaScript. The catch here is that interpolation needs to return parsing to
the host language so it uses the same grammar as the values it's trying to
access. For example, continuing our variable names scenario from above:
snake_case = 0; #dsl { kebab-case = ${snake_case}; //kebab-case in DSL, snake_case in interpolation }
Since snake_case
is not a valid variable
within the kebab-case
DSL, parsing has to
switch back to the snake_case
language to
handle the variable properly.
Syntax macros are a new model for implementing embedded DSLs that meets both of these requirements. Unlike regular macros which work with the existing AST, syntax macros are applied during parsing to use the proper grammar for the DSL. Furthermore, syntax macros use a standardized system of interpolation to access values within the host language.
In Rhovas, this looks like the following (which is what I've also been using in the examples):
db.execute(#sql { SELECT * FROM users WHERE name = ${name} });
The #
prefix is used to identify compile-time
operations, like macros. When the parser reaches this point, it delegates to an
sql
parser for input within the braces.
Finally, ${}
is used for interpolation, which
delegates parsing back to Rhovas to handle the variable. The sql
DSL can then convert this to a prepared statement for use at runtime.
If you're interested in implementation details, see the full thesis. There's a lot more information about how everything works, as well as a neat tie-in with pattern matching in Elixir.
There are a few other areas to explore moving forward:
- Syntax macros with arguments (like
#regex(:pcre) { ... }
), which conflicts with the syntax for regular macros and trailing lambdas. - Rules for managing control flow, scope, errors, and other effects, helping to isolate the DSL from having unexpected impact on the rest of the code.
- A dedicated API for creating syntax macros, as currently a new parser needs to be created and manually integrated into the Rhovas compiler.
These are all tricky problems that will take some time to solve, but as I
continue working on Rhovas I hope to make good progress with them. Keep an eye
on the #rhovas
channel in my Discord
for updates!
Syntax macros are a new model for implementing embedded DSLs that can extend the grammar of the host language, allowing different syntax to be used for the DSL. A generalized system of interpolation further allows DSLs to better integrate with the host language.
Feel free to reach out with questions or comments, and I'd love to hear feedback on this system and any insight into the points in future work.
- Email:
WillBAnders@gmail.com
- Discord: Tag me in the
#blog
channel!
Thanks for reading!
~Blake Anderson
P.S. Happy 3rd Anniversary Rhovas 🎉