Introducing Syntax Macros

July 11, 2021
~9m read time

Domain-specific languages (DSLs) are everywhere; you just might not recognize it. Regex is a DSL, as is SQL and HTML. C# has LINQ for queries. Even React HTML templates are actually from a DSL called JSX.

DSLs are excellent at what they do - picking the ideal language is like picking the right tool from the toolbox; not every problem needs the Python hammer. That said, there is little support for DSLs within existing languages, and we often end up throwing them into string literals hoping for the best at runtime.

Instead, I'd like to introduce a new model for implementing embedded DSLs called syntax macros, which allow extending the syntax of the language to support the DSL. This results in syntax like the following, where the DSL actually looks like it's own language:

db.execute(#sql {
    SELECT * FROM users
    WHERE name = ${name}
});

This article is a more approachable summary of my honors thesis that highlights key ideas and elaborates on points that presumed a background in programming languages. The full thesis is available through the UF Library Catalog, and I believe it's pretty straightforward as far as academic papers go.

Motivation

DSLs are often embedded within larger applications written in another language, called the host language. The most common method for embedding DSLs is to simply use strings, such as the following SQL query for looking up a user by their name in a database.

db.execute(
    "SELECT * FROM users " +
    "WHERE name = '" + name + "'"
);

However, by using string concatenation for variables this query becomes vulnerable to SQL injection - if name contained malicious SQL provided by the user, it would be executed as part of the query by the database. A common solution to this is prepared statements, which use a template and arguments to set values separate from the query itself.

stmt = db.execute(
    "SELECT * FROM users " +
    "WHERE name = ?"
);
stmt.set(0, name);
stmt.execute();

Though this does resolve the SQL injection issue, there are still some major limitations to this approach:

The query is still a string, and is thus restricted by the syntax of the host language (for example, using quotes in the DSL would require escaping).
The separation of the query, arguments, and execution hurts readability, especially in more complex examples.
Static analysis becomes harder, especially with ensuring the type-safety of arguments.

Ideally, we would be able to write the embedded SQL in the same way we would SQL itself, plus support for static analysis and type-safe arguments. Effectively, an embedded DSL should be just that - an embedded language, with all the benefits that come with it.

Existing Solutions

The most common approach for DSLs is to use strings, as in SQL above as well as regex. Libraries are another approach, like database ORMs or Kotlin's HTML builder.

However, one of the most promising solutions that is becoming popular is custom format strings, as seen in JavaScript's tagged templates and Scala's interpolators. These transform a string with interpolated values into a function call that can process the string as the DSL and handle the values appropriately. An example of this in JavaScript is shown below using our previous SQL query.

db.execute(sql`
    SELECT * FROM users
    WHERE name = ${name}
`);

This method is extremely versatile and is currently the best option for DSLs (after all, they're designed for it). Even so, there are limitations - syntax is still restricted (ending quote) and static analysis isn't possible without external tooling (which might need to be created).

Embedded DSL Requirements

There are two key requirements needed for a solution that meets our goals.

The syntax of the DSL should not be unnecessarily restricted by the host language, such as with string literals. This means that DSLs have to be handled at the level of the parser.
The DSL should integrate with the host language for accessing values, which is commonly used for templating or passing arguments. This works similar to string interpolation in existing languages.

Unrestricted Syntax

To allow a DSL to have unrestricted syntax, we must be able to change the grammar used during the parsing process. This means that the parser must use lazy lexing and be cautious of how much lookahead it uses to avoid accidentally lexing the DSL using the grammar of the host language.

That's a pretty term-heavy explanation, so an example here is easiest. Consider two languages with different (and incompatible) styles for variable names:

snake_case = 0; //only snake_case allowed
#dsl {
    kebab-case = 1; //only kebab-case allowed
}
snake_case = 2; //only snake_case allowed (again)

If the parser isn't careful, it could end up lexing the kebab-case variable using the grammar of the snake_case language, which would be incorrect. Likewise, the parser needs to know when the DSL ends to avoid lexing later lines using the kebab-case grammar.

Host Language Access

The second problem is ensuring that the DSL can access values from the host language, which can be done using interpolation, like with `name = ${name}` in JavaScript. The catch here is that interpolation needs to return parsing to the host language so it uses the same grammar as the values it's trying to access. For example, continuing our variable names scenario from above:

snake_case = 0;
#dsl {
    kebab-case = ${snake_case}; //kebab-case in DSL, snake_case in interpolation
}

Since snake_case is not a valid variable within the kebab-case DSL, parsing has to switch back to the snake_case language to handle the variable properly.

Syntax Macro Model

Syntax macros are a new model for implementing embedded DSLs that meets both of these requirements. Unlike regular macros which work with the existing AST, syntax macros are applied during parsing to use the proper grammar for the DSL. Furthermore, syntax macros use a standardized system of interpolation to access values within the host language.

In Rhovas, this looks like the following (which is what I've also been using in the examples):

db.execute(#sql {
    SELECT * FROM users
    WHERE name = ${name}
});

The # prefix is used to identify compile-time operations, like macros. When the parser reaches this point, it delegates to an sql parser for input within the braces. Finally, ${} is used for interpolation, which delegates parsing back to Rhovas to handle the variable. The sql DSL can then convert this to a prepared statement for use at runtime.

If you're interested in implementation details, see the full thesis. There's a lot more information about how everything works, as well as a neat tie-in with pattern matching in Elixir.

Future Work

There are a few other areas to explore moving forward:

Syntax macros with arguments (like #regex(:pcre) { ... }), which conflicts with the syntax for regular macros and trailing lambdas.
Rules for managing control flow, scope, errors, and other effects, helping to isolate the DSL from having unexpected impact on the rest of the code.
A dedicated API for creating syntax macros, as currently a new parser needs to be created and manually integrated into the Rhovas compiler.

These are all tricky problems that will take some time to solve, but as I continue working on Rhovas I hope to make good progress with them. Keep an eye on the #rhovas channel in my Discord for updates!

Closing Thoughts...

Syntax macros are a new model for implementing embedded DSLs that can extend the grammar of the host language, allowing different syntax to be used for the DSL. A generalized system of interpolation further allows DSLs to better integrate with the host language.

Feel free to reach out with questions or comments, and I'd love to hear feedback on this system and any insight into the points in future work.

Email: WillBAnders@gmail.com
Discord: Tag me in the #blog channel!

Thanks for reading!
~Blake Anderson

P.S. Happy 3^rd Anniversary Rhovas 🎉