Semantic Analysis Abstractions

June 25, 2022
~11m read time

In my last post, I introduced the idea of syntax macros for parsing embedded domain-specific languages (DSLs) like regex, SQL, and HTML. DSLs are an excellent tool for modeling many types of specialized problems, and improving language-level support for them is something I'm hoping to achieve in my own programming language Rhovas.

This time, I'm going to address the harder challenge of semantic analysis - variables/function resolution, macro hygiene, type checking, and so on. This process has two goals: transform the current AST of the DSL into a representation that is compatible with the host language, and perform any additional compile-time validation that would otherwise be lost during the transformation.

For example, the code belows shows an SQL DSL being transformed into a PreparedStatement. This process also validates the interpolated name variable, both with the variable being defined and whether it has the appropriate type (presuming the database schema is available).


  
db.execute(#sql {
    SELECT * FROM users
    WHERE name = ${name}
});

db.execute(PreparedStatement(
    "SELECT * FROM users WHERE name = ?",
    [name],
));

This article is a more approachable summary of my Master's Thesis that highlights key ideas and elaborates on points that presumed a background in programming languages. The full thesis is available through the UF Library Catalog, and I believe it's pretty straightforward as far as academic papers go.

DSL Transformations

As part of semantic analysis, the first goal is to transform the AST of the DSL (which is custom due to syntax macros) into the AST of the host language. Many languages already support similar behavior through interpolation and macro systems, so the problem itself is pretty well-defined.

Interpolator Transformations

The easiest way to implement a DSL is by transforming the DSL into a function call with a combination of string literals (DSL code) and interpolated arguments. Both Scala interpolators and JavaScript tagged templates are strong examples of this method.

The example below demonstrates this for SQL, which already uses string literals and such has an existing runtime to process that literal. Note the inclusion of newlines and the second literal argument, since ultimately the DSL becomes just a multiline string literal.


  
#sql {
    SELECT * FROM users
    WHERE name = ${name}
}

Sql(
    literals = ["SELECT * FROM users\nWHERE name = ", "\n"],
    arguments = [name],
)

This approach is a great starting point for getting DSLs working quickly, especially when there's an existing runtime available (as is often the case with regex, SQL, and HTML) or any string processing is relatively trivial. However, these are ultimately still string literals and thus aren't really validating syntax or interpolated variables.

Host AST/IR Transformations

After interpolators, the next step is to support compile-time validation. Using syntax macros, DSLs can be represented with a custom AST which then has to be lowered into the host language. This can be done by transforming to the host AST (before semantic analysis) or the host IR (after semantic analysis):

AST transformations are like rewriting the DSL using the host language's syntax. It's functional and conceptually straightforward, but because the AST is before semantic analysis the DSL must rely on the host language for any validation which can present issues with macro hygiene.
IR transformation are the most powerful, and unlike the AST have access to contextual information like types which can be used for validation. Most macro systems in modern languages incorporate some degree of contextual information (generally scopes) like Racket, Rust, and Julia. Since the DSL performs its own validation, it can transform to simplified representations like string literals or use tricky optimizations while also ensuring type safety.

AST transformations are convenient but a bit limiting. Instead, it's the ability to perform IR transformations that allow type checking that is most interesting, as that's what allows DSLs to support validation and strong error handling. The example below demonstrates how this can be used to typecheck HTML attributes:


  
val id = 1;
#html {
    <h1 id=${id}>Hello, World!</h1>
}

Error: Invalid HTML attribute value.
1 | #html {
2 |     <h1 id=${id}>Hello, World!</h1>
  |              ^^
An id attribute requires values to be type String, received Integer.

Note that the DSL here has access to type information from the host language that it can make use of, which is something that many existing macro systems don't offer - they rely on type checking after the transformation, which is more limiting and moves error handling outside the DSL (leading to errors akin to C++ templates).

Other Macro Features

There are two other key features of macro systems worth mentioning, since they have a major impact on what is or is not possible with DSLs:

Hygiene, which ensures identifiers inside/outside the macro are kept separated. For DSLs, this means that the DSL itself must be analyzed in a distinct scope but interpolated arguments still have to be analyzed in the original called scope. There are also cases where DSLs actually want to break hygiene, such as defining variables that are also accessible outside the DSL.
Splicing, which allows a macro to return a list of ASTs that are merged into the calling location (like spreading varargs or Fragment components in many webdev frameworks).

For instance, both of these features can be used to allow a TypeScript DSL to define new variables accessible in the host language:


  
#typescript {
    const x = 1;
    const y = 2;
    const z = 3;
}
print(x, y, z);

val x = 1;
val y = 2;
val z = 3;
print(x, y, z);

Context Analysis

The second goal of semantic analysis is to perform the analysis itself, aka compile-time validation. To implement this, the analyzer manages state identifying context-sensitive information about the program - things like variable/function scopes, labels, and control flow. This information is modified as analysis progress (thus, mutable state) and DSLs generally need to access/modify this information for things like typechecking.

Modeling this kind of data is not easy, but the solution I've come up with I call the Push-Pull Context System and has worked surprisingly well for me in practice. Not only is useful for DSLs, but even just for Rhovas itself I've found it to be a successful model that makes it easier to keep track of this mutable state.

Push-Pull Context System

The primary challenge for managing context is supporting conditional branching, as with if /match statements. There are two considerations here:

Each branch must be analyzed in a separate child context. This context is initially the same for each branch, and any updates to contextual information that occur need to be kept separate from the context of other branches.
After each branch is analyzed, any updates to contextual information must be merged and then applied to the parent context to be used for later analysis.

The context system is designed around modeling these operations as either pushing data from the parent context to a child context or pulling data from all children to the parent (merging as appropriate). The ContextItem interface defines these two operations as child (push) and merge (pull). An example implementation for tracking labels is shown below using two sets, one for available labels from the parent and another for accessed labels that is merged via union.

data class LabelContext(
    val available: MutableSet<String>,
    val accessed: MutableSet<String>,
) : ContextItem<LabelContext> {

    fun child(): LabelContext {
        return LabelContext(available.toMutableSet(), setOf());
    }

    fun merge(children: List<LabelContext>) {
        children.forEach { child ->
            accessed.addAll(child.accessed);
        }
    }

}

Labels are a simple example as they fit the push-pull model cleanly, but I also want to show how this system can be used to model control flow which is much more sophisticated with merging.

Control Flow Context

The primary challenges of tracking control flow are conditional evaluation and jumps. Conditional evaluation involves managing control flow across multiple branches and requires a method of merging these contexts. Similarly, jumps may occur manually through statements or also as the result of thrown exceptions.

For manual jumps, the approach used by Rhovas is to store a set of possible targets (labels) that can be jumped to within a given block. Statements that cause jumps add the appropriate label to the set while those that are possible targets clear the set if it contains the corresponding label. When conditional branches are merged, each set is unioned together unless one of the sets is empty (corresponding to normal control flow) in which case the set is cleared and normal control flow resumes. This merge algorithm is shown below:

val jumps: MutableSet<String>,
fun merge(children: List<JumpContext>) {
    if (children.any { it.jumps.isEmpty() }) {
        jumps.clear();
    } else {
        children.forEach { jumps.addAll(it.jumps);
    }
}

Exceptions are significantly harder due to their power (they can come from any function call, not just statements) and use of types. It's a bit out of scope for this summary post, but if you're interested you can review the full thesis and the Rhovas implementation (at time of writing) as well.

Future Work

There are a few other areas to explore moving forward:

A user-facing macro system for creating embedded DSLs without interfacing directly with the compiler. This also includes defining a series of standard macros for common operations, such as creating ASTs/IRs, which will remove large amounts of the semantic analysis burden from the DSLs.
The limitations of the push-pull context system, such as Java's 'effectively final' validation for variables accessed in lambda expressions and forward context changes into the catch block when exceptions are thrown. Both appear to require a method of long-term tracking and storage of the context that is not affected by context changes as semantic analysis progresses.
Integrating this model with other language standards such as the Language Server Protocol or LLVM to allow DSLs to employ existing compiler toolchains and backends to manage most of the work. However, these models are not designed for DSLs, and thus functionality like interpolated arguments and contextual information present challenges that likely require significant workarounds.

These are all tricky problems that will take some time to solve, but as I continue working on Rhovas I hope to make good progress with them. Keep an eye on the #rhovas channel in my Discord for updates!

Closing Thoughts...

The process of semantic analysis is hard enough for one language, let alone trying to work with embedded DSLs. The abstractions introduced above have already been successfully used in Rhovas, the language I work on, and can also provide benefits for other general-purpose languages for their own semantic analysis.

If you're interested in the more in-depth version, definitely take a look at the full thesis for additional details, examples, and edge cases left out for this post. The main paper is about 20 pages double-spaced and is pretty approachable with some background knowledge in programming languages and macros.

Feel free to reach out with questions or comments, and I'd love to hear feedback on this system and any insight into the points in future work.

Email: WillBAnders@gmail.com
Discord: Tag me in the #blog channel!

Thanks for reading!
~Blake Anderson