Syntax Highlighter From Scratch

Posted at 2024-11-09 # programming # tooling # neovim

Why make a syntax Highlighter?

The reason I started this project is I am a big fan of using neovim as a text editor and as part of my masters degree I took a module on multi agent systems. This module has had a profound impact on how I think of programming languages and I am hoping to write more on logic programming and constraint programming in the future. As part of the coursework we wrote programs in a language called ASTRA. The only problem was the tooling for this language was exclusively for VSCode. So instead of doing my assignments I thought what better way to procrastinate than to create my own parser for the language.

What is tree sitter?

‘Tree sitter is a parser generator tool and incremental parsing library’. It helps anyone create a parser for a source file and update the tree as the source file is edited. I had heard of tree sitter before because it is built in to neovim for parsing several languages.

Tree sitter also comes with a small declaritive language called queries which are used to extract information from your syntax tree so it can be used in other ways. To explain it better here is the paragraph taken from the neovim documentation.

…a query consists of one or more patterns. A pattern is defined over node types in the syntax tree. A match corresponds to specific elements of the syntax tree which match a pattern…A capture allows you to associate names with a specific node in a pattern…

Neovim comes with a set of builtin queries for common programming language tokens and grammars that are highlighted according to your favourite colorscheme.

Examples of Queries

Taken from the popular catpuccin colorscheme for neovim they have defined a color for builtin variables

["@variable.builtin"] = { fg = C.red, style = O.styles.properties or {} }

So how do you write the parser?

Well to write the parser you can install the tree-sitter-cli and set up a project.

Once you have the project setup you will see a grammar.js file in the root directory of your project. Its here where we will write the parser for our language

module.exports = grammar({
  name: 'YOUR_LANGUAGE_NAME',

  rules: {
    // TODO: add the actual grammar rules
    source_file: $ => 'hello'
  }
});

To write a basic grammar for your language you can start with something common across all languages, keywords. Keywords are common tokens across all languages they are usually a set of reserved words. Take below for example. If we want to match keywords we define a title for our token and on the right hand-side a match that must be found in our source code.

{
  // ...
  keyword_if: $ => 'if',
  keyword_while: $=> 'while',
  keyword_for: $=> 'for'
  //......
}

But this would get tedious and we aren’t trying to parse the whole language after all, we just want to highlight the code not compile it. So instead of repeating ourselves or writing redundant code we can use another useful feature of tree-sitter its small but rich domain specific language (DSL). This gives us some useful functions to simplify our parser. Lets revisit our above code.

We want to group all keywords in the language in to just keywords because they will all be colored the same. So instead of writing tree seperate rules lets just use the choice function from the tree-sitter DSL.

{
  // ...
  keywords: $ => choice( 'if','while', 'for')
  //......
}

The above code should be self explanatory but just to clarify this tells our parser that anything that matches the choice of ‘if’, ‘while’ or ‘for’ is considered a keyword.

Okay but what about the syntax highlighting?

Good question, but, unfortunately, this article is getting pretty long. So that will have to wait for a future post. Hope you enjoyed this introduction in to tree-sitter. I am hoping in future to learn more about the theory of parsing and lexical analysis so I can write parsers for more things. But that will have to wait.