How to create your own programming language?

A pragmatic guide

In an age of hundreds of programming languages, creating your own language may seem pointless, since languages for all kinds of purposes exist, and finding one for your use is as easy as Googling it, downloading the compiler/interpreter and getting started.

If you are someone who regularly writes code, you may have wondered how does your favourite language really work. In case you have just started programming, you may sometimes wonder about the syntax errors thrown when trying to run the code. In the latter case, going through the language reference is often helpful, but the very first time you do so, you may see something like this:

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U" | "f" | "F"
                        | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq

The language reference contains the rules which are valid in the language. These rules constitute the grammar of the language. The above grammar defines the valid syntax for string literals in Python. According to the first grammar rule, a ‘stringliteral’ is defined as an optional ‘stringprefix’ followed by either a ‘shortstring’ or a ‘longstring’. This is followed by rules describing ‘shortstring’, ‘longstring’ and rules used within them.

Writing your own language can also be fun, especially if you love taking on programming challenges. Designing a grammar for your language is the most interesting part of the process. Writing a compiler or an interpreter may be a more challenging task, but in the end, seeing your programming language accept input and produce results gives a true sense of accomplishment.

What is a program really?

A program is defined as a sequence of instructions given to a computer for execution.1 The meaning of a statement depends on the programming language which you are writing in, but there are situations wherein code which you write for one language may be valid in another language. Here’s an example, if you have never seen it before:

#include<cstdio>
#define print(x) int main(){printf(x);}
print("Hello World!")

Source: Reddit

This code can be compiled by a C++ compiler and also executed by the Python interpreter to print the same result. This type of program is called a polyglot, since it is a valid set of instructions in more than one language.

High-level languages like Python allow programmers to write their instructions in English-like statements, which are eventually converted to machine-level instructions for execution by the computer. Learning how your code is processed by a compiler or interpreter helps gain new insights while using the language. As an example, consider the fact that most programming languages are designed to ignore whitespace characters and comments. I only learned about this when I was tasked with implementing a tool to read comments.

The syntax of a language defines how programmers can write statements to perform tasks. The grammar of a language contains syntax rules defined in some type of formal notations. Most programming languages have their syntax defined using variants of a formal notation called the Backus—Naur Form (BNF). Here’s a grammar for a simple calculator in Extended BNF notation:

program: instruction+;

instruction: expression;

expression: expression '*' expression
            | expression '/' expression
            | expression '+' expression
            | expression '-' expression
            | number;

number: [1-9]([0-9]+)?('.'[0-9]+)? ;

whitespace: [\r?\n \t]+ ;

Every line in the above grammar describes a syntactical rule for a statement or a part of a statement. The text on the LHS of the colon is the name of the rule, and the content on the RHS is the definition of the rule. If the RHS contains the pipe character(|) between two rule names or definitions, it denotes an alternative, i.e. only one of the two rules/definitions can match. In case of the expression rule, an expression can contain *, /, + or - between two expressions. The first rule, program, is the starting rule, since it does not reference in any other rule. It defines the program as a set of one or more instruction statements. The instruction rule is simply defined as an expression. However, the expression rule is defined in terms of itself, i.e. the RHS contains references to the expression rule.

The above EBNF grammar can be used to create a calculator program, which reads instructions in the given format and prints the value of each instruction. Before that, the source code must be read and checked for validity. This task is composed of two subtasks; lexical analysis (lexing) and syntax analysis (parsing). There is also a third step, semantic analysis, but that won’t be covered today. Lexing is the process where tokens in the source program are identified from the input file. A token is a string with an assigned meaning.2 In the expression 3 * 4, 3 and 4 are identified as tokens belonging to the expression rule, which will then be resolved to the number rule since 3 and 4 satisfy the number alternative in the expression rule. In syntax analysis, the grammar is used to identify whether a statement is valid in the language or not.

Lexing is performed by a class of a programs called lexers and parsing is performed by another class of programs called parsers. In the early days of programming languages, parsers and lexers were written manually for each programming language, but with time, parser and lexer generator tools became available. ANTLR (ANother Tool for Language Recognition) is a lexer and parser generator tool which I will be using to create the lexer and parser for my programming language. ANTLR will generate the lexer and parser classes when given the specification of the syntax of the language in an EBNF-like format.

However, ANTLR alone won’t help you create your own programming language. A grammar only defines which statements are valid in the language, but it does not define the tasks which must be performed when a statement is encountered. Even after generating the lexer and the parser, implementing the actions to be performed for each statement is the role of the developer.

An action is something which must be done in response to a statement. For example, upon finding a variable initialization statement in the source code, the compiler needs to store it into a data structure for resolving its value later on. Implementing code for such tasks is entirely the responsibility of the developer.

Prerequisites

Since my intention is to help you understand the process of creating your own programming language as quickly as possible, I won’t be explaining the theoretical stuff. You will also need to have some experience writing code in an object-oriented programming language, such as Java or Python. I’ll be using ANTLR4 to generate the parser and lexer. The tool supports multiple language targets, such Python, Java, Javascript and a few others. This means parser and lexer code will be generated in one of these languages.

Note

Explaining the entire process of creating your own programming language cannot be covered in a single post, so I’ll be explaining the grammar and actions for a query language. Understanding the relationship between grammar and actions will elucidate the basics of creating a programming language.

Installing ANTLR4

To create a parser and lexer, ANTLR4 must be installed. Follow the official guide to install ANTLR on your system. In case you are using a linux distribution, you can also check the repositories for a package which sets up the shell aliases automatically and installs ANTLR4 for you. If you are using Arch Linux, you can install the antlr4 package. Another thing to keep in mind is that ANTLR4 must be installed; this post is based on ANTLR4 only.

After setting up ANTLR, I urge you to try replicating the Hello.g4 grammar given in the official guide, just to get a hang of using ANTLR4.

Installing Python3

If you are comfortable using Java, you can skip this step, but for those who have a predilection for Python code, download and install Python (python3), if you haven’t done so yet. After installing Python, you would also require the antlr4-python3-runtime package. To install this package, you can either use pip or conda.

To install using pip, type the following code into a new terminal/CMD/Powershell window:

python -m pip install antlr4-python3-runtime

To install using conda, type:

conda install -c conda-forge antlr4-python3-runtime

Although I have selected Python as my language target for ANTLR4, I’ll be first generating the parser and lexer with Java as the target, since visualizing the parse tree is not possible when Python is the target language. This is because the TestRig utility provided by ANTLR can only run Java code. A parse tree is defined as an ordered, rooted tree that represents the syntactic structure of a string according to a grammar.3 Visualizing the parse tree helps determine whether the input is being correctly recognized and which tokens are being matched. For the calculator grammar, with the following program, here is the parse tree generated by ANTLR4:

2 + 3
4 + 5
20 + 3 * 5
Parse tree generated by ANTLR for a specific input to parser

A 3—step guide to building a parser and lexer

To generate the lexer and parser, as well as test the recognition abilities of the parser, I would recommend following the below generic steps:

  1. Generate the lexer and parser in Java

    antlr4 *.g4 -visitor -listener
  2. Compile all the Java code in the directory

    javac *.java
  3. Use grun alias to see the parse tree for a given input. Either save the input to a file (like input.calc for the Calculator grammar) and provide the filepath as a command line argument to grun or paste/type the input after running grun.

    grun Calc program -gui input.calc
    Or execute like this:
    
    grun Calc program -gui
    -> 2 + 3
    -> 4 + 5
    -> 20 + 3 * 5
    The -> represents text to be typed by you. After entering the program, press Enter. Then press Ctrl + Z (in Windows) or Ctrl + D in Linux/Mac to send the input to the parser.

About the language

The language which I have created is not a programming language in the real sense. It’s a query language to search the web using Google, and store the results as a markdown file. The language is named Prushn after the Hindi word for ‘question’. The input to the language consists of questions like these:

Q1. What is the full form of JSP ?
Q2. Which is the hottest place on Earth ?

Here is the ANTLR4 grammar for the language:

grammar Prushn;

program: question+;
question: QUESTION_BEGIN text QUESTION_MARK NEWLINE;

text: (WORD SPACE)+;

QUESTION_BEGIN: 'Q' NUMBER '.' SPACE;
NEWLINE: ('\r'?'\n')+;
QUESTION_MARK: '?';
fragment OPERATOR: '/' | '*' | '+' | '-' ;
fragment EXPONENT: '^';
fragment MONEY: '$' | '₹';
fragment BLOCK: '(' | ')' | '[' | ']' | '{' | '}';
fragment OTHER: '!' | '@' | '#' | '<' | '>' | '/' | ',' | '.' | '|' | '\\' | '=' | '_' | '`' | '~';
SPECIAL: OPERATOR | EXPONENT | MONEY | BLOCK | OTHER ;

ALPHABET: [a-zA-Z];
LETTER: ALPHABET | SPECIAL;
DIGIT: [0-9];
NUMBER: ([0-9]+)? '.'? DIGIT+;
WORD: (LETTER | NUMBER | SPECIAL)+;
SPACE: ' '+;

The grammar notation used by ANTLR4 is based on EBNF, but has a few differences. The first line

grammar Prushn;

defines the name of the grammar. The grammar for Prushn is stored in a file named Prushn.g4. The filename must always be same as the grammar name.

Rule names in uppercase denote terminal symbols, i.e. those tokens which can be present at the leaf nodes in a parse tree. Fragments, like

fragment EXPONENT: '^';

are terminal symbols which are only used when defining other terminal symbols.

The parser looks for the rule program when matching tokens provided by the lexer. A “program” is defined as one or more sequence of “question” statements. The question rule simply allows a query such as “Q1 What is the full form of JSP?” to be recognized, so that some action can be performed. In this case, the action is querying Google using the “text” part of the question.

Here are some queries for the above described query language.


Q1. What is the full form of JSP ?
Q2. Which is the hottest place on Earth ?
Q3. Restaurants near me ?
Q4. 4WD vs FWD ?

These queries will work, because ultimately, Google will be providing the search results. It’s also possible to use another search engine, because the search queries only contain simple text for now. It’s also possible to detect search modifiers such as “site:somesite.com”, but that would require making changes to the grammar which might be complicate things a bit. I’ll be showing how to do that later.

Implementation

Using the grammar for the language, ANTLR can generate suitable lexer and parser classes. Ensure that you save the grammar file with the same name as the grammar. Run the following command in the terminal/CMD/Powershell window. Ensure that the grammar file Prushn.g4 is in the same directory.

antlr4 Prushn.g4 -Dlanguage=Python3 -visitor -listener

Running the above command will generate Python files containing visitor and listener classes. A visitor class allows the programmer to use the visitor design pattern in programming. The visitor design pattern is a way of separating an algorithm from the object of the data structure on which it operates, but without modifying the data structure.4

The listener classes allow the programmer to perform some tasks when the nodes of the parse tree are visited in a depth first manner. I’ll be using the listener classes instead of the visitor, since queries need to be fired sequentially. The visitor pattern can be used, if some specific child nodes of the parse tree need to be visited after reaching a particular node.

Before continuing, fetch a copy of the source code from GitHub so that you can view the files. Filenames ending with .interp and .tokens have been generated by ANTLR along with the PrushnParser.py, PrushnLexer.py, PrushnListener.py and PrushnVisitor.py files. Prushn.g4, Search.py, PrushnReader.py contain the grammar, the code for searching the web, and the actions to be performed when reading the queries respectively. Prushn.py is the main Python file which can be executed by the user.

The Prushn.py file writes the search results to the file Uttar.md for now. PrushnReader.py creates an object of GoogleSearch class and uses it to search the web. It does this by overriding the functions of PrushnListener.py class which allows the programmer to perform some actions when the text part of the question rule is visited. The GoogleSearch class uses the requests module and BeautifulSoup to send queries and extract the URLs from the result page.

To run the Prushn query parser, type:

python Prushn filename.prn

where filename.prn is the file containing the queries in the format shown before. This will search each query within the the .prn file and write the results to the Uttar.md file. The resultant Uttar.md file is a markdown file which can be easily read in an editor, and contains the query followed by the list of links shown in the result.

Enfin

Although I tried not to go into the nitty-gritty details of how programming languages can be implemented, most of the basics of grammars have been covered. I would highly recommend reading the book The Definitive ANTLR 4 Reference by Terence Parr to understand ANTLR in more detail.

References

  1. Definition of a computer program
  2. Lexical_analysis#Token
  3. Parse tree
  4. Visitor pattern

One thought on “How to create your own programming language?

Leave a comment