Question about GLSL grammar compared to desired AST output

In the 4.20 GLSL Grammar, there are these productions:

function_call:
   function_call_or_method

function_call_or_method:
   function_call_generic

function_call_generic:
     ....

I’m building a parser for GLSL. I’ve found that “function_call_or_method” and “function_call_generic” are only used here, nowhere else. I’m fairly new to parsing in general. When I’m building an AST, should I follow the rules in the GLSL spec and produce an AST that contains {function_call: { function_call_or_method: { function_call_generic: {...}}} or is it ok to simply produce a function call?

I don’t understand why the grammar has these rules that only produce a function_call_generic. What is the utility of these intermediate rules?

If I use the GLSL parser on astexplorer . net - it doesn’t produce the intermediate nodes. It produces a “call.” I don’t think the JS parser here is the most robust, so I’m not sure if I should be comparing against this.

This is my first time building a parser of this size. In general I’m trying to figure out the reasoning of which rules from the grammar I should include my parser, and which are OK to collapse.

If the answer is “it depends on what you want your parser to do,” the end goal of my parser is to be able to:

  • Re-print the exact same program from the input program
  • Find “significant” sections of the program - variables, functions, and be able to arbitrarily define what is “significant” - maybe some walker cares about addition operators, for example
  • Ideally be able to perform the same type of vertex/fragment validation the browser does on a shader program

I looked at various versions of the GLSL specification, and note that up to and including 4.20 revision 6, the grammar definition said

function_call_or_method:
    function_call_generic
    postfix_expression DOT function_call_generic

By 4.20 revision 11, this had changed to

function_call_or_method:
    function_call_generic

So I’m guessing it was related to how the grammar handled the .length() method. All versions have:

postfix_expression:
   ...
   postfix_expression DOT FIELD_SELECTION

so changing the definition of FIELD_SELECTION to include length() as well as bare identifiers would cover that case. The specification doesn’t formally define the lexical syntax; it only lists the tokens. Section 3 informally defines the lexical syntax, but doesn’t mention length() at all.

Ah, I didn’t realize there was a newer version of the GLSL spec than the spec I was working from. That makes it easy knowing that production was removed entirely in 4.60.7.

I have the same general question for other productions like function_header_with_parameters. Should my output AST have this node and match the productions in the grammar 1:1? If not, what is the general strategy of deciding when not to match the productions 1:1?

The specification doesn’t formally define the lexical syntax; it only lists the tokens.

I don’t fully follow this, are you referring to function_header_with_parameters as a “token” in this case? I was under the impression a lexer wouldn’t produce tokens like function_header_with_parameters, that decision would be up to the parser to produce semantically meaningful nodes like this.

The tokens are the upper-case identifiers such as DOT and FIELD_SELECTION. These are the output from the lexical analyser and the input to the parser. The grammar specification starts with a list of tokens, but the specification of their syntax (in terms of individual characters) is somewhat less formal. Sections 3 and 4 contain BNF specifications for identifer, integer-constant and floating-constant which presumably correspond to the IDENTIFIER, FLOATCONSTANT/DOUBLECONSTANT, and INTCONSTANT/UINTCONSTANT tokens. Most of the tokens are keywords or individual characters. But a few of them (such as FIELD_SELECTION) require some guesswork. AFAICT, if lexical analysis of length() doesn’t result in a FIELD_SELECTION token, parsing will fail, as I can’t see any other way for foo.length() to parse with the updated function_call_or_method production rule.

I note that the 4.6 version of the grammar specification has the comment

// Methods (.length), subroutine array calls, and identifiers are recognized through postfix_expression.

I’m creating a PEG grammar which sort of combines lexing and parsing, making this generally easier for me.

After further research I don’t think the AST I produce needs to be 1:1 with the productions in the grammar. Meaning I don’t need to have, for example, an explicit function_header_with_parameters node in my AST. I can just build a function_call AST node with a parameters key that is empty if there are no parameters. This is/was the core of my question: Does my AST need to match the Khronos grammar 1:1?

My current belief is that the Khronos grammar is just fighting the limitations of not using a more expressive syntax, for example they can’t use ? to represent optional parameters, so they have to have this kind of gnarly production branching like function_call_or_method. So when I’m building an AST, I can interpret the semantic meaning (like “_with_parameters”) into an AST node that has an optional parameters key.

For any practical purpose, no. For validation, the only requirement is that the grammar accepts exactly the same set of input sequences as the grammar in the specification. For compilation, the only requirement is that the AST is in a form useful to the code generator. E.g. any rule which can’t be invoked recursively can have its body “inlined” wherever it appears (possibly at the cost of increasing the total number of rules).

It isn’t necessary for the AST to have a distinct node type for every non-terminal symbol. Infix expressions typically use a distinct non-terminal symbol for each combination of precedence level and associativity (left or right) but the AST could use a single node type to encompass application of an operator to its arguments, with the operator as a parameter (although distinct types for each operator is likely to simplify code generation in the case where the operators correspond to primitive instructions). If you merged the rules for infix expressions into a single “expression operator expression” rule, the grammar would still accept the same set of inputs but the AST wouldn’t reflect the correct order of evaluation, so you’d have to build the precedence rules into the code generator.