en/LXR Language Description

The generic parser uses a very elementary algorithm to tokenise files. All is needed is to break the file into homogeneous regions, such as a string, a comment or code. Some of these regions undergo second level processing to extract identifiers which will be looked up in a dictionary.

Consequently, LXR does not need all the complication of a real compiler parser.

In the parser configuration file, every language description is a curly brace-enclosed comma-separated list of key/value pairs: { LXR_language_name => { 'langid' => unique_numeric_id , 'identdef' => pattern , 'flags' => [ list_of_flags ] , 'reserved' => [ list_of_keywords ] , 'include' => { include_management } , 'spec' => [ parser_configuration ] , 'typemap' => { ectags_correspondance_table } } }

The most important parameter is 'spec' which tells how to break the file into regions. 'spec' must define the qualified regions as:

'spec' => { region_name => [ start_pattern, end_pattern, escape_pattern ] , region_name => [ ... ] }

region_name is one of 'comment', 'string' or 'include'. Any chunk not captured in the explicitly defined regions is considered 'code'
start_pattern describes the sequence of characters for the beginning sentinel.
end_pattern describes the sequence of characters for the ending sentinel.
The optional escape_pattern is only necessary if some sequence of characters could be misinterpreted as the ending sentinel.

Examples:

/* ... */ comment in C

{ 'comment' => [ '/\*', '\*/' ] }

// comment in C++ (\$ stands for the end of line)

{ 'comment' => [ '//', '\$' ] }

String in C: we must stay inside the string if we meet escaped characters (otherwise we may decide the end of the string and face an out-of-sync situation).

Caution: this one is very tricky; read thoroughly your Perl documentation. { 'string' => [ '"', '"', '\\\\.' ] }

The second important parameter is 'identdef' used inside code regions to find identifiers and keywords:

'identdef' => keyword_pattern

Example:

Catchall for many languages (covers identifiers and special C preprocessor keywords)

'identdef' => '([\w~]|\#\s*)[\w]*'

Next, you give the list of reserved keywords which will not be considered for lookup:

'reserved' => [ keyword_list ] In some languages, keywords are case-insensitive. A single list is valid for any case variant if flag 'case_insensitive' is provided: 'flags' => [ 'case_insensitive' ]

Example:

Part of C table

'reserved' => [ 'auto', 'break', 'case', 'char', 'const', '#if' ]

Finally, you give plain text explanation of ctags flags so that the cross-reference listings can label the identifiers with human readable descriptions. Refer to ctags man page for the complete list applicable to a given language.

'typemap' => { letter => category , letter => category }

Example:

Part of C table

'typemap' => { 'c' => 'class' , 'd' => 'macro (un)definition' , 'f' => 'function definition' , 'v' => 'variable definition' }

In case the language may "import" sub-files (don't worry about C/C++, its rules are builtin), you give rules to LXR so that it transforms the language-form file description into OS-form file reference to be able to plug a clickable link to said sub-file:

'include' => { 'directive' => pattern , 'separator' => string , 'pre' => [ target, replacement ] , 'global' => [ target, replacement ] , 'post' => [ target, replacement ] } Prior to release 1.2, 'pre' and 'post' were respectively named 'first' and 'last'. 'separator' appeared in 1.2.

'directive' defines a reg-exp to split the statement into 5 components, namely:

statement keyword(s) or prefix,
spacer,
left delimiter,
file name,
right delimiter.

Example for Perl use or require:

'directive' => '([\w]+)(\s+)()([\w:]+\b)()' # no delimiters in Perl but regexp MUST define 5 components

'separator' optionally defines the language-specific path separator in filenames. It is replaced by the OS separator before trying to access the file.

'pre', 'global' and 'post' are optional substitution rules (target is a pattern and replacement is substituted in case of a match). 'pre' is applied only once at the beginning. 'global' is then repeatedly applied until there is no more match. 'separator' is replaced by the OS separator. 'post' is applied only once after the other rules.

Example for Perl use or require:

, 'separator' => '::' # Perl package separator , 'post' => [ '$', '.pm' ] # Add Perl OS-extension at the end of the filename

It is more efficient than the equivalent:

, 'global' => [ '::', '/' ] # Repeatedly replace :: Perl delimiter by / OS delimiter , 'post' => [ '$', '.pm' ] # Add Perl OS-extension at the end of the filename

In release 1.0 and higher, include rules are built in the parser for languages C/C++, Perl, Python and Ruby. Java and Make have their own parsers starting in release 1.1. These parsers no longer use 'include'.