language-guide.rst 8.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284
  1. Language definition guide
  2. =========================
  3. Highlighting overview
  4. ---------------------
  5. Programming language code consists of parts with different rules of parsing: keywords like ``for`` or ``if``
  6. don't make sense inside strings, strings may contain backslash-escaped symbols like ``\"``
  7. and comments usually don't contain anything interesting except the end of the comment.
  8. In highlight.js such parts are called "modes".
  9. Each mode consists of:
  10. * starting condition
  11. * ending condition
  12. * list of contained sub-modes
  13. * lexing rules and keywords
  14. * …exotic stuff like another language inside a language
  15. The parser's work is to look for modes and their keywords.
  16. Upon finding, it wraps them into the markup ``<span class="...">...</span>``
  17. and puts the name of the mode ("string", "comment", "number")
  18. or a keyword group name ("keyword", "literal", "built-in") as the span's class name.
  19. General syntax
  20. --------------
  21. A language definition is a JavaScript object describing the default parsing mode for the language.
  22. This default mode contains sub-modes which in turn contain other sub-modes, effectively making the language definition a tree of modes.
  23. Here's an example:
  24. ::
  25. {
  26. case_insensitive: true, // language is case-insensitive
  27. keywords: 'for if while',
  28. contains: [
  29. {
  30. className: 'string',
  31. begin: '"', end: '"'
  32. },
  33. hljs.COMMENT(
  34. '/\\*', // begin
  35. '\\*/', // end
  36. {
  37. contains: [
  38. {
  39. className: 'doc', begin: '@\\w+'
  40. }
  41. ]
  42. }
  43. )
  44. ]
  45. }
  46. Usually the default mode accounts for the majority of the code and describes all language keywords.
  47. A notable exception here is XML in which a default mode is just a user text that doesn't contain any keywords,
  48. and most interesting parsing happens inside tags.
  49. Keywords
  50. --------
  51. In the simple case language keywords are defined in a string, separated by space:
  52. ::
  53. {
  54. keywords: 'else for if while'
  55. }
  56. Some languages have different kinds of "keywords" that might not be called as such by the language spec
  57. but are very close to them from the point of view of a syntax highlighter. These are all sorts of "literals", "built-ins", "symbols" and such.
  58. To define such keyword groups the attribute ``keywords`` becomes an object each property of which defines its own group of keywords:
  59. ::
  60. {
  61. keywords: {
  62. keyword: 'else for if while',
  63. literal: 'false true null'
  64. }
  65. }
  66. The group name becomes then a class name in a generated markup enabling different styling for different kinds of keywords.
  67. To detect keywords highlight.js breaks the processed chunk of code into separate words — a process called lexing.
  68. The "word" here is defined by the regexp ``[a-zA-Z][a-zA-Z0-9_]*`` that works for keywords in most languages.
  69. Different lexing rules can be defined by the ``lexemes`` attribute:
  70. ::
  71. {
  72. lexemes: '-[a-z]+',
  73. keywords: '-import -export'
  74. }
  75. Sub-modes
  76. ---------
  77. Sub-modes are listed in the ``contains`` attribute:
  78. ::
  79. {
  80. keywords: '...',
  81. contains: [
  82. hljs.QUOTE_STRING_MODE,
  83. hljs.C_LINE_COMMENT,
  84. { ... custom mode definition ... }
  85. ]
  86. }
  87. A mode can reference itself in the ``contains`` array by using a special keyword ``'self``'.
  88. This is commonly used to define nested modes:
  89. ::
  90. {
  91. className: 'object',
  92. begin: '{', end: '}',
  93. contains: [hljs.QUOTE_STRING_MODE, 'self']
  94. }
  95. Note: ``self`` may not be used in the root level ``contains`` of a language. The root level mode is special and may not be self-referential.
  96. Comments
  97. --------
  98. To define custom comments it is recommended to use a built-in helper function ``hljs.COMMENT`` instead of describing the mode directly, as it also defines a few default sub-modes that improve language detection and do other nice things.
  99. Parameters for the function are:
  100. ::
  101. hljs.COMMENT(
  102. begin, // begin regex
  103. end, // end regex
  104. extra // optional object with extra attributes to override defaults
  105. // (for example {relevance: 0})
  106. )
  107. Markup generation
  108. -----------------
  109. Modes usually generate actual highlighting markup — ``<span>`` elements with specific class names that are defined by the ``className`` attribute:
  110. ::
  111. {
  112. contains: [
  113. {
  114. className: 'string',
  115. // ... other attributes
  116. },
  117. {
  118. className: 'number',
  119. // ...
  120. }
  121. ]
  122. }
  123. Names are not required to be unique, it's quite common to have several definitions with the same name.
  124. For example, many languages have various syntaxes for strings, comments, etc…
  125. Sometimes modes are defined only to support specific parsing rules and aren't needed in the final markup.
  126. A classic example is an escaping sequence inside strings allowing them to contain an ending quote.
  127. ::
  128. {
  129. className: 'string',
  130. begin: '"', end: '"',
  131. contains: [{begin: '\\\\.'}],
  132. }
  133. For such modes ``className`` attribute should be omitted so they won't generate excessive markup.
  134. Mode attributes
  135. ---------------
  136. Other useful attributes are defined in the :doc:`mode reference </reference>`.
  137. .. _relevance:
  138. Relevance
  139. ---------
  140. Highlight.js tries to automatically detect the language of a code fragment.
  141. The heuristics is essentially simple: it tries to highlight a fragment with all the language definitions
  142. and the one that yields most specific modes and keywords wins. The job of a language definition
  143. is to help this heuristics by hinting relative relevance (or irrelevance) of modes.
  144. This is best illustrated by example. Python has special kinds of strings defined by prefix letters before the quotes:
  145. ``r"..."``, ``u"..."``. If a code fragment contains such strings there is a good chance that it's in Python.
  146. So these string modes are given high relevance:
  147. ::
  148. {
  149. className: 'string',
  150. begin: 'r"', end: '"',
  151. relevance: 10
  152. }
  153. On the other hand, conventional strings in plain single or double quotes aren't specific to any language
  154. and it makes sense to bring their relevance to zero to lessen statistical noise:
  155. ::
  156. {
  157. className: 'string',
  158. begin: '"', end: '"',
  159. relevance: 0
  160. }
  161. The default value for relevance is 1. When setting an explicit value it's recommended to use either 10 or 0.
  162. Keywords also influence relevance. Each of them usually has a relevance of 1, but there are some unique names
  163. that aren't likely to be found outside of their languages, even in the form of variable names.
  164. For example just having ``reinterpret_cast`` somewhere in the code is a good indicator that we're looking at C++.
  165. It's worth to set relevance of such keywords a bit higher. This is done with a pipe:
  166. ::
  167. {
  168. keywords: 'for if reinterpret_cast|10'
  169. }
  170. Illegal symbols
  171. ---------------
  172. Another way to improve language detection is to define illegal symbols for a mode.
  173. For example in Python first line of class definition (``class MyClass(object):``) cannot contain symbol "{" or a newline.
  174. Presence of these symbols clearly shows that the language is not Python and the parser can drop this attempt early.
  175. Illegal symbols are defined as a a single regular expression:
  176. ::
  177. {
  178. className: 'class',
  179. illegal: '[${]'
  180. }
  181. Pre-defined modes and regular expressions
  182. -----------------------------------------
  183. Many languages share common modes and regular expressions. Such expressions are defined in core highlight.js code
  184. at the end under "Common regexps" and "Common modes" titles. Use them when possible.
  185. Regular Expression Features
  186. ---------------------------
  187. The goal of Highlight.js is to support whatever regex features Javascript itself supports. You're using real regular expressions, use them responsibly. That said, due to the design of the parser, there are some caveats. These are addressed below.
  188. Things we support now that we did not always:
  189. * look-ahead regex matching for `begin` (#2135)
  190. * look-ahead regex matching for `end` (#2237)
  191. * look-ahead regex matching for `illegal` (#2135)
  192. * back-references within your regex matches (#1897)
  193. * look-behind matching (when JS supports it) for `begin` (#2135)
  194. Things we currently know are still issues:
  195. * look-behind matching (when JS supports it) for `end` matchers
  196. Contributing
  197. ------------
  198. Follow the :doc:`contributor checklist </language-contribution>`.