Monday, April 26, 2021

My Own Syntax Highlighter

Let me start of by saying that there are a lot of great syntax highlighters out there. Certainly we don't need another one. That being said, I did create one, but not for any other reason than I wanted to see what was involved. It is not a complete highlighter in that not all scripting languages are covered. Basically, I just covered the languages I use most often, which are HTML, PHP, CSS, JAVASCRIPT, DELPHI and XML. You can try out a demo of the highlighter by clicking on the link below.

Syntax Highlighter Demo

The the following code in this blog was generated by this syntax highlighter.

I started off working on the HTML and decided to go the old C++ route and parse the script one character at a time. This worked out quite well as it was easy to identify the tags, tag name and attributes without having to iterate through a list of all the HTML tags and attributes. Also, this works well for XML as well where there aren't any standard tag names.

First we check if the script is HTML or XML and load the appropriate configuration file. This file essentially sets the coloring of the script.

/* start of process for HTML and XML */
if($_POST['script']=="html" || $_POST['script']=="xml") {
  if($_POST['script']=="html") include('configHTML.php');
  if($_POST['script']=="xml") include('configXML.php');

Then we analyze each character one at a time using a switch statement. Here is an example of how a space character is handled. First, is this a tag? If not, then just record the character and move on. It if is a tag, then has the tag name been set? Check if there is an attribute coming up and prepare. If we are already tagged (tag name set), then check if this is a comment or script or quote and adjust accordingly.

case ' ':
  if($isTag) {
    if(!$isTagged) { /* tag name not color coded yet */
      $colorCode = $colorCode . '</span> ';
      $isTagged = true;
      if(ctype_alnum($rawCode[$x+1])) { /* if next character is digit or letter then set attribute flag */
        $colorCode = $colorCode . ' <span style="color:'.$SCRIPT['attributecolor'].'">';
        $isAttr = true;
    } else { /* tag name has already been color coded so this is further into tag */
      if($isComment || $isScript) {
        $colorCode = $colorCode . $thisChar;
      } else {
        if($isQuote && $quoteChar=='') {
          $colorCode = $colorCode . '</span> ';
          $isQuote = false;
        } else {
          $colorCode = $colorCode . $thisChar;
        if($isAttr) { /* if this is a attribute then close attribute flag and adjust in next condition */
          $colorCode = $colorCode . '</span>';
          $isAttr = false;
        if(!$isAttr) {
          if(ctype_alnum($rawCode[$x+1])) { /* if next character is digit or letter then set attribute flag */
            $colorCode = $colorCode . ' <span style="color:'.$SCRIPT['attributecolor'].'">';
            $isAttr = true;
          } else {
            $colorCode = $colorCode . $thisChar;
  } else {
    /* check if previous character was space and use htmlcharacters to add space */
    if($rawCode[$x-1] == ' ') {
      $colorCode = $colorCode . ' ';
    } else {
      $colorCode = $colorCode . $thisChar;

And this goes on and on for the various characters that can be encountered in HTML or XML. See the source code for all the script. It's fairly easy to figure out.

When tackling the other scripting languages, I decided to use regular expression. I image this approach would also work well with HTML and XML. but I had already figured it out another way. A really good explanation of how to tackle syntax highlighting can be found at the PhobosLab website. The approach is all the same for all the scripts. Only the configuration file changes. The code is also quite small. Except for the configuration file, this is the entire code to handle the highlighting.

Special attention was needed for comments and quotes. As comments or quotes could contain script keywords, we need to ensure these words don't get highlighted twice. This can be accomplished by either stripping nested highlighting (span tags), or by stripping the comments and quotes from the text to be further highlighted, and then putting them back in after we are all finished. I elected to use the second approach.

if($_POST['script']=="php") include('configPHP.php');
if($_POST['script']=="css") include('configCSS.php');
if($_POST['script']=="jscript") include('configJSCRIPT.php');
if($_POST['script']=="delphi") include('configDELPHI.php');

$colorCode = $rawCode;

/* replace brackets on any html tags in code */
$colorCode = str_replace('<','<',$colorCode);
$colorCode = str_replace('>','>',$colorCode);

/* do highlighting as per regexList */
$colorCode = preg_replace( array_keys($regexList), array_values($regexList), $colorCode );
/* paste comments and quotes back in... loop to get nested comments and quotes */
while (preg_match("/##[a-zA-Z0-9]{14}##/",$colorCode)) {
  $colorCode = str_replace( array_keys($tokens), array_values($tokens), $colorCode);

/* replace tabs and double spaces with character code */
$tab = '';
for($t=0;$t<$tabspaces;$t++) { $tab = $tab . ' '; }
$colorCode = str_replace(chr(0x09),$tab,$colorCode);
$colorCode = str_replace('  ','  ',$colorCode);

/* change newlines to <br> and removed any nested span tags */
$colorCode = nl2br($colorCode);
$colorCode = stripNestedSpan($colorCode);

$colorCode = numberLines($colorCode);

/* add default color and copyright */
$colorCode = $colorCode = '<code><span style="font: normal 0.9em  consolas, \"trebuchet ms\" arial, helvetica, sans-serif; color:'.$COLOR['default'].'">' . $copyright . $colorCode . "</span></code>";

There are a few other helper functions not shown here. Again, check out the source code if you're interested. The highlighter is not perfect but works as expected about 98% of the time. I'm sure some of the regular expressions can be improved, and if anyone has any suggestion for improvement, please pass them along. As I said at the beginning, this was simply a self-learning project and was not intended as a finished product.


delicious digg facebook stumble twitter myspace linkedin technorati reddit google springpad blogger | addthis Share More...

speak your mind

If you have something to say, we would love to hear from you.






Get occasional email updates from kidmoses


Donations of any size to this website are greatly appreciated.

copyright © 2004 to the present day | web design by Top Place Web Solutions
privacy | terms | login | contact