Hacker News new | past | comments | ask | show | jobs | submit login
English Syntax Highlighting (evanhahn.github.io)
248 points by azdle on March 16, 2016 | hide | past | favorite | 105 comments



It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...


Those are stop words. You usually remove them from a text before analysing word frequencies because they are so common to all English text that they don't tell you anything specific about the current text.

I don't know what the rational is for greying them out is, but that's the category.


Sure, but "stop words" are a computing concept as opposed to "parts of speech" which is a linguistic one, which I think is related to his point.

https://en.wikipedia.org/wiki/Stop_words

https://en.wikipedia.org/wiki/Part_of_speech


I think we'd need to solve computational linguistics for a completely accurate parser to be able to tag the words properly. Although, the current state of the science works for something like 80% of cases (which are the "easy" ones).


Sorry but POS tagging is pretty much solved already. It is already in the 97+% [1]. Current papers are now mostly improving it by less than one percent.

1. http://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf


It's true that POS tagging works fairly well. But consider that a sentence involves more than one word. Even at 97 % accuracy for one word, the probability of correctly tagging every word in a short sentence of only ten words, is still as low as 0.97^10 = 0.74. And sentences are generally longer than ten words.

And as POS tagging is usually only done as preprocessing for some other task like syntactically parsing a text (which itself is usually preprocessing for yet another task), 97 % accuracy per word is not as good as it sounds. Parsers need to work with wrong data for every second or third sentence.


Indeed: the first paragraph of the linked paper says "Current good taggers have sentence accuracies around 55–57%".

(This surprises me. I would expect accuracy for different words in a sentence to be correlated, you either make no errors or several.)


Whoops, I didn't even look into the paper. Kind of makes my comment superfluous …


For the record, 97% on canonical test datasets with little recent progress doesn't mean that it's a solved problem. Admittedly, part of the problem is that elementary school POS categories aren't a great model of natural language.

More generally, for true syntax highlighting I think you do need the parse tree, and parsers (as opposed to taggers) definitely aren't at 97%.


They definitely aren't at 97% but they aren't that bad either. For English they are at around 92% (see for example http://arxiv.org/pdf/1603.04351.pdf) in labelled attachment score (right tag+right label).

If you are interested only in the syntactic tag, not in the structure of the tree, the number is somewhat higher.


Yes, you can get 97-98%, but only when evaluating on data from the same corpus as you trained on. If you evaluate on data from a different corpus, you immediately get a pretty big drop in performance. Thus one person in the field I've talked to even went so far as to say that competing in this part of the field (state-of-the-art performance, basically) is fundamentally a question of "who is the best at overfitting".

There's basically no part of NLP that's a solved problem. Even something as superficially simple as segmenting running text into sentences and tokens is decidedly non-trivial.


To be fair, I'm not a computational linguist (some of my friends did their PhDs in the field though). From what I remember, one of the most glaring issues in the field is that the most used corpus is a bunch of issues of the Wall Street Journal (which is a very specific data set).

The 80% figure was quoted from some talk I heard a few years ago, so I concede it's almost certainly improved since then.


On the other hand, 97% sounds impressive, but also would, on average, be slightly less than one error in your post (and more than one in this one)

Graded as a school exercise, I think 97% wouldn't be that good.


I think a naive approach would do quite well. We're not looking to comprehend the sentence entirely, rather a fairly good accuracy, so that the reader only needs, usually, focus on the highlighted words. For example, serif'd, bolded verbs and noun would retain most of the information while filler words like "the" and "and" might not be necessary.

The consequent highlighted parts reminds me a lot of Chinese, where essentially every word is "important" and there are few filler words, and hence there is often a lot of contextual information.


Looks like http://parts-of-speech.info/ does a pretty good job at detection; too bad the styling is so garish.


One site I use to improve my prose is

http://hemingwayapp.com/

not quite what you're proposing but useful nonetheless



I thought of that article too, and while this highlighter isn't quite as colourful as that example, it still felt more distracting to read than monochrome text.


You can stop spamming that blog now. The author has a wrong understanding of what syntax highlighting tries to achieve and because of that he comes to questionable conclusions.

It's really just a way to get people to visit his blog by making a bold statement.


Why is that spamming? Only because you don't share the writers opinion doesn't meant it doesn't fit here, right?


There's a cool app on ios which does this, iA Writer. I would love to know how they go about doing this.


In German (and other languages) you capitalize every noun. When I was younger I found it confusing that English didn't do that. It seems that this kind of syntax highlighting allows your brain to read text a little bit faster since it instantly knows that this will be a noun. It's also a little annoying if people write German and don't properly capitalize, your brain just doesn't expect the word to be noun if it's not capitalized.

* http://www.ruediger-weingarten.de/Texte/Capitalization.pdf (pg 4ff, last paragraph)

* https://mindmodeling.org/cogsci2013/papers/0462/paper0462.pd...

* http://linguistics.stackexchange.com/questions/699/does-capi...


English used to do this. You can still see evidence of it in things like the US Constitution, where the capitalization seems random until you remember English is a Germanic-family language and did this as recently as a couple hundred years ago:

We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America

I.

All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.

The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.

etc.


Why is "defence" not capitalized then? A typo?


"common defense" is being used as an elision of "common defense of the People". The entire phrase is a noun phrase. But " defense" is a verb within the noun phrase.


That was bothering me too.

At least, it is not a typo from ubernostrum, it's spelled "defence" in the original [1].

Moreover that's the British spelling of defense (is there a link with the lack of capital?).

Also note that, instead of "Blessings", it is "Bleſsings" in the original. But they are roughly equivalent if you apply a compatibility decomposition in your unicode normalization (NFKD/NFKC).

So where is the edit button for that constitution? Or do they only accept pull requests?

[1] http://www.archives.gov/exhibits/charters/charters_downloads...


Someone asked the same question here

https://www.quora.com/Why-is-the-word-defence-the-only-uncap...

but at least in my browser, I can't see the actual answer (it says "2 Answers" but I can only see one, which just confirms that the word is not capitalized).


Maybe defense is considered active in a way that Welfare and Tranquility are not?


In primary school, I was taught to capitalise nouns in titles.

So, an essay title might be "The Man and his Dog" rather than "The man and his dog".


But that rule doesn't apply only to nouns: it applies to all words that aren't pronouns, conjunctions and articles. Hence, "To Kill a Mockingbird" ('kill' is a verb), "Malone Dies," etc.


There isn't one uniform rule for headings: https://en.wikipedia.org/wiki/Letter_case#Headings_and_publi...


That rule (the one I learned in primary school) does apply only to nouns. I'm not suggesting my primary school teacher's rule is the correct, most popular or best rule :)

Personally, though, I find many US newspapers' headlines jarring due to excessive capitalisation.


Other languages also have more complex inclination and declension rules that add additional structure. (And for writing systems, c.f. the use of hiragana in Japanese for things like particles and verb endings).


Syntax highlighting traditionally chooses a different color for each token in this Java statement:

    final String id = leader(NAMES_AND_SCORES);
If I try to translate this statement into English:

    Given a global list of names and scores, determine the leader's 
    id. (Ensure that id is a string of characters.) I'll use "id" to
    refer to that leader throughout this paragraph.
If our traditional highlighting approach is generally correct for the code, shouldn't I be highlighting each sentence and/or phrase wholly with one color and not highlight per parts-of-speech?

Or in other words, does the analogy being proposed really hold?

Or another take -- speed readers take in whole sentences at a time. Colorizing parts-of-speech this way would only seem to slow them down whereas syntax highlighting code speeds my reading. I'm sure there's an analysis here; final, String, leader() are not parts-of-speech; each is a separate semantic statement.


We don't highlight each line of code separately, which would be the analogue to highlighting natural language at the sentence level. We highlight tokens based on their syntactic type. Strings are all colored the same. Operators are colored the same. That's pretty much the same idea as coloring all common parts of speech the same.

Parsing is the act of taking a linear string of tokens and building a tree out of them. That means reading in a string of tokens and applying the parsing rules (which may be encoded as a set of fuzzy correlations when humans learn those rules). When the rules are not solidly codified or slow to apply due to unfamiliarity, it helps to have hints to orient/validate yourself.

You do this parsing routine with your own natural language, too. You're just much more comfortable doing so and do not need hinting on what each word's role is. Just like a lot of old-school unix guys of lore are more comfortable reading code without the spectra of colors we commonly apply today. I could see natural language syntax highlighting being very useful for language learners, though. Color is used in Chinese language learning to indicate tonalities for learners, since most have no native/intuitive way to transcribe the pitch contours. I'm not convinced that the syntax highlighting presented in the article is really what you'd want, but I'm interested in the direction it's headed.

As an aside, speed readers don't take in a whole sentence at a time. An entire sentence simply doesn't fit within your fovea, but they have optimized their eye tracking to boost their speed. I do imagine that having lots of colors would disrupt and distract from the text and harm their speed / comprehension, but it may be possible that a different highlighting scheme could work for them.


I think you make a wonderful comment here, but your analogy doesn't hold, I'd say. While I applaud the effort going into this, I don't think this works for me. I read "code" for humans completely differently than I do code meant to describe computation (NLS text versus programs, respectively). IOW, there can be no analogy for me.

I think it stems from the way we read Sherlock Holmes as an experience whereas we read a program as an explanation. You cannot substitute an explanation for an experience.


> does the analogy being proposed really hold?

It's not an analogy, it IS syntax highlighting. You just encode a lot of semantics into syntax in java, so syntax highlighting is more useful for determining the semantics.

I don't think the OP was claiming this is useful for english in the way syntax highlighting is useful for java.


Syntax highlighters for natural language would help against garden path sentences:

For example:

    The old man the boat.
... is not ambiguous if written as:

    {SUBJECT}[The old] {VERB}man {OBJECT}[the boat]
I think the reason syntax highlighting is important in programming is because garden-path style sentences are more common with the pedanticly strict grammars that programming languages require.


The equivalent of garden-path sentences in formal languages would be shift-reduce conflicts and most pedanticly strict grammars are specifically designed to avoid that kind of problem, so that they can be parsed efficiently.


Of course the compiler has no problem understanding them, but humans aren't that good as parsing.

The purpose of a syntax highlighter is so that the human knows that the computer agrees on what the sentence/code structure is.


So you're saying English would be better as a statically typed language /s


Let's have different colors for dialog spoken by different characters, such as Holmes and Watson. That way I won't have to reread the text to figure out who said what.


That would be nice. It's often really hard to tell with long conversations in novels who is speaking. There's a tenancy to have the speaker implied rather than implicit for long conversations, and color would be a good solution. I wonder if this is ever used in cinema and / or theater.


It's not common, but some anime fansub groups make the subtitle text match the hair color of the character that is speaking.


This would be fantastic for reading to the kids. At the moment I have to scan ahead to work out which voice to put on.


"Even when reading a novel myself, it bothers me when I don't learn the speaker until after the I have read the quoted phrase, like in this comment" said WillAbides


This is actually one of the big issues with RSVP techniques like Spritz. When you can't see the paragraph structure, it can be very difficult to track who's talking.


I was taught that the protocol is for the speakers to alternate if no other information is given in the lines. In all other cases you need to restart the speech block with something like "Watson continued, "..."".


Sure, but sometimes I get a little lost - my eyes wandered away from the page, I got distracted and started thinking about something else, or there were too many back-and-forths for me to keep it straight.


We generally add points and comma to emphasize parts of the text, slow down, stop the reading... Try reading the text out loud without the syntax highlighting, then try with. See? It doesn't work, you're going to emphasize and stop on higlighted words. What would have worked would have been to highlight everything and split the highlight according to punctuation, not words.

As someone else pointed out, different colors for different protagonist talking would be neat as well. But there is not much you can't do without deforming the original intent of the writer.

Another thing that could be interesting would be to give these tools to the writer instead of highlighting the text automatically, but then as someone else pointed out, these tools already exists and are rarely used because of the noise they add (bold, italic, underlined). There are also quotes, quads, uppercase, ... There are many ways to help the reader follow the text.


Sorry, but when we speak of "syntax highlightning" for natural language what is the actual goal?

Are we trying to color differently verbs, nouns, adverbs and prepositions? So that the "goal" is to properly decide if "lie" is a verb or a noun?

Or are we trying to colour subject, verb, object and other elements of the sentence? So that "Rome destroyed Carthago" will have a different color for Rome then the sentence "Hannibal tried to destroy Rome"?

In general, code has "reserved words" and the rest is either a "name" (variables, constants, literals) or a ... Well... "Verb". Like functions, procedures, methods.

In some rare case (function pointers, closures) you have "verbs" that can be used as "nouns" but you completely lack concepts like dative, accusative and so on.

I think that this really breaks down as an analogy when you try to adapt syntax parsing to natural language.


I can tell you right away that it would be of huge help when reading a new language especially in a new script. And who are biggest set of users that might benefit from this? (clue: they haven't said their first words yet).


Which one? Sorting out names, verbs, adjectives etc. Or finding out what "role" each word is playing (sorry, English is not my language so I do not know the proper technical terms for this... In my culture the first is called "Grammatical Analysis" while the latter is called "Logical Analysis")


Syntax highlighting works great for code, because code itself is highly structured, mostly normalized data (hence BNF being a thing). Even though we call them "languages", they're much closer to spreadsheets than natural languages. That's why it's so useful. Code isn't meant to be read from left to right, top to bottom. There's a lot of back-and-forth skimming to understand what the code is doing. Natural languages are meant to be read from beginning to end. We do skim, but it's almost entirely contextual, and only slightly syntactic. They're just not alike enough for things like this to make sense.


As an armchair linguist, I get frustrated by how computer nerd types keep comparing programming languages with natural languages. The two have very little in common other than some superficial similarities. The way each is acquired, used, and evolved is very different from the other. One of the worst examples of this bad analogy is sigils in Perl, which Larry insists are supposed to be good because they mirror plurals in English.


The analysis itself could be IMMENSELY useful in things like typesetting. There's a whole assortment of stylistic rules based on rather subtle things (e.g. the space between initials is usually slightly smaller) and they're rather hard to automate. If it were possible to parse and tag text like that, it may give way to advanced typesetting algorithms.

By the way, syntax highlighting works for code only as long as you see the same color scheme. If the scheme changes, the benefit is lost.


This is interesting. I wonder if any publishing <font color="blue">house</font> would print a book with this idea?


Putting “house” in blue reminds me of House of Leaves, a book that plays with formatting & typography in unusual ways…

very minor spoilers

…as it descends into madness (blank pages, words in spirals, backwards characters, single-character pages, overlaid paragraphs…). The very first unusual formatting in the book, and spit-take surprising to me as I wasn't expecting anything unusual at all, was simply printing the word “house” in blue. A fun read, thanks for reminding me of it!

https://en.wikipedia.org/wiki/House_of_Leaves#Colors


Yes, that was the intent.


My startup [1], which also uses color in order to increase text readability, is launching a pilot program with an on-demand textbook publisher. But your point is well-taken — it has been a long slog to find our first couple publishing partners, even with solid independent research showing benefits for students and other readers.

1: http://www.BeeLineReader.com


It's cute idea but, feels distracting. Maybe a nicer color scheme would work better.


To be fair, I've started to feel the same way about syntax highlighting in code. I've slowly moved to only highlighting 2 things: String literals and comments.


I'm nowadays using a theme in Emacs where everything is black on white and keywords and comments are bold, whereas code regular. I'm happier with this than syntax colouring nowadays. I've also removed some colour from Org, namely headlines are bold and black. Again, I guess the less the colours the better here. I use colours with parens, they're pale by default, then highlighted with highlight-parenthesis mode, denoting nesting via tones of red.

I'm even doing some html-css-js-m4 work for a relatives business website nowadays, and I did not miss highlighting even with such complex mess, instead, I'm happier.


I like to have language keywords accented too, since that helps me scan the shape of the code without having to read it closely, but yes - most of the benefit is just in knowing what context you should read the characters in.


Same here. I do not highlight anything though: just white text on black background.


I found it to be very distracting. The reading process stopped on every colour change - especially punctuation and conjunctions, etc.

Larger blocks of cited text might work.


You find that if you scroll down, but yes, darkening the interjections is not helpful.


I saw it. I could not really tell if it was good - as it was now with the dark text.

I can't help to think that this was done by someone that does not read that much.

Younger people seems to prefer videos instead of READMEs, and I have not really understood why until I saw somewhere that speedreading skills apparently have been falling drastically among young people. I mean why look at a video for 30 minutes to see if a tool or framework is worth trying, when you can scan the equivalent text in 30 seconds.


The first book I read was "The Neverending Story", which used green for text that took place in the "real" world, and red for text that took place in the fantasy world in the book the protagonist was reading. That's not syntax highlighting, but structural story highlighting.

I would actually be interested to read a novel which did something similar: one color for narration (probably black, as it will be most common), and then a different color for each person speaking. That would be useful in a similar way that I find syntax highlighting useful: I could instantly look at text, and without even reading it, know who said it.

Narration text could also take on different colors, similar to in "The Neverending Story". How it is done, and how obvious its meaning, could even be a part of the art. That would be far more interesting to me than English syntax highlighting, to the point that if anyone knows of a book that does this, please tell me, because I would read it just to experience it. The point here is that in fiction, it's not the parts of speech that matter to readers, that's just a means to tell the story. What matters are the elements of the story, and communicating those elements visually could be interesting and useful.


Wow that was hard to read. We don't need to highlight prose. We don't read prose the same way we read code.


I found that difficult to read. Most editors go way over the top doing syntax highlighting and to me it makes the code harder to read.

I switched to a gray scale theme (emacs tao theme) two weeks ago and it is so much better.


Interesting, this follows the approach of sentence blocks, rather than syntax. I've written an Emacs plugin for lisp blocks (https://github.com/istib/rainbow-blocks) and one for English syntax (https://github.com/istib/wordsmith-mode) using NLP tools.


I'm getting this error when trying to install wordsmith-mode using package-install: `http://melpa.org/packages/wordsmith-mode-20140203.427.el: Not found`


I actually made a little thing for emacs that does something similar, actually tagging on the parsed parts of speech instead of just tokens. It requires you to select the text first instead of automatically highlighting, though. It uses coreNLP which was pretty lovely to work with.

https://github.com/cosmicexplorer/speech-tagger


Highlighting the quotes reminds me of a gripe I have with quoting styles in print: When a quote consists of two paragraphs, the first paragraph does not get an ending quote:

    He said, "The first sentence.

    "The second sentence," he continued. 
Somehow my mind gets triggered pretty intensely by these unbalanced quotes.

Does anyone have some background?


I believe the reasoning for this is that if you had two people talking, and there were end quotes after the first sentence, it would be parsed as the second sentence being said by the second person.


It might be. Most of the time, however, it is just a single person getting a longer quote.


This isn't actually mine, just something I stumbled on. Before I go and pull this apart to do it myself, does anyone know of an extension that will do this to arbitrary text in Firefox?

I found this because someone was using it as an argument against syntax highlighting for code, but I actually find that it lets me read significantly faster.


You may be interested in BeeLine Reader [1]. It puts a color gradient on alternating lines to text to hopefully let you read faster. I think it works a little bit. When I really want to read an entire article but don't want to invest too much time, especially if it's somewhat fluffy, I'll use Spritzlet set to 700 wpm [2].

1: http://www.beelinereader.com/

2: http://www.spritzlet.com/


Interestingly, when people see BeeLine Reader for the first time, many of them think (incorrectly) that it's a sentence-based algorithm instead of a line-based algorithm. Their belief often persists even after being told (by me, the creator) that it is in fact line-based. We've thought about doing something that's sentence-based, or syntactically or semantically aware, but as others have pointed out those tasks are much more complex.


I've experimented with this for a while, the goal being easier comprehension of new text. I found highlighting parts of speech and syntactic groups didn't work for me. The only thing that did work was highlighting keywords (that are specific to the text) and maybe named entities that refer to the same thing. Interestingly, there is some research indicating highlighting keywords may help people with dyslexia.

I've written a chrome extension for highlighting keywords, too bad I don't currently have time to give it the love it deserves:

https://chrome.google.com/webstore/detail/highlit/cooahmcpma...


It'd be interesting to integrate http://www.beelinereader.com/ style gradients. I had a hard time with the quick change from gray to white.


I see a lot of negative feedback here but I must say I tried reading it and it looks like a noticeable improvement. I can't put my finger on what exactly improved but somehow it reads better I think than just text.


For me the point would be to make reading main point of the text faster and more precise. Why not render based on meaning of the words, e.g. "love" and "sex" should be red, strong words in bold, weak in gray, "emotion" in tone based on particular emotion etc? Grammatical terms are not even that interesting to me, and any syntactic sugar can be made softer.


If you want a Mac app that does this, iA Writer has a feature called Syntax Control: https://ia.net/writer/updates/ia-writer-3-1-comes-in-colors

I use it sometimes, it works pretty well but occasionally gets confused.


Interesting concept although IMHO poorly executed: I found it harder to read than non highlighted text as critical ligature words were faded (like "and"). these should not be faded but emphasized (a bit like ampersands in programming languages)


On a somewhat related note, has anyone on here successfully mastered the art of speed reading? I'm a CS student at Uni right now, and compared to non-Engineer students, I've noticed I tend to read must slower. Any tips on up-ing my read speed?


If you try to read all the words, just faster, you won't get much farther. The trick is to take a more active approach and optimize how you spend your time towards the goal of comprehension.

One thing that worked well for me was shifting from "reading" to "interrogating". Don't just try to read through a text, think carefully (and jot down) what questions you need to answer from the text, and jump around the text as necessary to answer the questions. If you don't have a sense of what questions to ask, do a quick skim of the text and any other relevant material to get the questions first, then dive in. Iterate and refine your questions and answers.


I'd focus on understanding rather than reading speed. If reading faster compromises your understanding, back off. One more thing: the best way to see the need to read fast is to be overwhelmed by browser tabs. When you are, try to kill them after reading. With so many tabs, you'll want to end it up quickly.


In the presented example, gray Highlights within white phrases negatively affected my reading, but the white highlights within the yellow text helped I guess. Nonetheless, congratulations for experimenting.


Language is already highlighted. It's called Bold, Italics, Underline.


The problem is, if you use italics/bold/underline remotely as often as necessary to convey intonation or emphasis, you read like you're a loon or crank. Bolding is the green ink of the Internet.


That's a good point. Also, syntax highlighting helps me navigate my code by sight, something I don't need to do when I'm reading.


This is a very interesting idea! How about syntax highlighting for proper nouns, using a preset cycling color tables to give proper nouns consistent colors?


It's an interesting idea to highlight English. However, I believe it needs to be a little more intelligent than highlight.js.


In some early books, the open class words are capitalized. This style lives on in titles.


I would totally install a sublime plugin or something like that for this.


Similar to Red Letter Bibles, I think this gets in the way of the prose.


This is amazing Thanks so much author.


please make this a chrome extension that wraps all p tags in my browser.

I'm very curious.


Red periods were a bad idea. Also ew.


I don't know. That was my first reaction too, but I started reading and after a couple of paragraphs I found them to be helpful. The red periods actually let me read faster because they were easier to spot.


In British English, periods are always red. Full stops are usually black when printed.


I'm wondering if the visual pun was deliberate.


This isn't particularly interesting as art, and it's beyond incorrect as far as any formal theory of natural language goes, but you do you, playa




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: