Hacker News new | past | comments | ask | show | jobs | submit login

IMHO a text-based browser isn't exactly in the "challenging" category, as it basically amounts to stripping all the HTML tags out and doing some very simple transformations (like replacing <br>'s with newlines.) Then again, one of the things I've been working on intermittently for the past few years is a graphical (CSS2+) browser, which is definitely in the challenging category. There are some other public efforts too:

https://github.com/lexborisov/Modest

https://github.com/litehtml

https://github.com/ArthurHub/HTML-Renderer

Along the same lines, some other challenging projects I recommend are to write decoders/renderers for existing formats like MP3, MP4, PDF, etc.




To this list I would add "Web Browser Engineering" [0] which is a textbook / browser engine that is currently being written by Dr. Pavel Panchekha at the University of Utah. The code for the book and browser is available on GitHub [1] and a more current bleeding edge draft is also published [2].

The book guides the reader in implementing a graphical web browser, starting with HTTP and HTML then moving on to the layout, the box model, CSS, browser chrome, forms, and scripts.

[0] https://browser.engineering

[1] https://github.com/pavpanchekha/emberfox

[2] https://browser.engineering/draft/


Thanks, I will add that book to the post! It looks really good.


> IMHO a text-based browser isn't exactly in the "challenging" category, as it basically amounts to [...]

All my projects start with me thinking like that, then many hours, days or months later me thinking "hey it was more complex than I thought".

For 2021 I want to build a personal finance app for myself. The usual me thinks it will take a couple months. The realist me wonders if it will be finished in this decade :)


There's a difference between scope creep and difficulty.


It looks straightforward until you hit a couple of edge cases. Examples:

test <1 becomes test 1

Test< 2 becomes test 2

Test <a becomes test

Test < b becomes test b

(From memory)

What about: Test <fakeTag>?

Per tests i did, "test " was expected however "test <fakeTag>” was seen as the plaintext version suggesting there's a list of valid tags which is filtering the behavior.


That's because '<' needs to be followed by [!/?a-zA-Z] to be recognised as a tag start. Otherwise it is a literal '<'.

The full details are in here somewhere: https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.ht...


I have been stuck on such these edge cases for almost 15 years building my own HTML parser

It is always working on all the HTML files I have, but then people make new HTML files with other issues.


Doing proper table layouts (including rowspans, colspans) is a little more than stripping and replacing tags.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: