People should keep in mind that you do not always have the benefit of green field development; sometimes you have a project already written that does not have much of server-side component, and has no budget or time left for adding real pages/do graceful degradation (much less progressive enhancement.) In these cases, using a spider is pretty much your only choice. Some notes/recommendations:
- I'd recommend PhantomJS (there are some other packages built on top of it, but for my custom needs, using the original was better)
- If you spider a whole site, especially if it's somewhat complicated, log what you're spidering and see if and where it hangs. I started getting some PhantomJS hangs after ~100 URLs. In this case, it can be a good idea to do multiple spidering runs using different entry points (I use command line options to exclude certain URL patterns I know were spidered during previous runs)
- If you're spidering sites using script loaders (like require.js), pay careful attention to console errors; if you notice things aren't loading, you may have to tweak your load timeout to compensate. Using a "gnomon" (indicator) CSS selector is very helpful here.
- Add a link back to your static version for no-JS people in case Google/Bing serves up the static version. This only seemed to be problem shortly after spidering, but it's worth doing regardless (later, search engines seemed to start serving the real version)
- For those wondering how to keep the static version up-to-date, use a cron job, then cp/rsync/whatever your latest version to your "static" directory.
One thing I'd like to add is that I wish PhantomJS would support more things that node.js does (since some of its API is modeled after it), particularly many synchronous versions of functions. That aside, it's an incredibly useful piece of software.
- I'd recommend PhantomJS (there are some other packages built on top of it, but for my custom needs, using the original was better)
- If you spider a whole site, especially if it's somewhat complicated, log what you're spidering and see if and where it hangs. I started getting some PhantomJS hangs after ~100 URLs. In this case, it can be a good idea to do multiple spidering runs using different entry points (I use command line options to exclude certain URL patterns I know were spidered during previous runs)
- If you're spidering sites using script loaders (like require.js), pay careful attention to console errors; if you notice things aren't loading, you may have to tweak your load timeout to compensate. Using a "gnomon" (indicator) CSS selector is very helpful here.
- Add a link back to your static version for no-JS people in case Google/Bing serves up the static version. This only seemed to be problem shortly after spidering, but it's worth doing regardless (later, search engines seemed to start serving the real version)
- For those wondering how to keep the static version up-to-date, use a cron job, then cp/rsync/whatever your latest version to your "static" directory.
One thing I'd like to add is that I wish PhantomJS would support more things that node.js does (since some of its API is modeled after it), particularly many synchronous versions of functions. That aside, it's an incredibly useful piece of software.