Easily index your Single Page Application thanks to PhantomJS

Google provides a way to index your Single Page Application (SPA), built with Angular.js, Backbone.js, or other Javascript framework (ember.js, knockout.js, and others). In this article, we will first see together how it proceeds, the amount of work that the webmaster has to do, and then we will come up with an automatic solution.

All along this article, I will introduce tools and examples of this website, vickev.com (which is built with Angular.js and Node.js), to prove the relevance of the method.

The magic of Google crawl

image

Google perfectly indexes websites, but without executing any javascript code. If the website is well-done, it should display the information, no matter if the javascript is enabled or not. In order do that, the webmaster needs to test every single page disabling the Javascript engine of his browser, and see if the website provides the correct information and is browsable. It's a good and famous practice.

However, in the case of SPA, the foundation is based on Javascript. It can be very difficult, yet impossible, to port it in static HTML. In front of the emergence of Javascript Applications, Google needed to propose a method to index them, just like other static websites.

So, Google came up with a solution for developers. Here are the principle concepts:

  • Google will only index links with an exclamation mark after the dash in anchor-based URLs.
  • Googlebot will build a request to the same server, putting what follows the exclamation mark in a GET _escaped_fragment_ option.
  • The HTML returned by this request should be the content that the user is supposed to get after the execution of the Javascript. Google will index this HTML code under the anchor-based URL.

If it is not clear (I'm not sure it is, actually), please consult the official guide, which is pretty clear in my opinion. You can also try it with this website (see next sections)!

Try it yourself!

image

We can play a little bit with vickev.com. This website is a SPA, and is configured to be indexed by google. So, first of all, let's see what googlebot actually sees, by running a console browser such as Lynx.

We are going to try to display a previous article about Grunt. If we run in out Linux terminal this command:

lynx 'https://vickev.com/#!/article/grunt-the-perfect-tool-for-require-js'

We will have this result:

image

As we can see, we do not get the content of the article, but only the bottom of the page. It's because these kind of browser does not execute javascript (just like googlebot).

Now, let's do what googlebot does, by making this specific request to the server:

lynx 'https://vickev.com/?_escaped_fragment_=/article/grunt-the-perfect-tool-for-require-js'

image

That's much better! We see the HTML that the user sees if the javascript is enabled, and that's what will be indexed by Google. (Note that, I may have optimization to do here by removing the iframe tag at the beginning of the page for better indexing).

Now, I hope you get the idea. How to implement this...

Generating static HTML for Google

Our program needs to respond to the correct request to generate the HTML, and to give it back to the browser. But, of course, we are developers: we are lazy (by definition). So, we want an automatic way to do it.

PhantomJS, a headless browser

image

PhantomJS is Node.js software that can return the HTML of any web page, after running the Javascript code, like other WebKit browsers (that's the point!). It is usually for functional and unit testing purposes.

However, for us, the ability to load an URL and run javascript is exactly what we need!

ExpressJS route

ExpressJS is a simple MVC framework for Nodejs. Here, we want to create a specific route to deal with the _escaped_fragment_ option, that will run PhantomJS to load the URL of our own website, and return the HTML to the client. It looks like this:

var phantom = require('node-phantom');
// ...
app.get('/', function(req, res){
  // If there is _escaped_fragment_ option, it means we have to
  // generate the static HTML that should normally return the Javascript
  if(typeof(req.query._escaped_fragment_) !== "undefined") {
    phantom.create(function(err, ph) {
      return ph.createPage(function(err, page) {
        // We open phantomJS at the proper page.
        return page.open("https://vickev.com/#!" + req.query._escaped_fragment_, function(status) {
          return page.evaluate((function() {
            // We grab the content inside <html> tag...
            return document.getElementsByTagName('html')[0].innerHTML;
          }), function(err, result) {
            // ... and we send it to the client.
            res.send(result);
            return ph.exit();
          });
        });
      });
    });
  }
  else
    // If there is no _escaped_fragment_, we return the normal index template.
    res.render('index');
});

The comments in this source should be clear enough to understand what's going on. Note that in order to call phantomJS from Node.js, we use node-phantom.

This code is very simple, yet not the most efficient. An interesting improvement would be to cache the result of PhantomJS, to avoid processing the same request over and over again (which is quite expansive).

Results

image

We can see that Google was able to find the proper <title>, and the proper meta description of the article (those values are normally dynamically generated by Javascript). We can also notice that the real anchor-based URL that has been indexed for this page.

Useful tool: google webmaster

To double check that Google will get the proper content of our dynamic page, we can use Google webmaster tools. From here (after your website has been registered), we can go to Fetch as Google, and give an URL for Google to query.

image

In this picture, I didn't print all the output, but we can see some clues that prove it actually worked: the title and meta description are correctly generated.

Conclusion

The method that is introduced in this article provides an exhaustive way to make your dynamic pages indexed by Google. However, I just gave the technical aspect: a good indexing is first of all done by good practices (sitemap, relevant content at the beginning, title, meta tags, etc.). I hope this tutorial helped.

I wish you a happy indexing!

~Kevin


Written by
April 28, 2013 5:22am
Tags: JavascriptTrick

Loading... Loading content...
ShareThis Copy and Paste