Cloud PhantomJS on Amazon EC2

Joining the buzzword-laden crowd, here I’d like to say that PhantomJS goes to the cloud.

Back to the realistic world, this blog post shows how easy it is to build and deploy PhantomJS on a Linux instance of Amazon EC2. If you are not familiar with EC2, it’s the Elastic Compute Cloud platform from Amazon Web Service, essentially computer resource you can rent and scale up/down as neeeded. EC2 is quite popular, it powers various consumer-oriented services, from Amazon.com itself to Netflix.

Artwork credit: Internet cloud, Cartoon ghost.

There are two keys to the enablement of PhantomJS on EC2: the improved build workflow and the true headless”) feature. Assuming you have an instance running, it’s a matter of the following commands:

sudo yum install gcc-c++ git chrpath openssl-devel freetype-devel fontconfig-devel
git clone git://github.com/ariya/phantomjs.git && cd phantomjs
git checkout 1.5
./build.sh --jobs 1

That was tested in a 64-bit image with the following /etc/system-release:

Amazon Linux AMI release 2011.09

Note: With Amazon Linux AMI release 2012.03, make is also needed, i.e. sudo yum install make.

As expected, there is no need to have any sort of GUI to run PhantomJS. Pure headless.

For some tweaks and other notes, read the complete PhantomJS build instruction info. Please note that the build may take a long time, the Linux Micro Instance (free usage tier) took about 28 hours to complete the entire process. You may also switch to another Linux image or even build locally first on a beefy machine and then upload the resulting build. In fact, you could also use the included script deploy/package-linux-dynamic.sh to pack the build into a tarball and transport it somewhere else, e.g. further AMI instances. The package will be self-contained, the proof is in the result of running ldd on the binary:

linux-vdso.so.1 =>  (0x00007fff02dff000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fc9d266b000)
libQtWebKit.so.4 => /home/ec2-user/deploy/phantomjs/bin/../lib/libQtWebKit.so.4 (0x00007fc9d0cf3000)
libQtGui.so.4 => /home/ec2-user/deploy/phantomjs/bin/../lib/libQtGui.so.4 (0x00007fc9d01d6000)
libQtNetwork.so.4 => /home/ec2-user/deploy/phantomjs/bin/../lib/libQtNetwork.so.4 (0x00007fc9cfe92000)
libQtCore.so.4 => /home/ec2-user/deploy/phantomjs/bin/../lib/libQtCore.so.4 (0x00007fc9cf93e000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc9cf722000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fc9cf41c000)
libm.so.6 => /lib64/libm.so.6 (0x00007fc9cf197000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fc9cef81000)
libc.so.6 => /lib64/libc.so.6 (0x00007fc9cebe0000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc9d2877000)
libfreetype.so.6 => /usr/lib64/libfreetype.so.6 (0x00007fc9ce942000)
libfontconfig.so.1 => /usr/lib64/libfontconfig.so.1 (0x00007fc9ce70c000)
librt.so.1 => /lib64/librt.so.1 (0x00007fc9ce504000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x00007fc9ce2db000)

Now that you have something wandering around in the cloud, what can you do with it? There are few example usages of PhantomJS which may inspire you. Personally what I’d like to appear someday are the screenshot service and the next-generation network monitoring service.

For the screenshot service, it’s necessary to combine PhantomJS with other web stack frameworks. Basically PhantomJS is just the back-end, its screen capture will be driven by another middleware. There are examples of such an implementation using Perl Dancer (Screenshot), Node.js (screenshot-app), Python/Flask (bookmarking service), and Play2 (screenshot-webservice). For a reference of a commercial screenshot service, take a look at URL2PNG which seems to capture the web page using the Linux version of Chromium 11 (that’s a release from a year ago). Using Chromium might give a better rendering fidelity although a headless optimized PhantomJS is guaranteed to be more resource/CPU friendly.

One underrated feature of PhantomJS is its ability to track network activity, i.e. every single network response and request along with the timing information. This is used in e.g. confess.js. An export to HAR format is almost trivial. Now imagine you build an advanced network traffic and monitoring service based on this feature. You can enrich the report with tons of useful (and useless) metrics and stats, everything from HTTP header analysis, detailed breakdown of assets size, complete network waterfall diagram, optimization opportunities, and many more. Maybe even the screen capture of the monitored site. If your client focuses on interactive web page or rich internet apps (RIA), you can even report the code coverage and full execution trace by leveraging my other project, Esprima.

Do I hear a startup?

Cloud PhantomJS on Amazon EC2

Related posts: