Pjscrape: A web-scraping framework written in JS using PhantomJS and jQuery

weego · on Aug 13, 2011

I use phantomjs + jqyery for my own scraping/testing engine so I thought I would chuck in some extra information for those not familiar with PhantomJs.

The one thing that set phantomjs apart is that is it a full headless webkit browser rather than just an html parsing engine which most other solutions are. The big win with the above in mind is that you can scrape and test comet/heavy javascript apps without having to mock the polling or submit/responses.

I run it like a bot controlled by NodeJs with NowJs sending commands to it and it returning the results of tests, though I believe there are plans to get process to process communication working to make the process of controlling and pushing data out easier.

robterrell · on Aug 13, 2011

I, too, use a nodejs server to control multiple phantomjs processes. There's a patch that lets your script read from stdin -- last weekend I modified it to support my platform's preferred line ending. I also added commands for mousemove/mousedown/mouseup; they stuff actual mouse events in the Qt event queue, so you don't have to worry about the edge cases where javascript-faked mouse events fail.

https://github.com/robterrell/phantomjs

bryanh · on Aug 13, 2011

While this is awesome, anyone that needs to do about the same thing but with a Python stack should look at pyquery as an alternative.

davej · on Aug 13, 2011

Has anybody used PhantomJS with a client-side testing framework? I'd be very interested in hearing experiences.

ma2rten · on Aug 13, 2011

I am also working on a jquery/js scraping framework of my own. I think this is the way go, because there is no library that used more to extract HTML then jQuery. And it also enables you to scrape JS code on the page.

I used node+jsdom so far. I will have a look at phantom js.

camwest · on Aug 13, 2011

How does Pjscrape handle logins, SSL, and redirects?

jqueryin · on Aug 13, 2011

PhantomJS recently closed a pull request on some basic patching to support SSL.

There was also a fix for self-signed or invalid certs:

https://github.com/ariya/phantomjs/pull/40

AltIvan · on Aug 13, 2011

If it does what you guys say it does... you are full of awesome!