Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pjscrape: A web-scraping framework written in JS using PhantomJS and jQuery (nrabinowitz.github.com)
79 points by jamesjyu on Aug 13, 2011 | hide | past | favorite | 8 comments


I use phantomjs + jqyery for my own scraping/testing engine so I thought I would chuck in some extra information for those not familiar with PhantomJs.

The one thing that set phantomjs apart is that is it a full headless webkit browser rather than just an html parsing engine which most other solutions are. The big win with the above in mind is that you can scrape and test comet/heavy javascript apps without having to mock the polling or submit/responses.

I run it like a bot controlled by NodeJs with NowJs sending commands to it and it returning the results of tests, though I believe there are plans to get process to process communication working to make the process of controlling and pushing data out easier.


I, too, use a nodejs server to control multiple phantomjs processes. There's a patch that lets your script read from stdin -- last weekend I modified it to support my platform's preferred line ending. I also added commands for mousemove/mousedown/mouseup; they stuff actual mouse events in the Qt event queue, so you don't have to worry about the edge cases where javascript-faked mouse events fail.

https://github.com/robterrell/phantomjs


While this is awesome, anyone that needs to do about the same thing but with a Python stack should look at pyquery as an alternative.


Has anybody used PhantomJS with a client-side testing framework? I'd be very interested in hearing experiences.


I am also working on a jquery/js scraping framework of my own. I think this is the way go, because there is no library that used more to extract HTML then jQuery. And it also enables you to scrape JS code on the page.

I used node+jsdom so far. I will have a look at phantom js.


How does Pjscrape handle logins, SSL, and redirects?


PhantomJS recently closed a pull request on some basic patching to support SSL.

There was also a fix for self-signed or invalid certs:

https://github.com/ariya/phantomjs/pull/40


If it does what you guys say it does... you are full of awesome!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: