本文共 3322 字,大约阅读时间需要 11 分钟。
A couple of weeks ago, a colleague of mine showed me this cool tool called . This is a headless browser, that can receive javascript to do almost anything you would want from a regular browser, just without rendering anything to the screen. This could be really useful for tasks like running ui tests on a project you created, or crawling a set of web pages looking for something. ...So, this is exactly what i did! There's a great site I know of that has a ton of great ebooks ready to download, but the problem is that they show you only 2 results on each page, and the search never finds anything! Realizing that this site has a very simple url structure (e.g.: website/page/#), I just created a quick javascript file, telling phantomjs to go through the first 50 pages and search for a list of keywords that interest me. If i find something interesting, it saves the name of the book along with the page link into a text file so i can download them all later. :) Here's the script : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | var page; var fs = require( 'fs' ); var pageCount = 0; scanPage(pageCount); function scanPage(pageIndex) { // dispose of page before moving on if ( typeof page !== 'undefined' ) page.release(); // dispose of phantomjs if we're done if (pageIndex > 50) { phantom.exit(); return ; } pageIndex++; // start crawling... page = require('webpage ').create(); var currentPage = ' your-favorite-ebook-site-goes-here/page/ ' + pageIndex; page.open(currentPage, function(status) { if (status === ' success ') { window.setTimeout(function() { console.log(' crawling page ' + pageIndex); var booksNames = page.evaluate(function() { // there are 2 book titles on each page, just put these in an array return [ $($(' h2 a ')[0]).attr(' title '), $($(' h2 a ')[1]).attr(' title ') ]; }); checkBookName(booksNames[0], currentPage); checkBookName(booksNames[1], currentPage); scanPage(pageIndex); }, 3000); } else { console.log(' error crawling page ' + pageIndex); page.release(); } }); } // checks for interesting keywords in the book title, // and saves the link for us if necessary function checkBookName(bookTitle, bookLink) { var interestingKeywords = [' C #','java','nhibernate','windsor','ioc','dependency injection', 'inversion of control ',' mysql ']; for (var i=0; i<interestingKeywords.length; i++) { if (bookTitle.toLowerCase().indexOf(interestingKeywords[i]) !== -1) { // save the book title and link var a = bookTitle + ' => ' + bookLink + ' ; '; fs.write(' books.txt ', a, ' a'); console.log(a); break ; } } } |
转载地址:http://dzemb.baihongyu.com/