博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Having fun web crawling with phantomJs
阅读量:2435 次
发布时间:2019-05-10

本文共 3322 字,大约阅读时间需要 11 分钟。

A couple of weeks ago, a colleague of mine showed me this cool tool called .
This is a headless browser, that can receive javascript to do almost anything you would want from a regular browser, just without rendering anything to the screen.
This could be really useful for tasks like running ui tests on a project you created, or crawling a set of web pages looking for something.
...So, this is exactly what i did!
There's a great site I know of that has a ton of great ebooks ready to download, but the problem is that they show you only 2 results on each page, and the search never finds anything!
Realizing that this site has a very simple url structure (e.g.: website/page/#), I just created a quick javascript file, telling phantomjs to go through the first 50 pages and search for a list of keywords that interest me. If i find something interesting, it saves the name of the book along with the page link into a text file so i can download them all later. :)
Here's the script :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
var
page;
var
fs = require(
'fs'
);
var
pageCount = 0;
 
scanPage(pageCount);
 
function
scanPage(pageIndex) {
 
// dispose of page before moving on
 
if
(
typeof
page !==
'undefined'
)
  
page.release();
 
 
// dispose of phantomjs if we're done
 
if
(pageIndex > 50) {
  
phantom.exit();
  
return
;
 
}
 
 
pageIndex++;
  
 
// start crawling...
 
page = require('webpage
').create();
 
var currentPage = '
your-favorite-ebook-site-goes-here/page/
' + pageIndex;
 
page.open(currentPage, function(status) {
  
if (status === '
success
') {
   
window.setTimeout(function() {
    
console.log('
crawling page
' + pageIndex);
     
    
var booksNames = page.evaluate(function() {
     
// there are 2 book titles on each page, just put these in an array
     
return [ $($('
h2 a
')[0]).attr('
title
'), $($('
h2 a
')[1]).attr('
title
') ];
    
});
    
checkBookName(booksNames[0], currentPage);
    
checkBookName(booksNames[1], currentPage);
     
    
scanPage(pageIndex);
   
}, 3000);
  
}
  
else {
   
console.log('
error crawling page
' + pageIndex);
   
page.release();
  
}
 
});
}
 
// checks for interesting keywords in the book title,
// and saves the link for us if necessary
function checkBookName(bookTitle, bookLink) {
 
var interestingKeywords = ['
C
#','java','nhibernate','windsor','ioc','dependency injection',
  
'inversion of control
','
mysql
'];
 
for (var i=0; i<interestingKeywords.length; i++) {
  
if (bookTitle.toLowerCase().indexOf(interestingKeywords[i]) !== -1) {
   
// save the book title and link
   
var a = bookTitle + '
=>
' + bookLink + '
;
';
   
fs.write('
books.txt
', a, '
a');
   
console.log(a);
   
break
;
  
}
 
}
}
And this is what the script looks like, when running :
Just some notes on the script :
  • I added comments to try to make it as clear as possible. Feel free to contact me if it isn't.
  • I hid the real website name from the script for obvious reasons. This technique could be useful for a variety of things, but you should check first about legality issues.
  • I also added an interval of 3 seconds between each website crawl. Another precaution from putting too much load on their site.
In order to use this script, or something like it, just go to the homepage, download it, and run this at the command line :
C:\your-phantomjs-lib\phantomjs your-script.js
Enjoy! :)

转载地址:http://dzemb.baihongyu.com/

你可能感兴趣的文章
Linux命令英文解释(按英文字母顺序)
查看>>
分类模型的效果评估
查看>>
深入理解什么是Java双亲委派模型
查看>>
链表算法面试题---删除链表中的重复元素II
查看>>
链表算法面试题---合并两个链表
查看>>
链表算法面试题---交换链表的节点I
查看>>
链表算法面试题---交换链表的节点II
查看>>
链表算法面试题---链表的插入排序
查看>>
链表算法面试题---合并N个有序链表
查看>>
链表算法面试题---分割链表
查看>>
总结、归类---使用二分处理旋转数组的问题
查看>>
分布式常用技术
查看>>
uniapp DES加解密
查看>>
小程序数组去重
查看>>
进站画面:1q84音乐电台
查看>>
MFC程序更换XP皮肤
查看>>
SkinSharp使用方法
查看>>
盘点2010年电子书市场
查看>>
How Computers Know What We Want — Before We Do
查看>>
About Recommender Systems
查看>>