如何爬取 angularJs 的站?

2015-03-08 23:18:46 +08:00
 fy
如题,想要爬一个wiki整站当作离线资料,但这狗日的AngularJs。。。

第一个难点在于,只能通过点击链接的行为进行抓取,所以要模拟用户的浏览过程,前进后退。。。抓取整站。一旦使用url来直接定向,整个页面就要花5-20秒的时间重载。

第二个难点在于,根本无法确定AngularJs在何时加载完成,楼主一开始很傻很天真的写了一个Chrome小插件来缺人啥时候加载完了;千辛万苦翻越过google设定的JS沙盒之后,发现它确实能工作,但问题是与selenium并不兼容。。。。。楼主又没辙了。

哪位大能做过这事情,求指条明路!
4079 次点击
所在节点    问与答
15 条回复
AWSAM
2015-03-08 23:26:46 +08:00
phantomjs
14
2015-03-08 23:28:34 +08:00
他的数据不是通过ajax请求获取的?
fy
2015-03-08 23:30:45 +08:00
@AWSAM 有更多细节吗?怎么弄?


@14 是,但是一张页面有数百个请求,就算我能把他拼装起来,也很难知道是否已经加载完了!
14
2015-03-08 23:38:02 +08:00
@fy 方便提供网址吗。。。
kmvan
2015-03-08 23:57:09 +08:00
@fy 一张页面有数百个请求,就算我能把他拼装起来,也很难知道是否已经加载完了!
这么多请求,高并发服务器不挂吗?
crazyxin1988
2015-03-09 00:04:09 +08:00
不管前端是什么 应该看一下网络请求 分析一下HTTP request和response
之前模拟过单点登录 各种重定向 理清网络请求 全部了然啊
frankzeng
2015-03-09 00:10:46 +08:00
对于用js加载的页面,看一下它的http请求,直接拿它请求的结果来分析
evlos
2015-03-09 00:18:27 +08:00
一张页面就算有大量请求,其中访问 API 的请求肯定没几个,你找到他的 API 地址,接下来就非常简单了。
fy
2015-03-09 00:29:27 +08:00
@14 当然可以: http://atenc.totalwar.com/#

@kmvan 也许没有那么多吧,不过也并不是能够很轻松解析的那种。。

@evlos 这个很不幸……全都是访问api的请求,Angular站的特点就是前后端分离
randyzhao
2015-03-09 00:41:54 +08:00
phantomjs casperjs
fy
2015-03-09 01:14:34 +08:00
@randyzhao 这个跟selenium没什么区别吧...
13k
2015-03-09 01:24:39 +08:00
关键词:phantomjs或者selenium
fy
2015-03-09 08:06:37 +08:00
@13k 大哥 你要看看内容啊 .... 不过算了 我有个想法 试试看能不能行
jason52
2015-03-09 08:43:42 +08:00
俺的视频还没讲到这里。。。
azuginnen
2015-03-09 10:11:00 +08:00
返回的是个json
_______________________________________________________

GET /twe_at_en/att_rom_levis_armaturae?_nonce=jrZO5C6sdYKfyDcU HTTP/1.1
Host: attila-db.totalwar.com
Proxy-Connection: keep-alive
Accept: application/json
Origin: http://atenc.totalwar.com
User-Agent: Safari/536.36
Content-Type: application/json
Referer: http://atenc.totalwar.com/
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2

_______________________________________________________

{
"_id": "att_rom_levis_armaturae",
"_rev": "1-392fb8fd27191a085ac7d3c5a2d95493",
"index": 421,
"campaign": "main_attila",
"additional_picture": "",
"name": "Levis Armaturae",
"next_unit": "att_rom_matiarii",
"picture": "att_rom_levis_armaturae.png",
"prev_unit": "att_rom_funditores",
"requires_region": [
"",
"",
"",
"",
"",
"",
// delete some
"",

""
],
"ability_block": [
"",
"enc_text_manual_battle_conflict_attributes_fatigue",
"enc_text_manual_battle_conflict_attributes_scrub",
"",
"",
"",
"",
"",
"",
"",
""
],
"ability_link": [
"",
"0086_enc_page_battle_play_phase_conflict_attributes",
"0086_enc_page_battle_play_phase_conflict_attributes",
"",
"",
"",
"",
"",
"",
"",
""
],
"ability_text": [
"As these men are inexperienced sailors. Caught up in sea combat, they suffer various penalties.",
"Fatigue has less of an effect on this unit.",
"This unit can hide in forests until enemy units get too close.",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"ability_title": [
"Sea Sickness",
"Resistant to Fatigue",
"Hide (forest)",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"class": "Missile Infantry",
"class_description": "Long range units who provide support for melee units, but are themselves very weak in close combat.",
"faction": [
"att_fact_illyricum",
"att_fact_macedonia",
"att_fact_aegyptus",
"att_fact_hispania",
"att_fact_gallia",
"att_fact_italia",
"att_fact_britannia",
"att_fact_dacia",
"att_fact_septem_provinciem",
"att_fact_africa",
"att_fact_eastern_roman_empire",
"att_fact_ostrogothi",
"att_fact_western_roman_empire",
"att_fact_oriens",
"att_fact_pontus",
"att_fact_asia",
"",
"",
"",
"",
"",
"",
""
],
"description": [
"Levis armaturae' literally translates as 'lightly armoured'. These flexible, sparsely-clad infantry formed a large part of late Roman Legions, acting primarily as a skirmishing force. They were used to harass the enemy and hurled slingshot, javelins and plumbatae - deadly lead darts - before falling back through the maniples of heavy infantry as they advanced. They could then continue skirmishing on the enemy's flanks. In this way, levis armaturae never gave the enemy pause to regroup or breathe; successful skirmishers maintained a defensive screen as an army manoeuvred, able to keep pressure on the enemy and an eye on their Legion's flanks.",
""
],
"next_key": "att_rom_matiarii",
"prev_key": "att_rom_funditores",
"requires_building_faction_id": [
"att_fact_eastern_roman_empire",
"att_fact_western_roman_empire",
"",
"",
"",
"",
"",
""
],
"requires_building_id": {
"1": {
"1": "att_bld_roman_east_military_1att_cult_romanatt_sub_cult_roman_east",
"2": "",
"3": "",
"4": ""
},
"2": {
"1": "att_bld_roman_west_military_1att_cult_romanatt_sub_cult_roman_west",
"2": "",
"3": "",
"4": ""
},
"3": {
"1": "",
"2": "",
"3": "",
"4": ""
},
"4": {
"1": "",
"2": "",
"3": "",
"4": ""
},
"5": {
"1": "",
"2": "",
"3": "",
"4": ""
},
"6": {
"1": "",
"2": "",
"3": "",
"4": ""
},
"7": {
"1": "",
"2": "",
"3": "",
"4": ""
},
"8": {
"1": "",
"2": "",
"3": "",
"4": ""
}
},
"strengths_and_weaknesses_title": "Strengths & Weaknesses",
"strengths_and_weaknesses": [
"Excellent Rate of Fire",
"Very Poor Armour",
"Low Ammunition"
],
"stat_label": [
"Recruitment Cost",
"Upkeep Cost",
"Melee Attack",
"Melee Damage",
"Charge Bonus",
"Melee Defence",
"Armour",
"Health",
"Morale",
"Speed",
"Missile Damage",
"Ammunition",
"Capture Power",
"Missile Block Chance",
"Rate of Fire",
"Spotting",
"Range",
"Hiding"
],
"stat_percentage": [
"",
"",
"4.16667",
"8",
"0.333333",
"25.8333",
"6.66667",
"22.2857",
"22.6667",
"33.3333",
"90",
"16",
"40",
"20",
"30.5",
"50",
"16",
"50"
],
"stat_value": [
"300",
"150",
"5",
"6",
"1",
"31",
"8",
"78",
"34",
"40",
"90",
"8",
"10",
"20",
"61",
"500",
"80",
"1"
],
"game": "at_lb",
"tag": "654012",
"typeof": "Units",
"collection": "Units",
"modifiedBy": "martin.haynes",
"date_created": "2015-03-03T08:50:46+00:00",
"operation": "updated",
"data_updated": "2015-03-03T08:50:46+00:00"
}

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/175412

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX