爬虫技术－－ BeautifulSoup 标签选择的一个疑惑

2015-06-14 10:38:46 +08:00

redhatping

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<table>
<tr>
<th> biaoti </th>
<td> hi </td>
</tr>

</table>

<table>

<tr>
<th> biaoti2  </th>
<th>  biaoti3 </th>
</tr>

<tr>
<td>  hello </td>
<td>  <a  href= 'http://www.sina.com/blog'> world </a>  </td>

</tr>



</table>


</body>
</html>
"""

soup = BeautifulSoup(html_doc)
print soup.td.find_all('a')   #first td  ? how to chose second td

print soup.find_all(href = re.compile("blog"))

用第二种方法，可以方便找到 <a>标签，

如果使用第一种方法，明显是找第一个td, 当然没有a了，有没有其他办法可以选择呢？不使用re的情况下。

另外：v2ex Markdown- 怎么让python 高亮啊。。

5398 次点击

所在节点

Python

21 条回复

lxy

2015-06-14 10:54:03 +08:00

通过点取属性的方式只能获得当前名字的第一个tag。直接 soup.find_all('a') 就能找到所有 a 标签了。官网有详细中文文档。

redhatping

2015-06-14 11:01:04 +08:00

@lxy 是这样的，我的意思是：我想找 td 下有 <a>标签的方法？
如果不是td下的，不需要。

lxy

2015-06-14 11:10:45 +08:00

我想到的方法是 link = soup.find_all('a')，然后循环判断 link[i].parent.name 是否为 td

redhatping

2015-06-14 11:13:18 +08:00

@lxy 谢谢，是一个思路，有没有高大快的方法呢。。各位

bianzhifu

2015-06-14 11:25:05 +08:00

我更喜欢pyquery

zeroten

2015-06-14 12:23:51 +08:00

感觉pyquery更顺手些

zztt168

2015-06-14 13:18:00 +08:00

谢谢@bianzhifu @zeroten 推荐pyquery。

imn1

2015-06-14 13:32:49 +08:00

一直 lxml + xpath……

其实一直正则，偶尔 lxml

Tink

2015-06-14 13:37:05 +08:00

xpath好用

ca1n

2015-06-14 14:01:42 +08:00

@lxy
@redhatping
findall['td'].findall['a'] 多看文档

redhatping

2015-06-14 16:07:27 +08:00

@ca1n 呵呵，错误的哦
AttributeError: 'ResultSet' object has no attribute 'find_all'
对象可没有这个属性

realityone

2015-06-14 16:14:21 +08:00

@redhatping
@lxy
findAll

redhatping

2015-06-14 16:58:50 +08:00

@realityone 只有find_all 和 find方法，没有findall 。

realityone

2015-06-14 17:26:24 +08:00

@redhatping 啊。。已经是 bs4 了啊。。

xjx0524

2015-06-14 17:30:05 +08:00

@imn1
@Tink
挺喜欢用xpath的，但是有个问题不知道你们怎么做的
假如有个元素确定只有一个，那可以直接用a[0]
但是在批量爬取网页时，有可能某个网页没有这个元素，我一般都会
if a:
____a[0] balabala
这样判断，但是这种情况多了代码会很难看。
不知道怎么处理比较好？

ca1n

2015-06-14 21:09:31 +08:00

@redhatping 非要把代码甩在你脸上才开心

>>> for i in bs(html).find_all('td'):
... for x in i.find_all('a'):
... print x

拿好不送

ca1n

2015-06-14 21:14:38 +08:00

for i in bs(html).find_all('td'):
....for x in i.find_all('a'):
........print x

上面格式乱了, 如果你要说这个不能运行那纯粹是你自己的问题了

redhatping

2015-06-14 21:58:51 +08:00

@ca1n 当然可以，

runningteeth

2015-06-14 22:30:36 +08:00

soup.select('td > a')

http://stackoverflow.com/questions/25084485/beautifulsoup-findall-on-parent-child-tags

secondwtq

2015-06-14 23:08:59 +08:00

纯路过... 这两天撸前端，Polymer blablabla，看到这个主题第一反应居然是 document.querySelector...

以前 soup 和 pyquery 都用过一点，pyquery 用起来挺顺手，不过我记得貌似有坑，好像是换行处理不对还是什么来着...

第 1 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/198405

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.