一个关于 beautifulsoup 里面的 findAll 的使用问题

2016-09-10 12:17:36 +08:00

xucuncicero

我要解析的网页中同时存在

<p class="left title"><a href="xxxx">xxxx</a></p>

和

<p class="title"><a href="xxxx">xxxx</a></p>

如果用

soup.findAll('p', {'class': 'title'}):

会同时输出上面两个 class ，我现在只想要"title"，应该怎么写？

6659 次点击

所在节点

Python

8 条回复

assassinleo

2016-09-10 12:45:24 +08:00

试试这个看行不： soup.find_all(‘ p ’,class="title")或者 soup.find_all(‘ p ’,class=re.compile("title"))

ref: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

xucuncicero

2016-09-10 15:37:49 +08:00

@assassinleo 无效，只是写法不一样，后一个有语法错误（应该写成 class_），还有其他写法结果也没什么区别。

这里的问题是 find_all 或者 findAll 的参数中只要能匹配到"title"就肯定能匹配到"left title"，不知道有没有什么能排除某个字符串的写法。

caspartse

2016-09-10 16:19:17 +08:00

soup.select('p[class=title]')

xiahei

2016-09-10 16:23:10 +08:00

不一定就要用`find_all()`, `select()`也能用上。

7sDream

2016-09-10 16:26:24 +08:00

http://7sdream-rikka-demo.daoapp.io/files/2016-09-10-919783703

如果看不见图片就复制打开……

我试了一下这样可以。

希望有帮助。

judyApple

2016-09-11 04:29:44 +08:00

加个 if 语句好像就可以。 itDic.attrs 返回一个字典，要字典的 value 长度为 1 就可以筛去 left
for itDic in soup.findAll("p",{"class":"title"}):
if len(itDic.attrs['class'])==1:
print(itDic.attrs)

aihimmel

2016-09-11 10:32:11 +08:00

Xpath 大法好

xucuncicero

2016-09-11 11:24:17 +08:00

@caspartse
@xiahei
@7sDream 多谢，简单高效

@judyApple 同赞

@aihimmel 看来得多学点东西了

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/305253

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.