求助: UnicodeEncodeError: 'ascii' codec can't encode character '\uff0c' in position 96: ordinal not in range(128)

2017-04-26 00:38:05 +08:00
 dinmshi001
最近想学 python 写爬虫,这个问题困扰我好久了,一直不知道哪里出的问题,代码的背景是开源项目 IPProxyPool 部署到 linux 上。。。在我本地跑的时候没有问题,但是一部署到服务器就出问题,是服务器请求头填的是 ascii 吗。。。代码如下:

# coding:utf-8
from gevent import monkey
monkey.patch_all()

import sys
import time
import gevent

from gevent.pool import Pool
from multiprocessing import Queue, Process, Value

from api.apiServer import start_api_server
from config import THREADNUM, parserList, UPDATE_TIME, MINNUM
from db.DataStore import store_data, sqlhelper
from spider.HtmlDownloader import Html_Downloader
from spider.HtmlPraser import Html_Parser
from validator.Validator import validator, getMyIP, detect_from_db

'''
这个类的作用是描述爬虫的逻辑
'''

default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)

def startProxyCrawl(queue, db_proxy_num):
crawl = ProxyCrawl(queue, db_proxy_num)
crawl.run()


class ProxyCrawl(object):
proxies = set()

def __init__(self, queue, db_proxy_num):
self.crawl_pool = Pool(THREADNUM)
self.queue = queue
self.db_proxy_num = db_proxy_num

def run(self):
while True:
self.proxies.clear()
str = 'IPProxyPool----->>>>>>>>beginning'
##报错显示下面这一行出问题了
sys.stdout.write(str + "\r\n")
##报错显示上面这一行出问题了
sys.stdout.flush()
proxylist = sqlhelper.select()
myip = getMyIP()
spawns = []
for proxy in proxylist:
spawns.append(gevent.spawn(detect_from_db, myip, proxy, self.proxies))
gevent.joinall(spawns)
self.db_proxy_num.value = len(self.proxies)
str = 'IPProxyPool----->>>>>>>>db exists ip:%d' % len(self.proxies)

if len(self.proxies) < MINNUM:
str += '\r\nIPProxyPool----->>>>>>>>now ip num < MINNUM,start crawling...'
sys.stdout.write(str + "\r\n")
sys.stdout.flush()
self.crawl_pool.map(self.crawl, parserList)
else:
str += '\r\nIPProxyPool----->>>>>>>>now ip num meet the requirement , wait UPDATE_TIME...'
sys.stdout.write(str + "\r\n")
sys.stdout.flush()

time.sleep(UPDATE_TIME)

def crawl(self, parser):
html_parser = Html_Parser()
for url in parser['urls']:
response = Html_Downloader.download(url)
if response is not None:
proxylist = html_parser.parse(response, parser)
if proxylist is not None:
for proxy in proxylist:
proxy_str = '%s:%s' % (proxy['ip'], proxy['port'])
if proxy_str not in self.proxies:
self.proxies.add(proxy_str)
self.queue.put(proxy)


if __name__ == "__main__":
DB_PROXY_NUM = Value('i', 0)
q1 = Queue()
q2 = Queue()
p0 = Process(target=start_api_server)
p1 = Process(target=startProxyCrawl, args=(q1, DB_PROXY_NUM))
p2 = Process(target=validator, args=(q1, q2))
p3 = Process(target=store_data, args=(q2, DB_PROXY_NUM))

p0.start()
p1.start()
p2.start()
p3.start()

# spider = ProxyCrawl()
# spider.run()
2650 次点击
所在节点    Python
8 条回复
sagaxu
2017-04-26 00:39:37 +08:00
人生苦短,远离 python2
dinmshi001
2017-04-26 00:40:45 +08:00
@sagaxu 我用的 python3.5.1 = =!
sagaxu
2017-04-26 00:42:21 +08:00
@dinmshi001 在 python3.5 里面写 sys.setdefaultencoding 吗?
sagaxu
2017-04-26 00:44:17 +08:00
dinmshi001
2017-04-26 00:58:08 +08:00
@sagaxu 好神奇。。我把 sys.setdefaultencoding 去了还是不行,我想看看 str 是什么,就在出错那一行上面添加了 print(str) 就好了。。。
raysonx
2017-04-26 01:37:33 +08:00
因为 sys.stdout.write 接受的参数是 bytes 而不是 string 。
sys.stdout.write((str + "\r\n").encode("utf-8"))
mec
2017-04-26 09:19:38 +08:00
locale 改下,或者像楼上说的那样改代码
romanticbao
2017-05-11 03:36:19 +08:00
能用 repr 把数据打出来看看吗

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/357361

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX