V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
dinmshi001
V2EX  ›  Python

求助: UnicodeEncodeError: 'ascii' codec can't encode character '\uff0c' in position 96: ordinal not in range(128)

  •  
  •   dinmshi001 · 2017-04-26 00:38:05 +08:00 · 2635 次点击
    这是一个创建于 2551 天前的主题,其中的信息可能已经有所发展或是发生改变。
    最近想学 python 写爬虫,这个问题困扰我好久了,一直不知道哪里出的问题,代码的背景是开源项目 IPProxyPool 部署到 linux 上。。。在我本地跑的时候没有问题,但是一部署到服务器就出问题,是服务器请求头填的是 ascii 吗。。。代码如下:

    # coding:utf-8
    from gevent import m onkey
    m onkey.patch_all()

    import sys
    import time
    import gevent

    from gevent.pool import Pool
    from multiprocessing import Queue, Process, Value

    from api.apiServer import start_api_server
    from config import THREADNUM, parserList, UPDATE_TIME, MINNUM
    from db.DataStore import store_data, sqlhelper
    from spider.HtmlDownloader import Html_Downloader
    from spider.HtmlPraser import Html_Parser
    from validator.Validator import validator, getMyIP, detect_from_db

    '''
    这个类的作用是描述爬虫的逻辑
    '''

    default_encoding = 'utf-8'
    if sys.getdefaultencoding() != default_encoding:
    reload(sys)
    sys.setdefaultencoding(default_encoding)

    def startProxyCrawl(queue, db_proxy_num):
    crawl = ProxyCrawl(queue, db_proxy_num)
    crawl.run()


    class ProxyCrawl(object):
    proxies = set()

    def __init__(self, queue, db_proxy_num):
    self.crawl_pool = Pool(THREADNUM)
    self.queue = queue
    self.db_proxy_num = db_proxy_num

    def run(self):
    while True:
    self.proxies.clear()
    str = 'IPProxyPool----->>>>>>>>beginning'
    ##报错显示下面这一行出问题了
    sys.stdout.write(str + "\r\n")
    ##报错显示上面这一行出问题了
    sys.stdout.flush()
    proxylist = sqlhelper.select()
    myip = getMyIP()
    spawns = []
    for proxy in proxylist:
    spawns.append(gevent.spawn(detect_from_db, myip, proxy, self.proxies))
    gevent.joinall(spawns)
    self.db_proxy_num.value = len(self.proxies)
    str = 'IPProxyPool----->>>>>>>>db exists ip:%d' % len(self.proxies)

    if len(self.proxies) < MINNUM:
    str += '\r\nIPProxyPool----->>>>>>>>now ip num < MINNUM,start crawling...'
    sys.stdout.write(str + "\r\n")
    sys.stdout.flush()
    self.crawl_pool.map(self.crawl, parserList)
    else:
    str += '\r\nIPProxyPool----->>>>>>>>now ip num meet the requirement , wait UPDATE_TIME...'
    sys.stdout.write(str + "\r\n")
    sys.stdout.flush()

    time.sleep(UPDATE_TIME)

    def crawl(self, parser):
    html_parser = Html_Parser()
    for url in parser['urls']:
    response = Html_Downloader.download(url)
    if response is not None:
    proxylist = html_parser.parse(response, parser)
    if proxylist is not None:
    for proxy in proxylist:
    proxy_str = '%s:%s' % (proxy['ip'], proxy['port'])
    if proxy_str not in self.proxies:
    self.proxies.add(proxy_str)
    self.queue.put(proxy)


    if __name__ == "__main__":
    DB_PROXY_NUM = Value('i', 0)
    q1 = Queue()
    q2 = Queue()
    p0 = Process(target=start_api_server)
    p1 = Process(target=startProxyCrawl, args=(q1, DB_PROXY_NUM))
    p2 = Process(target=validator, args=(q1, q2))
    p3 = Process(target=store_data, args=(q2, DB_PROXY_NUM))

    p0.start()
    p1.start()
    p2.start()
    p3.start()

    # spider = ProxyCrawl()
    # spider.run()
    8 条回复    2017-05-11 03:36:19 +08:00
    sagaxu
        1
    sagaxu  
       2017-04-26 00:39:37 +08:00
    人生苦短,远离 python2
    dinmshi001
        2
    dinmshi001  
    OP
       2017-04-26 00:40:45 +08:00
    @sagaxu 我用的 python3.5.1 = =!
    sagaxu
        3
    sagaxu  
       2017-04-26 00:42:21 +08:00
    @dinmshi001 在 python3.5 里面写 sys.setdefaultencoding 吗?
    dinmshi001
        5
    dinmshi001  
    OP
       2017-04-26 00:58:08 +08:00
    @sagaxu 好神奇。。我把 sys.setdefaultencoding 去了还是不行,我想看看 str 是什么,就在出错那一行上面添加了 print(str) 就好了。。。
    raysonx
        6
    raysonx  
       2017-04-26 01:37:33 +08:00 via iPad
    因为 sys.stdout.write 接受的参数是 bytes 而不是 string 。
    sys.stdout.write((str + "\r\n").encode("utf-8"))
    mec
        7
    mec  
       2017-04-26 09:19:38 +08:00
    locale 改下,或者像楼上说的那样改代码
    romanticbao
        8
    romanticbao  
       2017-05-11 03:36:19 +08:00
    能用 repr 把数据打出来看看吗
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   2582 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 34ms · UTC 01:29 · PVG 09:29 · LAX 18:29 · JFK 21:29
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.