用python分析在chrome下自己的上网的习惯

初二啊今天，玩回来没事，四处浏览的时候，突然感觉自己一直写程序的时候不够专心，老是一会点下这，一会点下那的。于是就想查查自己的上网记录。想看看自己经常访问的网页有哪些，访问最多的是哪几个，我用chrome,但是它的history貌似没有这样的功能。于是就自己写下吧。
chrome 的history目录在~/.config/google-chrome/Default/，里面History打开是乱码，最后发现是存为sqllite db的，那就很easy了。找到History的那个file,查询下里面的table,有好几个，目测发现了个urls，应该是存自己访问的url的，一看里面还有visit count.就是它，开始select吧。select * from urls order by visit_count limit 0,10 ? 结果弄出来好多都是同一个网站的。好吧，又想看看同一个host下的情况。那就得先取出url中的host，然后再把相同的host相加再排序输出。

#!/usr/bin/env python
'''
analyse the user's chrome behavior.
'''
import sqlite3
import urlparse
class AnalyseChrome:
'''
the user's chrome history log is writed by sqllite. and saved default in ~/.config/google-chrome/Default/History at ubuntu.
'''
def __init__(self,db="/home/lijun/.config/google-chrome/Default/History"):
'''init the AnalyseChrome by the chrome history db path.'''
self.cn=sqlite3.connect(db)
self.cu=self.cn.cursor()
def get_sql_res(self,sql):
try:
self.cu.execute(sql)
except Exception,e:
print str(e)
return 0,str(e)
res=self.cu.fetchall()
return res,""
def show_table(self,name="%"):
'''show the table in db of History'''

sql="SELECT * FROM sqlite_master WHERE type='table' and name like '%s';"%(name,)
return self.get_sql_res(sql)

def clear(self,):
self.cn.close()

def top_n(self,n,orderby="host"):
'''
return the top n url or host the user visit frequently.default orderby host
'''

sql="select url,visit_count from urls order by url ;"
res,errmsg=self.get_sql_res(sql)
uniq_res=[]
#first select all url,visit form urls table sort by url ;
#and make a new list which has uniq url and new count. by myself.
#then sort by python's list.sort().
#at last print top n.
#maybe,it's not quick enough,or easy enough. max heap?my history is not that much.
if res:
urlhost=""
for item in res:
if orderby=="host":
now_urlhost=urlparse.urlparse(item[0]).netloc
elif orderby=="url":
now_urlhost=item[0]
else:
return None,"error argv in top_n"
if now_urlhost=="" or now_urlhost==None:
continue
if urlhost!=now_urlhost:
urlhost,count=now_urlhost,item[1]
uniq_res.append([urlhost,count])

else:
uniq_res[-1][-1]=uniq_res[-1][-1]+item[1]
continue
else:
return None,errmsg
uniq_res.sort(key=lambda x:x[1],reverse=True)
return [i for i in uniq_res[0:n]],""

if __name__=="__main__":
ac=AnalyseChrome()

tb,errormsg=ac.show_table('urls')
if tb:
for i in tb:
print i

res,errormsg=ac.top_n(20,"host")
no=1
if res:
for i in res:
print no,i
no+=1
else :
print errormsg
ac.clear()
开个头吧，后面还可以算各个host访问占的比例，某段时间里的访问情况。。。

res

self

urlhost

17 条回复 • 1970-01-01 08:00:00 +08:00