这样的乱码应该如何清除?

2019-12-30 11:22:53 +08:00
 RicardoY

我尝试了很多种编码,都不能正常显示,联系上下文,我猜这可能是 emoji 之类的东西

有什么好办法可以处理这样的乱码吗?如果我要删掉它,我也得先定义它...有一个简单的思路是判断一下是不是字符是不是 ascii 里的,如果不是就直接删掉,还有更好的办法吗?

1722 次点击
所在节点    问与答
9 条回复
mayx
2019-12-30 11:28:43 +08:00
正则表达式吧
lululau
2019-12-30 11:34:37 +08:00
文本发上来
lqs
2019-12-30 11:36:43 +08:00
猜测是用 emoji 编码成 utf8 然后用 iso-8859-1 解码了,可以把乱码发上来看看
chairuosen
2019-12-30 11:52:40 +08:00
爬的 twitter 吧?叹号后面肯定是表情啦
RicardoY
2019-12-30 12:43:34 +08:00
@lululau @lqs

文本在这里

链接: https://pan.baidu.com/s/1rtelRvHyHldPmB9a-7W0Lg 提取码: wm3f
RicardoY
2019-12-30 12:43:52 +08:00
@lqs 我用 utf-8 打开的
lqs
2019-12-30 13:05:49 +08:00
@RicardoY

和猜测的一样

$ head train_E6oV3lV.csv |iconv -f utf8 -t iso-8859-1
id,label,tweet
1,0, @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
3,0, bihday your majesty
4,0,#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦



>>> print(open('train_E6oV3lV.csv').read(1000).decode('utf8').encode('iso-8859-1').decode('utf8'))
id,label,tweet
1,0, @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
3,0, bihday your majesty
4,0,#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦
5,0, factsguide: society now #motivation
6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
7,0, @user camping tomorrow @user @user @user @user @user @user @user danny…
8,0,the next school year is the year for exams.😯 can't think about that 😭 #school #exams #hate #imagine #actorslife #revolutionschool #girl
9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers …
10,0, @user @user welcome here ! i'm it's so #gr8 !
11,0, ↝ #ireland consumer price index (mom)
RicardoY
2019-12-30 13:20:57 +08:00
@lqs
想再仔细问一下产生这个问题的原因,是 utf-8 和 iso-8859-1 支持的字符集不同导致的吗?
ipwx
2019-12-30 13:27:20 +08:00
@RicardoY 我记得 iso 那个编码是西欧编码,字符集大小为 256。换句话说无论啥编码过的二进制文本,都可以被当做西欧编码读出来。然后,这 256 个字符又被编码成 utf-8,毕竟每个西欧字符都被包括在 utf 码表里面了。。。

以上我猜的,甚至没看你的样本,没电脑

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/633493

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX