中文文献(文言)两个版本差异对比,有没有什么方案?

121 天前
 garywill

同一篇中文文章(文言),有两个版本,想找一个程序做差异对比,有没有什么现有的或相关的工具?

同时想要以下这些差异被忽略:

例如, 版本 1 ,都是简体字,且无标点,又有空格,每小句都换行

床 前 看 月 光

疑 是 地 上 霜

举 头 望 山 月

低 头 思 故 乡

版本 2

牀前明月光,疑是地上霜。

舉頭望明月,低頭思故鄉。

希望软件给出的对比结果是:

  1. 看<->明

  2. 山<->明

以上例子是五言,每一句字数一样。还需要比较每句字数不一样的文章

有没有什么现有的或相关的工具?

736 次点击
所在节点    编程
6 条回复
vacuitym
121 天前
这个自己实现起来应该比较容易,首先把两个都专程简体或者繁体,然后对符号也都转成一样的,然后直接对比差异
garywill
121 天前
补充个难度更大点的例子。断句、复杂的标点。(甚至有中间缺失几句的)

版本 1
帝曰:「有其年已老而有子者,何也?」岐伯曰:「此其天壽過度,氣脈常通,而腎氣有餘也。此雖有子,男不過盡八八,女不過盡七七,而天地之精氣皆竭矣。」

版本 2
帝曰有其年已老而有子者何也
岐伯曰此其天壽過度氣脈常通而腎氣有餘也
此雖有子男不過盡八八女不過盡七七
而天地之精氣皆竭矣
superychen
121 天前
字数都一样么?问问 gpt 就能用 python 给你生成个代码
superychen
121 天前
```python
import opencc
import re
from difflib import SequenceMatcher

PATTERN_CHINESE = re.compile(r'[\u4e00-\u9fa5]')
CONVERTER = opencc.OpenCC("t2s")

# 只保留中文
def clean(text):
return ''.join(PATTERN_CHINESE.findall(text))

# 繁体转简体
def simplify(text):
return CONVERTER.convert(text)

# 比较文本
def compare_text(text1, text2):
text1 = clean(text1)
text2 = clean(text2)
text1a = simplify(text1)
text2a = simplify(text2)
matcher = SequenceMatcher(None, text1a, text2a)
diffs = matcher.get_opcodes()
index = 0
for tag, i1, i2, j1, j2 in diffs:
if tag == 'replace':
index += 1
print(f'{index}. {text1[i1:i2]} <-> {text2[j1:j2]}')

# 简体转繁体
simplified_text = '''床 前 看 月 光

疑 是 地 上 霜

举 头 望 山 月

低 头 思 故 乡'''
traditional_text = '''牀前明月光,疑是地上霜。

舉頭望明月,低頭思故鄉。'''

compare_text(simplified_text,traditional_text)
```
geelaw
121 天前
可以用编辑距离建模。

准备工作:找一本字典,记住所有的标点、空白、汉字,以及同一个字的不同写法(简体繁体异体字)。

1. 两个字符串都删除所有的标点空白,只留汉字。
2. 计算编辑距离最小的编辑:把一个字替换为它的其他写法、删除一个字、增加一个字的代价可以都设置为 1 (这样的话把一个字改成和它没关系的另一个字的代价就是 2 )。

第二步是标准的动态规划问题。
superychen
121 天前
<iframe
src="https://carbon.now.sh/embed?bg=rgba%2874%2C74%2C74%2C1%29&t=vscode&wt=none&l=python&width=680&ds=true&dsyoff=20px&dsblur=68px&wc=true&wa=true&pv=56px&ph=56px&ln=false&fl=1&fm=Hack&fs=14px&lh=133%25&si=false&es=2x&wm=false&code=import%2520opencc%250Aimport%2520re%250Afrom%2520difflib%2520import%2520SequenceMatcher%250A%250APATTERN_CHINESE%2520%253D%2520re.compile%28r%27%255B%255Cu4e00-%255Cu9fa5%255D%27%29%250ACONVERTER%2520%253D%2520opencc.OpenCC%28%2522t2s%2522%29%250A%250A%2523%2520%25E5%258F%25AA%25E4%25BF%259D%25E7%2595%2599%25E4%25B8%25AD%25E6%2596%2587%250Adef%2520clean%28text%29%253A%250A%2520%2520%2520%2520return%2520%27%27.join%28PATTERN_CHINESE.findall%28text%29%29%250A%250A%2523%2520%25E7%25B9%2581%25E4%25BD%2593%25E8%25BD%25AC%25E7%25AE%2580%25E4%25BD%2593%250Adef%2520simplify%28text%29%253A%250A%2520%2520%2520%2520return%2520CONVERTER.convert%28text%29%250A%250A%2523%2520%25E6%25AF%2594%25E8%25BE%2583%25E6%2596%2587%25E6%259C%25AC%250Adef%2520compare_text%28text1%252C%2520text2%29%253A%250A%2520%2520%2520%2520text1%2520%253D%2520clean%28text1%29%250A%2520%2520%2520%2520text2%2520%253D%2520clean%28text2%29%250A%2520%2520%2520%2520text1a%2520%253D%2520simplify%28text1%29%250A%2520%2520%2520%2520text2a%2520%253D%2520simplify%28text2%29%250A%2520%2520%2520%2520matcher%2520%253D%2520SequenceMatcher%28None%252C%2520text1a%252C%2520text2a%29%250A%2520%2520%2520%2520diffs%2520%253D%2520matcher.get_opcodes%28%29%250A%2520%2520%2520%2520index%2520%253D%25200%250A%2520%2520%2520%2520for%2520tag%252C%2520i1%252C%2520i2%252C%2520j1%252C%2520j2%2520in%2520diffs%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520if%2520tag%2520%253D%253D%2520%27replace%27%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520index%2520%252B%253D%25201%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520print%28f%27%257Bindex%257D.%2520%257Btext1%255Bi1%253Ai2%255D%257D%2520%253C-%253E%2520%257Btext2%255Bj1%253Aj2%255D%257D%27%29%250A%250A%2523%2520%25E7%25AE%2580%25E4%25BD%2593%25E8%25BD%25AC%25E7%25B9%2581%25E4%25BD%2593%250Asimplified_text%2520%253D%2520%27%27%27%25E5%25BA%258A%2520%25E5%2589%258D%2520%25E7%259C%258B%2520%25E6%259C%2588%2520%25E5%2585%2589%250A%250A%25E7%2596%2591%2520%25E6%2598%25AF%2520%25E5%259C%25B0%2520%25E4%25B8%258A%2520%25E9%259C%259C%250A%250A%25E4%25B8%25BE%2520%25E5%25A4%25B4%2520%25E6%259C%259B%2520%25E5%25B1%25B1%2520%25E6%259C%2588%250A%250A%25E4%25BD%258E%2520%25E5%25A4%25B4%2520%25E6%2580%259D%2520%25E6%2595%2585%2520%25E4%25B9%25A1%27%27%27%250Atraditional_text%2520%253D%2520%27%27%27%25E7%2589%2580%25E5%2589%258D%25E6%2598%258E%25E6%259C%2588%25E5%2585%2589%25EF%25BC%258C%25E7%2596%2591%25E6%2598%25AF%25E5%259C%25B0%25E4%25B8%258A%25E9%259C%259C%25E3%2580%2582%250A%250A%25E8%2588%2589%25E9%25A0%25AD%25E6%259C%259B%25E6%2598%258E%25E6%259C%2588%25EF%25BC%258C%25E4%25BD%258E%25E9%25A0%25AD%25E6%2580%259D%25E6%2595%2585%25E9%2584%2589%25E3%2580%2582%27%27%27%250A%250Acompare_text%28simplified_text%252Ctraditional_text%29"
style="width: 673px; height: 951px; border:0; transform: scale(1); overflow:hidden;"
sandbox="allow-scripts allow-same-origin">
</iframe>

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1007060

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX