小白提问，怎么将富文本编辑器内的纯文本内容提取出来，然后再替换回去

cwj14

2022-01-17 15:46:31 +08:00

需求点是这样的，公司需要我用代码做翻译，对接了百度的 api ，我想把符文边编辑器里面的纯文本文字提取出来，翻译好了，再替换回去，PHP 语言，求求各位大佬教教我，替换回去需要保持原有的样式不变

qwertyzzz

2022-01-17 15:51:01 +08:00

好像有个 htmlspecial_decode?

shapl

2022-01-17 15:53:38 +08:00

提取出来很容易。。原封不动替换回去，有难度。

2i2Re2PLMaDnghL

2022-01-17 15:54:12 +08:00

正常解析 html ，文本节点投入翻译 API 替换

jslang

2022-01-17 15:54:29 +08:00

正则 HTML 替换
textVar.replace(/>([^<]+)</g, ($, text) => {console.log(text);'--------'+text+'-----------'})

cwj14

2022-01-17 15:55:53 +08:00

@shapl 我在想提取出来之后，作为数组的一个键，翻译好的作为值，然后在整个文本中替换回去

ctro15547

2022-01-17 15:57:30 +08:00

用 re 写个脚本?

ctro15547

2022-01-17 15:58:28 +08:00

BS4 应该也可以获取到 txt

cwj14

2022-01-17 16:00:06 +08:00

@jslang 老哥有没有 php 版本的，这个是可以的，但是是 js 的

imicksoft

2022-01-17 16:01:28 +08:00

按 html 或 xml 解析，判断 nodeType 是 Text 的节点，把文本内容翻译后再改回去

jslang

2022-01-17 16:11:38 +08:00

用这个试试呢，我没有 PHP 环境
preg_split(">([^<]+)<", textVar)
取奇数位索引，翻译好后，合并加上对应的"<"和">"

jslang

2022-01-17 16:29:04 +08:00

在线模拟了一下，应该可以了
```
$textVar = '......';

$arr = [];
preg_match_all("/>([^<]+)</", $textVar, $arr);
var_dump($arr[1]);

print_r(preg_split("/>([^<]+)</", $textVar));
```

cwj14

2022-01-17 17:14:25 +08:00

@jslang 真的谢谢大佬，小弟万分感谢

Rache1

2022-01-17 18:14:41 +08:00

Packagist
https://packagist.org/?query=html%20parser

找个合适的 parser 然后遍历节点，就好了

chengxiao

2022-01-17 23:27:34 +08:00

你需要学一下 xpath ，然后就是遍历调接口就行了

RickyC

2022-01-18 13:16:25 +08:00

```
function text_from_html($html)
{
// Remove the HTML tags
$html = strip_tags($html);

// Convert HTML entities to single characters
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
$html_len = mb_strlen($html, 'UTF-8');

// Make the string the desired number of characters
// Note that substr is not good as it counts by bytes and not characters
$html = mb_substr($html, 0, strlen($html), 'UTF-8');

return $html;
}

$content = htmlentities(text_from_html($content));
```
-----
提取出来可以试试这个。再放回去太难了。