如何实现一个基于 pattern 的文本相似度聚类

当前想实现一个服务，能够对业务日志进行聚类分析。业务日志具备一定的 pattern 特征，但是没办法穷举，所以想通过开发一个服务来对业务日志进行聚类，便于后续进一步分析。

当前的想法是，既然是使用聚类，那么需要选取一个日志文本到特征值的一个相似度衡量算法(text-embedding)，以及一个聚类算法。

当前纠结点在 text-embedding 要怎么选取。

之前没有做过类似相关，最近查了些资料，可能是姿势不对，没有发现可以用来借鉴的实现或算法。

不知道描述是否清晰，如果有做过相关工作的同学帮忙指点迷津~

如果思路有问题也请多多指教~~

widewing

2018-06-05 17:48:10 +08:00

我也想做这个，马克下

fffflyfish

2018-06-05 18:26:17 +08:00

训练分词 word2vec，然后 text 的所有分词的 vec 相加，得到 text 的相似度

ipwx

2018-06-05 18:35:34 +08:00

pattern 用 word-embedding 不一定能行，pattern 信息量太少，word-embedding 容易过拟合。

DeepLog 这篇论文了解一下，我没试过，不过好像挺厉害的。

shiznet

2018-06-05 20:30:55 +08:00

@ipwx 看了下「 DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning 」摘要，感觉和我需求不大一致。

```
Anomaly detection is a critical step towards building a secure and
trustworthy system. The primary purpose of a system log is to
record system states and significant events at various critical points
to help debug system failures and perform root cause analysis. Such
log data is universally available in nearly all computer systems.
Log data is an important and valuable resource for understanding
system status and performance issues; therefore, the various system
logs are naturally excellent source of information for online
monitoring and anomaly detection. We propose DeepLog, a deep
neural network model utilizing Long Short-Term Memory (LSTM),
to model a system log as a natural language sequence. is allows
DeepLog to automatically learn log patterns from normal execution,
and detect anomalies when log patterns deviate from the model
trained from log data under normal execution. In addition, we
demonstrate how to incrementally update the DeepLog model in
an online fashion so that it can adapt to new log patterns over time.
Furthermore, DeepLog constructs workows from the underlying
system log so that once an anomaly is detected, users can diagnose
the detected anomaly and perform root cause analysis effectively.
Extensive experimental evaluations over large log data have shown
that DeepLog has outperformed other existing log-based anomaly
detection methods based on traditional data mining methodologies.
```

takato

2018-06-05 20:55:29 +08:00

@ipwx LSTM 那部分还可以有很多优化。。目前比较优的做法是完全抛弃 RNN 了。

WildCat

2018-06-05 20:58:08 +08:00

@takato 不用 RNN 的话用什么架构呢？ 1d CNN ？

takato

2018-06-05 20:59:54 +08:00

@WildCat CNN 可以的呀，还可以并行化。
处理好长程依赖就可以了。。

takato

2018-06-05 21:00:57 +08:00

@WildCat 其实把裸的 RNN 到 LSTM 的改进思路再运用一遍就可以了。自然而然会走出 RNN 的范围。

ETiV

2018-06-05 22:49:44 +08:00

我这些天用 Google Cloud Function，它有一个出错信息汇总页面，
相同类型的 N 多错误被放在了一起，应该就是 LZ 想要的？

我觉得它实现起来蛮简单的：通过 error stack 来归类

LZ 也可以考虑给日志加上「当前模块、文件，当前行数」这种输出的话，用这两个值就可以做归类了吧~

shiznet

2018-06-06 09:10:36 +08:00

@ETiV

模块 /文件是独立的，这个可以区分开，但是一个模块中可能会输出不同的日志，比如说方法 A 有多个地方会有异常栈输出，且每个异常栈的信息可能略有不同。行数信息是在日志的描述中的一个变量，所以没办法将行数作为直接标识。

不过可以沿着这个思路走：
先按模块归类，然后对模块内再进一步归类

shiznet

2018-06-06 09:13:20 +08:00

@takato

老兄对这个问题有什么见解么？没有做过类似的东西，LSTM 和 RNN 这个如何应用在这个场景能展开讲讲么？

takato

2018-06-08 13:48:12 +08:00

@shiznet 呃- -这么问的话还真不知道该从哪里说呢。。。
可以给点提示么？比如之前有没有接触过算法相关的项目= =？

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/460641